Meta Learning via Learned Loss

Meta Learning via Learned Loss


Typically, loss functions, regularization mechanisms and other important aspects of training parametric models are chosen heuristically from a limited set of options. In this paper, we take the first step towards automating this process, with the view of producing models which train faster and more robustly. Concretely, we present a meta-learning method for learning parametric loss functions that can generalize across different tasks and model architectures. We develop a pipeline for “meta-training” such loss functions, targeted at maximizing the performance of the model trained under them. The loss landscape produced by our learned losses significantly improves upon the original task-specific losses in both supervised and reinforcement learning tasks. Furthermore, we show that our meta-learning framework is flexible enough to incorporate additional information at meta-train time. This information shapes the learned loss function such that the environment does not need to provide this information during meta-test time.

1 Introduction

Inspired by the remarkable capability of humans to quickly learn and adapt to new tasks, the concept of learning to learn, or meta-learning, recently became popular within the machine learning community (l2l; rl2; maml). We can classify learning to learn methods into roughly two categories: approaches that learn representations that can generalize and are easily adaptable to new tasks (maml), and approaches that learn how to optimize models (l2l; rl2).

Figure 1: Framework overview: The learned meta-loss is used as a learning signal to optimize the optimizee , which can be a regressor, a classifier or a control policy.

In this paper we investigate the second type of approach. We propose a learning framework that is able to learn any parametric loss function—as long as its output is differentiable with respect to its parameters. Such learned functions can be used to efficiently optimize models for new tasks.

Specifically, the purpose of this work is to encode learning strategies into a parametric loss function, or a meta-loss, which generalizes across multiple training contexts or tasks. Inspired by inverse reinforcement learning (ng2000algorithms), our work combines the learning to learn paradigm of meta-learning with the generality of learning loss landscapes. We construct a unified, fully differentiable framework that can learn optimizee-independent loss functions to provide a strong learning signal for a variety of learning problems, such as classification, regression or reinforcement learning. Our framework involves an inner and an outer optimization loops. In the inner loop, a model or an optimizee is trained with gradient descent using the loss coming from our learned meta-loss function. Fig. 1 shows the pipeline for updating the optimizee with the meta-loss. The outer loop optimizes the meta-loss function by minimizing a task-loss, such as a standard regression or reinforcement-learning loss, that is induced by the updated optimizee.

The contributions of this work are as follows: i) we present a framework for learning adaptive, high-dimensional loss functions through back-propagation that create the loss landscapes for efficient optimization with gradient descent. We show that our learned meta-loss functions improves over directly learning via the task-loss itself while maintaining the generality of the task-loss. ii) We present several ways our framework can incorporate extra information that helps shape the loss landscapes at meta-train time. This extra information can take on various forms, such as exploratory signals or expert demonstrations for RL tasks. After training the meta-loss function, the task-specific losses are no longer required since the training of optimizees can be performed entirely by using the meta-loss function alone, without requiring the extra information given at meta-train time. In this way, our meta-loss can find more efficient ways to optimize the original task loss.

We apply our meta-learning approach to a diverse set of problems demonstrating our framework’s flexibility and generality. The problems include regression problems, image classification, behavior cloning, model-based and model-free reinforcement learning. Our experiments include empirical evaluation for each of the aforementioned problems.

2 Related Work

Meta-learning originates from the concept of learning to learn (Schmidhuber:87long; bengio:synaptic; ThrunP98). Recently, there has a been a wide interest in finding ways to improve learning speeds and generalization to new tasks through meta-learning. Let us consider gradient based learning approaches, that update the parameters of an optimizee , with model parameters and inputs as follows:


where we take the gradient of a loss function , parametrized by , with respect to the optimizee’s parameters and use a gradient transform , parametrized by , to compute new model parameters 1. In this context, we can divide related work on meta-learning into learning model parameters that can be easily adapted to new tasks (maml; mendonca2019guided; gupta2018meta; yu2018one), learning optimizer policies that transform parameters updates with respect to known loss or reward functions (maclaurin2015gradient; l2l; li2016learning; franceschi2017forward; meier2018online; rl2), or learning loss/reward function representations  (sung2017learning; epg; zou2019reward). Alternatively, in unsupervised learning settings, meta-learning has been used to learn unsupervised rules that can be transferred between tasks (metz2018learning; cactus).

Our framework falls into the category of learning loss landscapes. Similar to works by sung2017learning and epg, we aim at learning loss function parameters that can be applied to various optimizee models, e.g. regressors, classifiers or agent policies. Our learned loss functions are independent of the model parameters that are to be optimized, thus they can be easily transferred to other optimizee models. This is in contrast to methods that meta-learn model-parameters directly (e.g. maml; mendonca2019guided), which are are orthogonal and complementary to ours, where the learned representation cannot be separated from the original model of the optimizee. The idea of learning loss landscapes or reward functions in the reinforcement learning (RL) setting can be traced back to the field of inverse reinforcement learning (ng2000algorithms; an04, IRL). However, in contrast to IRL we do not require expert demonstrations (however we can incorporate them). Instead we use task losses as a measure of the effectiveness of our loss function when using it to update an optimizee.

Closest to our method are the works on evolved policy gradients (epg), teacher networks (WuTXFQLL18), meta-critics (sung2017learning) and meta-gradient RL (xusilver). In contrast to using an evolutionary approach (e.g. epg), we design a differentiable framework and describe a way to optimize the loss function with gradient descent in both supervised and reinforcement learning settings. WuTXFQLL18 propose that instead of learning a differentiable loss function directly, a teacher network is trained to predict parameters of a manually designed loss function, whereas each new loss function class requires a new teacher network design and training. In xusilver, discount and bootstrapping parameters are learned online to optimize a task-specific meta-objective. Our method does not require manual design of the loss function parameterization or choosing particular parameters that have to be optimized, as our loss functions are learned entirely from data. Finally, in work by sung2017learning a meta-critic is learned to provide a task-conditional value function, used to train an actor policy. Although training a meta-critic in the supervised setting reduces to learning a loss function as in our work, in the reinforcement learning setting we show that it is possible to use learned loss functions to optimize policies directly with gradient descent.

3 Meta-Learning via Learned Loss

In this work, we aim to learn a loss function, which we call meta-loss, that is subsequently used to train an optimizee, e.g. a classifier, a regressor or a control policy. More concretely, we aim to learn a meta-loss function with parameters , that outputs the loss value which is used to train an optimizee with parameters via gradient descent:


where can be ground truth target information in supervised learning settings or goal and state information for reinforcement learning settings. In short, we aim to learn a loss function that can be used as depicted in Algorithm 2. Towards this goal, we propose an algorithm to learn meta-loss function parameters via gradient descent.

The key challenge is to derive a training signal for learning the loss parameters . In the following, we describe our approach to addressing this challenge, which we call Meta-Learning via Learned Loss (ML).

3.1 Ml for Supervised Learning

We start with supervised learning settings, in which our framework aims at learning a meta-loss function that produces the loss value given the ground truth target and the predicted target . For clarity purposes we constrain the following presentation to learning a meta-loss network that produces the loss value for training a regressor via gradient descent, however the methodology trivially generalizes to classification tasks.

Our meta-learning framework starts with randomly initialized model parameters and loss parameters . The current loss parameters are then used to produce loss value . To optimize model parameters we need to compute the gradient of the loss value with respect to , . Using the chain rule, we can decompose the gradient computation into the gradient of the loss network with respect to predictions of model times the gradient of model with respect to model parameters2,


Once we have updated the model parameters using the current meta-loss network parameters , we want to measure how much learning progress has been made with loss-parameters and optimize via gradient descent. Note, that the new model parameters are implicitly a function of loss-parameters , because changing would lead to different . In order to evaluate , and through that loss-parameters , we introduce the notion of a task-loss during meta-train time. For instance, we use the mean-squared-error (MSE) loss, which is typically used for regression tasks, as a task-loss . We now optimize loss parameters by taking the gradient of with respect to as follows3:


where we first apply the chain rule and show that the gradient with respect to the meta-loss parameters requires the new model parameters . We expand as one gradient step on based on meta-loss , making the dependence on explicit.

Optimization of the loss-parameters can either happen after each inner gradient step (where inner refers to using the current loss parameters to update ), or after inner gradient steps with the current meta-loss network .

The latter option requires back-propagation through a chain of all optimizee update steps. In practice we notice that updating the meta-parameters after each inner gradient update step works better. We reset after inner gradient steps. We summarize the meta-train phase in Algorithm 1, with one inner gradient step.

1:   randomly initialize 2:  while not done do 3:       randomly initialize 4:       5:       6:       7:       8:  end while Algorithm 1 ML at (meta-train) 1:   2:   randomly initialize 3:  for  do 4:       5:       6:       7:  end for Algorithm 2 ML at (meta-test)

3.2 Ml Reinforcement Learning

In this section, we introduce several modifications that allow us to apply the ML framework to reinforcement learning problems. Let be a finite-horizon Markov Decision Process (MDP), where and are state and action spaces, is a state-transition probability function or system dynamics, a reward function, an initial state distribution, a reward discount factor, and a horizon. Let be a trajectory of states and actions and the trajectory return. The goal of reinforcement learning is to find parameters of a policy that maximizes the expected discounted reward over trajectories induced by the policy: where and . In what follows, we show how to train a meta-loss network to perform effective policy updates in a reinforcement learning scenario. To apply our ML framework, we replace the optimizee from the previous section with a stochastic policy . We present two applications of ML to RL.

Ml for Model-Based Reinforcement Learning

Model-based RL (MBRL) attempts to learn a policy by first learning a dynamic model . Intuitively, if the model is accurate, we can use it to optimize the policy parameters . As we typically do not know the dynamics model a-priori, MBRL algorithms iterate between using the current approximate dynamics model , to optimize the policy such that it maximizes the reward under , then use the optimized policy to collect more data which is used to update the model . In this context, we aim to learn a loss function that is used to optimize policy parameters through our meta-network .

Similar to the supervised learning setting we use current meta-parameters to optimize policy parameters under the current dynamics model : ,

where is the sampled trajectory and the variable captures some task-specific information, such as the goal state of the agent. To optimize we again need to define a task loss, which in the MBRL setting can be defined as , denoting the reward that is achieved under the current dynamics model . To update , we compute the gradient of the task loss wrt. , which involves differentiating all the way through the reward function, dynamics model and the policy that was updated using the meta-loss . The pseudo-code in Algorithm 3 (Appendix A) illustrates the MBRL learning loop. In Algorithm 5 (Appendix A), we show the policy optimization procedure during meta-test time. Notably, we have found that in practice, the model of the dynamics is not needed anymore for policy optimization at meta-test time. The meta-network learns to implicitly represent the gradients of the dynamics model and can produce a loss to optimize the policy directly.

Ml for Model-Free Reinforcement Learning

Finally, we consider the model-free reinforcement learning (MFRL) case, where we learn a policy without learning a dynamics model. In this case, we can define a surrogate objective, which is independent of the dynamics model, as our task-specific loss (reinforce; sutton00; SchulmanHWA15):


Similar to the MBRL case, the task loss is indirectly a function of the meta-parameters that are used to update the policy parameters. Although we are evaluating the task loss on full trajectory rewards, we perform policy updates from Eq. 2 using stochastic gradient descent (SGD) on the meta-loss with mini-batches of experience for with batch size , similar to epg. The inputs of the meta-loss network are the sampled states, sampled actions, task information and policy probabilities of the sampled actions: . In this way, we enable efficient optimization of very high-dimensional policies with SGD provided only with trajectory-based rewards. In contrast to the above MBRL setting, the rollouts used for task-loss evaluation are real system rollouts, instead of simulated rollouts. At test time, we use the same policy update procedure as in the MBRL setting, see Algorithm 5 (Appendix A).

3.3 Shaping ML loss by adding extra loss information during meta-train

So far, we have discussed using standard task losses, such as MSE-loss for regression or reward functions for RL settings. However, it is possible to provide more information about the task at meta-train time, which can influence the learning of the loss-landscape. We can design our task-losses to incorporate extra penalties; for instance we can extend the MSE-loss with and weight the terms with and :


In our work, we experiment with 4 different types of extra loss information at meta-train time: for supervised learning we show that adding extra information through , where are the optimal regression parameters, can help shape a convex loss-landscape for otherwise non-convex optimization problems; we also show how we can use to induce a physics prior in robot model learning. For reinforcement learning tasks we demonstrate that by providing additional rewards in the task loss during meta-train time, we can encourage the trained meta-loss to learn exploratory behaviors; and finally also for reinforcement learning tasks, we show how expert demonstrations can be incorporated to learn loss functions which can generalize to new tasks. In all settings, the additional information shapes the learned loss function such that the environment does not need to provide this information during meta-test time.

4 Experiments

\thesubsubfigure Meta-Train Tasks
\thesubsubfigure Meta-Test Tasks
\thesubsubfigure Meta-Train
\thesubsubfigure Meta-Test
Figure 2: Meta-learning for regression (top) and binary classification (bottom) tasks. (a) meta-train task, (b) meta-test tasks, (c) performance of the meta-network on the meta-train task as a function of (outer) meta-train iterations in blue, as compared to SGD using the task-loss directly in orange, (d) average performance of meta-loss on meta-test tasks as a function of the number of gradient update steps

In this section we evaluate the applicability and the benefits of the learned meta-loss from two different view points. First, we study the benefits of using standard task losses, such as the mean-squared error loss for regression, to train the meta-loss in Section 4.1. We analyze how a learned meta-loss compares to using a standard task-loss in terms of generalization properties and convergence speed. Second, we study the benefit of adding extra information at meta-train time to shape the loss landscape in Section 4.2.

4.1 Learning to mimic and improve over known task losses

First, we analyze how well our meta-learning framework can learn to mimic and improve over standard task losses for both supervised and reinforcement learning settings. For these experiments, the meta-network is parameterized by a neural network with two hidden layers of 40 neurons each.

Meta-Loss for Supervised Learning

In this set of experiments, we evaluate how well our meta-learning framework can learn loss functions for regression and classification tasks. In particular, we perform experiments on sine function regression and binary classification of digits (see details in Appendix D). At meta-train time, we randomly draw one task for meta-training (see Fig. 2 (a)), and at meta-test time we randomly draw test tasks for regression, and test tasks for classification (Fig. 2(b)). For the sine regression, tasks are drawn according to details in Appendix D, and we initialize our model to a simple feedforward NN with 2 hidden layers and 40 hidden units each, for the binary classification task is initialized via the LeNet architecture (lenet). For both experiments we use a fixed learning rate for both inner () and outer () gradient update steps. We average results across random seeds, where each seed controls the initialization of both initial model and meta-network parameters, as well as the the random choice of meta-train/test task(s), and visualize them in Fig. 2.

We compare the performance of using SGD with the task-loss directly (in orange) to SGD using the learned meta-network (in blue), both using a learning rate . In Fig. 2 (c) we show the average performance of the meta-network as it is being learned, as a function of (outer) meta-train iterations in blue. In both regression and classification tasks, the meta-loss eventually leads to a better performance on the meta-train task as compared to the task loss. In Fig. 2 (d) we evaluate SGD using vs SGD using on previously unseen (and out-of-distribution) meta-test tasks as a function of the number of gradient steps. Even on these novel test tasks, our learned leads to improved performance as compared to the task-loss.

Learning Reward functions for Model-based Reinforcement Learning

(a) train (blue), test (orange) tasks
(b) Meta vs Task Loss Pointmass
(c) Meta vs Task Loss Reacher
Figure 3: ML for MBRL: results are averaged across 10 runs. We can see in (a) that the ML loss generalizes well, the loss was trained on the blue trajectories and tested on the orange ones for the PointmassGoal task. ML loss also significantly speeds up learning when compared to the task loss at meta-test time on the PointmassGoal (b) and the ReacherGoal (c) environments.

In the MBRL example, the tasks consist of a free movement task of a point mass in a 2D space, we call this environment PointmassGoal, and a reaching task with a 2-link 2D manipulator, which we call the ReacherGoal environment (see Appendix B for details). The task distribution consists of different target positions that either the point mass or the arm should reach. During meta-train time, a model of the system dynamics, represented by a neural network, is learned from samples of the currently optimal policy. The task loss during meta-train time is , where is the final distance from the goal , when rolling out in the dynamics model . Taking the gradient requires the differentiation through the learned model (see Appendix 3). The input to the meta-network is the state-action trajectory of the current roll-out and the desired target position. The meta-network outputs a loss signal together with the learning rate to optimize the policy. Fig. 2(a) shows the qualitative reaching performance of a policy optimized with the meta loss during meta-test on PointmassGoal. The meta-loss network was trained only on tasks in the right quadrant (blue trajectories) and tested on the tasks in the left quadrant (orange trajectories) of the plane, showing the generalization capability of the meta loss. Figure 2(b) and 2(c) show a comparison in terms of final distance to the target position at test time. The performance of policies trained with the meta-loss is compared to policies trained with the task loss, in this case final distance to the target. The curves show results for 10 different goal positions (including goal positions where the meta-loss needs to generalize). When optimizing with the task loss, we use the dynamics model learned during the meta-train time, as in this case the differentiation through the model is required during test time. As mentioned in Section 3.2.1, this is not needed when using the meta-loss.

Learning Reward functions for Model-free Reinforcement Learning

In the following, we move to evaluating on model-free RL tasks. Fig. 4 shows results when using two continuous control tasks based on OpenAI Gym MuJoCo environments (mujoco): ReacherGoal and AntGoal (see Appendix C for details)4

(a) ReacherGoal
(b) AntGoal
(c) ReacherGoal
(d) AntGoal
Figure 4: ML for model-free RL: results are averaged across tasks. (a+b) Policy learning on new task with ML loss compared to PPO objective performance during meta-test time. The learned loss leads to faster learning at meta-test time. (c+d) Using the same ML loss, we can optimize policies of different architectures, showing that our learned loss maintains generality.

Fig. 3(a) and Fig. 3(b) show the results of the meta-test time performance for the ReacherGoal and the AntGoal environments respectively. We can see that ML loss significantly improves optimization speed in both scenarios compared to PPO. In our experiments, we observed that on average ML requires 5 times fewer samples to reach 80% of task performance in terms of our metrics for the model-free tasks.

To test the capability of the meta-loss to generalize across different architectures, we first meta-train on an architecture with two layers and meta-test the same meta-loss on architectures with varied number of layers. Fig. 4 (c+d) show meta-test time comparison for the ReacherGoal and the AntGoal environments in a model-free setting for four different model architectures. Each curve shows the average and the standard deviation over ten different tasks in each environment. Our comparison clearly indicates that the meta-loss can be effectively re-used across multiple architectures with a mild variation in performance compare to the overall variance of the corresponding task optimization.

4.2 Shaping loss landscapes by adding extra information at meta-train time

This set of experiments shows that our meta-learner is able to learn loss functions that incorporate extra information available only during meta-train time. The learned loss will be shaped such that optimization is faster when using the meta-loss compared to using a standard loss.

Illustration: Shaping loss

We start by illustrating the loss shaping on an example of sine frequency regression where we fit a single parameter for the purpose of visualization simplicity.

(a) Sine: learned vs task loss
(b) Sine: meta-test time
(c) Reacher: inverse dynamics
(d) Sawyer: inverse dynamics
Figure 5: Meta-test time evaluation of the shaped meta-loss (ML), i.e. trained with shaping ground-truth (extra) information at meta-train time: a) Comparison of learned ML loss (top) and MSE loss (bottom) landscapes for fitting the frequency of a sine function. The red lines indicate the ground-truth values of the frequency. b) Comparing optimization performance of: ML loss trained with (green), and without (blue) ground-truth frequency values; MSE loss (orange). The ML loss learned with the ground-truth values outperforms both the non-shaped ML loss and the MSE loss. c-d) Comparing performance of inverse dynamics model learning for ReacherGoal (c) and Sawyer arm (d). ML loss trained with (green) and without (blue) ground-truth inertia matrix is compared to MSE loss (orange). The shaped ML loss outperforms the MSE loss in all cases.

For this illustration we generate training data , by drawing data samples from the ground truth function , for . We create a model , and aim to optimize parameter on , with the goal of recovering value . Fig. 4(a) (bottom) shows the loss landscape for optimizing , when using the MSE loss. The target frequency is indicated by a vertical red line. As noted by sines, the landscape of this loss is highly non-convex and difficult to optimize with conventional gradient descent.

Here, we show that by utilizing additional information about the ground truth value of the frequency at meta-train time, we can learn a better shaped loss. Specifically, during meta-train time, our task-specific loss is the squared distance to the ground truth frequency: that we later call the shaping loss. The inputs of the meta-network are the training targets and predicted function values , similar to the inputs to the mean-squared loss. After meta-train time commences our learned loss function produces a convex loss landscapes as depicted in Fig. 4(a)(top).

To analyze how the shaping loss impacts model optimization at meta-test time, we compare 3 loss functions: 1) directly using standard MSE loss (orange), 2) ML loss that was trained via the MSE loss as task loss (blue), and 3) ML loss trained via the shaping loss, Fig. 4(b). When comparing the performance of these 3 losses, it becomes evident that without shaping the loss landscape, the optimization is prone to getting stuck in a local optimum.

(a) Trajectory ML vs. iLQR
(b) MountainCar: meta-test time
(c) Train and test targets
(d) ML vs. Task loss at test
Figure 6: (a) MountainCar trajectory for policy optimized with iLQR compared to ML loss with extra information. (b) optimization performance during meta-test time for policies optimized with iLQR compared to ML with and without extra information. (c+d) ReacherGoal with expert demonstrations available during meta-train time. (c) shows the targets in end-effector space. The four blue dots show the training targets for which expert demonstrations are available, the orange dots show the meta-test targets. In (d) we show the reaching performance of a policy trained with the shaped ML loss at meta-test time, compared to the performance of training simply on the behavioral cloning objective and testing on test targets.

Shaping loss via physics prior for inverse dynamics learning

Next, we show the benefits of shaping our ML loss via ground truth parameter information for a robotics application. Specifically, we aim to learn and shape a meta-loss that improves sample efficiency for learning (inverse) dynamics models, i.e. a mapping , where: , , are vectors of joint angular positions, velocities and desired accelerations; is a vector of joint torques.

Rigid body dynamics (RBD) provides an analytical solution to computing the (inverse) dynamics and can generally be written as:


where the inertia matrix , and are computed analytically featherstone2014rigid. Learning an inverse dynamics model using neural networks can increase the expressiveness compared to RBD but requires many data samples that are expensive to collect. Here we follow the approach in (lutter2019deep), and attempt to learn the inverse dynamics via a neural network that predicts the inertia matrix . To improve upon sample efficiency we apply our method by shaping the loss landscape during meta-train time using the ground truth inertia matrix provided by a simulator. Specifically, we use the task loss to optimize our meta-loss network. During meta-test time we use our trained meta-loss shaped with the physics prior (the inertia matrix exposed by the simulator) to optimize the inverse dynamics neural network. In Fig. 5-c we show the prediction performance of the inverse dynamics model during meta-test time on new trajectories of the ReacherGoal environment. We compare the optimization performance during meta-test time when using the meta-loss trained with physics prior, the meta loss trained without physics prior (i.e via MSE loss) to the optimization with MSE loss. Fig. 5-d shows a similar comparison for the Sawyer environment - a simulator of the 7 degrees-of-freedom Sawyer anthropomorphic robot arm sawyer. Inverse dynamics learning using the meta loss with physics prior achieves the best prediction performance on both robots. ML without physics prior performs worst on the ReacherGoal environment, in this case the task loss formulated only in the action space did not provide enough information to learn a useful for optimization. For the Sawyer training with MSE loss leads to a slower optimization, however the asymptotic performance of MSE and ML is the same. Only ML with shaped loss outperforms both.

Shaping Loss via intermediate goal states for RL

We analyze loss landscape shaping on the MountainCar environment (Moore1990), a classical control problem where an under-actuated car has to drive up a steep hill. The propulsion force generated by the car does not allow steady climbing of the hill, thus greedy minimization of the distance to the goal often results in a failure to solve the task. The state space is two-dimensional consisting of the position and velocity of the car, the action space consists of a one-dimensional torque. In our experiments, we provide intermediate goal positions during meta-train time, which are not available during the meta-test time. The meta-network incorporates this behavior into its loss leading to an improved exploration during the meta-test time as can be seen in Fig. 6-a, when compared to a classical iLQR-based trajectory optimization (Tassa2014). Fig. 6-b shows the average distance between the car and the goal at last rollout time step over several iterations of policy updates with ML with and without extra information and iLQR. As we observe, ML with extra information can successfully bring the car to the goal in a small amount of updates, whereas iLQR and ML without extra information is not able to solve this task.

Shaping loss via expert information during meta-train time

Expert information, like demonstrations for a task, is another way of adding relevant information during meta-train time, and thus shaping the loss landscape. In learning from demonstration (LfD) (pomerleau1991efficient; ng2000algorithms; BillardCDS08), expert demonstrations are used for initializing robotic policies. In our experiments, we aim to mimic the availability of an expert at meta-test time by training our meta-network to optimize a behavioral cloning objective at meta-train time. We provide the meta-network with expert state-action trajectories during train time, which could be human demonstrations or, as in our experiments, trajectories optimized using iLQR. During meta-train time, the task loss is the behavioral cloning objective . Fig. 5(d) shows the results of our experiments in the ReacherGoal environment.

5 Conclusions

In this work we presented a framework to meta-learn a loss function entirely from data. We showed how the meta-learned loss can become well-conditioned and suitable for an efficient optimization with gradient descent. When using the learned meta-loss we observe significant speed improvements in regression, classification and benchmark reinforcement learning tasks. Furthermore, we showed that by introducing additional guiding information during training time we can train our meta-loss to develop exploratory strategies that can significantly improve performance during the meta-test time.

We believe that the ML framework is a powerful tool to incorporate prior experience and transfer learning strategies to new tasks. In future work, we plan to look at combining multiple learned meta-loss functions in order to generalize over different families of tasks. We also plan to further develop the idea of introducing additional curiosity rewards during training time to improve the exploration strategies learned by the meta-loss.


Appendix A MFRL and MBRL algorithms details

1:   randomly initialize parameters
2:  Randomly initialize dynamics model
3:  while not done do
4:      randomly initialize parameters
8:     Update to maximize reward under :
11:      update dynamics model with
12:  end while
Algorithm 3 ML for MBRL (meta-train)
1:   # of inner steps 2:   randomly initialize parameters 3:  while not done do 4:       randomly initialize policy 5:       sample training tasks 6:       7:      for  do 8:           9:           10:           compute task-loss 11:      end for 12:       13:       14:  end while Algorithm 4 ML for MFRL (meta-train) 1:   randomly initialize policy 2:  for  do 3:       4:       5:  end for Algorithm 5 ML for RL (meta-test)

We notice that in practice, including the policy’s distribution parameters directly in the meta-loss inputs, e.g. mean and standard deviation of a Gaussian policy, works better than including the probability estimate , as it provides a direct way to update the distribution parameters using back-propagation through the meta-loss.

Appendix B Experiments: MBRL

The forward model of the dynamics is represented in both cases by a neural network, the input to the network is the current state and action, the output is the next state of the environment.

The Pointmass state space is four-dimensional. For PointmassGoal are the 2D positions and velocities, and the actions are accelerations .

The ReacherGoal environment for the MBRL experiments is a lower-dimensional variant of the MFRL environment. It has a four dimensional state, consisting of position and angular velocity of the joints the torque is two dimensional The dynamics model is updated once every 100 outer iterations with the samples collected by the policy from the last inner optimization step of that outer optimization step, i.e. the latest policy.

Appendix C Experiments: MFRL

The ReacherGoal environment is a 2-link 2D manipulator that has to reach a specified goal location with its end-effector. The task distribution (at meta-train and meta-test time) consists of an initial link configuration and random goal locations within the reach of the manipulator. The performance metric for this environment is the mean trajectory sum of negative distances to the goal, averaged over 10 tasks. As a trajectory reward for the task-loss (see Eq. 7) we use , where is the distance of the end-effector to the goal specified as a 2-d Cartesian position. The environment has eleven dimensions specifying angles of each link, direction from the end-effector to the goal, Cartesian coordinates of the target and Cartesian velocities of the end-effector.

The AntGoal environment requires a four-legged agent to run to a goal location. The task distribution consists of random goals initialized on a circle around the initial position. The performance metric for this environment is the mean trajectory sum of differences between the initial and the current distances to the goal, averaged over 10 tasks. Similar to the previous environment we use , where is the distance from the center of the creature’s torso to the goal specified as a 2D Cartesian position. In contrast to the ReacherGoal this environment has  5 dimensional state space that describes Cartesian position, velocity and orientation of the torso as well as angles and angular velocities of all eight joints. Note that in both environments, the meta-network receives the goal information as part of the state in the corresponding environments. Also, in practice, including the policy’s distribution parameters directly in the meta-loss inputs, e.g. mean and standard deviation of a Gaussian policy, works better than including the probability estimate , as it provides a more direct way to update using back-propagation through the meta-loss.

Appendix D Experiments: Regression and Classification Details

For the sine task at meta-train time, we draw data points from function , with . For meta-test time we draw data points from function , with , and . We initialize our model to a simple feedforward NN with 2 hidden layers and 40 hidden units each, for the binary classification task is initialized via the LeNet architecture. For both regression and classification experiments we use a fixed learning rate for both inner () and outer () gradient update steps. We average results across random seeds, where each seed controls the initialization of both initial model and meta-network parameters, as well as the the random choice of meta-train/test task(s), and visualize them in Fig. 2. Task losses are and for regression and classification meta-learning respectively.


  1. For simple gradient descent:
  2. Alternatively this gradient computation can be performed using automatic differentiation
  3. footnotemark:
  4. Our framework is implemented using open-source libraries Higher grefenstette2019generalized for convenient second-order derivative computations and Hydra (hydra2020) for simplified handling of experiment configurations.
  5. In contrast to the original Ant environment we remove external forces from the state.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description