# Zero-shot task adaptation by homoiconic meta-mapping

Andrew K. Lampinen
Department of Psychology
Stanford University
lampinen@stanford.edu
&James L. McClelland
Department of Psychology
Stanford University
mcclelland@stanford.edu
###### Abstract

How can deep learning systems flexibly reuse their knowledge? Toward this goal, we propose a new class of challenges, and a class of architectures that can solve them. The challenges are meta-mappings, which involve systematically transforming task behaviors to adapt to new tasks zero-shot. The key to achieving these challenges is representing the task being performed in such a way that this task representation is itself transformable. We therefore draw inspiration from functional programming and recent work in meta-learning to propose a class of Homoiconic Meta-Mapping (HoMM) approaches that represent data points and tasks in a shared latent space, and learn to infer transformations of that space. HoMM approaches can be applied to any type of machine learning task, including supervised learning and reinforcement learning. We demonstrate the utility of this perspective by exhibiting zero-shot remapping of behavior to adapt to new tasks.

\iclrpreprintcopy

## 1 Introduction

Humans are able to use and reuse knowledge more flexibly than most deep learning models can (Lake et al., 2017; Marcus, 2018). The problem of rapid learning has been partially addressed by meta-learning systems (Santoro et al., 2016; Finn et al., 2017, 2018; Stadie et al., 2018; Botvinick et al., 2019, see also section 7). However, humans can use our knowledge of a task to flexibly adapt when the task changes. In particular, we can often perform an altered task zero-shot, that is, without seeing any data at all. For example, once we learn to play a game, we can immediately switch to playing in order to lose, and can perform reasonably on our first attempt.

In this paper, we propose a new class of tasks based on this idea: meta-mappings, i.e. mappings between tasks (see below). As noted above, this type of transfer is easily accessible to humans (Lake et al., 2017), but is generally inaccessible to deep-learning models. To address this challenge, we propose using architectures which essentially take a functional perspective on meta-learning, and exploit the idea of homoiconicity. (A homoiconic programming language is one in which programs in the language can be manipulated by programs in the language, just as data can.) By treating both data and task behaviors as functions, we can conceptually think of both data and learned task behaviors as transformable. This yields the ability to not only learn to solve new tasks, but to learn how to transform these solutions in response to changing task demands. We demonstrate that our architectures can flexibly remap their behavior to address the meta-mapping challenge. By allowing the network to recursively treat its task representations as data points, and transform them to produce new task representations, our approach is able to achieve this flexibility parsimoniously. We suggest that approaches like ours will be key to building more intelligent and flexible deep learning systems.

## 2 Meta-mapping

We propose the meta-mapping challenge. We define a meta-mapping as a task, or mapping, that takes a task as an input, output, or both. These include mapping from tasks to language (explaining), mapping from language to tasks (following instructions), and mapping from tasks to tasks (adapting behavior). While the first two categories have been partially addressed in prior work (e.g. Hermann et al., 2017; Co-Reyes et al., 2019), the latter is more novel. (We discuss the relationship between our work and prior work in section 7.) This adaptation can be cued in several ways, including examples of the mapping (after winning and losing at poker, try to lose at blackjack) or natural-language instructions (“try to lose at blackjack”).

## 3 Homoiconic meta-mapping (HoMM) architecture

To address these challenges, we propose HoMM architectures, composed of two components:

1. Input/output systems: domain specific encoders and decoders (vision, language, etc.) that map into a shared embedding space .

2. A meta-learning system that a) learns to embed tasks into the shared embedding space , b) learns to use these task embeddings to perform task-appropriate behavior, c) learns to embed meta-mappings into the same space, and d) learns to use these meta-mapping embeddings to transform basic task embeddings in a meta-mapping appropriate way.

These architectures are homoiconic because they have a completely shared for individual data points, tasks, and meta-mappings. Why is this useful? The primary advantage is that it parsimoniously allows for arbitrary mappings between these entities. In addition to basic tasks, the system can learn to perform meta-mappings to follow instructions or change behavior. That is, it can transform task representations using the same components it uses to transform basic data points. (See also appendix E.1.)

Without training on meta-mappings, of course, the system will not be able to execute them well. However, as we will show, if it is trained on a broad enough set of such mappings, it will be able to generalize to new instances drawn from the same meta-mapping distribution. For instances that fall outside its data distribution, or for optimal performance, it may require some retraining, however. This reflects the structure of human behavior – we are able to adapt rapidly when new knowledge is relatively consistent with our prior knowledge, but learning an entirely new paradigm (such as calculus for a new student) can be quite slow (cf. Kumaran et al., 2016; Botvinick et al., 2019).

More formally, we treat functions and data as entities of the same type. From this perspective, the data points that one function receives can themselves be functions111Indeed, any data point can be represented as a constant function that outputs the data point.. The key insight is that then our architecture can transform data points222Where “data” is a quite flexible term. The approach is agnostic to whether the learning is supervised or reinforcement learning, whether inputs are images or natural language, etc. to perform basic tasks (as is standard in machine learning), but it can also transform these task functions to adapt to new tasks. This is related to the concepts of homoiconicity, defined above, and higher-order functions. Under this perspective, basic tasks and meta-mappings from task to task are really the same type of problem. The functions at one level of abstraction (the basic tasks) become inputs and outputs for higher-level functions at the next level of abstraction (meta-mapping between tasks).

Specifically, we embed each input, target, or mapping into a shared representational space . This means that single data points are embedded in the same space as the representation of a function or an entire dataset. Inputs are embedded by a deep network . Model outputs are decoded from by . Target outputs are encoded by .

Given this, the task of mapping inputs to outputs can be framed as trying to find a transformation of the representational space that takes the (embedded) inputs from the training set to embeddings that will decode to the target outputs. These transformations are performed by a system with the following components (see fig. 1): – the meta network, which collapses a dataset of (input embedding, target embedding) pairs to produce a single function embedding. – the hyper network, which maps a function embedding to parameters. – the transformation, implemented by a deep network parameterized by .

##### Basic meta-learning:

To perform a basic task, input and target encoders ( and ) are used to embed individual pairs from an example dataset , to form a dataset of example (input, output) tuples (fig. 1a). These examples are fed to , which produces a function embedding (via a deep neural network, with several layers of parallel processing across examples, followed by an element-wise max across examples, and several more layers). This function embedding is mapped through the hyper network to parameterize , and then is used to process a dataset of embedded probe inputs, and to map the resultant embeddings to outputs. This system can be trained end-to-end on target outputs for the probes. Having two distinct datasets forces generalization at the meta-learning level, see appendix A.1. See appendix F.2 for detailed architecture, and hyper-parameters.

More explicitly, suppose we have a dataset of example input, target pairs (), and some input from a probe dataset . The system would predict a corresponding output as:

 ^y=O(Fzfunc(I(x)))

where is the meta-learner’s representation for the function underlying the examples in :

 Fzfunc is parameterized by H(zfunc),%wherezfunc=M({(I(x0),T(y0)),...})

Then, given some loss function defined on a single target output and an actual model output , we define our total loss computed on the probe dataset as:

 E(x,y)∈D2[L(y,O(FD1(I(x))))]

The system can then be trained end-to-end on this loss to adjust the weights of , and .

##### Meta-mapping:

For example, suppose we have an embedding for the task of playing some game, and we want to switch to trying to lose this game. We can generate a meta-mapping embedding from examples of embeddings generated by the system when it is trying to win and lose various games: . We can generate a new task embedding :

 ^zgame1,lose=Fzmeta(zgame1)where Fzmeta is parameterized by H(zmeta)

This can be interpreted as the system’s guess at a losing strategy for game 1. To train a meta-mapping, we minimize the loss in the latent space betwen this guessed embedding and the embedding of the target task333The gradients do not update the example function embeddings, only the weights of and , due to memory contraints. Allowing this might be useful in more complex applications.. Whether or not we have such a target embedding, we can evaluate how well the system loses with this strategy, by stepping back down a level of abstraction and actually having it play the game via this embedding (fig. 1c). This is how we evaluate meta-mapping performance – evaluating the loss of transformed task embeddings on the respective target tasks.

Alternatively, we could map from language to a meta-mapping embedding, rather than inducing it from examples of the meta-mapping. This corresponds to the human ability to change behavior in response to instructions. The key feature of our architecture – the fact that tasks, data, and language are all embedded in a shared space – allows for substantial flexibility within a unified framework. Furthermore, our approach is parsimonious. Because it uses the same meta-learner for both basic tasks and meta-mappings, this increased flexibility does not require any added parameters.444At least in principle, in practice of course increasing network size might be more beneficial for HoMM architectures performing meta-mappings as well as basic tasks, compared to those performing only basic tasks.

## 4 Learning multivariate polynomials

As a proof of concept, we first evaluated the system on the task of learning polynomials of degree in 4 variables (i.e. the task was to regress functions of the form where , though the model was given no prior inductive bias toward polynomial forms). For example, if , the model might see examples like and , and be evaluated on its output for points like . This yields an infinite family of base-level tasks (the vector space of all such polynomials), as well as many families of meta-mappings over tasks (for example, multiplying polynomials by a constant, squaring them, or permuting their input variables). This allows us to not only examine the ability of the system to learn to learn polynomials from data, but also to adapt its learned representations in accordance with these meta-tasks. Details of the architecture and training can be found in appendix F.

##### Basic meta-learning:

First, we show that the system is able to achieve the basic goal of learning a held-out polynomial from a few data points in fig. 1(a) (with good sample-efficiency, see supp. fig. 7).

Furthermore, the system is able to perform meta-mappings over polynomials in order to flexibly reconfigure its behavior (fig. 2(a)). We train the system to perform a variety of mappings, for example switch the first two inputs of the polynomial, add 3 to the polynomial, or square the polynomial. We then test its ability to generalize to held-out mappings from examples, for example a held-out input permutation, or an unseen additive shift. The system is both able to apply learned meta-mappings to held-out polynomials, and to apply held-out meta-mappings it has not been trained on, simply by seeing examples of the mapping.

## 5 A stochastic learning setting: simple card games

We next explored the setting of simple card games, where the agent is dealt a hand and must bet. There are three possible bets (including “don’t bet”), and depending on the opponent’s hand the agent either wins or loses the amount bet. This task doesn’t require long term planning, but does incorporate some aspects of reinforcement learning, namely stochastic feedback on only the action chosen. We considered five games that are simplified analogs of various real card games (see Appendix F.1.2). We also considered several binary options that could be applied to the games, including trying to lose instead of trying to win, or switching which suit was more valuable. These are challenging manipulations, for instance trying to lose requires completely inverting a learned -function.

In order to adapt the HoMM architecture, we made a very simple change. Instead of providing the system with (input, target) tuples to embed, we provided it with (state, action, reward) tuples, and trained it to predict rewards for each bet in each state. (A full RL framework is not strictly necessary here because there is no temporal aspect to the tasks; however, because the outcome is only observed for the action you take, it is not a standard supervised task.) The hand is explicitly provided to the network for each example, but which game is being played is implicitly captured in the training examples, without any explicit cues. That is, the system must learn to play directly from seeing a set of (state, action, reward) tuples which implicitly capture the structure and stochasticity of the game. We also trained the system to make meta-mappings, for example switching from trying to win a game to trying to lose. Details of the architecture and training can be found in appendix F.

##### Basic meta-learning:

First, we show that the system is able to play a held-out game from examples in fig. 1(b). We compare two different hold-out sets: 1) train on half the tasks at random, or 2) specifically hold out all the “losers” variations of the “straight flush” game. In either of these cases, the meta-learning system achieves well above chance performance (0) at the held out tasks, although it is slightly worse at generalizing to the targeted hold out, despite having more training tasks in that case. Note that the sample complexity in terms of number of trained tasks is not that high, even training on 20 randomly selected tasks leads to good generalization to the held-out tasks. Furthermore, the task embeddings generated by the system are semantically organized, see appendix D.

Furthermore, the system is able to perform meta-mappings (mappings over tasks) in order to flexibly reconfigure its behavior. For example, if the system is trained to map games to their losers variations, it can generalize this mapping to a game it has not been trained to map, even if the source or target of that mapping is held out from training. In fig. 2(b) we demonstrate this by taking the mapped embedding and evaluating the reward received by playing the targeted game with it. This task is more difficult than simply learning to play a held out game from examples, because the system will actually receive no examples of the target game (when it is held out). Furthermore, in the case of the losers mapping, leaving the strategy unchanged would produce a large negative reward, and chance performance would produce 0 reward, so the results are quite good.

## 6 An extension via language

Language is fundamental to human flexibility. Often the examples of the meta-mapping are implicit in prior knowledge about the world that is cued by language. For example, “try to lose at go” does not give explicit examples of the “lose” meta-mapping, but rather relies on prior knowledge of what losing means. This is a much more efficient way to cue a known meta-mapping. In order to replicate this, we trained the HoMM system with both meta-mappings based on examples, and meta-mappings based on language. In the language-based meta-mappings, a language input identifying the meta-mapping (but not the basic task to apply it to) is encoded by a language encoder, and then provided as the input to (instead of an output from ). The meta-mapping then proceeds as normal — parameterizes , which is used to transform the embedding of the input task to produce an embedding for the target.

This language-cued meta-mapping approach also yields good performance (fig. 4). However, examples of the meta-mapping are slightly better, especially for meta-mappings not seen during training, presumably because examples provide a richer description. In Appendix A.2 we show that using language to specify a meta-mapping performs better than using language to directly specify the target task, presumably by leveraging the richer task representation of the task embedding.

## 7 Discussion

##### Related work:

Our work is an extrapolation from the rapidly-growing literature on meta-learning (e.g. Vinyals et al., 2016; Santoro et al., 2016; Finn et al., 2017, 2018; Stadie et al., 2018; Botvinick et al., 2019). It is also related to the literature on continual learning, or more generally tools for avoiding catastrophic interference based on changes to the architecture (e.g. Fernando et al., 2017; Rusu et al., 2016), loss (e.g. Kirkpatrick et al., 2016; Zenke et al., 2017; Aljundi et al., 2019), or external memory (e.g. Sprechmann et al., 2018). We also connect to a different perspective on continual learning in appendix B. Recent work has also begun to blur the separation between these approaches, for example by meta-learning in an online setting (Finn et al., 2019). Our work is specifically inspired by the algorithms that attempt to have the system learn to adapt to a new task via activations rather than weight updates, either from examples (e.g. Wang et al., 2016; Duan et al., 2016), or a task input (e.g. Borsa et al., 2019).

Our architecture builds directly off of prior work on HyperNetworks (Ha et al., 2016) – networks which parameterize other networks – and other recent applications thereof, such as guessing parameters for a model to accelerate model search (e.g. Brock et al., 2018; Zhang et al., 2019), and meta-learning (Li et al., 2019; Rusu et al., 2019, e.g.). Our work is also related to the longer history of work on different time-scales of weight adaptation (Hinton and Plaut, 1982; Kumaran et al., 2016) that has more recently been applied to meta-learning contexts (e.g. Ba et al., 2016; Munkhdalai and Yu, 2017; Garnelo et al., 2018) and continual learning (Hu et al., 2019, e.g.). It is more abstractly related to work on learning to propose architectures (e.g. Zoph and Le, 2016; Cao et al., 2019), and to models that learn to select and compose skills to apply to new tasks (e.g. Andreas et al., 2016b, a; Tessler et al., 2016; Reed and de Freitas, 2015; Chang et al., 2019). In particular, some of the work in domains like visual question answering has explicitly explored the idea of building a classifier conditioned on a question (Andreas et al., 2016b, 2017), which is related to one of the possible computational paths through our architecture. Work in model-based reinforcement learning has also partly addressed how to transfer knowledge between different reward functions (e.g. Laroche and Barlier, 2017); our approach is more general. Indeed, our insights could be combined with model-based approaches, for example our approach could be used to adapt a task embedding, which would then be used by a learned planning model.

There has also been other recent interest in task (or function) embeddings. Achille et al. (Achille et al., 2019) recently proposed computing embeddings for visual tasks from the Fisher information of the parameters in a model partly tuned on the task. They show that this captures some interesting properties of the tasks, including some types of semantic relationships, and can help identify models that can perform well on a task. Rusu and colleagues recently suggested a similar meta-learning framework where latent codes are computed for a task which can be decoded to a distribution over parameters (Rusu et al., 2019). Other recent work has tried to learn representations for skills (e.g. Eysenbach et al., 2019) or tasks (Hsu et al., 2019, e.g.) for exploration and representation learning. Our perspective can be seen as a generalization of these that allows for remapping of behavior via meta-tasks. To the best of our knowledge none of the prior work has explored zero-shot performance of a task via meta-mappings.

##### Future Directions:

We think that the general perspective of considering meta-mappings will yield many fruitful future directions. We hope that our work will inspire more exploration of behavioral adaptation, in areas beyond the simple domains we considered here. To this end, we suggest the creation of meta-learning datasets which include information not only about tasks, but about the relationships between them. For example, reinforcement learning tasks which involve executing instructions (e.g. Hermann et al., 2017; Co-Reyes et al., 2019) can be usefully interpreted from this perspective. Furthermore, we think our work provides a novel perspective on the types of flexibility that human intelligence exhibits, and thus hope that it may have implications for cognitive science.

We do not necessarily believe that the particular architecture we have suggested is the best architecture for addressing these problems, although it has a number of desirable characteristics. However, the modularization of the architecture makes it easy to modify. (We compare some variations in appendix E.) For example, although we only considered task networks that are feed-forward and of a fixed depth, this could be replaced with a recurrent architecture to allow more adaptive computation, or even a more complex architecture (e.g. Reed and de Freitas, 2015; Graves et al., 2016). Our work also opens the possibility of doing unsupervised learning over function representations for further learning, which relates to long-standing ideas in cognitive science about how humans represent knowledge (Clark and Karmiloff-Smith, 1993).

## 8 Conclusions

We see our proposal as a logical progression from the fundamental idea of meta-learning – that there is a continuum between data and tasks. This naturally leads to the idea of manipulating task representations just like we manipulate data. We’ve shown that this approach yields considerable flexibility, most importantly the meta-mapping ability to adapt zero-shot to a new task. We hope that these results will lead to the development of more powerful and flexible deep-learning models.

#### Acknowledgements

We would like to acknowledge Noah Goodman, Surya Ganguli, Katherine Hermann, Erin Bennett, and Arianna Yuan for stimulating questions and suggestions on this project.

## References

• Achille et al. (2019) Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C., Soatto, S., and Perona, P. (2019). Task2Vec: Task Embedding for Meta-Learning. arXiv preprint.
• Aljundi et al. (2019) Aljundi, R., Rohrbach, M., and Tuytelaars, T. (2019). Selfless sequential learning. In International Conference on Learning Representations, pages 1–17.
• Andreas et al. (2016a) Andreas, J., Klein, D., and Levine, S. (2016a). Modular Multitask Reinforcement Learning with Policy Sketches. arXiv preprint.
• Andreas et al. (2016b) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016b). Learning to Compose Neural Networks for Question Answering. arXiv preprint.
• Andreas et al. (2017) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2017). Deep Compositional Question Answering with Neural Module Networks.
• Ba et al. (2016) Ba, J., Hinton, G., Mnih, V., Leibo, J. Z., and Ionescu, C. (2016). Using Fast Weights to Attend to the Recent Past. In Advances in Neural Information Processing Systems, pages 1–10.
• Baars (2005) Baars, B. J. (2005). Global workspace theory of consciousness: Toward a cognitive neuroscience of human experience. Progress in Brain Research, 150:45–53.
• Borsa et al. (2019) Borsa, D., Quan, J., Mankowitz, D., Hasselt, H. V., Silver, D., and Schaul, T. (2019). Universal Successor Features Approximators. In International Conference on Learning Representations, number 2017, pages 1–24.
• Botvinick et al. (2019) Botvinick, M., Ritter, S., Wang, J. X., Kurth-nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement Learning , Fast and Slow. Trends in Cognitive Sciences, pages 1–15.
• Brock et al. (2018) Brock, A., Lim, T., Ritchie, J. M., and Weston, N. (2018). SMASH: One-Shot Model Architecture Search through HyperNetworks. In International Conference on Learning Representations.
• Cao et al. (2019) Cao, S., Wang, X., and Kitani, K. M. (2019). Learnable Embedding Space for Efficient Neural Architecture Compression. In International Conference on Learning Representations, pages 1–17.
• Chang et al. (2019) Chang, M. B., Gupta, A., Levine, S., and Griffiths, T. L. (2019). Automatically Composing Representation Transformations as a Means for Generalization. In International Conference on Learning Representations, pages 1–23.
• Clark and Karmiloff-Smith (1993) Clark, A. and Karmiloff-Smith, A. (1993). The Cognizer’s Innards: A Psychological and Philosophical Perspective on the Development of Thought. Mind & Language, 8(4):487–519.
• Co-Reyes et al. (2019) Co-Reyes, J. D., Gupta, A., Suvansh, S., Altieri, N., Andreas, J., DeNero, J., Abbeel, P., and Levine, S. (2019). Guiding policies with language via meta-learning. In International Conference on Learning Representations, pages 1–17.
• Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. (2016). RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv preprint, pages 1–14.
• Eysenbach et al. (2019) Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. (2019). Diversity is all you need: learning skills without a reward function. In International Conference on Learning Representations, pages 1–22.
• Fernando et al. (2017) Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A. A., Pritzel, A., and Wierstra, D. (2017). PathNet: Evolution Channels Gradient Descent in Super Neural Networks. arXiv.
• Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th Annual Conference on Machine Learning.
• Finn et al. (2019) Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. (2019). Online Meta-Learning. arXiv preprint.
• Finn et al. (2018) Finn, C., Xu, K., and Levine, S. (2018). Probabilistic Model-Agnostic Meta-Learning. arXiv preprint.
• Garnelo et al. (2018) Garnelo, M., Rosenbaum, D., Maddison, C. J., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J., and Eslami, S. M. A. (2018). Conditional Neural Processes. arXiv preprint.
• Graves et al. (2016) Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., Gómez Colmenarejo, S., Grefenstette, E., Ramalho, T., Agapiou, J., Badia, A. P., Moritz Hermann, K., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature Publishing Group, 538(7626):471–476.
• Ha et al. (2016) Ha, D., Dai, A., and Le, Q. V. (2016). HyperNetworks. arXiv.
• Hermann et al. (2017) Hermann, K. M., Hill, F., Green, S., Wang, F., Faulkner, R., Soyer, H., Szepesvari, D., Czarnecki, W. M., Jaderberg, M., Teplyashin, D., Wainwright, M., Apps, C., and Hassabis, D. (2017). Grounded Language Learning in a Simulated 3D World. arXiv preprint, pages 1–22.
• Hinton and Plaut (1982) Hinton, G. E. and Plaut, D. C. (1982). Using Fast Weights to Deblur Old Memories. Proceedings of the 9th Annual Conference of the Cognitive Science Society, (1987).
• Hlavac (2018) Hlavac, M. (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
• Hsu et al. (2019) Hsu, K., Levine, S., and Finn, C. (2019). Unsupervised Learning Via Meta-Learning. In International Conference on Learning Representations.
• Hu et al. (2019) Hu, W., Lin, Z., Liu, B., Tao, C., Tao, Z., Zhao, D., and Yan, R. (2019). Overcoming catastrophic forgetting for continual learning via model adaptation. In International Conference on Learning Representations, pages 1–13.
• Johnson et al. (2016) Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. (2016). Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. arXiv, pages 1–16.
• Kirkpatrick et al. (2016) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., and Hadsell, R. (2016). Overcoming catastrophic forgetting in neural networks. arXiv preprint.
• Kumaran et al. (2016) Kumaran, D., Hassabis, D., and McClelland, J. L. (2016). What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Updated. Trends in Cognitive Sciences, 20(7):512–534.
• Lake et al. (2017) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017). Building Machines that learn and think like people. Behavioral and Brain Sciences, pages 1–55.
• Lampinen and Ganguli (2019) Lampinen, A. K. and Ganguli, S. (2019). An analytic theory of generalization dynamics and transfer learning in deep linear networks. In ICLR, pages 1–20.
• Lampinen and McClelland (2018) Lampinen, A. K. and McClelland, J. L. (2018). One-shot and few-shot learning of word embeddings. arXiv preprint.
• Laroche and Barlier (2017) Laroche, R. and Barlier, M. (2017). Transfer Reinforcement Learning with Shared Dynamics. In Proceedings of the Thirty First AAAI Conference on Artificial Intelligence, pages 2147–2153.
• Laurens van der Maaten and Hinton (2008) Laurens van der Maaten and Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579–2605.
• Li et al. (2019) Li, H., Dong, W., Mei, X., Ma, C., Huang, F., and Hu, B.-G. (2019). LGM-Net: Learning to Generate Matching Networks for Few-Shot Learning. Proceedings of the 36th International Conference on Machine Learning.
• Marcus (2018) Marcus, G. (2018). Deep Learning: A Critical Appraisal. arXiv preprint, pages 1–27.
• McCloskey and Cohen (1989) McCloskey, M. and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24.
• Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. a., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
• Munkhdalai and Yu (2017) Munkhdalai, T. and Yu, H. (2017). Meta Networks. arXiv preprint.
• Reed and de Freitas (2015) Reed, S. and de Freitas, N. (2015). Neural Programmer-Interpreters. arXiv preprint, pages 1–12.
• Rumelhart and Todd (1993) Rumelhart, D. E. and Todd, P. M. (1993). Learning and connectionist representations. Attention and performance XIV: Synergies in experimental psychology, artificial intelligence, and cognitive neuroscience, pages 3–30.
• Rusu et al. (2016) Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). Progressive neural networks. arXiv preprint.
• Rusu et al. (2019) Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. (2019). Meta-Learning with Latent Embedding Optimization. International Conference on Learning Representations, pages 1–17.
• Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016). Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning, volume 48.
• Sprechmann et al. (2018) Sprechmann, P., Jayakumar, S. M., Rae, J. W., Pritzel, A., Uria, B., Vinyals, O., Hassabis, D., Pascanu, R., and Blundell, C. (2018). Memory-based parameter Adaptation. In International Conference on Learning Representations.
• Stadie et al. (2018) Stadie, B. C., Yang, G., Houthooft, R., Chen, X., Duan, Y., Wu, Y., Abbeel, P., and Sutskever, I. (2018). Some Considerations on Learning to Explore via Meta-Reinforcement Learning. arXiv preprint.
• Tessler et al. (2016) Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. (2016). A Deep Hierarchical Approach to Lifelong Learning in Minecraft. arXiv preprint.
• Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching Networks for One Shot Learning. Advances in Neural Information Processing Systems.
• Wang et al. (2016) Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint, pages 1–17.
• Zenke et al. (2017) Zenke, F., Poole, B., and Ganguli, S. (2017). Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning.
• Zhang et al. (2019) Zhang, C., Ren, M., Urtasun, R., Advanced, U., and Group, T. (2019). Graph HyperNetworks for neural architecture search. In International Conference on Learning Representations, number 2018, pages 1–17.
• Zoph and Le (2016) Zoph, B. and Le, Q. V. (2016). Neural Architecture Search with Reinforcement Learning. arXiv preprint, pages 1–16.

The supplemental material is organized as follows: In section A we clarify some definitional details, and discuss the value of meta-mappings in detail by comparing to other methods of performing a new task. In section B we describe a continual-learning like perspective based on our approach. In section C we provide supplemental figures. In section D we show -sne results for the cards domain. In section E we provide some lesion studies. In section F we list details fo the datasets and architectures we used, as well as providing links to the source code for all models, experiments, and analyses. In section G we provide means and bootstrap CIs corresponding to the major figures in the paper.

## Appendix A Clarifying meta-mapping

### a.1 Clarifying hold-outs

There are several distinct types of hold-outs in the basic training of our architecture:

1. On each basic task, some of the data () is fed to the meta-network while some () is held out. This encourages the model to actually infer the underlying function, rather than just memorizing the examples.

2. There are also truly held-out tasks that the system has never seen in training. These are the held-out tasks that we evaluate on at the end of training and that are plotted in the “Held out” sections in the main plots.

This applies analogously to the meta-mappings: each time a meta-mapping is trained, some basic tasks are used as examples while others are held out to encourage generalization. There are also meta-mappings which have never been encountered during training, which we evaluate on at the end of training, those are the meta-mappings which are plotted in the “held out” section in the relevant plots. We also evaluate the old (and new) meta-mappings on the new basic tasks that have never been trained.

### a.2 Why meta-map from tasks to tasks?

To address this latter possibility, we trained a version of the model where we included training the language system to produce embeddings for the basic tasks (while simultaneously training the system on all the other objectives, such as performing the tasks from examples, in order to provide the strongest possible structuring of the system’s knowledge for the strongest possible comparison). We compare this model’s performance at held-out tasks to that of systems learning from examples of the new task directly, or from meta-mapping, see fig. 5.

These results demonstrate the advantage of meta-mapping. While learning from examples is still better given enough data, it requires potentially-expensive data collection and does not allow zero-shot adaptation. Performing the new task from a language description alone uses only the implicit knowledge in the model’s weights, and likely because of this it does not generalize well to the difficult held-out tasks. Meta-mapping performs substantially better, while relying only on cached prior knowledge, viz. prior task-embedding(s) and a description of the meta-mapping (either in the form of examples or natural language). That is, meta-mapping has the advantage of requiring no new data collection, like performing from language alone, but results in much better performance by leveraging a richer description of the new task constructed using the system’s knowledge of a prior task and the new task’s relationship to it.

### a.3 A definitional note

When we discussed meta-mappings in the main task, we equivocated between tasks and behaviors for the sake of brevity. For a perfect model, this is somewhat justifiable, because each task will have a corresponding optimal behavior, and the sytem’s embedding of the task will be precisely the embedding which produces this optimal behavior. However, behavior-irrelevant details of the task, like the color of the board, may not be embedded, so this should not really be thought of as a task-to-task mapping. This problem is exacerbated when the system is imperfect, e.g. during learning. It is thus more precise to distinguish between a ground-truth meta-mapping, which maps tasks to tasks, and the computational approach to achieving that meta-mapping, which really maps between representations which combine both task and behavior.

## Appendix B Continual learning

##### Continual learning:

Although the meta-learning approach is effective for rapidly adapting to a new task, it is unreasonable to think that our system must consider every example it has seen at each inference step. We would like to be able to store our knowledge more efficiently, and allow for further refinement. Furthermore, we would like the system to be able to adapt to new tasks (for which its guessed solution isn’t perfect) without catastrophically interfering with prior tasks (McCloskey and Cohen, 1989).

A very simple solution to these problems is naturally suggested by our architecture. Specifically, task embeddings can be cached so that they don’t have to be regenerated at each inference step. This also allows optimization of these embeddings without altering the other parameters in the architecture, thus allowing fine-tuning on a task without seeing more examples, and without interfering with performance on any other task (cf. Rumelhart and Todd, 1993; Lampinen and McClelland, 2018). This is like the procedure of Rusu et al. (2019), except considered across episodes. That is, we can see the meta-learning step as a “warm start” for an optimization procedure over embeddings that are cached in memory (cf. Kumaran et al., 2016). While this is not a traditional continual learning perspective, we think it provides an interesting perspective on the issue. It might in fact be much more memory-efficient to store an embedding per task, compared to storing an extra “importance” parameter for every parameter in our model, as in e.g. elastic weight consolidation (Kirkpatrick et al., 2016). It also provides a stronger guarantee of non-interference.

To test this idea, we pre-trained the system on 100 polynomial tasks, and then introduced 100 new tasks. We trained on these new tasks by starting from the meta-network’s “guess” at the correct task embedding, and then optimizing this embedding without altering the other parameters. The results are shown in fig. 6. The meta-network embeddings offer good immediate performance, and substantially accelerate the optimization process, compared to a randomly-initialized embedding (see supp. fig. 10 for a more direct comparison). Furthermore, this ability to learn is due to training, not simply the expressiveness of the architecture, as is shown by attempting the same with an untrained network.

## Appendix D Card game t-Sne

We performed -SNE (Laurens van der Maaten and Hinton, 2008) on the task embeddings of the system at the end of learning the card game tasks, to evaluate the organization of knowledge in the network. In fig. 12 we show these embeddings for just the basic tasks. The embeddings show systematic grouping by game attributes. In fig. 13 we show the embeddings of the meta and basic tasks, showing the organization of the meta-tasks by type.

## Appendix E Architecture experiments

In this section we consider a few variations of the architecture, to justify the choices made in the paper.

### e.1 Shared Z vs. separate task-embedding and data-embedding space

Instead of having a shared where data and tasks are embedded, why not have a separate embedding space for data, tasks, and so on? There are a few conceptual reason why we chose to have a shared , including its greater parameter efficiency, the fact that humans seem to represent our conscious knowledge of different kinds in a shared space (Baars, 2005), and the fact that this representation could allow for zero-shot adaptation to new computational pathways through the latent space, analogously to the zero-shot language translation results reported by Johnson and colleagues (Johnson et al., 2016). In this section, we further show that training with a separate task encoding space worsens performance, see fig. 14. This seems to primarily be due to the fact that learning in the shared accelerates and de-noises the learning process, see fig. 15. (It’s therefore worth noting that running this model for longer could result in convergence to the same asymptotic generalization performance.)

### e.2 Hyper network vs. conditioned task network

Instead of having the task network parameterized by the hyper network , we could simply have a task network with learned weights which takes a task embedding as another input. Here, we show that this architecture fails to learn the meta-mapping tasks, although it can successfully perform the basic tasks. We suggest that this is because it is harder for this architecture to prevent interference between the comparatively larger number of basic tasks and the smaller number of meta-tasks. While it might be possible to succeed with this architecture, it was more difficult in the hyper-parameter space we searched.

## Appendix F Detailed methods

### f.1 Datasets

#### f.1.1 Polynomials

We randomly sampled the train and test polynomials as follows:

1. Sample the number of relevant variables () uniformly at random from 0 (i.e. a constant) to the total number of variables.

2. Sample the subset of variables that are relevant from all the variables.

3. For each term combining the relevant variables (including the intercept), include the term with probability 0.5. If so give it a random coefficient drawn from .

The data points on which these polynomials were evaluated were sampled uniformly from independently for each variable, and for each polynomial. The datasets were resampled every 50 epochs of training.

• Classifying polynomials as constant/non-constant.

• Classifying polynomials as zero/non-zero intercept.

• For each variable, identifying whether that variable was relevant to the polynomial.

We trained on 20 meta-mapping tasks, and held out 16 related meta-mappings.

• Squaring polynomials (where applicable).

• Adding a constant (trained: -3, -1, 1, 3, held-out: 2, -2).

• Multiplying by a constant (trained: -3, -1, 3, held-out: 2, -2).

• Permuting inputs (trained: 1320, 1302, 3201, 2103, 3102, 0132, 2031, 3210, 2301, 1203, 1023, 2310, held-out: 0312, 0213, 0321, 3012, 1230, 1032, 3021, 0231, 0123, 3120, 2130, 2013).

Language: We encoded the meta-tasks in language by sequences as follows:

• Classifying polynomials as constant/non-constant: [‘‘is’’, ‘‘constant’’]

• Classifying polynomials as zero/non-zero intercept: [‘‘is’’, ‘‘intercept_nonzero’’]

• For each variable, identifying whether that variable was relevant to the polynomial: [‘‘is’’, <variable-name>, ‘‘relevant’’]

• Squaring polynomials: [‘‘square’’]

• Adding a constant: [‘‘add’’, <value>]

• Multiplying by a constant: [‘‘multiply’’, <value>]

• Permuting inputs:

[‘‘permute’’, <variable-name>, <variable-name>, <variable-name>,
<variable-name>]


All sequences were front-padded with “<PAD>” to the length of the longest sequence.

#### f.1.2 Card games

Our card games were played with two suits, and 4 values per suit. In our setup, each hand in a game has a win probability (proportional to how it ranks against all other possible hands). The agent is dealt a hand, and then has to choose to bet 0, 1, or 2 (the three actions it has available). We considered a variety of games which depend on different features of the hand:

• High card: Highest card wins.

• Pairs Same as high card, except pairs are more valuable, and same suit pairs are even more valuable.

• Straight flush: Most valuable is adjacent numbers in same suit, i.e. 4 and 3 in most valuable suit wins every time (royal flush).

• Match: the hand with cards that differ least in value (suit counts as 0.5 pt difference) wins.

• Blackjack: The hand’s value increases with the sum of the cards until it crosses 5, at which point the player “goes bust,” and the value becomes negative.

We also considered three binary attributes that could be altered to produce variants of these games:

• Losers: Try to lose instead of winning! Reverses the ranking of hands.

• Suits rule: Instead of suits being less important than values, they are more important (essentially flipping the role of suit and value in most games).

• Switch suit: Switches which of the suits is more valuable.

Any combination of these options can be applied to any of the 5 games, yielding 40 possible games. The systems were trained with the full 40 possible games, but after training we discovered that the “suits rule” option does not substantially alter the games we chose (in the sense that the probability of a hand winning in the two variants of a game is very highly correlated), so we have omitted it from our analyses.

Meta-tasks: For meta-tasks, we gave the network 8 task-embedding classification tasks (one-vs-all classification of each of the 5 game types, and of each of the 3 attributes), and 3 meta-mapping tasks (each of the 3 attributes).

Language: We encoded the meta-tasks in language by sequences of the form [‘‘toggle’’, <attribute-name>] for the meta-mapping tasks, and [‘‘is’’, <attribute-or-game-name>].

### f.2 Model & training

1. A training dataset of (input, target) pairs is embedded by and to produce a set of paired embeddings. Another set of (possibly unlabeled) inputs is provided and embedded.

2. The meta network maps the set of embedded (input, target) pairs to a function embedding.

3. The hyper network maps the function embedding to parameters for , which is used to transform the second set of inputs to a set of output embeddings.

4. The output embeddings are decoded by to produce a set of outputs.

5. The system is trained end-to-end to minimize the loss on these outputs.

The model is trained to minimize

 E(x,y)∈D2[L(y,O(FD1(I(x))))]

where is the transformation the meta-learner guesses for the training dataset :

 FD1 is parameterized by H(M({(I(xi),T(yi))% for (xi,yi)∈D1}))

1. A meta-dataset of (source-task-embedding, target-task-embedding) pairs, , is collected. Another dataset (possibly only source tasks) is provided. (All embeddings included in during training are for basic tasks that have themselves been trained, to ensure that there is useful signal. During evaluation, the embeddings in are for tasks that have been trained on, but those in may be new.

2. The meta network maps this set of (source, target) task-embedding pairs to a function embedding.

3. The hyper network maps the function embedding to parameters for , which is used to transform the second set of inputs to a set of output embeddings.

4. The system is trained to minimize loss between these mapped embeddings and the target embeddings.

The model is trained to minimize

 E(zsource,ztarget)∈D2[L(ztarget,FD1(I(zsource)))]

where is loss, and is the transformation the meta-learner guesses for the training dataset :

 FD1 is parameterized by H(M({((zsource,ztarget)∈D1}))

Note that there are three kinds of hold-out in the training of this system, see section A.1.

Language-cued meta-tasks: The procedure is analogous to the meta-tasks from examples, except that the input to is the embedding of the language input, rather than the output of . The systems that were trained from language were also trained with the example-based meta-tasks.

#### f.2.1 Detailed hyper-parameters

See table 1 for detailed architectural description and hyperparameters for each experiment. Hyperparameters were generally found by a heuristic search, where mostly only the optimizer, learning rate annealing schedule, and number of training epochs were varied, not the architectural parameters. Some of the parameters take the values they do for fairly arbitrary reasons, e.g. the continual learning experiments were run with the current polynomial hyperparameters before the hyperparameter search for the polynomial data was complete, so some parameters are altered between these.

Each epoch consisted of a separate learning step on each task (both base and meta), in a random order. In each task, the meta-learner would receive only a subset (the “batch size“ above) of the examples to generate a function embedding, and would have to generalize to the remainder of the examples in the dataset. The embeddings of the tasks for the meta-learner were computed once per epoch, so as the network learned over the course of the epoch, these embeddings would get “stale,” but this did not seem to be too detrimental.

The results reported in the figures in this paper are averages across multiple runs, with different trained and held-out tasks (in the polynomial case) and different network initializations (in all cases), to ensure the robustness of the findings.

### f.3 Source repositories

The full code for the experiments and analyses can be found on github:

## Appendix G Numerical results

In this section we provide the mean values and bootstrap confidence intervals corresponding to the major figures in the paper, as well as the baseline results in those figures. Tables were generated with stargazer (Hlavac, 2018).

### g.2 Cards

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

392036

How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description