Learned Optimizers that Scale and Generalize
Abstract
Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We introduce a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead. We achieve this by introducing a novel hierarchical RNN architecture, with minimal perparameter overhead, augmented with additional architectural features that mirror the known structure of optimization tasks. We also develop a metatraining ensemble of small, diverse optimization tasks capturing common properties of loss landscapes. The optimizer learns to outperform RMSProp/ADAM on problems in this corpus. More importantly, it performs comparably or better when applied to small convolutional neural networks, despite seeing no neural networks in its metatraining set. Finally, it generalizes to train Inception V3 and ResNet V2 architectures on the ImageNet dataset for thousands of steps, optimization problems that are of a vastly different scale than those it was trained on. We release an open source implementation of the metatraining algorithm.
1 Introduction
Optimization is a bottleneck for almost all tasks in machine learning, as well as in many other fields, including engineering, design, operations research, and statistics. Advances in optimization therefore have broad impact. Historically, optimization has been performed using handdesigned algorithms. Recent results in machine learning show that, given sufficient data, welltrained neural networks often outperform handtuned approaches on supervised tasks. This raises the tantalizing possibility that neural networks may be able to outperform handdesigned optimizers.
Despite the promise in this approach, previous work on learned RNN optimizers for gradient descent has failed to produce neural network optimizers that generalize to new problems, or that continue to make progress on the problems for which they were metatrained when run for large numbers of steps (see Figure 2). Current neural network optimizers are additionally too costly in both memory and computation to scale to larger problems.
We address both of these issues. Specifically, we improve upon existing learned optimizers by:

Developing a metatraining set that consists of an ensemble of small tasks with diverse loss landscapes

Introducing a hierarchical RNN architecture with lower memory and compute overhead, and which is capable of capturing interparameter dependencies.

Incorporating features motivated by successful handdesigned optimizers into the RNN, so that it can build on existing techniques. These include dynamically adapted input and output scaling, momentum at multiple time scales, and a cross between Nesterov momentum and RNN attention mechanisms.

Improving the metaoptimization pipeline, for instance by introducing a metaobjective that better encourages exact convergence of the optimizer, and by drawing the number of optimization steps during training from a heavy tailed distribution.
2 Related work
Learning to learn has a long history in psychology (Ward, 1937; Harlow, 1949; Kehoe, 1988; Lake et al., 2016). Inspired by it, machine learning researchers have proposed metalearning techniques for optimizing the process of learning itself. Schmidhuber (1987), for example, considers networks that are able to modify their own weights. This leads to endtoend differentiable systems which allow, in principle, for extremely general update strategies to be learned. There are many works related to this idea, including (Sutton, 1992; Naik & Mammone, 1992; Thrun & Pratt, 1998; Hochreiter et al., 2001; Santoro et al., 2016).
A series of papers from Bengio et al. (1990, 1992, 1995) presents methods for learning parameterized local neural network update rules that avoid backpropagation. Runarsson & Jonsson (2000) extend this to more complex update models. The result of meta learning in these cases is an algorithm, i.e. a local update rule.
Andrychowicz et al. (2016) learn to learn by gradient descent by gradient descent. Rather than trying to distill a global objective into a local rule, their work focuses on learning how to integrate gradient observations over time in order to achieve fast learning of the model. The componentwise structure of the algorithm allows a single learned algorithm to be applied to new problems of different dimensionality. While Andrychowicz et al. (2016) consider the issue of transfer to different datasets and model structures, they focus on transferring to problems of the same class. In fact, they report negative results when transferring optimizers, metatrained to optimize neural networks with logistic functions, to networks with ReLU functions.
Li & Malik (2017) proposed an approach similar to Andrychowicz et al. (2016), around the same time, but they rely on policy search to compute the metaparameters of the optimizer. That is, they learn to learn by gradient descent by reinforcement learning.
Zoph & Le (2017) also metatrain a controller RNN, but this time to produce a string in a custom domain specific language (DSL) for describing neural network architectures. An architecture matching the produced configuration (the “child” network) is instantiated and trained in the ordinary way. In this case the metalearning happens only at the network architecture level.
Ravi & Larochelle (2017) modify the optimizer of Andrychowicz et al. (2016) for 1 and 5shot learning tasks. They use test error to optimize the meta learner. These tasks have the nice property that the recurrent neural networks only need to be unrolled for a small number of steps.
Wang et al. (2016) show that it is possible to learn to solve reinforcement learning tasks by reinforcement learning. They demonstrate their approach on several examples from the bandits and cognitive science literature. A related approach was proposed by Duan et al. (2016).
Finally, Chen et al. (2016) also learn reinforcement learning, but by supervised metatraining of the metalearner. They apply their methods to blackbox function optimization tasks, such as Gaussian process bandits, simple lowdimensional controllers, and hyperparameter tuning.
3 Architecture
At a high level, a hierarchical RNN is constructed to act as a learned optimizer, with its architecture matched to the parameters in the target problem. The hierarchical RNN’s parameters (called metaparameters) are shared across all target problems, so despite having an architecture that adapts to the target problem, it can be applied to new problems. At each optimization step, the learned optimizer receives the gradients for every parameter along with some additional quantities derived from the gradients, and outputs an update to the parameters. Figure 1 gives an overview.
3.1 Hierarchical architecture
In order to effectively scale to large problems, the optimizer RNN must stay quite small while maintaining enough flexibility to capture interparameter dependencies that shape the geometry of the loss surface. Optimizers that account for this second order information are often particularly effective (e.g. quasiNewton approaches). We propose a novel hierarchical architecture to enable both low perparameter computational cost, and aggregation of gradient information and coordination of update steps across parameters (Figure 1). At the lowest level of the hierarchy, we have a small Parameter RNN that receives direct perparameter (scalar) gradient inputs. One level up, we have an intermediate Tensor RNN that incorporates information from a subset of the Parameter RNNs (where the subsets are problem specific). For example, consider a feedforward fullyconnected neural network. There would be a Tensor RNN for each layer of the network, where each layer contains an () weight matrix and therefore Parameter RNNs.
At the highest level of the hierarchy is a Global RNN which receives output from every Tensor RNN. This allows the Parameter RNN to have very few hidden units with larger Tensor and Global RNNs keeping track of problemlevel information. The Tensor and Global RNNs can also serve as communication channels between Parameter and Tensor RNNs respectively. The Tensor RNN outputs are fed as biases to the Parameter RNN, and the new parameter state is averaged and fed as input to the Tensor RNN. Similarly, the Global RNN state is fed as a bias to each Tensor RNN, and the output of the Tensor RNNs is averaged and fed as input to the Global RNN (Figure 1).
The architecture used in the experimental results has a Parameter RNN hidden state size of 10, and a Tensor and Global RNN state size of 20 (the architecture used by Andrychowicz et al. (2016) had a two layer RNN for each parameter, with 20 units per layer). These sizes showed the best generalization to ConvNets and other complex test problems. Experimentally, we found that we could make the Parameter RNN as small as 5, and the Tensor RNN as small as 10 and still see good performance on most problems. We also found that the performance decreased slightly even on simple test problems if we removed the Global RNN entirely. We used a GRU architecture (Cho et al., 2014) for all three of the RNN levels.
3.2 Features inspired by optimization literature
The best performing neural networks often have knowledge about task structure baked into their design. Examples of this include convolutional models for image processing (Krizhevsky et al., 2012; He et al., 2016), causal models (RNNs) for modeling causal time series data, and the merging of neural value functions with Monte Carlo tree search in AlphaGo (Silver et al., 2016).
We similarly incorporate knowledge of effective strategies for optimization into our network architecture. We emphasize that these are not arbitrary design choices. The features below are motivated by results in optimization and recurrent network literature. They are also individually important to the ability of the learned optimizer to generalize to new problems, as is illustrated by the ablation study in Section 5.5 and Figure 6.
Let be the loss of the target problem, where is the set of all parameter tensors (e.g. all weight matrices and bias vectors in a neural network). At each training iteration , each parameter tensor is updated as , where the update step is set by the learned optimizer (Equation 5 below).
3.2.1 Attention and Nesterov Momentum
Nesterov momentum (Nesterov, 1983a) is a powerful optimization approach, where parameter updates are based not on the gradient evaluated at the current iterate , but rather at a location which is extrapolated ahead of the current iterate. Similarly, attention mechanisms have proven extremely powerful in recurrent translation models (Bahdanau et al., 2015), decoupling the iteration of RNN dynamics from the observed portion of the input sequence. Motivated by these successes, we incorporate an attention mechanism that allows the optimizer to explore new regions of the loss surface by computing gradients away (or ahead) from the current parameter position. At each training step the attended location is set as , where the offset is further described by Equation 6 below. Note that the attended location is an offset from the previous parameter location rather than the previous attended location .
The gradient of the loss with respect to the attended parameter values will provide the only input to the learned optimizer, though it will be further transformed before being passed to the hierarchical RNN. For every parameter tensor , .
3.2.2 Momentum on multiple timescales
Momentum with an exponential moving average is typically motivated in terms of averaging away minibatch noise or high frequency oscillations, and is often a very effective feature (Nesterov, 1983b; Tseng, 1998). We provide the learned optimizer with exponential moving averages of the gradients on several timescales, where indexes the timescale of the average. The update equation for the moving average is
(1) 
where the indicates the sigmoid function, and where the momentum logit for the shortest timescale is output by the RNN, and the remaining timescales each increase by a factor of two from that baseline.
By comparing the moving averages at multiple timescales, the learned optimizer has access to information about how rapidly the gradient is changing with training time (a measure of loss surface curvature), and about the degree of noise in the gradient.
3.2.3 Dynamic input scaling
We would like our optimizer to be invariant to parameter scale. Additionally, RNNs are most easily trained when their inputs are well conditioned, and have a similar scale as their latent state. In order to aid each of these goals, we rescale the average gradients in a fashion similar to what is done in RMSProp (Tieleman & Hinton, 2012), ADAM (Kingma & Ba, 2015), and SMORMS3 (Funk, 2015),
(2)  
(3) 
where is a running average of the square average gradient, is the scaled averaged gradient, and the momentum logit for the shortest timescale will be output by the RNN, similar to how the timescales for momentum are computed in the previous section.
It may be useful for the learned optimizer to have access to how gradient magnitudes are changing with training time. We therefore provide as further input a measure of relative gradient magnitudes at each averaging scale . Specifically, we provide the relative log gradient magnitudes,
(4) 
3.2.4 Decomposition of output into direction and step length
Another aspect of RMSProp and ADAM is that the learning rate corresponds directly to the characteristic step length. This is true because the gradient is scaled by a running estimate of its standard deviation, and after scaling has a characteristic magnitude of 1. The length of update steps therefore scales linearly with the learning rate, but is invariant to any scaling of the gradients.
We enforce a similar decomposition of the parameter updates into update directions and for parameters and attended parameters, with corresponding step lengths and ,
(5)  
(6) 
where is the number of elements in the parameter tensor . The directions and are read directly out of the RNN (though see B.1 for subtleties).
Relative learning rate
We want the performance of the optimizer to be invariant to parameter scale. This requires that the optimizer judge the correct step length from the history of gradients, rather than memorizing the range of step lengths that were useful in its metatraining ensemble. The RNN therefore controls step length by outputing a multiplicative (additive after taking a logarithm) change, rather than by outputing the step length directly,
(7)  
(8) 
where for stability reasons, the log step length is specified relative to an exponential running average with metalearned momentum . The attended parameter log step length is related to by a metalearned constant offset ,
(9) 
To further force the optimizer to dynamically adapt the learning rate rather than memorizing a learning rate trajectory, the learning rate is initialized from a log uniform distribution from to . We emphasize that the RNN has no direct access to the learning rate, so it must adjust it based purely on its observations of the statistics of the gradients.
In order to aid in coordination across parameters, we do provide the RNN as an input the relative log learning rate of each parameter, compared to the remaining parameters, .
3.3 Optimizer inputs and outputs
As described in the preceding sections, the full set of Parameter RNN inputs for each tensor are , corresponding to the scaled averaged gradients, the relative log gradient magnitudes, and the relative log learning rate.
The full set of Parameter RNN outputs for each tensor are , corresponding to the parameter and attention update directions, the change in step length, and the momentum logits. Each of the outputs in is read out via a learned affine transformation of the Parameter RNN hidden state. The readout biases are clamped to 0 for and . The RNN update equations are then:
(10)  
(11)  
(12)  
(13) 
where is the hidden state for each level of the RNN, as described in Section 3.1, and and are learned weights of the affine transformation from the lowest level hidden state to output.
3.4 Compute and memory cost
The computational cost of the learned optimizer is , where is the minibatch size, is the total number of parameters, is the number of parameter tensors, and , , and are the latent sizes for Parameter, Tensor, and Global RNNs respectively. Typically, we are in the regime where , in which case the computational cost simplifies to . Note that as the minibatch size is increased, the computational cost of the learned optimizer approaches that of vanilla SGD, as the cost of computing the gradient dominates the cost of computing the parameter update.
The memory cost of the learned optimizer is , which similarly to computational cost typically reduces to . So long as the latent size of the Parameter RNN can be kept small, the memory overhead will also remain small.
We show experimental results for computation time in Section 5.6.
4 Metatraining
The RNN optimizer is metatrained by a standard optimizer on an ensemble of target optimization tasks. We call this process metatraining, and the parameters of the RNN optimizer the metaparameters.
4.1 Metatraining set
Previous learned optimizers have failed to generalize beyond the problem on which they were metatrained. In order to address this, we metatrain the optimizer on an ensemble of small problems, which have been chosen to capture many commonly encountered properties of loss landscapes and stochastic gradients. By metatraining on small toy problems, we also avoid memory issues we would encounter by metatraining on very large, realworld problems.
Except where otherwise indicated, all target problems were designed to have a global minimum of zero (in some cases a constant offset was added to make the minimum zero). The code defining each of these problems is included in the open source release. See A.
4.1.1 Exemplar problems from literature
We included a set of 2dimensional problems which have appeared in optimization literature (Surjanovic & Bingham, 2013) as toy examples of various loss landscape pathologies. These consisted of Rosenbrock, Ackley, Beale, Booth, StyblinskiTang, Matyas, Branin, Michalewicz, and logsumexp functions.
4.1.2 Well behaved problems
We included a number of wellbehaved convex loss functions, consisting of quadratic bowls of varying dimension with randomly generated coupling matrices, and logistic regression on randomly generated, generally linearly separable data. For the logistic regression problem, when the data is not fully linearly separable, the global minimum is greater than 0.
4.1.3 Noisy gradients and minibatch problems
For problems with randomly generated data, such as logistic regression, we fed in minibatches of various sizes, from 10 to 200. We also used a minibatch quadratic task, where the minibatch loss consisted of the square inner product of the parameters with random input vectors.
For fullbatch problems, we sometimes added normally distributed noise with standard deviations from 0.1 to 2.0 in order to simulate noisy minibatch loss.
4.1.4 Slow convergence problems
We included several tasks where optimization could proceed only very slowly, despite the small problem size. This included a manydimensional oscillating valley whose global minimum lies at infinity, and a problem with a loss consisting of a very strong coupling terms between parameters in a sequence. We additionally included a task where the loss only depends on the minimum and maximum valued parameter, so that gradients are extremely sparse and the loss has discontinuous gradients.
4.1.5 Transformed problems
We also included a set of problems which transform the previously defined target problems in ways which map to common situations in optimization.
To simulate problems with sparse gradients, one transformation sets a large fraction of the gradient entries to 0 at each training step. To simulate problems with different scaling across parameters, we added a transformation which performs a linear change of variables so as to change the relative scale of parameters. To simulate problems with different steepnessprofiles over the course of learning, we added a transformation which applied monotonic transformations (such as raising to a power) to the final loss. Finally, to simulate complex tasks with diverse parts, we added a multitask transformation, which summed the loss and concatenated the parameters from a diverse set of problems.
4.2 Metaobjective
For the metatraining loss, used to train the metaparameters of the optimizer, we used the average log loss across all training problems,
(14) 
where the second term is a constant, and where is the full set of metaparameters for the learned optimizer, consisting of , where indicates the GRU weights and biases for the Parameter, Tensor, or Global RNN, is the learning rate momentum and is the attended step offset (Section 3.2.4).
Minimizing the average log function value, rather than the average function value, better encourages exact convergence to minima and precise dynamic adjustment of learning rate based on gradient history (Figure 6). The average logarithm also more closely resembles minimizing the final function value, while still providing a metalearning signal at every training step, since very small values of make an outsized contribution to the average after taking the logarithm.
4.3 Partial unrolling
Metalearning gradients were computed via backpropagation through partial unrolling of optimization of the target problem, similarly to Andrychowicz et al. (2016). Note that Andrychowicz et al. (2016) dropped second derivative terms from their backpropagation, due to limitations of Torch. We compute the full gradient in TensorFlow, including second derivatives.
4.4 Heavytailed distribution over training steps
In order to encourage the learned optimizer to generalize to long training runs, both the number of partial unrollings, and the number of optimization steps within each partial unroll, was drawn from a heavy tailed exponential distribution. The resulting distribution is shown in Appendix C.1
4.5 Metaoptimization
The optimizers were metatrained for at least 40M metaiterations (each metaiteration consists of loading a random problem from the metatraining set, running the learned optimizer on that target problem, computing the metagradient, and then updating the metaparameters). The metaobjective was minimized with asynchronous RMSProp across 1000 workers, with a learning rate of .
5 Experiments
5.1 Failures of existing learned optimizers
Previous learned optimizer architectures like Andrychowicz et al. (2016) perform well on the problems on which they are metatrained. However, they do not generalize well to new architectures or scale well to longer timescales. Figure 2 shows the performance of an optimizer metatrained on a 2layer perceptron with sigmoid activations on the same problem type with ReLU activations and a new problem type (a 2layer convolutional network). In both cases, the same dataset (MNIST) and minibatch size (64) was used. In contrast, our optimizer, which has not been metatrained on this dataset or any neural network problems, shows performance comparable with ADAM and RMSProp, even for numbers of iterations not seen during metatraining (Section 4.4).
5.2 Performance on training set problems
The learned optimizer matches or outperforms ADAM and RMSProp on problem types from the metatraining set (Figure 3). The exact setup for each problem type can be seen in the python code in the supplementary materials.
5.3 Generalization to new problem types
The metatraining problem set did not include any convolutional or fullyconnected layers. Despite this, we see comparable performance to ADAM, RMSProp, and SGD with momentum on simple convolutional multilayer networks and multilayer fully connected networks both in terms of final loss and number of iterations to convergence (Figure 3(a) and Figure 2).
We also tested the learned optimizer on Inception V3 (Szegedy et al., 2016) and on ResNet V2 (He et al., 2016). Figure 3(b) shows the learned optimizer is able to stably train these networks for the first 10K to 20K steps, with performance similar to traditional optimizers tuned for the specific problem. Unfortunately, we find that later in training the learned optimizer stops making effective progress, and the loss approaches a constant (approximately 6.5 for Inception V3). Addressing this issue would be a goal of future work.
5.4 Performance is robust to choice of learning rate
One timeconsuming aspect of training neural networks with current optimizers is choosing the right learning rate for the problem. While the learned optimizer is also sensitive to initial learning rate, it is much more robust. Figure 5 shows the learned optimizer’s training loss curve on a quadratic problem with different initial learning rates compared to those same learning rates on other optimizers.
5.5 Ablation experiments
The design choices described in Section 3 matter for the performance of the optimizer. We ran experiments in which we removed different features and remetatrained the optimizer from scratch. We kept the features which, on average, made performance better on a variety of test problems. Specifically, we kept all of the features described in 3.2 such as attention (3.2.1), momentum on multiple timescales (gradient scl) (3.2.2), dynamic input scaling (variable scl decay) (3.2.3), and a relative learning rate (relative lr) (3.2.4). We found it was important to take the logarithm of the metaobjective (log obj) as described in 4.2. In addition, we found it helpful to let the RNN learn its own initial weights (trainable weight init) and an accumulation decay for multiple gradient timescales (inp decay). Though all features had an effect, some features were more crucial than others in terms of consistently improved performance. Figure 6 shows one test problem (a 2layer convolutional network) on which all final features of the learned optimizer matter.
5.6 Wall clock comparison
In experiments, for small minibatches, we significantly underperform ADAM and RMSProp in terms of wall clock time. However, consistent with the prediction in 3.4, since our overhead is constant in terms of minibatch we see that the overhead can be made small by increasing the minibatch size.
6 Conclusion
We have shown that RNNbased optimizers metatrained on small problems can scale and generalize to early training on large problems like ResNet and Inception on the ImageNet dataset. To achieve these results, we introduced a novel hierarchical architecture that reduces memory overhead and allows communication across parameters, and augmented it with additional features shown to be useful in previous optimization and recurrent neural network literature. We also developed an ensemble of small optimization problems that capture common and diverse properties of loss landscapes. Although the wall clock time for optimizing new problems lags behind simpler optimizers, we see the difference decrease with increasing batch size. Having shown the ability of RNNbased optimizers to generalize to new problems, we look forward to future work on optimizing the optimizers.
References
 Andrychowicz et al. (2016) Andrychowicz, Marcin, Denil, Misha, Gomez, Sergio, Hoffman, Matthew W, Pfau, David, Schaul, Tom, Shillingford, Brendan, and de Freitas, Nando. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.
 Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. iclr, 2015.
 Bengio et al. (1995) Bengio, S., Bengio, Y., and Cloutier, J. On the search for new learning rules for ANNs. Neural Processing Letters, 2(4):26–30, 1995.
 Bengio et al. (1990) Bengio, Yoshua, Bengio, Samy, and Cloutier, Jocelyn. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche opérationnelle, 1990.
 Bengio et al. (1992) Bengio, Yoshua, Bengio, Samy, Cloutier, Jocelyn, and Gecsei, Jan. On the optimization of a synaptic learning rule. In in Conference on Optimality in Biological and Artificial Networks, 1992.
 Chen et al. (2016) Chen, Yutian, Hoffman, Matthew W., Colmenarejo, Sergio Gomez, Denil, Misha, Lillicrap, Timothy P., and de Freitas, Nando. Learning to learn for global optimization of black box functions. arXiv Report 1611.03824, 2016.
 Cho et al. (2014) Cho, Kyunghyun, Van Merriënboer, Bart, Bahdanau, Dzmitry, and Bengio, Yoshua. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 Duan et al. (2016) Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter, Sutskever, Ilya, and Abbeel, Pieter. Rl: Fast reinforcement learning via slow reinforcement learning. Technical report, UC Berkeley and OpenAI, 2016.
 Funk (2015) Funk, Simon. RMSprop loses to SMORMS3  beware the epsilon!, 2015. URL sifter.org/$∼$simon/journal/20150420.html.
 Harlow (1949) Harlow, Harry F. The formation of learning sets. Psychological review, 56(1):51, 1949.
 He et al. (2016) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Springer, 2016.
 Hochreiter et al. (2001) Hochreiter, Sepp, Younger, A Steven, and Conwell, Peter R. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Springer, 2001.
 Kehoe (1988) Kehoe, E James. A layered network model of associative learning: learning to learn and configuration. Psychological review, 95(4):411, 1988.
 Kingma & Ba (2015) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. iclr, 2015.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Lake et al. (2016) Lake, Brenden M, Ullman, Tomer D, Tenenbaum, Joshua B, and Gershman, Samuel J. Building machines that learn and think like people. arXiv Report 1604.00289, 2016.
 Li & Malik (2017) Li, SKe and Malik, Jitendra. Learning to optimize. In International Conference on Learning Representations, 2017.
 Naik & Mammone (1992) Naik, Devang K and Mammone, RJ. Metaneural networks that learn by learning. In International Joint Conference on Neural Networks, volume 1, pp. 437–442. IEEE, 1992.
 Nesterov (1983a) Nesterov, Yurii. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983a.
 Nesterov (1983b) Nesterov, Yurii. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pp. 372–376, 1983b.
 Ravi & Larochelle (2017) Ravi, Sachin and Larochelle, Hugo. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.
 Runarsson & Jonsson (2000) Runarsson, Thomas Philip and Jonsson, Magnus Thor. Evolution and design of distributed learning rules. In IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks, pp. 59–63. IEEE, 2000.
 Santoro et al. (2016) Santoro, ADAM, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Metalearning with memoryaugmented neural networks. In International Conference on Machine Learning, 2016.
 Schmidhuber (1987) Schmidhuber, Jurgen. Evolutionary Principles in SelfReferential Learning. On Learning how to Learn: The MetaMetaMeta…Hook. PhD thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
 Silver et al. (2016) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Surjanovic & Bingham (2013) Surjanovic, Sonja and Bingham, Derek. Optimization test functions and datasets, 2013. URL http://www.sfu.ca/~ssurjano/optimization.html.
 Sutton (1992) Sutton, Richard S. Adapting bias by gradient descent: An incremental version of deltabardelta. In Association for the Advancement of Artificial Intelligence, pp. 171–176, 1992.
 Szegedy et al. (2016) Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna, Zbigniew. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, 2016.
 Thrun & Pratt (1998) Thrun, Sebastian and Pratt, Lorien. Learning to learn. Springer Science and Business Media, 1998.
 Tieleman & Hinton (2012) Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.
 Tseng (1998) Tseng, Paul. An incremental gradient (projection) method with momentum term and adaptive stepsize rule. Journal on Optimization, 8(2):506–531, 1998.
 Wang et al. (2016) Wang, Jane X., KurthNelson, Zeb, Tirumala, Dhruva, Soyer, Hubert, Leibo, Joel Z., Munos, Rémi, Blundell, Charles, Kumaran, Dharshan, and Botvinick, Matt. Learning to reinforcement learn. arXiv Report 1611.05763, 2016.
 Ward (1937) Ward, Lewis B. Reminiscence and rote learning. Psychological Monographs, 49(4):i, 1937.
 Zoph & Le (2017) Zoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
Appendix
Appendix A Code
The code for the metatraining procedure and metatrain problem set is available at https://git.io/v5oq5.
Appendix B Additional details of RNN architecture
b.1 Shortcut connection
Since we expect to be the primary driver of update step direction, and in order to further reduce the information which must be stored in the Parameter RNN hidden state, we included a metatrainable linear projection from the average rescaled gradients and the update directions and .
Appendix C Additional details of metatraining process
c.1 Heavytailed distribution over training steps
Appendix D Architecture updates
The Inception V3 experiment in Figure 3(b) used a slightly newer version of the learned optimizer codebase. The changes were:
d.1 Parameter noise during training
Due to the use of small metatraining problems in Section 4.1, during metatraining the learned optimizer is often able to optimize the problem almost exactly early in the unrolled optimization, after which the metaloss s becomes relatively uninformative. In order to better simulate tasks which take many steps to optimize, small Gaussian noise is added to the parameters during each optimization step. This effectively moves the loss landscape underneath the optimizer, providing a more informative learning signal after many unrolls, and forcing the learned optimizer to be robust to a new type of noise. Specifically, the parameter update becomes
(15)  
(16) 
where the noise scale is drawn from a log uniform distribution between and for each problem.
d.2 Momentum from previous timescale
In Equation 3 we scale the average gradients by a running estimate of the rootmeansquare magnitude of . This is a mismatch with Adam, where the average gradient is scaled by a running estimate of the rootmeansquare magnitude of the nonaveraged gradients. In order to be consistent with this, and in order to encourage better use of the dynamic range of (as defined in the text body, it spends much of its time with values near or ), we modify Equation 3 to normalize the average gradient by from the immediately faster timescale,
(17) 
and where we define the average gradient at the fastest time scale to be the raw gradient,
d.3 No normalization of step length
In order to simplify interactions between parameters, we no longer force a normalization of the parameter and attention update directions and . We do still decompose the update into the product of a learning rate and a step. Since the attended update direction is now able to take on a different magnitude, the separate attention log learning rate is no longer required, and is eliminated. Equations 5 and 6 thus become
(18)  
(19) 
d.4 More stable metatraining hyperparameters
The distribution over metaloss gradients is observed to be assymmetrical and heavy tailed. This combination is known to cause biased parameter updates in RMSProp and Adam, since both optimizers underweight the contribution from extremely rare extremely large gradients. In order to reduce this tendency, we updated the meanquaregradient momentum term to be 0.999, rather than 0.9 in the metaoptimizer RMSProp (Section 4.5).