Metacontrol for Adaptive ImaginationBased Optimization
Abstract
Many machine learning systems are built to solve the hardest examples of a particular task, which often makes them large and expensive to run—especially with respect to the easier examples, which might require much less computation. For an agent with a limited computational budget, this “onesizefitsall” approach may result in the agent wasting valuable computation on easy examples, while not spending enough on hard examples. Rather than learning a single, fixed policy for solving all instances of a task, we introduce a metacontroller which learns to optimize a sequence of “imagined” internal simulations over predictive models of the world in order to construct a more informed, and more economical, solution. The metacontroller component is a modelfree reinforcement learning agent, which decides both how many iterations of the optimization procedure to run, as well as which model to consult on each iteration. The models (which we call “experts”) can be state transition models, actionvalue functions, or any other mechanism that provides information useful for solving the task, and can be learned onpolicy or offpolicy in parallel with the metacontroller. When the metacontroller, controller, and experts were trained with “interaction networks” (Battaglia et al., 2016) as expert models, our approach was able to solve a challenging decisionmaking problem under complex nonlinear dynamics. The metacontroller learned to adapt the amount of computation it performed to the difficulty of the task, and learned how to choose which experts to consult by factoring in both their reliability and individual computational resource costs. This allowed the metacontroller to achieve a lower overall cost (task loss plus computational cost) than more traditional fixed policy approaches. These results demonstrate that our approach is a powerful framework for using rich forward models for efficient modelbased reinforcement learning.
appendixReferences \iclrfinalcopy
1 Introduction
While there have been significant recent advances in deep reinforcement learning (Mnih et al., 2015; Silver et al., 2016) and control (Lillicrap et al., 2015; Levine et al., 2016), most efforts train a network that performs a fixed sequence of computations. Here we introduce an alternative in which an agent uses a metacontroller to choose which, and how many, computations to perform. It “imagines” the consequences of potential actions proposed by an actor module, and refines them internally, before executing them in the world. The metacontroller adaptively decides which expert models to use to evaluate candidate actions, and when it is time to stop imagining and act. The learned experts may be state transition models, actionvalue functions, or any other function that is relevant to the task, and can vary in their accuracy and computational costs. Our metacontroller’s learned policy can exploit the diversity of its pool of experts by trading off between their costs and reliability, allowing it to automatically identify which expert is most worthwhile.
We draw inspiration from research in cognitive science and neuroscience which has studied how people use a metalevel of reasoning in order to control the use of their internal models and allocation of their computational resources. Evidence suggests that humans rely on rich generative models of the world for planning (Gläscher et al., 2010), control (Wolpert & Kawato, 1998), and reasoning (Hegarty, 2004; JohnsonLaird, 2010; Battaglia et al., 2013), that they adapt the amount of computation they perform with their model to the demands of the task (Hamrick et al., 2015), and that they trade off between multiple strategies of varying quality (Lee et al., 2014; Lieder et al., 2014; Lieder & Griffiths, in revision; Kool et al., in press).
Our imaginationbased optimization approach is related to classic artificial intelligence research on boundedrational metareasoning (Horvitz, 1988; Russell & Wefald, 1991; Hay et al., 2012), which formulates a metalevel MDP for selecting computations to perform, where the computations have a known cost. We also build on classic work by Schmidhuber (1990a, b), which used an RL controller with a recurrent neural network (RNN) world model to evaluate and improve upon candidate controls online.
Recently Andrychowicz et al. (2016) used a fully differentiable deep network to learn to perform gradient descent optimization, and Tamar et al. (2016) used a convolutional neural network for performing value iteration online in a deep learning setting. In other similar work, Fragkiadaki et al. (2015) made use of “visual imaginations” for action planning. Our work is also related to recent notions of “conditional computation” (Bengio, 2013; Bengio et al., 2015), which adaptively modifies network structure online, and “adaptive computation time” (Graves, 2016) which allows for variable numbers of internal “pondering” iterations to optimize computational cost.
Our work’s key contribution is a framework for learning to optimize via a metacontroller which manages an adaptive, imaginationbased optimization loop. This represents a hybrid RL system where a modelfree metacontroller constructs its decisions using an actor policy to manage modelfree and modelbased experts. Our experimental results demonstrate that a metacontroller can flexibly allocate its computational resources on a casebycase basis to achieve greater performance than more rigid fixed policy approaches, using more computation when it is required by a more difficult task.
2 Model
We consider a class of fully observed, oneshot decisionmaking tasks (i.e., continuous, contextual bandits). The performance objective is to find a control which, given an initial state , minimizes some loss function between a known future goal state and the result of a forward process, . The performance loss is the (negative) utility of executing the control in the world, and is related to the optimal solution as follows:
(1)  
(2) 
However, (2) defines only the optimal solution—not how to achieve it.
2.1 Optimizing Performance
We consider an iterative optimization procedure that takes and as input and returns an approximation of in order to minimize (1). The optimization procedure consists of a controller, which iteratively proposes controls, and an expert, which evaluates how good those controls are. On the iteration, the controller takes as input, , , and information about the history of previously proposed controls and evaluations , and returns a proposed control that aims to improve on previously proposed controls. An expert takes the proposed control and provides some information about the quality of the control, which we call an opinion. This opinion is added to the history, which is passed back to the controller, and the loop continues for steps, after which a final control is proposed.
Standard optimization methods use principled heuristics for proposing controls. In gradient descent, for example, controls are proposed by adjusting in the direction of the gradient of the reward with respect to the control. In Bayesian optimization, controls are proposed based on selection criteria such as “probability of improvement”, or a metaselection criterion for choosing among several basic selection criteria Hoffman et al. (2011); Shahriari et al. (2014). Rather than choosing one of several controllers, our work learns a single controller and instead focuses on selecting from multiple experts (see Sec. 2.2). In some cases is known and inexpensive to compute, and thus the optimization procedure sets . However, in many realworld settings, is expensive or nonstationary and so it can be advantageous to use an approximation of (e.g., a state transition model), (e.g., an actionvalue function), or any other quantity that gives some information about or .
2.2 Optimizing Computational Cost
Given a controller and one or more experts, there are two important decisions to be made. First, how many optimization iterations should be performed? The approximate solution usually improves with more iterations, but each iteration costs computational resources. However, most traditional optimizers either ignore the cost of computation or select the number of iterations using simple heuristics. Because they do not balance the cost of computation against the performance loss, the overall effectiveness of these approaches is subject to the skill and preferences of the practitioners who use them. Second, which expert should be used on each step of the optimization? Some experts may be accurate but expensive to compute in terms of time, energy and/or money, while others may be crude, yet cheap. Moreover, the reliability of the experts may not be known a priori, further limiting the effectiveness of the optimization procedure. Our use of a metacontroller address these issues by jointly optimizing over the choices of how many steps to take and which experts to use.
We consider a family of optimizers which use the same controller, , but vary in their expert evaluators, . Assuming that the controller and experts are deterministic functions, the number of iterations and the sequences of experts exactly determine the final control and performance loss . This means we have transformed the performance optimization over into an optimization over and : , where the notation is used to emphasize that the control is a function , , , and .
If each optimizer has an associated computational cost , then and also exactly determine the computational resource loss of the optimization run, . The total loss is then the sum of and , each of which are functions of and ,
(3)  
(4) 
and the optimal solution is defined as . Optimizing is difficult because of the recursive dependency on the history, , and because the discrete choices of and mean is not differentiable.
To optimize we recast it as an RL problem where the objective is to jointly optimize task performance and computational cost. As shown in Figure 1a, the metacontroller agent is comprised of a controller , a pool of experts , a manager , and a memory . The manager is a metalevel policy (Russell & Wefald, 1991; Hay et al., 2012) over actions indexed by , which determine whether to terminate the optimization procedure () or to perform another iteration of the optimization procedure with the expert. Specifically, on the iteration the controller produces a new control based on the history of controls, experts, and evaluations. The manager, also relying on this history, independently decides whether to end the optimization procedure (i.e., to execute the control in the world) or to perform another iteration and evaluate the proposed control with the expert (i.e., to ponder, after Graves (2016)). The memory then updates the history by concatenating , , and with the previous history . Coming back to the notion of imaginationbased optimization, we suggest that this iterative optimization process is analogous to imagining what will happen (using one or more approximate world models) before actually executing that action in the world. For further details, see Appendix A, and for an algorithmic illustration of the metacontroller agent, see Algorithm 1 in the appendix.
We also define two special cases of the metacontroller for baseline comparisons. The iterative agent does not have a manager and uses only a single expert. Its number of iterations are preset to a single . The reactive agent, , is a special case of the iterative agent, where the number of iterations is fixed to . This implies that proposed controls are executed immediately in the world, and are not evaluated by an expert. For algorithmic illustrations of the iterative and reactive agents, see Algorithms 2 and 3 in the appendix.
2.3 Neural Network Implementation
We use standard deep learning building blocks, e.g., multilayer perceptrons (MLPs), RNNs, etc., to implement the controller, experts, manager, and memory, because they are effective at approximating complex functions via gradientbased and reinforcement learning, but other approaches could be used as well. In particular, we constructed our implementation to be able to make control decisions in complex dynamical systems, such as controlling the movement of a spaceship (Figure 1bc), though we note that our approach is not limited to such physical reasoning tasks. Here we used meansquared error (MSE) for our and Adam (Kingma & Ba, 2014) as the training optimizer.
Experts
We implemented the experts as MLPs and “interaction networks” (INs) (Battaglia et al., 2016), which are wellsuited to predicting complex dynamical systems like those in our experiments below. Each expert has parameters , i.e. , and may be trained either onpolicy using the outputs of the controller (as is the case in this paper), or offpolicy by any data that pairs states and controls with future states or reward outcomes. The objective for each expert may be different depending on what the expert outputs. For example, the objective could be the loss between the goal and future states, , which is what we use in our experiments. Or, it could be the loss between and an actionvalue function that predicts directly, . See Appendix B.1 for details.
Controller and Memory
We implemented the controller as an MLP with parameters , i.e. , and we implemented the memory as a Long ShortTerm Memory (LSTM) (Hochreiter & Schmidhuber, 1997) with parameters . The memory embeds the history as a fixedlength vector, i.e. . The controller and memory were trained jointly to optimize (1). However, this objective includes , which is often unknown or not differentiable. We overcame this by approximating with a differentiable critic analogous to those used in policy gradient methods (e.g. Silver et al., 2014; Lillicrap et al., 2015; Heess et al., 2015). See Appendices B.2 and B.3 for details.
Manager
We implemented the manager as a stochastic policy that samples from a categorical distribution whose weights are produced by an MLP with parameters , i.e. . We trained the manager to minimize (3) using Reinforce (Williams, 1992), but other deep RL algorithms could be used instead. See Appendix B.4 for details.
3 Experiments
To evaluate our metacontroller agent, we measured its ability to learn to solve a class of physicsbased tasks that are surprisingly challenging. Each episode consisted of a scene which contained a spaceship and multiple planets (Figure 1bc). The spaceship’s goal was to rendezvous with its mothership near the center of the system in exactly 11 time steps, but it only had enough fuel to fire its thrusters once. The planets were static but the gravitational force they exerted on the spacecraft induced complex nonlinear dynamics on the motion over the 11 steps. The spacecraft’s action space was continuous, up to some maximum magnitude, and represented the instantaneous Cartesian velocity vector imparted by its thrusters. Further details are in Appendix C.
We trained the reactive, iterative, and metacontroller agents on five versions of the spaceship task involving different numbers of planets.
3.1 Reactive and iterative agents
Figure 2 shows the performance on the test set of the reactive and iterative agents for different numbers of ponder steps. The reactive agent performed poorly on the task, especially when the task was more difficult. With the five planets dataset, it was only able to achieve a performance loss of on average (see Figure 1 for a depiction of the magnitude of the loss). In contrast, the iterative agent with the true simulation expert performed much better, reaching ceiling performance on the datasets with one and two planets, and achieving a performance loss of on the five planets dataset. The IN and MLP experts also improve over the reactive agent, with a minimum performance loss of and on the five planets dataset, respectively.
Figure 2 also highlights how important the choice of expert is. When using the true simulation and IN experts, the iterative agent performs well. With the MLP expert, however, performance is substantially diminished. But despite the poor performance of the MLP expert, there is still some benefit of pondering with it. With even just a few steps, the MLP iterative agent outperforms its reactive counterpart. However comparing the reactive agent with the iterative agent is somewhat unfair because the iterative agent has more parameters due to the expert and the memory. However, given that there tends to also be an increase in performance between one and two ponder steps (and beyond), it is clear that pondering—even with a highly inaccurate model—can still lead to better performance than a modelfree reactive approach.
3.2 Metacontroller with One Expert
Though the iterative agents achieve impressive results, they expend more computation than necessary. For example, in the one and two planet conditions, the performances of the IN and true simulation iterative agents received little performance benefit from pondering more than two or three steps, while for the four and five planet conditions they required at least five to eight steps before their performance converged. When computational resources have no cost, the number of steps are of no concern, but when they have some cost it is important to be economical.
Because the metacontroller learns to choose its number of pondering steps, it can balance its performance loss against the cost of computation. Figure 3 (top row, middle and right subplots) shows that the IN and true simulation expert metacontroller take fewer ponder steps as increases, tracking closely the minimum of the iterative agent’s cost curve (i.e., the metacontroller points are always near the iterative agent curves’ minima). This adaptive behavior emerges automatically from the manager’s learned policy, and avoids the need to perform a hyperparameter search to find the best number of iterations for a given .
The metacontroller does not simply choose an average number of ponder steps to take per episode: it actually tailors this choice to the difficulty of each episode. Figure 4 shows how the number of ponder steps the IN metacontroller chooses in each episode depends on that episode’s difficulty, as measured by the episode’s loss under the reactive agent. For more difficult episodes, the metacontroller tends to take more ponder steps, as indicated by the positive slopes of the best fit lines, and this proportionality persists across the different levels of in each subplot.
The ability to adapt its choice of number of ponder steps on a perepisode basis is very valuable because it allows the metacontroller to spend additional computation only on those episodes which require it. The total costs of the IN and true simulation metacontrollers’ are 11% and 15% lower (median) than the best achievable costs of their corresponding iterative agents, respectively, across the range of values we tested (see Figure 7 in the Appendix for details).
There can even be a benefit to using a metacontroller when there are no computational resource costs. Consider the rightmost points in Figure 3 (bottom row, middle and right subplots), which show the performance loss for the IN and true simulation metacontrollers when is low. Remarkably, these points still outperform the best achievable iterative agents. This suggests that there can be an advantage to stopping pondering once a good solution is found, and more generally demonstrates that the metacontroller’s learning process can lead to strategies that are superior to those available to less flexible agents.
The metacontroller with the MLP expert had very poor average performance and high variance on the five planet condition (Figure 3, top left subplot), which is why we restricted our focus in this section to how the metacontrollers with IN and true simulation experts behaved. The MLP’s poor performance is crucial, however, for the following section (3.3) which analyzes how a multipleexpert metacontroller manages experts which vary greater in their reliability.
3.3 Metacontroller with Two Experts
When we allow the manager to additionally choose between two experts, rather than only relying on a single expert, we find a similar pattern of results in terms of the number of ponder steps (Figure 5, left). Additionally, the metacontroller is successfully able to identify the more reliable IN network and consequently uses it a majority of the time, except in a few cases where the cost of the IN network is extremely high relative to the cost of the MLP network (Figure 5, right). This pattern of results makes sense given the good performance (described in the previous section) of the metacontroller with the IN expert compared to the poor performance of the metacontroller with the MLP expert. The manager should not generally rely on the MLP expert because it is simply not a reliable source of information.
However, the metacontroller has more difficulty finding an optimal balance between the two experts on a stepbystep basis: the addition of a second expert did not yield much of an improvement over the singleexpert metacontroller, with only % of the different versions (trained with different values for the two experts) achieving a lower loss than the best iterative controller. We believe the mixed performance of the metacontroller with multiple experts is partially due to an entropy term which we used to encourage the manager’s policy to be nondeterministic (see Appendix B.4). In particular, for high values of , the optimal thing to do is to always execute immediately without pondering. However, because of the entropy term, the manager is encourage to have a nondeterministic policy and therefore is likely to ponder more than it should—and to use experts that are more unreliable—even when this is suboptimal in terms of the total loss (3).
Despite the fact that the metacontroller with multiple experts does not result in a substantial improvement over that which uses a single expert, we emphasize that the manager is able to identify and use the more reliable expert the majority of the time. And, it is still able to choose a variable number of steps according to how difficult the task is (Figure 5, left). This, in and of itself, is an improvement over more traditional optimization methods which would require that the expert is handpicked ahead of time and that the number of steps are determined heuristically.
4 Discussion
In this paper, we have presented an approach to adaptive, imaginationbased optimization in neural networks. Our approach is able to flexibly choose which computations to perform as well as how many computations need to be performed, approximately solving a speedaccuracy tradeoff that depends on the difficulty of the task. In this way, our approach learns to rely on whatever source of information is most useful and most efficient. Additionally, by consulting the experts onthefly, our approach allows agents to test out actions to ensure that their consequences are not disastrous before actually executing them.
While the experiments in this paper involve a oneshot decision task, our approach lays a foundation that can be built upon to support more complex situations. For example, rather than applying a force only on the first time step, we could turn the problem into one of trajectory optimization for continuous control by asking the controller to produce a sequence of forces. In the case of planning, our approach could potentially be combined with methods like Monte Carlo TreeSearch (MCTS) (Coulom, 2006), where our experts would be akin to having several different rollout policies to choose from, and our controller would be akin to the tree policy. While most MCTS implementations will run rollouts until a fixed amount of time has passed, our approach would allow the manager to adaptively choose the number of rollouts to perform and which policies to perform the rollouts with. Our method could also be used to naturally augment existing modelfree approaches such as DQN (Mnih et al., 2015) with online modelbased optimization by using the modelfree policy as a controller and adding additional experts in the form of statetransition models. An interesting extension would be to compare our metacontroller architecture with a naïve modelbased controller that performs gradientbased optimization to produce the final control. We expect our metacontroller architecture might require fewer model evaluations and to be more robust to model inaccuracies compared to the gradientbased method, because our method has access to the full history of proposed controls and evaluations whereas traditional gradientbased methods do not.
Although we rely on differentiable experts in our metacontroller architecture, we do not utilize the gradient information from these experts. An interesting extension to our work would be to pass this gradient information through to the manager and controller (as in Andrychowicz et al. (2016)), which would likely improve performance further, especially in the more complex situations discussed here. Another possibility is to train some or all of the experts inline with the controller and metacontroller, rather than independently, which could allow their learned functionality to be more tightly integrated with the rest of the optimization loop, at the expense of their generality and ability to be repurposed for other uses.
To conclude, we have demonstrated how neural networkbased agents can use metareasoning to adaptively choose what to think about, how to think about it, and for how long to think for. Our method is directly inspired by human cognition and suggests a way to make agents much more flexible and adaptive than they currently are, both in decision making tasks such as the one described here, as well as in planning and control settings more broadly.
Acknowledgments
We would like to thank Matt Hoffman, Andrea Tacchetti, Tom Erez, Nando de Freitas, Guillaume Desjardins, Joseph Modayil, Hubert Soyer, Alex Graves, David Reichert, Theo Weber, Jon Scholz, Will Dabney, and others on the DeepMind team for helpful discussions and feedback.
Appendix A Metacontroller Details
Here, we give the precise definitions of the metacontroller agent. As described in the main text, the iterative and reactive agents are special cases of the metacontroller agent, and are therefore not discussed here.
The metacontroller agent is comprised of the following components:

A historysensitive controller, , which is a policy that maps goal and initial states, and a history, , to controls, whose aim is to minimize (1).

A pool of experts . Each expert maps goal states, input states, and actions to opinions. Opinions can be either statesonly (), states and rewards (), or rewardsonly (). The expert corresponds to the evaluator for the optimization routine, i.e., an approximation of the forward process .

A manager, , which is a policy which decides whether to send a proposed control to the world () or to the expert for evaluation, in order to minimize (3). This formulation is based on that used by metareasoning systems \citepappendixRussel1991,Hay2012. Details on the corresponding MDP are given in Appendix A.1.

A memory, , which is a function that maps the prior history , as well as the most recent manager choice, proposed control, and expert evaluation , to an updated history , which is then made available to the manager and controller on subsequent iterations. The history at step is a recursively defined tuple which is the concatenation of the prior history with the most recently proposed control, expert evaluation, and expert identity: where represents an empty initial history. Similarly, the finite set of histories up to step is: where .
The metacontroller produces:
(5) 
where . This function is summarized in Algorithm 1. The other agents (iterative and reactive), as mentioned in the main text, are simpler versions of the metacontroller agent and are summarized in Algorithms 2 and 3.
a.1 MetaLevel MDP
To implement the manager for the metacontroller agent, we draw inspiration from the metareasoning literature \citepappendixRussel1991,Hay2012 and formulate the problem as a finitehorizon Markov Decision Process (MDP) over the decision of whether to perform another iteration of the optimization procedure or to execute a control in the world.

The state space consists of goal states, external states, and internal histories, .

The action space contains discrete actions, , which correspond to execute () and ponder (), where ponder (after \citetappendixGraves2016) refers to performing an iteration of the optimization procedure with the expert.

The (deterministic) state transition model is,
where and and,

The (deterministic) reward function maps the current state, current action, and next state to realvalued loss:
where .
We approximate the solution to this MDP with a stochastic manager policy . The manager chooses actions proportional to the immediate reward for taking action in state plus the expected sum of future rewards. This construction imposes a tradeoff between accuracy and resources, incentivizing the agent to ponder longer and with more accurate (and potentially expensive) experts when the problem is harder.
Appendix B Gradients
b.1 Experts
Training the experts is a straightforward supervised learning problem (Figure 6c). The gradient is:
(6) 
where is the expert and is the loss function for the expert. For example, in the case of an actionvalue function expert, this loss function might be . In the case of an expert that predicts the final state using a model of the system dynamics, it might be .
b.2 Critic
The critic, , is an approximate model of the performance loss, , (1), which is used to backpropagate gradients to the controller and memory. This means the critic can either be an actionvalue function, which approximates directly, or a model of the system dynamics composed with a known loss function between the goal and future states, . We train the critic, , using the same procedure as the experts are trained (Figure 6d). A good expert may even be used as the critic.
b.3 Controller and Memory
As shown in Figure 6a, we trained the controller and memory using backpropagation through time (BPTT) with an actorcritic architecture. Specifically, rather than assuming is known and differentiable, we use a critic and backpropagate through it \citepappendixHeess2015:
(7) 
where is the critic, is the maximum number of iterations the controller can use, and:
(8) 
where we are using the notation to indicate summed gradients, following \citetappendixPascanu2013. Since has already been produced by the manager it can be treated as a constant and will produce an unbiased estimate of the gradient. This is convenient because it allows for training the controller and manager separately, or testing the controller’s behavior with arbitrary actions posttraining.
b.4 Manager
As discussed in the main text, we used the Reinforce algorithm \citetappendixWilliams1992 to train the manager (Figure 6b). One potential issue, however, is that when training the controller and manager simultaneously, the controller will result in high cost early on in training and thus the manager will learn to always choose the execute action. To discourage the manager from learning what is an essentially deterministic policy, we included a regularization term based on the entropy, \citepappendixWilliams1991,Mnih2016:
is the full return given by (3) and is the strength of the regularization term.
Appendix C Spaceship Task
c.1 Datasets
We generated five datasets, each containing scenes with a different number of planets (ranging from a single planet to five planets). Each dataset consisted of 100,000 training scenes and 1,000 testing scenes. The target in each scene was always located at the origin, and each scene always had a sun with a mass of 100 units. The sun was located between 100 and 200 distance units away from the target, with this distance sampled uniformly at random. The other planets had a mass between 20 and 50 units, and were located 100 to 250 distance units away from the target, sampled uniformly at random. The spaceship had a mass between 1 and 9 units, and was located 150 to 250 distance units away from the target. The planets were always fixed (i.e., they could not move), and the spaceship always started at the beginning of each episode with zero velocity.
c.2 Environment
We simulated our scenes using a physical simulation of gravitational dynamics. The planets were always stationary (i.e., they were not acted upon by any of the objects in the scene) but acted upon the spaceship with a force of:
(9) 
where is the force vector of the planet on the spaceship, is a gravitational constant, is the mass of the planet, is the mass of the spaceship, is the distance between the centers of masses of the planet and the spaceship, is the location of the planet, and is the location of the spaceship. We simulated this environment using the Euler method, i.e.:
(10) 
where , , and are the acceleration, velocity, and position of the spaceship, respectively; is a damping constant; is the control force applied to the spaceship; and is the step size. Note that we set to zero for all timesteps except the first.
Appendix D Implementation Details
We used TensorFlow \citepappendixTensorFlow to implement and train all versions of the model.
d.1 Architecture
In our implementation of the controller, we used a twolayer MLP each with 100 units. The first layer used ReLU activations and the second layer used a multiplicative interaction similar to \citetappendixVandenOord2016, which we found to work better in practice. In our implementation of the memory, we used a single LSTM layer of size 100. In our implementation of the manager, we used a MLP of two fully connected layers of 100 units each, with ReLU nonlinearities.
We constructed three different experts to test the various controllers. The true simulation expert was the same as the world model, and consisted of a simulation for 11 timesteps with (see Appendix C). The IN expert was an interaction network \citepappendixBattaglia2016, which has previously been shown to be able to learn to predict body dynamics accurately for simple systems. The IN consists of a relational module and an object module. In our case, the relational module was composed of 4 hidden layers of 150 nodes each, outputting “effects” encodings of size 100. These effects, together with the relational model input are then used as input to the object model, which contained a single hidden layer of 100 nodes. The object model outputs the velocity of the spaceship and we trained it to predict the velocity on every timestep of the spaceship’s trajectory. The MLP expert was a MLP that predicted the final location of the spaceship and had the same architecture as the controller.
As discussed in Appendix B, we used a critic to train the controller and memory. We always used the IN expert as the critic, except in the case when the true simulation expert was used, in which case we also used the true simulation as the critic.
d.2 Training Procedure
All weights were initialized uniformly at random between 0 and 0.01. An iteration of training consisted of gradient updates over a minibatch of size 1000; in total, we ran training for 100,000 iterations. We additionally used a waterfall schedule for each of the learning rates during training, such that after 1000 iterations, if the loss was not decreasing, we would decay the step size by 5%.
We trained the controller and memory together using the Adam optimizer \citepappendixKingma2014 with gradients clipped to a maximum global norm of 10 \citepappendixPascanu2013. The manager was trained simultaneously, but using a different learning rate than the controller and memory. The IN and MLP experts were also trained simultaneously, but again with different learning rates. Learning rates were determined using a grid search over a small number of values, and are given in Table 1 for the iterative agent, in Table 2 for the metacontroller with one expert, and in Table 3 for the metacontroller with two experts.
The iterative agent was trained to take a fixed number of ponder steps, ranging from 0 (i.e., the reactive agent) to 10. The metacontrollers were allowed to take a variable number of ponder steps up to a maximum of 10. For the metacontroller with a single expert, we trained the manager using and 20 additional values of spaced logarithmically between 0.00004 and 0.4 (inclusive). For the metacontroller with multiple experts, we trained the manager on a grid of pairs of values, where each expert could have or one of 6 values spaced logarithmically between 0.00004 and 0.2 (inclusive). In all cases, the entropy penalty for the metacontroller was .
d.3 Convergence
Reactive agent.
Training for the reactive agents was straightforward and converged reliably on all datasets.
Iterative agent.
For the iterative agent with the interaction network or true simulation experts, convergence was also reliable for small numbers of ponder steps. Convergence was somewhat less reliable for larger numbers of ponder steps. We believe this is because for some scenes, a larger number of ponder steps was more than necessary to solve the task (as is evidenced by the plateauing performance in Figure 2). So, the iterative agent had to effectively “remember” what the best control was while it took the last few ponder steps, which is a more complicated and difficult task to perform.
For the iterative agent with the MLP expert, convergence was more variable especially when the task was harder, as can be seen in the variable performance on the five planets dataset in Figure 2 (left). We believe this is because the MLP agent was so poor, and that convergence would have been more reliable with a better agent.
Metacontroller with a single expert.
The metacontroller agent with a single expert converged more reliably than the corresponding iterative agent (see the bottom row of Figure 3). As mentioned in the previous paragraph, the iterative agent had to take more steps than actually necessary, causing it to perform less well for larger numbers of ponder steps, whereas the metacontroller agent had the flexibility of stopping when it had found a good control. On the other hand, we found that the metacontroller agent sometimes performed too many ponder steps for large values of (see Figures 3 and 7). We believe this is due to the entropy term () added to the Reinforce loss. This is because when then ponder cost is very high, the optimal thing to do is to behave deterministically and always execute (never ponder); however, the entropy term encouraged the policy to be nondeterministic. We plan to explore different training regimes in future work to alleviate this problem, for example by annealing the entropy term to zero over the course of training.
Metacontroller with multiple experts.
The metacontroller agent with multiple experts was somewhat more difficult to train, especially for high ponder cost of the interaction network expert. For example, note how the proportion of steps using the MLP expert does not decrease monotonically in Figure 5 (right) with increasing cost for the MLP expert. We believe this is also an unexpected result of using the entropy term: in all of these cases, the optimal thing to do actually is to rely on the MLP expert 100% of the time, yet the entropy term encourages the policy to be nondeterministic. Future work will explore these difficulties further by using experts that complement each other better (i.e., so there is not one that is wholly better than the other).
Experts.
The experts themselves always converged quickly and reliably, and trained much faster than the rest of the network.
True sim.  MLP  IN  

Dataset  # Ponder Steps  
one planet  0  1e03  1e03  3e03  5e04  1e03  1e03 
one planet  1  1e03  1e03  3e03  1e03  1e03  1e03 
one planet  2  1e03  1e03  3e03  5e04  1e03  1e03 
one planet  3  1e03  1e03  3e03  5e04  1e03  1e03 
one planet  4  1e03  1e03  3e03  1e03  1e03  1e03 
one planet  5  1e03  1e03  3e03  5e04  5e04  1e03 
one planet  6  1e03  1e03  3e03  5e04  1e03  1e03 
one planet  7  1e03  1e03  3e03  5e04  1e03  1e03 
one planet  8  1e03  1e03  3e03  1e03  1e03  1e03 
one planet  9  5e04  1e03  3e03  5e04  5e04  1e03 
one planet  10  1e03  1e03  3e03  5e04  1e03  1e03 
two planets  0  1e03  1e03  3e03  1e03  3e03  3e03 
two planets  1  1e03  1e03  3e03  5e04  1e03  1e03 
two planets  2  1e03  1e03  3e03  5e04  1e03  1e03 
two planets  3  1e03  1e03  3e03  5e04  1e03  1e03 
two planets  4  1e03  1e03  3e03  1e03  1e03  1e03 
two planets  5  1e03  1e03  1e03  1e03  1e03  1e03 
two planets  6  1e03  1e03  3e03  1e03  1e03  1e03 
two planets  7  5e04  1e03  3e03  5e04  5e04  1e03 
two planets  8  1e03  1e03  3e03  5e04  5e04  1e03 
two planets  9  1e03  1e03  3e03  5e04  3e03  3e03 
two planets  10  5e04  1e03  3e03  1e03  5e04  1e03 
three planets  0  1e03  1e03  3e03  1e03  1e03  3e03 
three planets  1  1e03  1e03  3e03  1e03  1e03  1e03 
three planets  2  1e03  5e04  3e03  1e03  1e03  1e03 
three planets  3  1e03  1e03  1e03  5e04  1e03  1e03 
three planets  4  1e03  1e03  3e03  5e04  1e03  1e03 
three planets  5  1e03  1e03  1e03  5e04  5e04  1e03 
three planets  6  1e03  5e04  3e03  5e04  1e03  1e03 
three planets  7  1e03  1e03  3e03  1e03  1e03  1e03 
three planets  8  1e03  1e03  3e03  1e03  5e04  1e03 
three planets  9  1e03  1e03  3e03  5e04  1e03  1e03 
three planets  10  1e03  5e04  3e03  1e03  1e03  1e03 
four planets  0  1e03  5e04  3e03  5e04  1e03  1e03 
four planets  1  1e03  5e04  3e03  1e03  1e03  1e03 
four planets  2  1e03  5e04  3e03  1e03  1e03  1e03 
four planets  3  1e03  1e03  3e03  5e04  1e03  1e03 
four planets  4  1e03  5e04  3e03  1e03  1e03  1e03 
four planets  5  1e03  1e03  3e03  1e03  1e03  1e03 
four planets  6  1e03  1e03  3e03  1e03  1e03  1e03 
four planets  7  5e04  1e03  1e03  1e03  1e03  1e03 
four planets  8  5e04  1e03  3e03  1e03  1e03  1e03 
four planets  9  1e03  1e03  3e03  1e03  5e04  1e03 
four planets  10  1e03  1e03  3e03  1e03  5e04  1e03 
five planets  0  1e03  1e03  3e03  5e04  1e03  3e03 
five planets  1  1e03  1e03  3e03  5e04  1e03  1e03 
five planets  2  5e04  1e03  3e03  5e04  1e03  1e03 
five planets  3  1e03  1e03  3e03  1e03  1e03  1e03 
five planets  4  5e04  1e03  3e03  5e04  1e03  1e03 
five planets  5  1e03  5e04  3e03  1e03  1e03  1e03 
five planets  6  1e03  1e03  3e03  1e03  1e03  1e03 
five planets  7  1e03  1e03  3e03  1e03  1e03  3e03 
five planets  8  5e04  1e03  3e03  1e03  1e03  3e03 
five planets  9  1e03  1e03  3e03  1e03  1e03  1e03 
five planets  10  1e03  1e03  3e03  5e04  1e03  1e03 
True sim.  MLP  IN  

0.00000  5e04  5e04  5e04  1e03  3e03  1e03  5e04  1e04  1e03 
0.00004  1e03  1e04  1e03  5e05  3e03  5e04  1e03  1e03  1e03 
0.00006  5e04  5e05  1e03  5e04  3e03  1e03  5e04  5e05  1e03 
0.00011  1e03  1e04  1e03  1e04  3e03  1e03  5e04  5e04  1e03 
0.00017  5e04  1e04  1e03  1e03  3e03  1e03  1e03  5e05  1e03 
0.00028  1e03  1e03  1e03  1e03  3e03  1e03  5e04  5e05  1e03 
0.00045  1e03  1e03  5e04  1e04  3e03  1e03  1e03  5e05  1e03 
0.00073  1e03  1e04  1e03  1e04  3e03  1e03  1e03  5e05  1e03 
0.00119  1e03  5e05  1e03  1e04  5e04  1e03  5e04  5e04  1e03 
0.00193  1e03  5e05  1e03  5e05  3e03  5e04  1e03  5e05  1e03 
0.00314  1e03  1e04  1e03  1e04  3e03  5e04  1e03  1e04  1e03 
0.00510  1e03  5e05  1e03  5e05  3e03  1e03  1e03  5e05  1e03 
0.00828  1e03  5e04  1e03  5e04  3e03  5e04  1e03  1e03  1e03 
0.01344  1e03  5e05  1e03  5e05  3e03  5e04  5e04  5e05  1e03 
0.02182  1e03  1e04  1e03  1e04  3e03  5e04  1e03  1e04  1e03 
0.03543  1e03  1e04  1e03  1e04  3e03  1e03  1e03  1e04  1e03 
0.05754  1e03  5e04  1e03  5e04  3e03  5e04  1e03  1e04  1e03 
0.09343  1e03  5e05  1e03  5e05  3e03  1e03  1e03  1e04  1e03 
0.15171  1e03  1e04  1e03  5e04  3e03  5e04  1e03  1e04  1e03 
0.24634  1e03  5e05  1e03  1e03  3e03  1e03  1e03  1e03  1e03 
0.40000  1e03  1e03  1e03  1e03  3e03  5e04  1e03  1e03  1e03 
IN + MLP  

0.00000  0.00000  1e03  5e05  1e03  1e03 
0.00000  0.00121  1e03  5e04  1e03  1e03 
0.00000  0.00663  1e03  1e03  1e03  1e03 
0.00000  0.03641  1e03  5e05  1e03  1e03 
0.00000  0.20000  1e03  5e05  1e03  1e03 
0.00000  0.30000  5e04  1e04  1e03  1e03 
0.00000  0.40000  5e04  5e05  1e03  1e03 
0.00121  0.00000  1e03  1e04  1e03  1e03 
0.00121  0.00121  1e03  5e05  1e03  1e03 
0.00121  0.00663  1e03  1e03  1e03  1e03 
0.00121  0.03641  1e03  1e04  1e03  1e03 
0.00121  0.20000  1e03  5e04  1e03  1e03 
0.00121  0.30000  5e04  5e05  1e03  1e03 
0.00121  0.40000  1e03  1e04  1e03  1e03 
0.00663  0.00000  1e03  1e03  1e03  1e03 
0.00663  0.00121  5e04  5e05  1e03  1e03 
0.00663  0.00663  5e04  1e04  1e03  1e03 
0.00663  0.03641  1e03  1e04  1e03  1e03 
0.00663  0.20000  5e04  5e04  1e03  1e03 
0.00663  0.30000  5e04  1e03  1e03  1e03 
0.00663  0.40000  5e04  1e04  1e03  1e03 
0.03641  0.00000  1e03  5e04  1e03  1e03 
0.03641  0.00121  1e03  5e04  1e03  1e03 
0.03641  0.00663  1e03  1e03  1e03  1e03 
0.03641  0.03641  1e03  5e04  1e03  1e03 
0.03641  0.20000  1e03  1e04  1e03  1e03 
0.03641  0.30000  1e03  5e05  1e03  1e03 
0.03641  0.40000  1e03  1e04  1e03  1e03 
0.20000  0.00000  1e03  5e04  1e03  1e03 
0.20000  0.00121  1e03  5e04  1e03  1e03 
0.20000  0.00663  1e03  5e04  1e03  1e03 
0.20000  0.03641  1e03  1e04  1e03  1e03 
0.20000  0.20000  5e04  1e03  1e03  1e03 
0.20000  0.30000  1e03  5e05  1e03  1e03 
0.20000  0.40000  1e03  5e04  1e03  1e03 
0.30000  0.00000  5e04  1e04  1e03  1e03 
0.30000  0.00121  5e04  1e03  1e03  1e03 
0.30000  0.00663  1e03  1e03  1e03  1e03 
0.30000  0.03641  1e03  5e04  1e03  1e03 
0.30000  0.20000  1e03  1e03  1e03  1e03 
0.30000  0.30000  1e03  1e04  1e03  1e03 
0.30000  0.40000  1e03  5e05  1e03  1e03 
0.40000  0.00000  1e03  1e03  1e03  1e03 
0.40000  0.00121  5e04  1e03  1e03  1e03 
0.40000  0.00663  1e03  5e04  1e03  1e03 
0.40000  0.03641  5e04  1e04  1e03  1e03 
0.40000  0.20000  1e03  1e03  1e03  1e03 
0.40000  0.30000  5e04  1e03  1e03  1e03 
0.40000  0.40000  5e04  5e04  1e03  1e03 
references \bibliographystyleappendixiclr2017_conference
Footnotes
 Available from: https://www.github.com/deepmind/spaceship_dataset
References
 Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. arXiv:1606.04474, 2016.
 Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. Advances in Neural Information Processing Systems, 2016.
 Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45):18327–18332, 2013.
 Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv:1511.06297, 2015.
 Yoshua Bengio. Deep learning of representations: Looking forward. arXiv:1305.0445, 2013.
 Rémi Coulom. Efficient selectivity and backup operators in montecarlo tree search. In International Conference on Computers and Games, pp. 72–83. Springer, 2006.
 Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning Visual Predictive Models of Physics for Playing Billiards. Proceedings of the International Conference on Learning Representations (ICLR 2016), pp. 1–12, 2015. URL http://arxiv.org/abs/1511.07404.
 Jan Gläscher, Nathaniel Daw, Peter Dayan, and John P. O’Doherty. States versus rewards: Dissociable neural prediction error signals underlying modelbased and modelfree reinforcement learning. Neuron, 66(4):585 – 595, 2010.
 Alex Graves. Adaptive computation time for recurrent neural networks. arXiv:1603.08983, 2016.
 Jessica B. Hamrick, Kevin A. Smith, Thomas L. Griffiths, and Edward Vul. Think again? the amount of mental simulation tracks uncertainty in the outcome. In Proceedings of the 37th Annual Conference of the Cognitive Science Society, 2015.
 Nicholas Hay, Stuart J. Russell, David Tolpin, and Solomon Eyal Shimony. Selecting computations: Theory and applications. Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, 2012.
 Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems, 2015.
 Mary Hegarty. Mechanical reasoning by mental simulation. Trends in Cognitive Sciences, 8(6):280 – 285, 2004.
 Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Matthew W Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for Bayesian optimization. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pp. 327–336, 2011.
 Eric J. Horvitz. Reasoning about beliefs and actions under computational resource constraints. In Uncertainty in Artificial Intelligence, Vol. 3, 1988.
 Philip N JohnsonLaird. Mental models and human reasoning. Proceedings of the National Academy of Sciences, 107(43):18243–18250, 2010.
 Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
 Wouter Kool, Fiery A. Cushman, and Samuel J. Gershman. When does modelbased control pay off? PLOS Computational Biology, in press.
 Sang Wan Lee, Shinsuke Shimojo, and John P. O’Doherty. Neural computations underlying arbitration between modelbased and modelfree learning. Neuron, 81:687–699, 2014.
 Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17:1–40, 2016.
 Falk Lieder and Thomas L. Griffiths. Strategy selection as rational metareasoning. in revision.
 Falk Lieder, Dillon Plunkett, Jessica B. Hamrick, Stuart J. Russell, Nicholas J. Hay, and Thomas L. Griffiths. Algorithm selection by rational metareasoning as a model of human strategy selection. 27:2870–2878, 2014.
 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv:1509.02971, 2015.
 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Stuart Russell and Eric Wefald. Principles of metareasoning. Artificial Intelligence, 49(1):361 – 395, 1991.
 Jürgen Schmidhuber. An online algorithm for dynamic reinforcement learning and planning in reactive environments. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 1990a.
 Jürgen Schmidhuber. Reinforcement learning in Markovian and nonMarkovian environments. Advances in Neural Information Processing Systems, 1990b.
 Bobak Shahriari, Ziyu Wang, Matthew W Hoffman, Alexandre BouchardCôté, and Nando de Freitas. An entropy search portfolio for Bayesian optimization. arXiv:1406.4625, 2014.
 David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. Proceedings of the 31st International Conference on Machine Learning, 2014.
 David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Aviv Tamar, Sergey Levine, and Pieter Abbeel. Value Iteration Networks. Advances in Neural Information Processing Systems, 2016. URL http://arxiv.org/abs/1602.02867.
 Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256, 1992.
 D.M. Wolpert and M. Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, 11(7â8):1317 – 1329, 1998.