Recurrent Environment Simulators
Abstract
Models that can simulate how environments change in response to actions can be used by agents to plan and act efficiently. We improve on previous environment simulators from highdimensional pixel observations by introducing recurrent neural networks that are able to make temporally and spatially coherent predictions for hundreds of timesteps into the future. We present an indepth analysis of the factors affecting performance, providing the most extensive attempt to advance the understanding of the properties of these models. We address the issue of computationally inefficiency with a model that does not need to generate a highdimensional image at each timestep. We show that our approach can be used to improve exploration and is adaptable to many diverse environments, namely 10 Atari games, a 3D car racing environment, and complex 3D mazes.
1 Introduction
In order to plan and act effectively, agentbased systems require an ability to anticipate the consequences of their actions within an environment, often for an extended period into the future. Agents can be equipped with this ability by having access to models that can simulate how the environments changes in response to their actions. The need for environment simulation is widespread: in psychology, modelbased predictive abilities form sensorimotor contingencies that are seen as essential for perception \citeporegan01sensorimotor; in neuroscience, environment simulation forms part of deliberative planning systems used by the brain \citepniv09reinforcement; and in reinforcement learning, the ability to imagine the future evolution of an environment is needed to form predictive state representations \citeplittman06predictive and for Monte Carlo planning \citepsutton98reinforcement.
Simulating an environment requires models of temporal sequences that must possess a number of properties to be useful: the models should make predictions that are accurate, temporally and spatially coherent over long time periods; and allow for flexibility in the policies and action sequences that are used. In addition, these models should be generalpurpose and scalable, and able to learn from highdimensional perceptual inputs and from diverse and realistic environments. A model that achieves these desiderata can empower agentbased systems with a vast array of abilities, including counterfactual reasoning \citeppearl09causality, intuitive physical reasoning \citepmccloskey83intuitive, modelbased exploration, episodic control \citeplengyel07hippocampal, intrinsic motivation \citepoudeyer07intrinsic, and hierarchical control.
Deep neural networks have recently enabled significant advances in simulating complex environments, allowing for models that consider highdimensional visual inputs across a wide variety of domains \citepwahlstrom15pixels, watter15embed, sun2015learning, patraucean15spatio. The model of \citetoh15action represents the stateoftheart in this area, demonstrating high longterm accuracy in deterministic and discreteaction environments.
Despite these advances, there are still several challenges and open questions. Firstly, the properties of these simulators in terms of generalisation and sensitivity to the choices of model structure and training are poorly understood. Secondly, accurate prediction for long time periods into the future remains difficult to achieve. Finally, these models are computationally inefficient, since they require the prediction of a highdimensional image each time an action is executed, which is unnecessary in situations where the agent is interested only in the final prediction after taking several actions.
In this paper we advance the stateoftheart in environment modelling. We build on the work of \citetoh15action, and develop alternative architectures and training schemes that significantly improve performance, and provide indepth analysis to advance our understanding of the properties of these models. We also introduce a simulator that does not need to predict visual inputs after every action, reducing the computational burden in the use of the model. We test our simulators on three diverse and challenging families of environments, namely Atari 2600 games, a firstperson game where an agent moves in randomly generated 3D mazes, and a 3D car racing environment; and show that they can be used for modelbased exploration.
2 Recurrent Environment Simulators
An environment simulator is a model that, given a sequence of actions and corresponding observations of the environment, is able to predict the effect of subsequent actions , such as forming predictions or state representations of the environment.
Our starting point is the recurrent simulator of \citetoh15action, which is the stateoftheart in simulating deterministic environments with visual observations (frames) and discrete actions. This simulator is a recurrent neural network with the following backbone structure:
In this equation, is a hidden state representation of the environment, and a nonlinear deterministic state transition function. The symbol indicates the selection of the predicted frame or real frame , producing two types of state transition called predictiondependent transition and observationdependent transition respectively. is an encoding function consisting of a series of convolutions, and is a decoding function that combines the state with the action through a multiplicative interaction, and then transforms it using a series of full convolutions to form the predicted frame .
The model is trained to minimise the mean squared error between the observed timeseries , corresponding to the evolution of the environment, and its prediction. In a probabilistic framework, this corresponds to maximising the loglikelihood in the graphical model depicted in Fig. 1(a). In this graph, the link from to represents stochastic dependence, as is formed by adding to a Gaussian noise term with zero mean and unit variance, whilst all remaining links represent deterministic dependences. The dashed lines indicate that only one of the two links is active, depending on whether the state transition is predictiondependent or observationdependent.
The model is trained using stochastic gradient descent, in which each minibatch consists of a set of segments of length randomly subsampled from . For each segment in the minibatch, the model uses the first observations to evolve the state and forms predictions of the last observations only. Training comprises three phases differing in the use of predictiondependent or observationdependent transitions (after the first transitions) and in the value of the prediction length . In the first phase, the model uses observationdependent transitions and predicts for timesteps. In the second and third phases, the model uses predictiondependent transitions and predicts for and timesteps respectively. During evaluation or usage, the model can only use predictiondependent transitions.
ActionDependent State Transition
A strong feature of the model of \citetoh15action described above is that the actions influence the state transitions only indirectly through the predictions or the observations. Allowing the actions to condition the state transitions directly could potentially enable the model to incorporate action information more effectively. We therefore propose the following backbone structure:
In the graphical model representation, this corresponds to replacing the link from to with a link from to as in Fig. 1(b).
ShortTerm versus LongTerm Accuracy
The last two phases in the training scheme of \citetoh15action described above are used to address the issue of poor accuracy that recurrent neural networks trained using only observationdependent transitions display when asked to predict several timesteps ahead. However, the paper does not analyse nor discuss alternative training schemes.
In principle, the highest accuracy should be obtained by training the model as closely as possible to the way it will be used, and therefore by using a number of predictiondependent transitions which is as close as possible to the number of timesteps the model will be asked to predict for. However, predictiondependent transitions increase the complexity of the objective function such that alternative schemes are most often used \citeptalvitie14model,bengio15scheduled,oh15action. Current training approaches are guided by the belief that using the observation , rather than the prediction , to form the state has the effect of reducing the propagation of the errors made in the predictions, which are higher at earlier stages of the training, enabling the model to correct itself from the mistakes made up to timestep . For example, [Bengio et al.(2015)Bengio, Vinyals, Jaitly, and Shazeer] introduce a scheduled sampling approach where at each timestep the type of state transition is sampled from a Bernoulli distribution, with parameter annealed from an initial value corresponding to using only observationdependent transitions to a final value corresponding to using only predictiondependent transitions, according to a schedule selected by validation.
Our analysis of different training schemes on Atari, which considered the interplay among warmup length , prediction length , and number of predictiondependent transitions, suggests that, rather than as having a corrective effect, observationdependent transitions should be seen as restricting the time interval in which the model considers its predictive abilities, and therefore focuses resources. Indeed we found that, the higher the number of consecutive predictiondependent transitions, the more the model is encouraged to focus on learning the global dynamics of the environment, which results in higher longterm accuracy. The highest longterm accuracy is always obtained by a training scheme that uses only predictiondependent transitions even at the early stages of the training. Focussing on learning the global dynamics comes at the price of shifting model resources away from learning the precise details of the frames, leading to a decrease in shortterm accuracy. Therefore, for complex games for which reasonable longterm accuracy cannot be obtained, training schemes that mix predictiondependent and observationdependent transitions are preferable. It follows from this analysis that percentage of consecutive predictiondependent transitions, rather than just percentage of such transitions, should be considered when designing training schemes.
From this viewpoint, the poor results obtained in [Bengio et al.(2015)Bengio, Vinyals, Jaitly, and Shazeer] when using only predictiondependent transitions can be explained by the difference in the type of the tasks considered. Indeed, unlike our case in which the model is tolerant to some degree of error such as blurriness in earlier predictions, the discrete problems considered in [Bengio et al.(2015)Bengio, Vinyals, Jaitly, and Shazeer] are such that one prediction error at earlier timesteps can severely affect predictions at later timesteps, so that the model needs to be highly accurate shortterm in order to perform reasonably longerterm. Also, [Bengio et al.(2015)Bengio, Vinyals, Jaitly, and Shazeer] treated the prediction used to form as a fixed quantity, rather than as a function of , and therefore did not perform exact maximum likelihood.
PredictionIndependent State Transition
In addition to potentially enabling the model to incorporate action information more effectively, allowing the actions to directly influence the state dynamics has another crucial advantage: it allows to consider the case of a state transition that does not depend on the frame, i.e. of the form , corresponding to removing the dashed links from and from to in Fig. 1(b). We shall call such a model predictionindependent simulator, referring to its ability to evolve the state without using the prediction during usage. Predictionindependent state transitions for highdimensional observation problems have also been considered in [Srivastava et al.(2015)Srivastava, Mansimov, and Salakhutdinov].
A predictionindependent simulator can dramatically increase computational efficiency in situations is which the agent is interested in the effect of a sequence of actions rather than of a single action. Indeed, such a model does not need to project from the lower dimensional state space into the higher dimensional observation space through the set of convolutions, and vice versa, at each timestep.
3 PredictionDependent Simulators
We analyse simulators with state transition of the form on three families of environments with different characteristics and challenges, namely Atari 2600 games from the arcade learning environment \citepbellemare13arcade, a firstperson game where an agent moves in randomly generated 3D mazes \citepbeattie16deepmind, and a 3D car racing environment called TORCS \citepwymann13torcs. We use two evaluation protocols. In the first one, the model is asked to predict for 100 or 200 timesteps into the future using actions from the test data. In the second one, a human uses the model as an interactive simulator. The first protocol enables us to determine how the model performs within the action policy of the training data, whilst the second protocol enables us to explore how the model generalises to other action policies.
As state transition, we used the following actionconditioned long shortterm memory (LSTM) \citephochreiter97long:
Encoding:  (1)  
Action fusion:  (2)  
Gate update:  
(3)  
Cell update:  (4)  
State update:  (5) 
where denotes the Hadamard product, the logistic sigmoid function, is a onehot vector representation of , and are parameter matrices. In Eqs. (2)–(5), and are the LSTM state and cell forming the model state ; and , and are the input, forget, and output gates respectively (for simplicity, we omit the biases in their updates). The vectors and had dimension 1024 and 2048 respectively. Details about the encoding and decoding functions and for the three families of environments can be found in Appendix B.1, B.2 and B.3. We used a warmup phase of length and we did not backpropagate the gradient to this phase.
3.1 Atari
We considered the 10 Atari games Freeway, Ms Pacman, Qbert, Seaquest, Space Invaders, Bowling, Breakout, Fishing Derby, Pong, and Riverraid. Of these, the first five were analysed in \citetoh15action and are used for comparison. The remaining five were chosen to better test the ability of the model in environments with other challenging characteristics, such as scrolling backgrounds (Riverraid), small/thin objects that are key aspects of the game (lines in Fishing Derby, ball in Pong and Breakout), and sparsereward games that require very longterm predictions (Bowling). We used training and test datasets consisting of five and one million 210160 RGB images respectively, with actions chosen from a trained DQN agent \citepmnih15dqn according to an greedy policy. Such a large number of training frames ensured that our simulators did not strongly overfit to the training data (see training and test lines in Figs. 2 and 3, and the discussion in Appendix B.1).
ShortTerm versus LongTerm Accuracy
Below we summarise our results on the interplay among warmup length , prediction length , and number of predictiondependent transitions – the full analysis is given in Appendix B.1.1.
The warmup and prediction lengths and regulate degree of accuracy in two different ways. 1) The value of determines how far into the past the model can access information – this is the case irrespectively of the type of transition used, although when using predictiondependent transitions information about the last timesteps of the environment would need to be inferred. Accessing information far back into the past can be necessary even when the model is used to perform onestep ahead prediction only. 2) The higher the value of and the number of predictiondependent transitions, the more the corresponding objective function encourages longterm accuracy. This is achieved by guiding the onestep ahead prediction error in such a way that further predictions will not be strongly affected, and by teaching the model to make use of information from the far past. The more precise the model is in performing onestep ahead prediction, the less noise guidance should be required. Therefore, models with very accurate convolutional and transition structures should need less encouragement.
Increasing the percentage of consecutive predictiondependent transitions increases longterm accuracy, often at the expense of shortterm accuracy. We found that using only observationdependent transitions leads to poor performance in most games. Increasing the number of consecutive predictiondependent transitions produces an increase in longterm accuracy, but also a decrease in shortterm accuracy usually corresponding to reduction in sharpness. For games that are too complex, although the lowest longterm prediction error is still achieved with using only predictiondependent transitions, reasonable longterm accuracy cannot be obtained, and training schemes that mix predictiondependent and observationdependent transitions are therefore preferable.
To illustrate these results, we compare the following training schemes for prediction length :
 [leftmargin=*]
 0% PDT:

Only observationdependent transitions.
 33% PDT:

Observation and predictiondependent transitions for the first 10 and last 5 timesteps respectively.
 0%20%33% PDT:

Only observationdependent transitions in the first 10,000 parameter updates; observationdependent transitions for the first 12 timesteps and predictiondependent transitions for the last 3 timesteps for the subsequent 100,000 parameters updates; observationdependent transitions for the first 10 timesteps and predictiondependent transitions for the last 5 timesteps for the remaining parameter updates (adaptation of the training scheme of \citetoh15action to ).
 46% PDT Alt.:

Alternate between observationdependent and predictiondependent transitions from a timestep to the next.
 46% PDT:

Observation and predictiondependent transitions for the first 8 and last 7 timesteps respectively.
 67% PDT:

Observation and predictiondependent transitions for the first 5 and last 10 timesteps respectively.
 0%100% PDT:

Only observationdependent transitions in the first 1000 parameter updates; only predictiondependent transitions in the subsequent parameter updates.
 100% PDT:

Only predictiondependent transitions.
For completeness, we also consider a training scheme as in \citetoh15action, which consists of three phases with , and 500,000, 250,000, 750,000 parameter updates respectively. In the first phase is formed by using the observed frame , whilst in the two subsequent phases is formed by using the predicted frame .
In Figs. 2 and 3 we show the prediction error averaged over 10,000 sequences
These figures clearly show that longterm accuracy generally improves with increasing number of consecutive predictiondependent transitions. When using alternating (46% PDT Alt.), rather than consecutive (46% PDT), predictiondependent transitions longterm accuracy is worse, as we are effectively asking the model to predict at most two timesteps ahead. We can also see that using more predictiondependent transitions produces lower shortterm accuracy and/or slower shortterm convergence. Finally, the figures show that using a training phase with only observationdependent transitions that is too long, as in \citetoh15action, can be detrimental: the models reaches at best a performance similar to the 46% PDT Alt. training scheme (the sudden drop in prediction error corresponds to transitioning to the second training phase), but is most often worse.
By looking at the predicted frames we could notice that, in games containing balls and paddles, using only observationdependent transitions gives rise to errors in reproducing the dynamics of these objects. Such errors decrease with increasing predictiondependent transitions. In other games, using only observationdependent transitions causes the model to fail in representing moving objects, except for the agent in most cases. Training schemes containing more predictiondependent transitions encourage the model to focus more on learning the dynamics of the moving objects and less on details that would only increase shortterm accuracy, giving rise to more globally accurate but less sharp predictions. Finally, in games that are too complex, the strong emphasis on longterm accuracy produces predictions that are overall not sufficiently good.
More specifically, from the videos available at
The tradingoff of long for shortterm accuracy when using more predictiondependent transitions is particularly evident in the videos of Seaquest: the higher the number of such transitions, the better the model learns the dynamics of the game, with new fish appearing in the right location more often. However, this comes at the price of reduced sharpness, mostly in representing the fish.
This tradeoff causes problems in Breakout, Ms Pacman, Qbert, and Space Invaders, so that schemes that also use observationdependent transitions are preferable for these games. For example, in Breakout, the model fails at representing the ball, making the predictions not sufficiently good. Notice that the prediction error (see Fig. 15) is misleading in terms of desired performance, as the 100%PDT training scheme performs as well as other mixing schemes for longterm accuracy – this highlights the difficulties in evaluating the performance of these models.
Increasing the prediction length increases longterm accuracy when using predictiondependent transitions. In Fig. 4, we show the effect of using different prediction lengths on the training schemes 0%PDT, 67%PDT, and 100%PDT for Pong and Seaquest. In Pong, with the 0%PDT training scheme, using higher improves longterm accuracy: this is a game for which this scheme gives reasonable accuracy and the model is able to benefit from longer history. This is however not the case for Seaquest (or other games as shown in Appendix B.1.1). On the other hand, with the 100%PDT training scheme, using higher improves longterm accuracy in most games (the difference is more pronounced between and than between and ), but decreases shortterm accuracy. Similarly to above, reduced shortterm accuracy corresponds to reduced sharpness: from the videos available at we can see, for example, that the moving caught fish in Fishing Derby, the fish in Seaquest, and the ball in Pong are less sharp for higher .
Truncated backpropagation still enables increase in longterm accuracy. Due to memory constraints, we could only backpropagate gradients over sequences of length up to 20. To use , we split the prediction sequence into subsequences and performed parameter updates separately for each subsequence. For example, to use we split the prediction sequence into two successive subsequences of length 15, performed parameter updates over the first subsequence, initialised the state of the second subsequence with the final state from the first subsequence, and then performed parameter updates over the second subsequence. This approach corresponds to a form of truncated backpropagation through time \citepwilliams95gradient – the extreme of this strategy (with equal to the length of the whole training sequence) was used by \citetzarembe14recurrent.
In Fig. 5, we show the effect of using 2 and 5 subsequences of length (indicated by BPTT(15, 2) and BTT(15, 5)) on the training schemes 0%PDT, 33%PDT, and 100%PDT for Pong and Seaquest. We can see that the 0%PDT and 33%PDT training schemes display no difference in accuracy for different values of . On the other hand, with the 100%PDT training scheme, using more than one subsequence improves longterm accuracy (the difference is more pronounced between and than between and ), but decreases shortterm accuracy (the difference is small at convergence between and , but big between and ). The decrease in accuracy with 5 subsequences is drastic in some games.
For Riverraid, using more than one subsequence with the 33%PDT and 100%PDT training schemes improves longterm accuracy dramatically, as shown in Fig. 6, as it enables correct prediction after a jet loss. Interestingly, for the 100%PDT training scheme, using with prediction length (black line) does not give the same amount of gain as when using BPTT(15, 2), even if history length is the same. This would seem to suggest that some improvement in BPTT(15, 2) is due to encouraging longerterm accuracy, indicating that this can be achieved even when not fully backpropagating the gradient.
From the videos available at , we can see that with the predictions in some of the Fishing Derby videos are faded, whilst in Pong the model can suddenly switch from one dynamics to another for the ball and the opponent’s paddle.
In conclusion, using higher through truncated backpropagation can improve performance. However, in schemes that use many predictiondependent transitions, a high value of can lead to poor predictions.
Evaluation through Human Play
Whilst we cannot expect our simulators to generalise to structured sequences of actions never chosen by the DQN and that are not present in the training data, such as moving the agent up and down the alley in Bowling, it is reasonable to expect some degree of generalisation in the actionwise simple environments of Breakout, Freeway and Pong.
We tested these three games by having humans using the models as interactive simulators. We generally found that models trained using only predictiondependent transitions were more fragile to states of the environment not experienced during training, such that the humans were able to play these games for longer with simulators trained with mixing training schemes. This seems to indicate that models with higher longterm test accuracy are at higher risk of overfitting to the training policy.
In Fig. 7(a), we show some salient frames from a game of Pong played by a human for 500 timesteps (the corresponding video is available at PongHPlay). The game starts with score (2,0), after which the opponent scores five times, whilst the human player scores twice. As we can see, the scoring is updated correctly and the game dynamics is accurate. In Fig. 7(b), we show some salient fames from a game of Breakout played by a human for 350 timesteps (the corresponding video is available at BreakoutHPlay). As for Pong, the scoring is updated correctly and the game dynamics is accurate. These images demonstrate some degree of generalisation of the model to a human style of play.
Evaluation of State Transitions Structures
In Appendix B.1.2 and B.1.3 we present an extensive evaluation of different actiondependent state transitions, including convolutional transformations for the action fusion, and gate and cell updates, and different ways of incorporating action information. We also present a comparison between actiondependent and actionindependent state transitions.
Some actiondependent state transitions give better performance than the baseline (Eqs. (1)–(5)) in some games. For example, we found that increasing the state dimension from 1024 to the dimension of the convolved frame, namely 2816, might be preferable. Interestingly, this is not due to an increase in the number of parameters, as the same gain is obtained using convolutions for the gate and cell updates. These results seem to suggest that highdimensional sparse state transition structures could be a promising direction for further improvement. Regarding different ways of incorporation action information, we found that using local incorporation such as augmenting the frame with action information and indirect action influence gives worse performance that direct and global action influence, but that there are several ways of incorporating action information directly and globally that give similar performance.
3.2 3D Environments
Both TORCS and the 3D maze environments highlight the need to learn dynamics that are temporally and spatially coherent: TORCS exposes the need to learn fast moving dynamics and consistency under motion, whilst 3D mazes are partiallyobserved and therefore require the simulator to build an internal representation of its surrounding using memory, as well learn basic physics, such as rotation, momentum, and the solid properties of walls.
TORCS. The data was generated using an artificial agent controlling a fast car without opponents (more details are given in Appendix B.2).
When using actions from the test set (see Fig. 49 and the corresponding video at TORCS), the simulator was able to produce accurate predictions for up to several hundreds timesteps. As the car moved around the racing track, the simulator was able to predict the appearance of new features in the background (towers, sitting areas, lamp posts, etc.), as well as model the jerky motion of the car caused by our choices of random actions. Finally, the instruments (speedometer and rpm) were correctly displayed.
The simulator was good enough to be used interactively for several hundred frames, using actions provided by a human. This showed that the model had learnt well how to deal with the car hitting the wall on the right side of the track. Some salient frames from the game are shown in Fig. 8 (the corresponding video can be seen at TORCSHPlay).
3D Mazes. We used an environment that consists of randomly generated 3D mazes, containing textured surfaces with occasional paintings on the walls: the mazes were all of the same size, but differed in the layout of rooms and corridors, and in the locations of paintings (see Fig. 11(b) for an example of layout). More details are given in Appendix B.3.
When using actions from the test set, the simulator was able to very reasonably predict frames even after 200 steps. In Fig. 9 we compare predicted frames to the real frames at several timesteps (the corresponding video can be seen at 3DMazes). We can see that the wall layout is better predicted when walls are closer to the agent, and that corridors and far awaywalls are not as long as they should be. The lighting on the ceiling is correct on all the frames shown.
When using the simulator interactively with actions provided by a human, we could test that the simulator had learnt consistent aspects of the maze: when walking into walls, the model maintained their position and layout (in one case we were able to walk through a painting on the wall – paintings are rare in the dataset and hence it is not unreasonable that they would not be maintained when stress testing the model in this way). When taking spins, the wall configurations were the same as previously generated and not regenerated afresh, and shown in Fig. 10 (see also 3DMazesHPLay). The coherence of the maze was good for nearby walls, but not at the end of longcorridors.
3.3 Modelbased Exploration
The search for exploration strategies better than greedy is an active area of research. Various solutions have been proposed, such as density based or optimistic exploration \citepauer02finite. [Oh et al.(2015)Oh, Guo, Lee, Lewis, and Singh] considered a memorybased approach that steers the agent towards previously unobserved frames. In this section, we test our simulators using a similar approach, but select a group of actions rather than a single action at a time. Furthermore, rather than a fixed 2D environment, we consider the more challenging 3D mazes environment. This also enables us this present a qualitative analysis, as we can exactly measure and plot the proportion of the maze visited over time. Our aim is to be quantitatively and qualitatively better than random exploration (using dithering of , as this lead to the best possible random agent).
We used a 3D maze simulator to predict the outcome of sequences of actions, chosen with a hardcoded policy. Our algorithm (see below) did MonteCarlo simulations with randomly selected sequences of actions of fixed length . At each timestep , we stored the last 10 observed frames in an episodic memory buffer and compared predicted frames to those in memory.
Our method (see Fig. 11(a)) covered % more of the maze area after timesteps than random exploration. These results were obtained with 100 MonteCarlo simulations and sequences of 6 actions (more details are given in Appendix B.4). Comparing typical paths chosen by the random explorer and by our explorer (see Fig. 11(b)), we see the our explorer has much smoother trajectories.
This is a good local exploration strategy that leads to faster movement through corridors. To transform this into a good global exploration strategy, our explorer would have to be augmented with a better memory in order to avoid going down the same corridor twice. These sorts of smooth local exploration strategies could also be useful in navigation problems.
4 PredictionIndependent Simulators
A predictionindependent simulator has state transitions of the form , which therefore do not require the highdimensional predictions. In the Atari environment, for example, this avoids having to project from the state space of dimension 1024 into the observation space of dimension 100,800 (2101603) through the decoding function , and vice versa through the encoding function – in the used structure this enables saving around 200 million flops at each timestep.
For the state transition, we found that a working structure was to use Eqs. (1)–(5) with and with different parameters for the warmup and prediction phases. As for the predictiondependent simulator, we used a warmup phase of length , but we did backpropagate the gradient back to timestep five in order to learn the encoding function .
Our analysis on Atari (see Appendix C) suggests that the predictionindependent simulator is much more sensitive to changes in the state transition structure and in the training scheme than the predictiondependent simulator. We found that using prediction length gave much worse longterm accuracy than with the predictiondependent simulator. This problem could be alleviated with the use of prediction length through truncated backpropagation.
Fig. 12 shows a comparison of the predictiondependent and predictionindependent simulators using through two subsequences of length 15 (we indicate this as BPTT(15, 2), even though in the predictionindependent simulator we did backpropagate the gradient to the warmup phase).
When looking at the videos available at PISimulators, we can notice that the predictionindependent simulator tends to give worse type of longterm prediction. In Fishing Derby for example, in the longterm the model tends to create fish of smaller dimension in addition to the fish present in the real frames. Nevertheless, for some difficult games the predictionindependent simulator achieves better performance than the predictiondependent simulator. More investigation about alternative state transitions and training schemes would need to be performed to obtain the same overall level of accuracy as with the predictiondependent simulator.
5 Discussion
In this paper we have introduced an approach to simulate actionconditional dynamics and demonstrated that is highly adaptable to different environments, ranging from Atari games to 3D car racing environments and mazes. We showed stateoftheart results on Atari, and demonstrated the feasibility of live human play in all three task families. The system is able to capture complex and longterm interactions, and displays a sense of spatial and temporal coherence that has, to our knowledge, not been demonstrated on highdimensional timeseries data such as these.
We have presented an indeep analysis on the effect of different training approaches on short and longterm prediction capabilities, and showed that moving towards schemes in which the simulator relies less on past observations to form future predictions has the effect on focussing model resources on learning the global dynamics of the environment, leading to dramatic improvements in the longterm predictions. However, this requires a distribution of resources that impacts shortterm performance, which can be harmful to the overall performance of the model for some games. This tradeoff is also causing the model to be less robust to states of the environment not seen during training. To alleviate this problem would require the design of more sophisticated model architectures than the ones considered here. Whilst it is also expected that more adhoc architectures would be less sensitive to different training approaches, we believe that guiding the noise as well as teaching the model to make use of past information through the objective function would still be beneficial for improving longterm prediction.
Complex environments have compositional structure, such as independently moving objects and other phenomena that only rarely interact. In order for our simulators to better capture this compositional structure, we may need to develop specialised functional forms and memory stores that are better suited to dealing with independent representations and their interlinked interactions and relationships. More homogeneous deep network architectures such as the one presented here are clearly not optimal for these domains, as can be seen in Atari environments such as Ms Pacman where the system has trouble keeping track of multiple independently moving ghosts. Whilst the LSTM memory and our training scheme have proven to capture longterm dependencies, alternative memory structures are required in order, for example, to learn spatial coherence at a more global level than the one displayed by our model in the 3D mazes in oder to do navigation.
In the case of actionconditional dynamics, the policyinduced data distribution does not cover the state space and might in fact be nonstationary over an agent lifetime. This can cause some regions of the state space to be oversampled, whereas the regions we might actually care about the most – those just around the agent policy state distribution – to be underrepresented. In addition, this induces biases in the data that will ultimately not enable the model learn the environment dynamics correctly. As verified from the experiments in this paper, both on live human play and modelbased exploration, this problem is not yet as pressing as might be expected in some environments. However, our simulators displayed limitations and faults due to the specificities of the training data, such as for example predicting an event based on the recognition of a particular sequence of actions always cooccurring with this event in the training data rather than on the recognition of the real causes.
Finally, a limitation of our approach is that, however capable it might be, it is a deterministic model designed for deterministic environments. Clearly most real world environments involve noisy state transitions, and future work will have to address the extension of the techniques developed in this paper to more generative temporal models.
Acknowledgments
The authors would like to thank David Barber for helping with the graphical model interpretation, Alex Pritzel for preparing the DQN data, Yori Zwols and Frederic Besse for helping with the implementation of the model, and Oriol Vinyals, Yee Whye Teh, Junhyuk Oh, and the anonymous reviewers for useful discussions and feedback on the manuscript.
Appendix A Data, Preprocessing and Training Algorithm
When generating the data, each selected action was repeated for 4 timesteps and only the 4th frame was recorded for the analysis. The RGB images were preprocessed by subtracting mean pixel values (calculated separately for each color channel and over an initial set of 2048 frames only) and by dividing each pixel value by 255.
As stochastic gradient algorithm, we used centered RMSProp \citepgraves13generating with learning rate
Appendix B PredictionDependent Simulators
As baseline for the singlestep simulators we used the following state transition:
Encoding:  
Action fusion:  
Gate update:  
Cell update:  
State update: 
with vectors and of dimension 1024 and 2048 respectively.
b.1 Atari
We used a trained DQN agent (the scores are given in the table on the right) to generate training and test datasets consisting of 5,000,000 and 1,000,000 (210160) RGB images respectively, with actions chosen according to an greedy policy. Such a large number of training frames was necessary to prevent our simulators from strongly overfitting to the training data. This would be the case with, for example, one million training frames, as shown in Fig. 13 (the corresponding video can be seen at MSPacman). The ghosts are in frightened mode at timestep 1 (first image), and have returned to chase mode at timestep 63 (second image). The simulator is able to predict the exact time of return to the chase mode without sufficient history, which suggests that the sequence was memorized.
Game Name  DQN Score 

Bowling  51.84 
Breakout  396.25 
Fishing Derby  19.30 
Freeway  33.38 
Ms Pacman  2963.31 
Pong  20.88 
Qbert  14,865.43 
Riverraid  13,593.49 
Seaquest  17,250.31 
Space Invaders  2952.09 
The encoding consisted of 4 convolutional layers with 64, 32, 32 and 32 filters, of size , , , and , stride 2, and padding 0, 1, 1, 0 and 1, 1, 1, 0 for the height and width respectively. Every layer was followed by a randomized rectified linear function (RReLU) \citepxu15empirical with parameters , . The output tensor of the convolutional layers of dimension was then flattened into the vector of dimension 2816. The decoding consisted of one fullyconnected layer with 2816 hidden units followed by 4 full convolutional layers with the inverse symmetric structure of the encoding transformation: 32, 32, 32 and 64 filters, of size , , , and , stride 2, and padding 0, 1, 1, 0 and 0, 1, 1, 1. Each full convolutional layer (except the last one) was followed by a RReLU.
In Fig. 14, we show one example of successful prediction at timesteps 100 and 200 for each game.
ShortTerm Versus LongTerm Accuracy
In Figures 1519, we show the prediction error obtained with the training schemes described in Sec. 3.1 for all games. Below we discuss the main findings for each game.
Bowling. Bowling is one of the easiest games to model. A simulator trained using only observationdependent transitions gives quite accurate predictions. However, using only predictiondependent transitions reduces the error in updating the score and predicting the ball direction.
Breakout. Breakout is a difficult game to model. A simulator trained with only predictiondependent transitions predicts the paddle movement very accurately but almost always fails to represent the ball. A simulator trained with only observationdependent transitions struggles much less to represent the ball but does not predict the paddle and ball positions as accurately, and the ball also often disappears after hitting the paddle. Interestingly, the longterm prediction error (bottomright of Fig. 15(b)) for the 100%PDT training scheme is the lowest, as when not representing the ball the predicted frames look closer to the real frames than when representing the ball incorrectly. A big improvement in the model ability to represent the ball could be obtained by preprocessing the frames with maxpooling as done for DQN, as this increases the ball size. We believe that a more sophisticated convolutional structure would be even more effective, but did not succeed in discovering such a structure.
Fishing Derby. In Fishing Derby, longterm accuracy is disastrous with the 0%PDT training scheme and good with the 100%PDT training scheme. Shortterm accuracy is better with schemes using more observationdependent transitions than in the 100% or 0%100%PDT training schemes, especially at low numbers of parameter updates.
Freeway. With Bowling, Freeway is one of the easiest games to model, but more parameter updates are required for convergence than for Bowling. The 0%PDT training scheme gives good accuracy, although sometimes the chicken disappears or its position is incorrectly predicted – this happens extremely rarely with the 100%PDT training scheme. In both schemes, the score is often wrongly updated in the warning phase.
Ms Pacman. Ms Pacman is a very difficult game to model and accurate prediction can only be obtained for a few timesteps into the future. The movement of the ghosts, especially when in frightened mode, is regulated by the position of Ms Pacman according to complex rules. Furthermore, the DQN greedy policy does not enable the agent to explore certain regions of the state space. As a result, the simulator can predict well the movement of Ms Pacman, but fails to predict longterm the movement of the ghosts when in frightened mode or when in chase mode later in the episodes.
Pong. With the 0%PDT training scheme, the model often incorrectly predicts the direction of the ball when hit by the agent or by the opponent. Quite rarely, the ball disappears when hit by the agent. With the 100%PDT training scheme, the direction the ball is much more accurately predicted, but the ball more often disappears when hit by the agent, and the ball and paddles are generally less sharp.
Qbert. Qbert is a game for which the 0%PDT training scheme is unable to predict accurately beyond very shortterm, as after a few frames only the background is predicted. The more predictiondependent transitions are used, the less sharply the agent and the moving objects are represented.
Riverraid. In Riverraid, prediction with the 0%PDT training scheme is very poor, as this scheme causes no generation of new objects or background. With all schemes, the model fails to predict the frames that follow a jet loss – that’s why the prediction error increases sharply after around timestep 13 in Fig. 18(b). The longterm prediction error is lower with the 100%PDT training scheme, as with this scheme the simulator is more accurate before, and sometimes after, a jet loss. The problem of incorrect prediction after a jet loss disappears when using BBTT(15,2) with predictiondependent transitions.
Seaquest. In Seaquest, with the 0%PDT training scheme, the existing fish disappears after a few timesteps and no new fish ever appears from the sides of the frame. The higher the number of predictiondependent transitions the less sharply the fish is represented, but the more accurately its dynamics and appearance from the sides of the frame can be predicted.
Space Invaders Space Invaders is a very difficult game to model and accurate prediction can only be obtained for a few timesteps into the future. The 0%PDT training scheme is unable to predict accurately beyond very shortterm. The 100%PDT training scheme struggles to represent the bullets.
In Figs. 2024 we show the effect of using different prediction lengths with the training schemes 0%PDT, 67%PDT, and 100%PDT for all games.
In Figs. 2529 we show the effect of using different prediction lengths through truncated backpropagation with the training schemes 0%PDT, 33%PDT, and 100%PDT for all games.
Different ActionDependent State Transitions
In this section we compare the baseline state transition
Encoding:  
Action fusion:  
Gate update:  
Cell update:  
State update: 
where the vectors and have dimension 1024 and 2048 respectively (this model has around 25 millions (25M) parameters), with alternatives using unconstrained or convolutional transformations, for prediction length and the 0%100%PDT training scheme.
More specifically, in Figs. 3034 we compare the baseline transition with the following alternatives:
 Base2816:

The vectors and have the same dimension as , namely 2816. This model has around 80M parameters.
 and 2816:

Have a separate gating for in the cell update, i.e.
This model has around 30 million parameters. We also considered removing the linear projection of , i.e.
without RReLU after the last convolution and with vectors and of dimensionality 2816. This model has around 88M parameters.
 and –2816:

Remove in the gate updates, i.e.
with one of the following cell updates
 , –, and –2816:

Substitute with in the gate updates, i.e.
with one of the following cell updates
As we can see from the figures, there is no other transition that is clearly preferable to the baseline, with the exception of Fishing Derby, for which transitions with 2816 hidden dimensionality perform better and converge earlier in terms of number of parameter updates.
In Figs. 3539 we compare the baseline transition with the following convolutional alternatives (where to apply the convolutional transformations the vectors and of dimensionality 2816 are reshaped into tensors of dimension )
 and 2:

Convolutional gate and cell updates, i.e.
where denotes either one convolution with 32 filters of size 33, with stride 1 and padding 1 (as to preserve the input size), or two such convolutions with RReLU nonlinearity in between. These two models have around 16M parameters.
 DA and 2DA:

As above but with different action fusion parameters for the gate and cell updates, i.e.
These two models have around 40M parameters.
 –2816–2:

As ’–2816’ with convolutional gate and cell updates, i.e.
where denotes two convolutions as above. This model has around 16M parameters.
 –2816–DA and –2816–2DA:

As above but with different parameters for the gate and cell updates, and one or two convolutions. These two models have around 48M parameters.
 –2816–2A:

As ’–2816’ with convolutional action fusion, gate and cell updates, i.e.
where indicates two convolutions as above. This model has around 8M parameters.
Action Incorporation
In Figs. 4044 we compare different ways of incorporating the action for actiondependent state transitions, using prediction length and the 0%100%PDT training scheme. More specifically, we compare the baseline structure (denoted as ’’ in the figures) with the following alternatives:
 :

Multiplicative/additive interaction of the action with , i.e. . This model has around 25M parameters.
 :

Multiplicative interaction of the action with the encoded frame , i.e.
This model has around 22M parameters.
 :

Multiplicative interaction of the action with both and in the following way
This model has around 19M parameters. We also considered having different matrices for the gate and cell updates (denoted in the figures as ’’). This model has around 43M parameters.
 :

Alternative multiplicative interaction of the action with and
This model has around 28M parameters. We also considered having different matrices for the gate and cell updates (denoted in the figures as ’’). This model has around 51M parameters.
 As Input:

Consider the action as an additional input, i.e.
This model has around 19M parameters.
 :

Combine the action with the frame, by replacing the encoding with
where indicates an augmenting operation: the frame of dimension is augmented with (number of actions) fullzero or fullone matrices of dimension , producing a tensor of dimension . As the output of the first convolution can be written as
where and indicate the filter strides, with this augmentation the action has a local linear interaction. This model has around 19M parameters.
As we can see from the figures, ’’ is generally considerably worse than the other structures.
ActionIndependent versus ActionDependent State Transition
In Fig. 45, we compare the baseline structure with one that is actionindependent as in [Oh et al.(2015)Oh, Guo, Lee, Lewis, and Singh], using prediction length and the 0%100%PDT training scheme.
As we can see, having an an actionindependent state transition generally gives worse performance in the games with higher error. An interesting disadvantage of such a structure is its inability to predict the moving objects around the agent in Seaquest. This can be noticed in the videos in Seaquest, which show poor modelling of the fish. This structure also makes it more difficult to correctly update the score in some games such as Seaquest and Fishing Derby.
Human Play
In Fig. 46, we show the results of a human playing Freeway for 2000 timesteps (the corresponding video is available at FreewayHPlay). The model is able to update the score correctly up to (14,0). At that point the score starts flashing and to change color as a warn to the resetting of the game. The model is not able to predict the score correctly in this warning phase, due to the bias in the data (DQN always achieves score above 20 at this point in the game), but flashing starts at the right time as does the resetting of the game.
b.2 3D Car Racing
We generated 10 and one million (180180) RGB images for training and testing respectively, with an agent trained with the asynchronous advantage actor critic algorithm (Fig. 2 in \citepmnih16asynchronous). The agent could choose among the three actions accelerate straight, accelerate left, and accelerate right, according to an greedy policy, with selected at random between 0 and 0.5, independently for each episode. We added a 4th ‘do nothing’ action when generating actions at random. Smaller lead to longer episodes (1500 frames), while larger lead to shorter episodes (200 frames).
We could use the same number of convolutional layers, filters and kernel sizes as in Atari, with no padding.
Fig. 49 shows side by side predicted and real frames for up to actions. We found that this quality of predictions was very common.
When using our model as an interactive simulator, we observed that the car would slightly slow down when selecting no action, but fail to stop. Since the model had never seen occurrences of the agent completely releasing the accelerator for more than a few consecutive actions, it makes sense it would fail to deal with this case appropriately.
b.3 3D Mazes
Unlike Atari and TORCS, we could rely on agents with random policies to generate interesting sequences. The agent could choose one of five actions: forward, backward, rotate left, rotate right or do nothing. During an episode, the agent alternated between a random walk for 15 steps, and spinning on itself for 15 steps (roughly, a complete spin). This encourages coherent learning of the predicted frames after a spin. The random walk was with dithering of 0.7, meaning that new actions were chosen with a probability of 0.7 at every timestep. The training and test datasets were made of 7,600 and 1,100 episodes, respectively. All episodes were of length 900 frames, resulting in 6,840,000 and 990,000 (4848) RGB images for training and testing respectively.
We adapted the encoding by having only convolutions with 64 filters of size , stride 2, and padding 0, 1, and 2. The decoding transformation was adapted accordingly.
b.4 Modelbased Exploration
We observed that increasing the number of MonteCarlo simulations beyond made little to no difference, probably because with possible actions the number of possible MonteCarlo simulations is so large that we quickly get diminishing returns with every new simulation.
Increasing significantly the sequence length of actions beyond lead to a large decrease in performance. To explain this, we observed that after steps, our average prediction error was less than half the average prediction error after steps ( and respectively). Since the average minimum and maximum distances did not vary significantly (from to , and from to respectively), for deep simulations we ended up with more noise than signal in our predictions and our decisions were no better than random.
Fig. 50 shows some examples of trajectories chosen by our explorer. Note that all these trajectories are much smoother than for our baseline agent.
Appendix C PredictionIndependent Simulators
In this section we compare different actiondependent state transitions and prediction lengths for the predictionindependent simulator.
More specifically, in Fig. 51 we compare (with ) the state transition
Encoding: 