Temporal Difference Variational Auto-Encoder
One motivation for learning generative models of environments is to use them as simulators for model-based reinforcement learning. Yet, it is intuitively clear that when time horizons are long, rolling out single step transitions is inefficient and often prohibitive. In this paper, we propose a generative model that learns state representations containing explicit beliefs about states several time steps in the future and that can be rolled out directly in these states without executing single step transitions. The model is trained on pairs of temporally separated time points, using an analogue of temporal difference learning used in reinforcement learning, taking the belief about possible futures at one time point as a bootstrap for training the belief at an earlier time. While we focus purely on the study of the model rather than its use in reinforcement learning, the model architecture we design respects agents’ constraints as it builds the representation online.
Having a model of the environment allows us to consider consequences of different actions without acting in the real environment and to formulate a plan (Browne et al., 2012; Silver et al., 2016; Betts, 1998; Bertsekas et al., 1995) or simply learn from simulated experience (Sutton, 1991). In a general agent setting, when a model is not available, the usual approach is to learn one and use one of the model-based methods, such as ones referenced above. This has proven difficult but has had some success on relatively simple domains (Deisenroth & Rasmussen, 2011; Racanière et al., 2017; Buesing et al., 2018; Lee et al., 2018; Ha & Schmidhuber, 2018).
Planning can be most successful if the typical time separation of the events over which one plans is not much smaller than the time scale of the task. For example in the classical planning of air transport, one plans over events such as loading cargo and flying to another city, not in terms of one-second time intervals. Yet, most environment model research has focused on building accurate simulators of environments (reviewed below). It seems that we need a different kind of model: a model that can bridge large time scales in one, or a small number of computational steps (see some approaches below). What are the properties that such models should satisfy? This question won’t be settled until these models are successfully used in planning, which is a whole research topic in itself. To make progress and to be able to focus on model building, we propose a set of properties that are desirable for models used by agents. Then, we build a model that satisfies them and test it in experiments.
We first review some approaches to sequence modelling which include environment simulators. One of the most straightforward methods is to use a recurrent network to make a deterministic prediction of the next frame (Oh et al., 2015; Chiappa et al., 2017). However, most environments are complex and/or stochastic and require stochastic modelling. The most direct method is an extension of the previous ones, conditioning various types of generative models on the past frames (Graves, 2013; Kalchbrenner et al., 2016; Van Den Oord et al., 2016). Finally, another approach is to build a latent variable model that can make decisions in latent space, processing them using a neural network to generate observations (Chung et al., 2015; Lee et al., 2018; Archer et al., 2015; Fraccaro et al., 2016; Liu et al., 2017; Buesing et al., 2018; Serban et al., 2017; Oord et al., 2017). Depending on the architecture, some of these models are able to roll forward in latent space, and only generate pixels if needed for some purpose such as visualization. In these models the recurrent computations are carried out at a rate of one time step per input time step. Some works consider skipping a part of the computation for a number of time steps or various forms of directly predicting future more than one step ahead (Koutnik et al., 2014; Chung et al., 2016; Zheng et al., 2016; Venkatraman et al., 2017). Our model differs from these in significant aspects. First, it does not consider fixed time skips but has a belief about future states at flexible time separations. Second, it can roll forward in these states. Finally, it is able to learn from two time points without needing to back-propagate between them. The model satisfies the list of proposed desired properties, listed below:
1. The model should be able to make predictions about states many time steps into the future and roll forward samples of these states, without considering single step transitions. Consider a task that requires planning about a set of events . If these events are separated by large time delays, a single step planner is significantly more expensive and inefficient than a planner that plans directly in terms of these events. Furthermore, the model should predict states, not observations (for example predicting not only position but also motion).
2. The model should be able to learn from temporally separated time points without the need to back-propagate through all the time points between them. This could have several advantages: The separation could be large and back-propagating would be expensive in terms of computational speed and memory and difficult due to the vanishing gradient problem. Bridging different time points directly could also allow many such pairs to be processed in a short amount of time.
3. The model should be deep and layered. The basic principle behind deep learning is not only to execute several computations in sequence (as opposed to shallow computation of one or two steps), but also to create an operation - a layer - that is replicated many times in depth. That is, rather then designing these operations by hand, we create one operation and replicate it, and let the system learn to make this operation represent different things in different layers. For example in convolutional networks, such operation is the triplet of convolution, batch normalization and the rectified-linear non-linearity.
4. The model needs to build state representations from observations online. This is required because an agent should use the representations to take actions.
5. The state representation should contain explicit beliefs about various parts of the environment and in a way that is easy to access. For example when moving around a room, the model should have beliefs about parts of the room it has not seen, or how what it has seen has changed. Extracting these beliefs should not require long computation such as long forward rolls from the model.
6. To build a state and to act the model should be able to make the best use of the available information. Therefore there should be a deterministic pathway from observations to states. A simple example of this is the Hidden Markov Model, which models a stochastic process, but the parameters of which, representing probabilities of different states, are obtained deterministically from observations.
2 Temporal difference variational auto-encoder (td-VAE).
Here we explain the single layer version of the model. It consists of two parts: an aggregation network and a belief network. The aggregation network is the computation that builds the state that the agent can use to act. The belief network represents explicit beliefs about the state of the world and provides the training signal for itself and for the aggregation network.
The aggregation network we use is simply a recurrent neural network (RNN) that consumes observations (including actions and rewards if present) and the state that the agent uses to act is the internal state of the RNN. Mathematically, the state at time is computed from the previous state and the input at and previous action and reward as . In practice, if the observations are images, they are often processed using a convolutional network followed by an LSTM. From now on we will omit actions and rewards unless otherwise stated.
Next, we build the belief network in a way that forces the state to contain beliefs about the world. We express these beliefs by a probability distribution over a latent variable . As an example, imagine that an agent enters a room and has seen only a part of it. The variable could represent the possible contents of the room. The elements of vector that represent what has been observed, or what is predicted from observations should have a low uncertainty under and the remaining elements should have high uncertainty.
We are going to train the model by considering two time points and that are separated by a random amount from some range, for example . Except for the recurrent aggregation network above, we want all the updates to depend only on the variables at these two time points. The model is displayed and explained in Figure 1. We write its specifications here and derive them in the next paragraph. We let denote the state and latent variables at times . The model is defined by the following equations.
Here denotes the agent’s belief distribution, is a reconstructive distribution of from , is an inference distribution and is the loss that we minimize. is at the same time a forward model of the environment that we can use to roll samples forward, skipping steps. Equations (1-4) form a type of a variational auto-encoder (as discussed below) which is trained using back-propagation as usual. Next we derive this model.
We aim for to represent the model’s belief about the world, having seen observations up to time (online). First, should represent the current frame, thus aiming to maximize where is a ‘decoder’ distribution. Next, it should contain beliefs about the future states. Consider time . We could theoretically get the most precise state of the world if we consider all the past and the future frames that will be encountered and encode this information into . There is still uncertainty left in our knowledge and therefore we can only have a belief over , which we denote by . If we had this distribution, we could train to predict it (note that is computed online only from the observations up to time ). Now, in practice we can’t consider all future inputs to compute this (and in addition it might be difficult to aggregate all the inputs that we do consider). Therefore we use our belief at time to estimate the . The belief contains information about frames observed between and and also expresses the beliefs about what will be the future after . We don’t know which future will happen, and therefore we take a sample from this belief as a possible future. Now assuming this sampled future, we can infer what would have been using a function and train the belief to predict the belief , that is, trying to minimize . Now, in order for to represent a sample of the belief in the future, we train it to reconstruct this future. That is, assuming that contains the belief about the state of the world at , with being a specific instance of this belief, we can train to represent the corresponding state at time by trying to reconstruct from . The reconstruction is expressed as distribution with loss to be minimized. Adding these losses and the input reconstruction term gives us the loss (4).
2.1 Relationship to temporal difference learning of reinforcement learning
In reinforcement learning the state of an agent represents a belief about the sum of discounted rewards . In the classic setting, the agent only models the mean of this distribution represented by the value function or action dependent function (Sutton & Barto, 1998). Recently in (Bellemare et al., 2017), a full distribution over the belief of has been considered. To train the beliefs or , one does not usually wait to get all the rewards to compute . Instead one uses a belief at some future time as a bootstrap to train or at (temporal difference). In policy gradients one trains to be close to and in Q learning one trains to be close to .
In our case, the model expresses belief about possible future states instead of the sum of discounted rewards. The model trains the belief at time using beliefs at various time points in the future. It accomplishes this by (variationally) auto-encoding a sample of the future state into a sample , using the approximate posterior inference distribution and the decoding distribution . The auto-encoding mapping translates between states at and , forcing beliefs at the two time steps to be consistent. Sample forms the target for training the belief (which appears as a prior distribution over in the language of VAE’s).
2.2 Relationship to common forms of VAE and to filtering and smoothing of the field of state estimation.
Mathematically, td-VAE forms a two layer variational auto-encoder with the following structure (see Figure 1 of the supplement for a diagram). Variable is in the first layer and in the second. First, is sampled from the approximate posterior inference distribution , which depends on the observation through . Next, is sampled from the second layer approximate posterior inference distribution . The prior for is . Variable is decoded into the space of using , which forms the prior over . Variable is decoded into the input space using . The critical element that allows us to learn the state model from two temporally separated time points, is the use of as both the posterior at time and as a prior at time . At it forms the (bootstrap) information about the world that is to be predicted from the information at time .
In the language of state space estimation, forms a filtering distribution that depends only on the variables observed so far. The forms the smoothing distribution, correcting the belief at earlier times based on the future information. Finally is the forward model.
We show that the model is able to learn the state and roll forward in jumps on the following simple example. We consider sequences of length 20 of images of MNIST digits. For a given sequence, a random digit from the dataset is chosen, as well as direction of movement: left or right. Then, at each time step in the sequence, the digit moves by one pixel in the chosen direction, as shown in Figure 2, left. We train the model with and separated by a random amount from the interval . We describe the functional forms we use below. We would like to see whether the model at a given time point can rollout a simulated experience in time steps , , … with , without considering the inputs in between these time points. Note that it is not sufficient to predict the future inputs , … as they do not contain information about whether the digit moves left or right. We need to sample a state that contains this information.
We sample a sequence from the model (do a rollout) as follows. State at the current time step is obtained online using the aggregation recurrent network from observations up to time . Next, the belief is computed and a sample from this belief is produced. This represents the current frame as well as beliefs about future states. The belief about the current input can be decoded from using the decoding distribution and is generally very close to the input . To produce the sequence, we sequentially sample starting with . We decode each such using the decoder distribution . The resulting sequences are shown in the Figure 2, right. We see that indeed the model can roll forward the samples in steps of more than one elementary time step (the sampled digits move by more than one pixel) and that it preserves the direction of motion, demonstrating that it rolls forward a state.
Why does the model do it? We train the model on two time points and (). The sample at tries to reconstruct the frame at and also reconstruct a future sample . Therefore, it puts the information about the future into and thus effectively contains the motion information. The belief tries to predict this information based on its inputs up to including time , which is rather easy for a recurrent network, and therefore will contain motion information. This is of course happening at every time step including . That means that contains not only information about but also the motion information. Thus tried to reconstruct not only information about frame but also the motion information stored in . Thus rolling forward from gives us a sample of the state at , which can subsequently be rolled forward.
3 Deep Model
In the previous section we provided the framework for learning models by bridging two temporally separated time points. Experience and intuitive considerations suggest that for these models to be powerful, they need to be deep. We can of course use deep networks for various distributions used in the model. However, it would be more appealing to have layers of beliefs, ideally with higher layers representing more invariant or abstract information. In fact one of the principles of deep learning is not only to have several steps of computation, but to have a layer that is replicated several times to form a deep network. Here we form such deep network. In addition we suggest specific functional forms, derived from those that generally give rise to strong versions of variational auto-encoders.
The first part to extend to layers is the RNN that aggregates observations to produce the state . Here we simply use a deep LSTM, but with a given layer receiving inputs also from layer from the previous time step. This is so that the higher layers can influence the lower ones (and vice versa). For :
and setting and .
We create a deep version of the belief part of the model as a stacking of the shallow one, as shown in Figure 3. In the usual spirit of deep directed models, the model samples downwards, generating higher level representations before the lower level ones (closer to pixels). The model implements deep inference, that is, the posterior distribution of one layer depends on the samples from the posterior distribution at previously sampled layers. The order of inference is a choice, and we use the same direction as that of generation, from higher to lower layers as done for example in Gregor et al. (2016); Kingma et al. (2016); Rasmus et al. (2015). We implement the dependence of various distributions on latent variables sampled so far using a recurrent neural network that summarizes all such variables (in a given group of distributions). Note that we don’t share the weights between different layers. Given these choices, we can allow all connections consistent with the model. The functional forms we used are described in the next subsection.
In the first experiment we would like to demonstrate that the model can build a state even when very little information is present in a given frame and that it can sample states far into the future. For this we consider a one dimensional sequence obtained from a noisy harmonic oscillator, as shown in Figure 4 (the first and the fourth rows). The frequencies, initial positions and initial velocities are chosen at random from some range. At every update noise is added to the position and the velocity of the oscillator, but the energy is approximately preserved. The model observes a noisy version of the current position. Attempting to predict the input, which consists of one value, 100 time steps in the future would be uninformative. First, we would not know what is the frequency or the magnitude of the signal from this prediction. Furthermore, because the oscillator updates are noisy, the phase information is lost over such long period and it is even impossible to make an accurate prediction of the input. Instead we should try to predict as much as possible about the state, which consists of the frequency, the magnitude and the position, and it is only the position that cannot be accurately predicted.
The single step true model of the environment is a frequency dependent linear mapping between (position, velocity) vectors at consecutive time steps with added noise. For single frequency the single step update would be well modeled by the Kalman Filter. However, we are interested in a model that can skip time steps and work for all frequencies. We would also like to use the standard forms of variational auto-encoders which use diagonal variance Gaussian distributions. The correct belief about position and velocity has full covariance. To implement such belief, we need at least two layers of latent variables. The specific functional forms we use are in the supplement, but we summarize them briefly here. The aggregation RNN is an LSTM. There are two layers, the ’s at the second layer are sampled first, and their results are passed to the lower layer networks. The , and distributions are feed-forward networks and the decoder simply extracts the first component from the of the first layer. We also feed the time interval into and . We train on sequences of length , with taking values chosen at random from interval .
We would like to see if the model can correctly predict future states directly. We analyze what the model has learned as follows. We pick time and sample from the belief distribution at . Then we choose a time interval to skip - we choose two intervals for demonstration, of length and . We sample from the forward model to obtain sample at . To see the content of this state, we then roll forward repeatedly times with with time step and plot the result. This is shown in Figure 4. We see that indeed the state is predicted correctly, containing correct frequency and magnitude of the signal. We also see that the position (phase) is predicted well for and less accurately for (at which point the noisiness of the system makes it unpredictable).
In the final experiment we would like to analyze the model on a more visually complex domain. We use sequences of frames seen by an agent solving tasks in DeepMind Lab environment (Beattie et al., 2016). We would like to demonstrate that the model holds explicit beliefs about various possible future states (desired property number 5 of introduction) and that it is able to make a rollout in jumps. We suggest functional forms inspired by a strong version of VAE, namely convolutional DRAW: we use convolutional LSTM’s for all the circles in Figure 3 and make the model 16 layers deep (except for the forward updating LSTM’s which are fully connected of depth 4. Figure 3 shows two layers.).
We consider time skips sampled uniformly from interval and analyze the content of the belief , Figure 5. We take three samples from and look for similarities and differences. These samples should represent three instances of possible futures. First, to see what they represent about the current time, we decode them using , and display the result in the Figure 5, left. We see that they decode to approximately the same frame. To see what they represent about the future, we make fives samples from , and decode them, as shown in Figure 5, right. We see that for a given , the decoded samples (images in a given row) decode to a similar frames. However for different ’s decode to different frames. That means the represents beliefs about different possible futures directly.
Finally, we would like to see what a rollout looks like. To get a visual sense of it, we train on time separations chosen uniformly from interval on a task where the agent tends to move forward and rotate. Figure 6 shows four rollouts from the model. We see that the motion appears to go forward and into corridors and that it skips several time steps (real single step motion is slower).
In this paper we built a generative model that satisfies a number of properties that are potentially useful for building state representation of an agent and for planning. Our model is able to make predictions about states several time steps into the future and roll (sample) forward in terms of them, skipping time points in between. It is able to learn from pairs of time points separated by a large and flexible amount directly, without need to back-propagate through all the time steps between them. We showed how the model forms a two layer variational auto-encoder, implementing a form of temporal difference learning by auto-encoding the future into the past, and how different elements of the auto-encoder correspond to filtering, smoothing and forward model of state estimation. We introduced a deep/layered version of the system, we argued that it is appropriate even in the case of a noisy harmonic oscillator and proposed functional forms of the deep system for both one dimensional inputs and for visually complex cases. We demonstrated these properties in proof of principle experiments. In the future we hope to investigate and improve functional forms and training regimes, apply the model to more difficult settings and investigate a number of possible uses in reinforcement learning such are representation learning and planning.
Appendix A td-VAE as a two layer VAE
Figure 7 shows td-VAE in a format of two layer VAE. The explanation is in the caption.
Appendix B td-VAE functional forms used for MNIST experiment.
Here we explain the computations for the shallow model of td-VAE. The aggregation RNN is an LSTM that consumes observations and updates state . The updates are standard
Where square bracket denotes concatenation and ’s and ’s are weights and biases. We use the same letter for all for simplicity, but they all have different values. We will use the same notation for the rest of the supplement.
All the remaining distributions , , and are the same functions (with different weights): One layer MLP with rectified linear units:
where is a vector that is a concatenation of the mean and logarithm of variance of Gaussian distribution. If the there are several inputs, they are concatenated.
Appendix C Deep model mathematical definition.
We write the computations defining the deep model, Figure 3 of the text. The computations at a given layer () are as follows:
Here are functions with internal weights, that are different for each instance and for each layer and where outputs probability distribution. The top states are initialized as learned biases and are absent.
After all the layers are executed, the reconstruction loss is computed and all the losses are added together
which is the loss that we minimize.
These are the computations which we use for the DeepMind Lab experiments, and can be done sequentially executing each layer before proceeding to the next one. However more generally, we first compute the , in all the layers first, then, proceed to all the computations at time and making them dependent on ’s in all the layers, and then proceed to compute , having it depend on all ’s. This is the setting we use for the harmonic oscillator.
Appendix D Deep model functional forms.
For the model trained on one dimensional coming from the harmonic oscillator we use the following function forms. The aggregation RNN is an LSTM. The belief net is one layer MLP with tanh nonlinearity
where is a vector that is a concatenation of the mean and logarithm of variance of Gaussian distribution.
For the inference distribution and the forward model distribution , we use the following function
where is the time separation between the points, and the multiplication between and is point-wise.
For the decoder we simply extract the first component of for the parameter of the mean of the observation and use learned bias as the variance. This simple form stabilizes the training.
For the DeepMind Lab experiments, all the circles in Figure 3 are LSTM’s. For the rightward moving blue circles, we use fully connected LSTM and for the remaining one we use convolutional ones. Convolutional LSTM updates are the same as fully connected ones above, except: 1. The states are three dimensional (or four if we include the batch dimension): spatial plus feature dimensions. 2. The weight operators are convolutional from spatial layers to spatial layers, and fully connected followed by broadcasting from one dimensional to three dimensional layers.
We use fully connected LSTM of size and convolutional layers of size . All kernel sizes are . The decoder layer has an extra canvas layer, similar to DRAW.
- Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367, 2015.
- Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
- Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
- Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995.
- John T Betts. Survey of numerical methods for trajectory optimization. Journal of guidance, control, and dynamics, 21(2):193–207, 1998.
- Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
- Lars Buesing, Theophane Weber, Sebastien Racaniere, SM Eslami, Danilo Rezende, David P Reichert, Fabio Viola, Frederic Besse, Karol Gregor, Demis Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.
- Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. arXiv preprint arXiv:1704.02254, 2017.
- Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
- Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
- Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011.
- Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207, 2016.
- Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.
- David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
- Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
- Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
- Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511, 2014.
- Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
- Hao Liu, Lirong He, Haoli Bai, and Zenglin Xu. Efficient structured inference for stochastic recurrent neural networks. 2017.
- Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pp. 2863–2871, 2015.
- Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
- Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5694–5705, 2017.
- Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
- Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pp. 3295–3301, 2017.
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- Arun Venkatraman, Nicholas Rhinehart, Wen Sun, Lerrel Pinto, Martial Hebert, Byron Boots, Kris Kitani, and J Bagnell. Predictive-state decoders: Encoding the future into recurrent networks. In Advances in Neural Information Processing Systems, pp. 1172–1183, 2017.
- Stephan Zheng, Yisong Yue, and Jennifer Hobbs. Generating long-term trajectories using deep hierarchical networks. In Advances in Neural Information Processing Systems, pp. 1543–1551, 2016.