Stochastic Variational Video Prediction
Abstract
Predicting the future in realworld settings, particularly from raw sensory observations such as images, is exceptionally challenging. Realworld events can be stochastic and unpredictable, and the high dimensionality and complexity of natural images require the predictive model to build an intricate understanding of the natural world. Many existing methods tackle this problem by making simplifying assumptions about the environment. One common assumption is that the outcome is deterministic and there is only one plausible future. This can lead to lowquality predictions in realworld settings with stochastic dynamics. In this paper, we develop a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables. To the best of our knowledge, our model is the first to provide effective stochastic multiframe prediction for realworld videos. We demonstrate the capability of the proposed method in predicting detailed future frames of videos on multiple realworld datasets, both actionfree and actionconditioned. We find that our proposed method produces substantially improved video predictions when compared to the same model without stochasticity, and to other stochastic video prediction methods. Our SV2P implementation will be open sourced upon publication.
1 Introduction
Understanding the interaction dynamics of objects and predicting what happens next is one of the key capabilities of humans which we heavily rely on to make decisions in everyday life (Bubic et al., 2010). A model that can accurately predict future observations of complex sensory modalities such as vision must internally represent the complex dynamics of realworld objects and people, and therefore is more likely to acquire a representation that can be used for a variety of visual perception tasks, such as object tracking and action recognition (Srivastava et al., 2015; Lotter et al., 2017; Denton & Birodkar, 2017). Furthermore, such models can be inherently useful themselves, for example, to allow an autonomous agent or robot to decide how to interact with the world to bring about a desired outcome (Oh et al., 2015; Finn & Levine, 2017).
However, modeling future distributions over images is a challenging task, given the high dimensionality of the data and the complex dynamics of the environment. Hence, it is common to make various simplifying assumptions. One particularly common assumption is that the environment is deterministic and that there is only one possible future (Chiappa et al., 2017; Srivastava et al., 2015; Boots et al., 2014; Lotter et al., 2017). Models conditioned on the actions of an agent frequently make this assumption, since the world is more deterministic in these settings (Oh et al., 2015; Finn et al., 2016). However, most realworld prediction tasks, including the actionconditioned settings, are in fact not deterministic, and a deterministic model can lose many of the nuances that are present in real physical interactions. Given the stochastic nature of video prediction, any deterministic model is obliged to predict a statistic of all the possible outcomes. For example, deterministic models trained with a mean squared error loss function generate the expected value of all the possibilities for each pixel independently, which is inherently blurry (Mathieu et al., 2016).
Our main contribution in this paper is a stochastic variational method for video prediction, named SV2P, that predicts a different plausible future for each sample of its latent random variables. We also provide a stable training procedure for training a neural network based implementation of this method. To the extent of our knowledge, SV2P is the first latent variable model to successfully predict multiple frames in realworld settings. Our model also supports actionconditioned predictions, while still being able to predict stochastic outcomes of ambiguous actions, as exemplified in our experiments. We evaluate SV2P on multiple realworld video datasets, as well as a carefully designed toy dataset that highlights the importance of stochasticity in video prediction (see Figure 1). In both our qualitative and quantitative comparisons, SV2P produces substantially improved video predictions when compared to the same model without stochasticity, with respect to standard metrics such as PSNR and SSIM. The stochastic nature of SV2P is most apparent when viewing the predicted videos. Therefore, we highly encourage the reader to check the project website https://goo.gl/iywUHc to view the actual videos of the experiments. The TensorFlow (Abadi et al., 2016) implementation of this project will be open sourced upon publication.
2 Related Work
A number of prior works have addressed video frame prediction while assuming deterministic environments (Ranzato et al., 2014; Srivastava et al., 2015; Vondrick et al., 2015; Xingjian et al., 2015; Boots et al., 2014; Lotter et al., 2017). In this work, we build on the deterministic video prediction model proposed by Finn et al. (2016), which generates the future frames by predicting the motion flow of dynamically masked out objects extracted from the previous frames. Similar transformationbased models were also proposed by De Brabandere et al. (2016); Liu et al. (2017). Prior work has also considered alternative objectives for deterministic video prediction models to mitigate the blurriness of the predicted frames and produce sharper predictions (Mathieu et al., 2016; Vondrick & Torralba, 2017). Despite the adversarial objective, Mathieu et al. (2016) found that injecting noise did not lead to stochastic predictions, even for predicting a single frame. Oh et al. (2015); Chiappa et al. (2017) make sharp video predictions by assuming deterministic outcomes in video games given the actions of the agents. However, this assumption does not hold in realworld settings, which almost always have stochastic dynamics.
Autoregressive models have been proposed for modeling the joint distribution of the raw pixels (Kalchbrenner et al., 2017). Although these models predict sharp images of the future, their training and inference time is extremely high, making them difficult to use in practice. Reed et al. (2017) proposed a parallelized multiscale algorithm that significantly improves the training and prediction time but still requires more than a minute to generate one second of video on a GPU. Our comparisons suggest that the predictions from these models are sharp, but noisy, and that our method produces substantially better predictions, especially for longer horizons.
Another approach for stochastic prediction uses generative adversarial networks (GANs) (Goodfellow et al., 2014), which have been used for video generation and prediction (Tulyakov et al., 2017; Li et al., 2017). Vondrick et al. (2016); Chen et al. (2017) applied adversarial training to predict video from a single image. Although GANs generate sharp images, they tend to suffer from modecollapse (Goodfellow, 2016), particularly in conditional generation settings (Zhu et al., 2017).
Variational autoencoders (VAEs) (Kingma & Welling, 2014) also have been explored for stochastic prediction tasks. Xue et al. (2016) predicts a single stochastic frame using cross convolutional networks in a VAElike architecture. Shu et al. (2016) uses conditional VAEs and Gaussian mixture priors for stochastic prediction. Both of these works have been evaluated solely on synthetic datasets with simple moving sprites and no object interaction. Real images significantly complicate video prediction due to the diversity and variety of stochastic events that can occur. Fragkiadaki et al. (2017) compared various architectures for multimodal motion forecasting and oneframe video prediction, including variational inference and straightforward sampling from the prior. Unlike these prior models, our focus is on designing a multiframe video prediction model to produce stochastic predictions of the future. Multiframe prediction is dramatically harder than singleframe prediction, since complex events such as collisions require multiple frames to fully resolve, and singleframe predictions can simply ignore this complexity. We believe, our approach is the first latent variable model to successfully demonstrate stochastic multiframe video prediction on real world datasets.
3 Stochastic Variational Video Prediction (SV2P)
In order to construct our stochastic variational video prediction model, we first formulate a probabilistic graphical model that explains the stochasticity in the video. Since our goal is to perform conditional video prediction, the predictions are conditioned on a set of context frames (e.g., if we are conditioning on one frame, ), and our goal is to sample from , where denotes the i^{th} frame of the video (Figure 2).
Video prediction is stochastic as a consequence of the latent events that are not observable from the context frames alone. For example, when a robot’s arm pushes a toy on a table, the unknown weight of that toy affects how it moves. We therefore introduce a vector of latent variables into our model, distributed according to a prior , and build a model . This model is still stochastic but uses a more general representation, such as a conditional Gaussian, to explain just the noise in the image, while accounts for the more complex stochastic phenomena. We can then factorize this model to . Learning then involves training the parameters of these factors , which we assume to be shared between all the time steps.
At inference time we need to estimate values for the true posterior , which is intractable due its dependency on . We overcome this problem by approximating the posterior with an inference network that outputs the parameters of a conditionally Gaussian distribution . This network is trained using the reparameterization trick (Kingma & Welling, 2014), according to:
(1) 
Here, and are the parameters of the generative model and inference network, respectively. To learn these parameters, we can optimize the variational lower bound, as in the variational autoencoder (VAE) (Kingma & Welling, 2014):
(2) 
where is the KullbackLeibler divergence between the approximated posterior and assumed prior which in our case is the standard Gaussian .
In Equation 2, the first term on the RHS represents the reconstruction loss while the second term represents the divergence of the variational posterior from the prior on the latent variable. It is important to emphasize that the approximated posterior is conditioned on all of the frames, including the future frames . This is feasible during training, since is available at the training time, while at test time we can sample the latents from the assumed prior. Since the aim in our method is to recover latent variables that correspond to events which might explain the variability in the videos, we found that it is in fact crucial to condition the inference network on future frames. At test time, the latent variables are simply sampled from the prior which corresponds to a smoothinglike inference process. In principle, we could also perform a filteringlike inference procedure of the form for time step to infer the most likely latent variables based only on the conditioning frames, instead of sampling from the prior, which could produce more accurate predictions at test time. However, it would be undesirable to use a filtering process at training time: in order to incentivize the forward prediction network to make use of the latent variables, they must contain some information that is useful for predicting future frames that is not already present in the context frames. If they are predicted entirely from the context frames, no such information is present, and indeed we found that a purely filteringbased model simply ignores the latent variables.
So far, we’ve assumed that the latent events are constant over the entire video. We can relax this assumption by conditioning prediction on a timevariant latent variable that is sampled at every time step from . The generative model then becomes and, assuming a fixed posterior, the inference model will be approximated by , where the model parameters are shared across time. In practice, the only difference between these two formulations is the frequency of sampling from and . In the timeinvariant version, we sample once per video, whereas with the timevariant latent, sampling happens every frame. The main benefit of timevariant latent variable is better generalization beyond , since the model does not have to encode all the events of the video in one vector . We provide an empirical comparison of these formulations in Section 5.2.
In actionconditioned settings, we modify the generative model to be conditioned on action vector . This results in as generative model while keeping the posterior approximation intact. Conditioning the outcome on actions can decrease future variability; however it will not eliminate it if the environment is inherently stochastic or the actions are ambiguous. In this case, the model is still capable of predicting stochastic outcomes in a narrower range of possibilities.
3.1 Model Architecture
To model the approximated posterior we used a deep convolutional neural network as shown in the top row of Figure 3. Since we assumed a diagonal Gaussian distribution for , this network outputs the mean and standard deviation of the approximated posterior. Hence the entire inference network is convolutional, the predicted parameters are single channel response maps. We assume each entry in this response maps is pairwise independent, forming the latent vector . The latent value is then sampled using Equation 1. As discussed before, this sampling happens every frame for timevarying latent, and once per video in timeinvariant case.
For , we used the CDNA architecture proposed by Finn et al. (2016), which is a deterministic convolutional recurrent network that predicts the next frame given the previous frame and an optional action . This model constructs the next frames by predicting the motions of segments of the image (i.e., objects) and then merging these predictions via masking. Although this model directly outputs pixels, it is partiallyappearance invariant and can generalize to unseen objects (Finn et al., 2016). To condition on the latent value, we modify the CDNA architecture by stacking as an additional channel on tiled action .
3.2 Training Procedure
Our model can be trained endtoend. However, our experiments show that naïve training usually results in the model ignoring the latent variables and converging to a suboptimal deterministic solution (Figure 4). Therefore, we train the model endtoend in three phases, as follows:

Training the generative network: In this phase, the inference network has been disabled and the latent value will be randomly sampled from . The intuition behind this phase is to train the generative model to predict the future frames deterministically (i.e. modeling ).

Training the inference network: In the second phase, the inference network is trained to estimate the approximate posterior ; however, the KLloss is set to . This means that the model can use the latent value without being penalized for diverging from . As seen in Figure 4, this phase results in very low reconstruction error, however it is not usable at the test time since and sampling from the assumed prior will be inaccurate.

Divergence reduction: In the last phase, the KLloss is added, resulting in a sudden drop of KLdivergence and an increase of reconstruction error. The reconstruction loss converging to a value lower than the first phase and KLloss converging to zero are indicators of successful training. This means that can be sampled from at test time for effective stochastic prediction.
To gradually transition from the second phase to the third, we add a multiplier to KLloss that is set to zero during the first two phases and then increased slowly in the last phase. This is similar to the hyperparameter in Higgins et al. (2016) that is used to balance latent channel capacity and independence constraints with reconstruction accuracy.
We found that this training procedure is quite stable and the model almost always converges to the desired parameters. To demonstrate this stability, we trained the model with and without the proposed training procedure, five times each. Figure 4 shows the average and standard deviation of reconstruction loss at the end of these training sessions. Naïve training results in a slightly better error compared to Finn et al. (2016), but with high variance. When following the proposed training algorithm, the model consistently converges to a much lower reconstruction error.
4 Stochastic movement dataset
To highlight the importance of stochasticity in video prediction, we created a toy video dataset with intentionally stochastic motion. Each video in this dataset is four frames long. The first frame contains a random shape (triangle, rectangle or circle) with random size and color, centered in the frame, which then randomly moves to one of the eight directions (up, down, left, right, upleft, upright, downleft, downright). Each frame is and the background is static gray. The main intuition behind this design is that, given only the first frame, a model can figure out the shape, color, and size of the moving object, but not its movement direction.
We train Finn et al. (2016) and SV2P to predict the future frames, given only the first frame. Figure 1 shows the video predictions from these two models. Since Finn et al. (2016) is a deterministic model with mean squared error as loss, it predicts the average of all possible outcomes, as expected. In contrast, SV2P predicts different possible futures for each sample of the latent variable .
To demonstrate that the inference network is working properly and that the latent variable does indeed learn to store the information necessary for stochastic prediction (i.e., the direction of movement), we include predicted futures when . By estimating the correct parameters of the latent distribution, using the inference network, the model always generates the right outcome. However, this cannot be used in practice, since the inference network requires access to all the frames, including the ones in the future. Instead, will be sampled from assumed prior .
5 Experiments
To evaluate SV2P, we test it on three realworld video datasets by comparing it to the CDNA model (Finn et al., 2016), as a deterministic baseline, as well as a baseline that outputs the last seen frame as the prediction. We compare SV2P with an autoregressive stochastic model, video pixel networks (VPN) (Kalchbrenner et al., 2017). We use the parallel multiresolution implementation of VPN from Reed et al. (2017), which is an order of magnitude faster than the original VPN, but still requires more than a minute to generate one second of video. In all of these experiments, we plot the results of sampling the latent once per video (SV2P timeinvariant latent) and once per frame (SV2P timevariant latent). We strongly encourage readers to view https://goo.gl/iywUHc for videos of the results which are more illustrative than printed frames.
5.1 Datasets
We quantitatively and qualitatively evaluate SV2P on following realworld datasets:

BAIR robot pushing dataset (Ebert et al., 2017): This dataset contains actionconditioned videos collected by a Sawyer robotic arm pushing a variety of objects. All of the videos in this datasets have similar table top settings with static background. An interesting property of this dataset is the fact that the arm movements are quite unpredictable in the absence of actions (compared to the robot pushing dataset (Finn et al., 2016) which the arm moves to the center of the bin). For this dataset, we train the models to predict the next ten frames given the first two, both in actionconditioned and actionfree settings.

Human3.6M (Ionescu et al., 2014): Humans and animals are one of the most interesting sources of stochasticity in natural videos, which behave in complex ways as a consequence of unpredictable intentions. To study human motion prediction, we use the Human3.6M dataset which consists of actors performing various actions in a room. We used the preprocessing and testing format of Finn et al. (2016): a 10 Hz frame rate and 10frame prediction given the previous ten.

Robotic pushing prediction (Finn et al., 2016): We use the robot pushing prediction dataset to compare SV2P with another stochastic prediction method, video pixel networks (VPNs) (Kalchbrenner et al., 2017). VPNs demonstrated excellent results on this dataset in prior work, and therefore robot pushing dataset provides a strong point of comparison. However, in contrast to our method, VPNs do not include latent stochastic variables that represent random events, and rely on an expensive autoregressive architecture. In this experiment, the models have been trained to predict the next ten frames, given the first two.
5.2 Quantitative Comparison
In our quantitative evaluation, we aim to understand whether the range of possible futures captured by our stochastic model includes the true future. Models that are more stochastic do not necessarily score better on average standard metrics such as PSNR (HuynhThu & Ghanbari, 2008) and SSIM (Wang et al., 2004). However, if we are interested primarily in understanding whether the true outcome is within the set of predictions, we can instead evaluate the score of the best sample from multiple random priors. We argue that this is a better metric for stochastic models, since it allows us to understand if uncertain futures contain the true outcome. Figure 5 illustrates how this metric changes with different numbers of samples. By predicting more possible futures, the probability of predicting the true outcome increases, and therefore it is more likely to get a sample with higher PSNR compared to the ground truth. Of course, as with all video prediction metrics, it is imperfect, and is only suitable for understanding the performance of the model when combined with a visual examination of the qualitative results in Section 5.3.
To use this metric, we sample 100 latent values from prior and use them to predict 100 videos and show the result of the sample with highest PSNR. For a fair comparison to VPN, we use the same best out of 100 samples for our stochastic baseline. Since even the fast implementation of VPN is quite slow, we limit the comparison with VPN to only last dataset with 256 test samples.
Figure 6 displays the quantitative comparison of the predictions on all of the datasets. In this graph, the top row is a PSNR comparison and the bottom row is SSIM, while each column represents a different dataset. To evaluate the generalization of the models beyond what they have been trained for, we generate more frames than what the models observed during training time. The length of the training sequences is marked by a vertical separator in all of the graphs, and the results beyond this line represent extrapolation to longer sequences.
Overall, SV2P with both timevariant and timeinvariant latent sampling outperform all of the other baselines, by predicting higher quality videos with higher PSNR and SSIM. Timevarying latent sampling is more stable beyond the time horizon used during training (Figure 6b). One possible explanation for this behaviour is that the timeinvariant latent has to include the information required for predicting all the frames and therefore, beyond training time, it collapses. This issue is mitigated by a timevariant latent variable which takes a different value at each time step. One other interesting observation is that the timeinvariant model outperforms the timevariant model in the Human3.6M dataset. In this dataset, the most important latent event – the action performed by the actor – is consistent across the whole video which is easier to capture using timeinvariant latent.
5.3 Qualitative Comparison
We can better understand the performance of the proposed model by visual examination of the qualitative results. We highlight some of the most important and observable differences in predictions by different models in Figures LABEL:fig:exp:berkeley:actionfreeLABEL:fig:exp:robot ^{1}^{1}1The videos of these experiments can be found at the project website (https://goo.gl/iywUHc).. In all of these figures, the xaxis is time (i.e., each row is one video). The first row is the ground truth video, and the second row is the result of Finn et al. (2016). The result of sampling the latent from approximated posterior is provided in the third row. For stochastic methods, we show the best (highest PSNR) and worst (lowest PSNR) predictions out of 100 samples (as discussed in Section 5.2), as well as two random predicted videos from our model.
Figure LABEL:fig:exp:berkeley:actionfree illustrates two examples from the BAIR robot pushing dataset in the actionfree setting. As a consequence of the high stochasticity in the movement of the arm in absence of actions, Finn et al. (2016) only blurs the arm out, while SV2P predicts varied but coherent movements of the arm. Note that, although each predicted movements of the arm is random, it is still in the valid range of possible outcomes (i.e., there is no sudden jump of the arm nor random movement of the objects). The proposed model also learned how to move objects in cases where they have been pushed by the predicted movements of the arm, as can be seen in the zoomed images of both samples.
In the actionconditioned setting (Figure LABEL:fig:exp:berkeley:action), the differences are more subtle: the range of possible outcomes is narrower, but we can still observe stochasticity in the behavior of the pushed objects. Interactions between the arm and objects are uncertain due to ambiguity in depth, friction, and mass, and SV2P is able to capture some of this variation. Since these variations are subtle and occupy a smaller part of the images, we illustrate this with zoomed insets in Figure LABEL:fig:exp:berkeley:action. Some examples of varied object movements can be found in last three rows of right example of Figure LABEL:fig:exp:berkeley:action. SV2P also generates sharper outputs, compared to Finn et al. (2016) as is evident in the left example of Figure LABEL:fig:exp:berkeley:action.
Figure LABEL:fig:exp:human displays two examples from the Human3.6M dataset. In absence of actions, but given more context frames, Finn et al. (2016) manages to separate the foreground from background, but cannot predict what happens next accurately. This results in distorted or blurred foregrounds. On the other hand, SV2P predicts a variety of different outcomes, and moves the actor accordingly. Note that PSNR and SSIM are measuring reconstruction loss with respect to the ground truth and they may not generally present a better prediction. For some applications, a prediction with lower PSNR/SSIM might have higher quality and be more interesting. A good example is the prediction with the worst PSNR in Figure LABEL:fig:exp:humanright, where the model predicts that the actor is spinning in his chair with relatively high quality. However, this output has the lowest PSNR compared to the ground truth.
Finally, Figure LABEL:fig:exp:robot demonstrates results on the Google robot pushing dataset. The qualitative and quantitative results in Figure LABEL:fig:exp:robot and 6 both indicate that SV2P produces substantially better predictions than VPNs. The quantitative results suggest that our bestof100 metric is a reasonable measure of performance: the VPN predictions are more noisy, but simply increasing noise is not sufficient to increase the quality of the best sample. The stochasticity in our predictions is more coherent, corresponding to differences in object or arm motion, while much of the stochasticity in the VPN predictions resembles noise in the image, as well as visible artifacts when predicting for substantially longer time horizons.
6 Conclusion
We proposed stochastic variational video prediction (SV2P), an approach for multistep video prediction based on variational inference. Our primary contributions include an effective stochastic prediction method with latent variables, a network architecture that succeeds on natural videos, and a training procedure that provides for stable optimization. The source code for our method will be released upon acceptance. We evaluated our proposed method on three realworld datasets in actionconditioned and actionfree settings, as well as one toy dataset which has been carefully designed to highlight the importance of the stochasticity in video prediction. Both qualitative and quantitative results indicate higher quality predictions compared to other deterministic and stochastic baselines.
SV2P can be expanded in numerous ways. First, the current inference network design is fully convolutional, which exposes multiple limitations, such as unmodeled spatial correlations between the latent variables. The model could be improved by incorporating the spatial correlation induced by the convolutions into the prior, using a learned structured prior in place of the standard spherical Gaussian. Timevariant posterior approximation to reflect the new information that is revealed as the video progresses, is another possible SV2P improvement. However, as discussed in Section 3, this requires incentivizing the inference network to incorporate the latent information at training time. This would allow timevariant latent distributions which is more aligned with generative neural models for timeseries(Johnson et al., 2016; Gao et al., 2016; Krishnan et al., 2017).
Another exciting direction for future research would be to study how stochastic predictions can be used to act in the real world, producing modelbased reinforcement learning methods that can execute risksensitive behaviors from raw image observations. Accounting for risk in this way could be especially important in safetycritical settings, such as robotics.
References
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
 Boots et al. (2014) Byron Boots, Arunkumar Byravan, and Dieter Fox. Learning predictive models of a depth camera & manipulator from raw execution traces. In International Conference on Robotics and Automation (ICRA), 2014.
 Bubic et al. (2010) Andreja Bubic, D Yves Von Cramon, and Ricarda I Schubotz. Prediction, cognition and the brain. Frontiers in human neuroscience, 4, 2010.
 Chen et al. (2017) Baoyang Chen, Wenmin Wang, Jinzhuo Wang, Xiongtao Chen, and Weimian Li. Video imagination from a single image with transformation generation. arXiv preprint arXiv:1706.04124, 2017.
 Chiappa et al. (2017) Silvia Chiappa, Sébastien Racanière, Daan Wierstra, and Shakir Mohamed. Recurrent environment simulators. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
 De Brabandere et al. (2016) Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
 Denton & Birodkar (2017) Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
 Ebert et al. (2017) Frederik Ebert, Chelsea Finn, Alex X. Lee, and Sergey Levine. SelfSupervised Visual Planning with Temporal Skip Connections. Conference on Robot Learning (CoRL), 2017.
 Finn & Levine (2017) Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In International Conference on Robotics and Automation (ICRA), 2017.
 Finn et al. (2016) Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, 2016.
 Fragkiadaki et al. (2017) Katerina Fragkiadaki, Jonathan Huang, Alex Alemi, Sudheendra Vijayanarasimhan, Susanna Ricco, and Rahul Sukthankar. Motion prediction under multimodality with conditional stochastic networks. CoRR, abs/1705.02082, 2017.
 Gao et al. (2016) Yuanjun Gao, Evan W Archer, Liam Paninski, and John P Cunningham. Linear dynamical neural population models through nonlinear embeddings. In Advances in Neural Information Processing Systems, pp. 163–171, 2016.
 Goodfellow (2016) Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
 Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations (ICLR), 2016.
 HuynhThu & Ghanbari (2008) Quan HuynhThu and Mohammed Ghanbari. Scope of validity of psnr in image/video quality assessment. Electronics letters, 2008.
 Ionescu et al. (2014) Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7), 2014.
 Johnson et al. (2016) Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
 Kalchbrenner et al. (2017) Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. International Conference on Machine Learning (ICML), 2017.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
 Krishnan et al. (2017) Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101–2109, 2017.
 Li et al. (2017) Yitong Li, Martin Renqiang Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. arXiv preprint arXiv:1710.00421, 2017.
 Liu et al. (2017) Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. International Conference on Computer Vision (ICCV), 2017.
 Lotter et al. (2017) William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. International Conference on Learning Representations (ICLR), 2017.
 Mathieu et al. (2016) Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multiscale video prediction beyond mean square error. International Conference on Learning Representations (ICLR), 2016.
 Oh et al. (2015) Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, 2015.
 Ranzato et al. (2014) MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
 Reed et al. (2017) Scott E. Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gomez Colmenarejo, Ziyu Wang, Yutian Chen, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. International Conference on Machine Learning (ICML), 2017.
 Shu et al. (2016) Rui Shu, James Brofos, Frank Zhang, Hung Hai Bui, Mohammad Ghavamzadeh, and Mykel Kochenderfer. Stochastic video prediction with conditional density estimation. In ECCV Workshop on Action and Anticipation for Visual Learning, 2016.
 Srivastava et al. (2015) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 2015.
 Tulyakov et al. (2017) Sergey Tulyakov, MingYu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
 Vondrick & Torralba (2017) Carl Vondrick and Antonio Torralba. Generating the future with adversarial transformers. In Computer Vision and Pattern Recognition (CVPR), 2017.
 Vondrick et al. (2015) Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015.
 Vondrick et al. (2016) Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, 2016.
 Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004.
 Xingjian et al. (2015) SHI Xingjian, Zhourong Chen, Hao Wang, DitYan Yeung, WaiKin Wong, and Wangchun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, 2015.
 Xue et al. (2016) Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, 2016.
 Zhu et al. (2017) JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. International Conference on Computer Vision (ICCV), 2017.