Improved Conditional VRNNs for Video Prediction
Abstract
Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational AutoEncoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current stateoftheart latent variable models. Our method performs favorably under several metrics in three different datasets.
1 Introduction
^{†}^{†}Correspondence to lluis.enric.castrejon.subira@umontreal.caWe investigate the task of video prediction, a particular instantiation of selfsupervision [6, 8] where generative models learn to predict future frames in a video. Training such models does not require any annotated data, yet the models need to capture a notion of the complex dynamics of realworld phenomena (such as physical interactions) to generate coherent sequences.
Context  Predicted Frames  

GT 

SVGLP  
[4]  
Ours 
Uncertainty is an inherent difficulty associated with video prediction, as many future outcomes are plausible for a given sequence of observations [1, 4]. Predictions from deterministic models rapidly degrade over time as uncertainty grows, converging to an average of the possible future outcomes [32]. To address this issue, probabilistic latent variable models such as Variational AutoEncoders (VAEs) [18, 29], and more specifically Variational Recurrent Neural Networks (VRNNs) [2], have been proposed for video prediction [1, 4]. These models define a prior distribution over a set of latent variables, allowing different samples from these latents to capture multiple outcomes.
It has been empirically observed that VAE and VRNNbased models produce blurry predictions [20, 21]. This tendency is usually attributed to the use of a similarity metric in pixel space [20, 24] such as Mean Squared Error (corresponding to a loglikelihood loss under a fully factorized Gaussian distribution). This has lead to alternative models such as VAEGAN [20, 21], which extends the traditional VAE objective with an adversarial loss in order to obtain more visually compelling generations.
In addition, the lack of expressive latent distributions has been shown to lead to poor model fitting [12]. Training VAEs involves defining an approximate posterior distribution over the latent variables which models their probability after the generated data has been observed. If the approximate posterior is too constrained, it will not be able to match the true posterior distribution and this will prevent the model from accurately fitting the training data. On the other hand, the prior distribution over the latent variables can be interpreted as a model of uncertainty.
The decoder or likelihood network needs to transform latent samples into data observations covering all plausible outcomes. Given a simple prior, this transformation can be very complex and require high capacity networks. We hypothesize that the reduced expressiveness of current VRNN models is limiting the quality of their predictions and investigate two main directions to improve video prediction models. First, we propose to scale the capacity of the likelihood network. We empirically demonstrate that by using a high capacity decoder we can ease the latent modeling problem and better fit the data.
Second, we introduce more flexible posterior and prior distributions [30]. Current video prediction models usually rely on one shallow level of latent variables and the prior and approximate posterior are parameterized using diagonal Gaussian distributions [1]. We extend the VRNN formulation by proposing a hierarchical variant that uses multiple levels of latents per timestep.
Models leveraging a hierarchy of latents are known to be hard to optimize as they are required to backpropagate through a stack of stochastic latent variables, usually resulting in models that only make use of a small subset of the latents [18, 23, 30]. We mitigate this problem by using a warmup regime for the KL loss [31] and a dense connectivity pattern [13, 22] between the input and latent variables. Specifically, each stochastic latent variable is connected to the input and to all subsequent stochastic levels in the hierarchy. Our empirical findings confirm that only with these techniques our model is able to take advantage of different layers in a latent hierarchy.
We validate our hierarchical VRNN in three datasets with varying levels of future uncertainty and realism: Stochastic Moving MNIST [4], the BAIR Push Dataset [7] and Cityscapes [3]. When compared to current state of the art models [4, 21], our approach performs favorably under several metrics. In particular for the BAIR Push Dataset, our hierarchicalVRNN shows an improvement of in Video Fréchet Distance (FVD) [34] and in term of LPIPS score [41] over SVGLP [4], the previous best VAEbased model. It also achieves a similar FVD than the SAVP VAEGAN model [21], while showing a improvement in terms of LPIPS over this baseline.
2 Related Work
Initial video prediction approaches relied on deterministic models. Ranzato et al. [27] divided frames into patches and predicted their evolution in time given previous neighboring patches. In [32] Srivastava et al. used LSTM networks on pretrained image embeddings to predict the future. Similarly, Oh et al. [25] used LSTMs on CNN representations to predict frames in Atari games when given the player actions.
ConvLSTMs [40] adapt the LSTM equations to spatial feature maps by replacing matrix multiplications with convolutions. They were originally used for precipitation nowcasting and are commonly used for video prediction.
Other works have proposed to disentangle the motion and context of the frames to generate [35, 33, 5]. They assume that a scene can be decomposed as multiple objects, which allows them to use a fixed representation for the background. Our approach does not follow this modeling assumption and instead tries to capture the uncertainty in the future.
Autoregressive models [15, 28] approximate the full joint data distribution over pixels, which allows them to capture complex pixel dependencies but at the expense of making their inference mechanism slow and not scalable to high resolutions. Latent variable models using GANs [9] were proposed in [37, 36, 33]. Training pure GAN video models is still an open research direction: training is unstable and most models require auxiliary losses.
A successful approach so far has been based on VAE [18, 29]/VRNN [2] models. SV2P [1] proposed to capture sequence uncertainty in a single set of latent variables kept fixed for each predicted sequence. SVG [4] adopted the VRNN formulation [2], introducing perstep latent variables (SVGFP) and a variant with a learned prior (SVGLP), which makes the prior at a certain timestep a function of previous frames. In recent work, SAVP [21] proposed to use the VAEGAN [20] framework for video, a hybrid model that offers a tradeoff between VAEs and GANs. Our model extends the VRNN formulation by introducing a hierarchy of latents to better approximate the data likelihood.
There are multiple works addressing hierarchical VAEs for nonsequential data [26, 23, 31, 17]. While hierarchical VAEs can model more flexible latent distributions, training them is usually difficult due to the multiple layers of conditional latents [30]. Ladder Variational Autoencoders [31] proposed a series of techniques to partially alleviate this issue. IAF [17] used a similar architecture to Ladder VAEs and extended it with a novel normalizing flow. Recent work [22] has trained very deep hierarchical models that produce visually compelling samples. We extend hierarchical latent variable models to sequential data and apply them to video prediction. Concurrent work [19] has proposed a fully invertible model for video.
3 Preliminaries
We follow previous work in video prediction [4]. Given context frames and the following future frames , our goal is to learn a generative model that maximizes the probability .
VRNN follows the VAE formalism and introduces a set of latent variables to capture the variations in the observed variables at each timestep . It defines a likelihood model and a prior distribution which are parametrized in an autoregressive manner; \ieat each timestep observed and latent variables are conditioned on the past latent samples and observed frames. VRNN therefore uses a learned prior [2, 4]. Taking into account the temporal structure of the data, the probability is factorized as
(1) 
Computing requires marginalizing over the latent variables , which is computationaly intractable. Instead, VRNN relies on Variation Inference [14] and defines an amortized approximate posterior that approximates the true posterior distribution . We then can derive the evidence lower bound (ELBO), a lower bound to the marginal loglikelihood :
(2) 
VRNN can be optimized to fit the training data by maximizing the ELBO using stochastic backpropagation and the reparameterization trick [18, 29].
4 Hierarchical VRNN
We now introduce a hierarchical version of the VRNN model. At each timestep, we consider levels of latents variables . We then further factorize the latent prior as
(3) 
The sampling process of the latent variable now depends on the latent variables from previous time steps for that level and on the latent variables of the previous levels at the current timestep . Similarly, we can write the approximate posterior as:
(4) 
Using eq. 3 and eq.4, we can rewrite the ELBO as
(5) 
Refer to the Appendix for a full derivation of the ELBO.
4.1 Dense Latent Connectivity
Training a hierarchy of latent variables is known to be challenging as it requires to backpropagate through multiple stochastic layers. Usually this results in models that only use one specific level of the hierarchy [18, 23, 30]. To ease the optimization we use a dense connectivity pattern between latent levels both for the prior and the approximate posterior, following [13, 22].
Fig 2 illustrates the dense connection of the learned prior (refer to the Appendix for the approximate posterior). For each latent level, the prior and posterior are implemented using recurrent neural networks which take as input a deterministic transformation of (red arrows in Fig 2), and to all the latent variables from the previous levels (green arrows in in Fig 2). In addition, each latent variable has a direct connection to the output variables .
4.2 Model Parametrization
We now describe an instantiation of the VRNN model that we will use in the experiments, illustrated in Fig. 3. First we compute features for each context frame and use them to initialize the hidden state of the prior/posterior/decoder networks, all of which have recurrent components. At a given timestep , the model takes as input the latent variable samples with the embedding of the previously generated frame and outputs the next frame . During training we draw latent samples from the approximate posterior distribution and maximize the ELBO. To generate unseen sequences, we sample from the learned prior . Note that since we have multiple levels of conditional latents we use ancestral sampling to generate , \iewe first sample from the top level of the hierarchy and we then sequentially sample the lower levels conditioning on the sampled values of the previous layers in the hierarchy.
Frame Encoder We use a 2D CNN with ResNet [11] blocks and maxpooling layers to represent input frames.
Prior/Approximate Posterior We parametrize both the prior and the posterior as a hierarchy of diagonal Normal distributions , where the parameters and are recurrent functions of samples from i) previous levels in the hierarchy and ii) the frame encoder features. Each level in the hierarchy operates at a different spatial resolution, with the top level features operating at a 1x1 resolution, \ienot having a spatial topology. At a given timestep , the parameters for a specific latent level are given by a ConvLSTM that consumes i) a previous hidden state, ii) samples from the previous levels in the hierarchy , iii) the feature map of a frame with the same spatial resolution as the ConvLSTM. For the prior network, the input frame embedding corresponds to the previously generated frame , while for the posterior the input comes from the frame to generate .
Likelihood/Frame Decoder At each timestep , the decoder takes a representation of the previously generated frame and the samples and generates according to . The decoder consists of ConvLSTMs interleaved with transposed convolutions that upscale the feature maps back to the input resolution.
Initial State The initial states of our prior, posterior and decoder/likelihood models are functions of the context. We use small CNNs to initialize each of the ConvLSTMs layers used in the VAE components.
5 Experiments
All our models are trained with Adam [16] and a batch size of on Nvidia DGX1s. We use a learning rate warmup [10] starting with an initial learning rate = 2e5 that is linearly increased at each timestep until reaching = 1.6e4 in 5 epochs. We use = 0.5 and = 0.9 and weight decay = 1e4. We train the autoregressive components of our models using teacher forcing [39].
Our models are also trained using beta warmup [31], which consists in gradually increasing the weight of the KL divergence in the ELBO, turning the model from an unregularized Autoencoder into a VAE progressively. VAEs trained with beta warmup usually encode more information in the latent variables. Refer to the Appendix for a complete description of our models.
5.1 Ablation Study
We first investigate the importance of each VRNN component, namely the likelihood, the prior and the posterior. We focus on the BAIR Push dataset [7] with 64x64 color sequences of a robotic arm interacting with children toys in a sandbox. Similarly to previous works [21], we use trajectories 256 to 511 as our test set and the rest for training, resulting in the 43264 train and 256 test sequences. At training we randomly subsample 12 frames from each train sequence, use the first 2 frames as the context, and learn to predict the remaining 10 frames. To evaluate the different model variations, we report the training objective (ELBO) obtained for the training set and the test set.
5.1.1 Scaling the Likelihood Model
Model  Parameters  Train/Test ELBO() 

1ConvLSTM  62.22M  3237/3826 
3ConvLSTM  86.64M  1948/2355 
6ConvLSTM  93.81M  1279.21/1731.31 
+ higher capacity  194.15M  1113.31/1589.72 
We assess the importance of the likelihood model . For this purpose, we build a VRNN with a single level of latent variables and modify the number of ConvLSTM layers in the decoder. Our aim is to investigate whether increasing the capacity of the mapping from latent to the observations results in better predictions.
In this experiment, our baseline likelihood model has one LSTM at 1x1 spatial resolution. We then gradually replace convolutional layers in the decoder with ConvLSTM layers, which increases the amount of information that can be carried from previous timesteps and, by extension, also increases the overall likelihood model capacity. We compare to a model with 3 ConvLSTM layers at resolutions 1x1, 4x4 and 8x8 and a model with 6 ConvLSTM layers at 1x1, 4x4, 8x8, 16x16, 32x32 and 64x64. Additionally, we also increase the size of the ConvLSTM layers for the model with 6 layers as another way of adding capacity.
Results can be found in Fig 1. We observe that, as a general trend, both the training and test ELBO decrease as we increase the model capacity, which suggests that current video prediction models might operate in an underfitting regime and benefit from higher capacity decoders.
5.1.2 More Flexible Prior and Posterior
We now investigate the importance of having more flexible prior and approximate posterior distributions and augment the 6ConvLSTM VRNN model with a hierarchy of latent variables. For all models, we fix the frame encoder and likelihood model^{1}^{1}1To add the multiple levels of latents in the decoder we need to modify the likelihood network and slightly increase the number of parameters. However, most () of the added capacity when adding a new level of latents goes towards the prior and posterior networks. and change the networks that estimate the learned prior , and the approximate posterior over the latent variables. All these models use a dense connectivity pattern and beta warmup.
Model  Parameters  Train/Test ELBO () 

1  166.55M  1141.85/1536.93 
18  220.60M  989.39/1313.02 
1832  230.74M  883.10/1162.24 
181632  245.19M  956.63/1256.22 
Naive Training  224.18M  1127.33/1440.58 
BW  224.18M  1101.39/1440.62 
Dense  230.74M  1182.60/1547.05 
BW and Dense  230.74M  883.10/1162.24 
We compare a VRNN baseline with a single level of latents with no spatial topology, with a model with two levels of latents at resolutions 1x1 and 8x8 (18), three levels of latents at 1x1, 8x8 and 32x32 (1832), and four levels of latents (181632) in the top half of Table 2. All models are trained with beta warmup and dense latent connectivity. We observe that in general adding more levels of latents with higher resolution reduces the train and test ELBOs, supporting the hypothesis that a more flexible prior and posterior leads to a better data fit. However, we observe diminishing returns past 3 levels, as our 181632 model does not outperform the 3 layers model. We attribute this to the difficulties in training deep hierarchies of latents, which remains a challenging optimization problem.
To further highlight the difficulties in training hierarchies of latents, we investigate the importance of using beta warmup [31] and having a dense connectivity between latents. The results of this experiment can be found in the bottom half of Table 2. We observe that these techniques are required to make our 1832 model make use of the hierarchy of latents and improve upon the single level model.
This is analyzed in more detail in Fig 4, where we visualize the KL between the prior and the posterior distributions for the test sequences of the BAIR Push dataset for the 1832 model and the variant without warmup or dense connectivity (Naive training). We consider a channel to be active if its average KL is higher than 0.01 following [22], and consider that a unit with a KL higher than 0.15 is maximally activated. We observe that without these techniques the model only uses a few latents of the top level in the hierarchy. However, when using beta warmup and a dense connectivity most of the latents are active across levels.
5.2 Comparisons to Previous Approaches
Next, we compare our single latent level VRNN (Ours w/o Hier), and our hierarchical VRNN with 3 levels of latents (Ours w/ Hier) to previous video approaches on Stochastic Moving MNIST [4], BAIR Push [7] and the Cityscapes [3] datasets.
5.2.1 Evaluation and Metrics
Defining evaluation metrics for video prediction is an open research question. In general we want models to predict sequences that are plausible, look realistic and cover all possible outcomes. Unfortunately, we are not aware of any metric that reflects all these aspects.
To measure coverage and plausibility we adopt the evaluation protocol proposed in [4, 21]. For each ground truth test sequence, we sample random predictions from the model which are conditioned on the test sequence initial frames. Then we find the sample that best matches the ground truth sequence according to a given metric and report that metric value. Some common metric choices are MeanSquare Error (MSE), Structural Similarity (SSIM) [38] or Peak SignaltoNoise Ratio (PSNR). In practice, these metrics have been shown to not correlate well with human judgement as they tend to prefer blurry predictions over sharper but imperfect generations [41, 21, 34]. LPIPS [41], on the other hand, is a perceptual metric that employs CNN features and has better correlation to human judgment. For this evaluation we choose to produce 100 samples following previous work and use SSIM and LPIPS as metrics. We have empirically observed that using 100 samples the results stay fairly consistent across different samplings. We report the metric average over the test set.
Additionally, we also use the recently proposed Fréchet Video Distance (FVD), which measures sample realism. FVD uses features from a 3D CNN and has also been shown to correlate well with human perception [34]. FVD compares populations of samples to assert whether they were both generated by the same distribution (it does not compare pairs of ground truth/generated frames directly). We form the ground truth population by using all the test sequences with their context. For the predicted population we randomly sample one video out of the generated for each test sequence. We repeat this process 5 times and report the mean of the FVD scores obtained, which stay fairly stable across samplings.
Model  FVD ()  LPIPS ()  SSIM () 

SVGLP [4]  90.81  0.153 0.03  0.668 0.04 
\hdashlineOurs w/o hier  63.81  0.102 0.04  0.763 0.09 
Ours w/ hier  57.17  0.103 0.03  0.760 0.08 
Context  Predicted Frames  

GT 

SAVP [21]  
SVGLP [4]  
Ours w/ Hier  
GT 

SVGLP [4]  
Ours w/ Hier 
5.2.2 Stochastic Moving MNIST
Stochastic Moving MNIST is a synthetic dataset proposed in [4] which consists of black and white sequences of MNIST digits moving over a black background and bouncing off the frame borders. As opposed to the original Moving MNIST dataset [32] with completely deterministic motion, Stochastic Moving MNIST has uncertain digit trajectories  the digits bounce off the border with a random new trajectory. We train two variants of our model and compare to the SVGLP baseline [4], for which we use a pretrained model from the official codebase. All models are trained using 5 frames of context and 10 future frames to predict. To evaluate the models, we follow the procedure in [4] described in section 5.2.1.
We report the results of the experiment in Table 3. We observe that both versions of our model (with/out the latent hierarchy) outperform the SVGLP baseline by a significant margin on all metrics. Note that LPIPS and FVD might not be suited to this dataset as they use features from CNNs trained on real world images, but we report them for completeness. Visually, our samples (found in the Appendix) reproduce the digits more faithfully with reduced degradation over time. There are small differences between the two versions of our model, suggesting that the extra expressiveness of the hierarchical model is not necessary in this synthetic dataset.
5.2.3 BAIR Push

We compare our VRNN models to SVGLP [4] and SAVP [21]. We use their official implementations and pretrained models to reproduce their results. We use the experimental setup of previous works [4, 21], using 2 context frames and generating 28 frames.
Results can be found on Fig 6. When the robotic arm is interacting with an object, SVGLP tends to generate blurry predictions characterized by a high FVD score. SAVP exhibits a lower FVD as it produces more realistically looking predictions. However, SAVP does not have a better coverage of the ground truth sequences compared to SVGLP as measured by LPIPS and SSIM. By inspecting the SAVP samples we notice that the SAVP generations tend to be sharper but sometimes they exhibit temporal inconsistencies or implausible interactions (see Fig 5). Our w/o Hier VRNN models obtain better scores than SVGLP, the previous best VAEtype model. This supports the importance of having a highcapacity likelihood model. In addition, our hierarchical VRNN further improves both the FVD and LPIPS metrics, suggesting that the hierarchy of latents helps modeling the data In particular, our hierarchical VRNN shows an improvement of in terms of FVD and in terms of LPIPS over SVGLP, the previous best VAEbased model. It also achieves a similar FVD than the SAVP GANVAE model, while outperforming it in terms of LPIPS by .
5.2.4 Cityscapes
The Cityscapes dataset contains sequences recorded from a car driving around multiple cities under different conditions. Cityscapes is a challenging dataset  while contiguous frames are locally similar, uncertainty grows significantly with time. Compared to previous datasets, the backgrounds in Cityscapes do not stay constant across time.
We consider sequences with 30 frames from the training set cities for a total of 1877 train sequences and randomly select 256 test sequences. We use 2 context and 10 prediction frames to train the models. At test time we predict 28 frames following the BAIR Push experimental protocol. We preprocess the videos by taking a 1024x1024 center crop of the original sequences and resizing them to 128x128 pixels. For evaluating the models we use the standard setup where we generate 100 samples per test sequence and report FVD, SSIM and LPIPS metrics. Since none of the baselines from previous experiments are trained on Cityscapes, we use the official SVG implementation (that defines models with 128x128 inputs) and train a SVGLP model. We train all models for 100 epochs.
Results for this experiment can be found in Fig. 7. SVGLP has trouble modelling motion in the dataset, usually predicting a static image similar to the last context frame. In contrast, our model without a hierarchy of latents is able to model the changing scene. When adding hierarchical latents our model is able to capture more finegrained details, and as a result, it produces more visually appealing samples with a boost in all metrics. We note that the SSIM scores for SVGLP match those of our models at later timesteps in the prediction, however this does not translate to better samples as can be seen in Fig. 5 or in the Appendix. This further indicates that SSIM might not be suitable to evaluate video prediction models.
6 Conclusions
We propose a hierarchical VRNN for video prediction that features an improved likelihood model and a hierarchy of latent variables. Our approach compares favorably to current state of the art models in terms of the Fréchet Video Distance, LPIPS and SSIM metrics, producing visually appealing and coherent samples. Our results demonstrate that current video prediction models benefit from increased capacity, opening the door to further gains with bigger and more flexible generative models.
References
 [1] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
 [2] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
 [4] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1182–1191, 2018.
 [5] E. L. Denton et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017.
 [6] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [7] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Selfsupervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
 [8] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
 [9] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [10] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [12] M. D. Hoffman and M. J. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
 [13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
 [14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 [15] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1771–1779. JMLR. org, 2017.
 [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
 [18] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [19] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A flowbased generative model for video. arXiv preprint arXiv:1903.01434, 2019.
 [20] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 [21] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
 [22] L. Maaløe, M. Fraccaro, V. Liévin, and O. Winther. Biva: A very deep hierarchy of latent variables for generative modeling. arXiv preprint arXiv:1902.02102, 2019.
 [23] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 [24] M. Mathieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. International Conference on Learning Representations, 2016.
 [25] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
 [26] R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In International Conference on Machine Learning, pages 324–333, 2016.
 [27] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
 [28] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale autoregressive density estimation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2912–2921. JMLR. org, 2017.
 [29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [30] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. How to train deep variational autoencoders and probabilistic ladder networks. In 33rd International Conference on Machine Learning (ICML 2016) International Conference on Machine Learning, 2016.
 [31] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
 [32] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
 [33] S. Tulyakov, M.Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
 [34] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
 [35] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
 [36] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 98–106, 2016.
 [37] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
 [38] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
 [39] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
 [40] S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
 [41] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.