Improved Conditional VRNNs for Video Prediction

Improved Conditional VRNNs for Video Prediction

Lluís Castrejón              Nicolas Ballas              Aaron Courville
Mila - Université de Montréal           Facebook AI Research
Canadian Institute for Advanced Research (CIFAR)

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.


1 Introduction

Correspondence to

We investigate the task of video prediction, a particular instantiation of self-supervision [6, 8] where generative models learn to predict future frames in a video. Training such models does not require any annotated data, yet the models need to capture a notion of the complex dynamics of real-world phenomena (such as physical interactions) to generate coherent sequences.

Context Predicted Frames


Figure 1: Can generative models predict the future? We propose an improved VAE model for video prediction. Our model uses hierarchical latents and a higher capacity likelihood network to improve upon previous VAE approaches, generating more visually appealing samples that remain coherent for longer temporal horizons.

Uncertainty is an inherent difficulty associated with video prediction, as many future outcomes are plausible for a given sequence of observations [1, 4]. Predictions from deterministic models rapidly degrade over time as uncertainty grows, converging to an average of the possible future outcomes [32]. To address this issue, probabilistic latent variable models such as Variational Auto-Encoders (VAEs) [18, 29], and more specifically Variational Recurrent Neural Networks (VRNNs) [2], have been proposed for video prediction [1, 4]. These models define a prior distribution over a set of latent variables, allowing different samples from these latents to capture multiple outcomes.

It has been empirically observed that VAE and VRNN-based models produce blurry predictions [20, 21]. This tendency is usually attributed to the use of a similarity metric in pixel space [20, 24] such as Mean Squared Error (corresponding to a log-likelihood loss under a fully factorized Gaussian distribution). This has lead to alternative models such as VAE-GAN [20, 21], which extends the traditional VAE objective with an adversarial loss in order to obtain more visually compelling generations.

In addition, the lack of expressive latent distributions has been shown to lead to poor model fitting [12]. Training VAEs involves defining an approximate posterior distribution over the latent variables which models their probability after the generated data has been observed. If the approximate posterior is too constrained, it will not be able to match the true posterior distribution and this will prevent the model from accurately fitting the training data. On the other hand, the prior distribution over the latent variables can be interpreted as a model of uncertainty.

The decoder or likelihood network needs to transform latent samples into data observations covering all plausible outcomes. Given a simple prior, this transformation can be very complex and require high capacity networks. We hypothesize that the reduced expressiveness of current VRNN models is limiting the quality of their predictions and investigate two main directions to improve video prediction models. First, we propose to scale the capacity of the likelihood network. We empirically demonstrate that by using a high capacity decoder we can ease the latent modeling problem and better fit the data.

Second, we introduce more flexible posterior and prior distributions [30]. Current video prediction models usually rely on one shallow level of latent variables and the prior and approximate posterior are parameterized using diagonal Gaussian distributions [1]. We extend the VRNN formulation by proposing a hierarchical variant that uses multiple levels of latents per timestep.

Models leveraging a hierarchy of latents are known to be hard to optimize as they are required to backpropagate through a stack of stochastic latent variables, usually resulting in models that only make use of a small subset of the latents [18, 23, 30]. We mitigate this problem by using a warmup regime for the KL loss [31] and a dense connectivity pattern [13, 22] between the input and latent variables. Specifically, each stochastic latent variable is connected to the input and to all subsequent stochastic levels in the hierarchy. Our empirical findings confirm that only with these techniques our model is able to take advantage of different layers in a latent hierarchy.

We validate our hierarchical VRNN in three datasets with varying levels of future uncertainty and realism: Stochastic Moving MNIST  [4], the BAIR Push Dataset  [7] and Cityscapes  [3]. When compared to current state of the art models [4, 21], our approach performs favorably under several metrics. In particular for the BAIR Push Dataset, our hierarchical-VRNN shows an improvement of in Video Fréchet Distance (FVD) [34] and in term of LPIPS score [41] over SVG-LP [4], the previous best VAE-based model. It also achieves a similar FVD than the SAVP VAE-GAN model [21], while showing a improvement in terms of LPIPS over this baseline.

2 Related Work

Initial video prediction approaches relied on deterministic models. Ranzato et al. [27] divided frames into patches and predicted their evolution in time given previous neighboring patches. In [32] Srivastava et al. used LSTM networks on pretrained image embeddings to predict the future. Similarly, Oh et al. [25] used LSTMs on CNN representations to predict frames in Atari games when given the player actions.

ConvLSTMs [40] adapt the LSTM equations to spatial feature maps by replacing matrix multiplications with convolutions. They were originally used for precipitation nowcasting and are commonly used for video prediction.

Other works have proposed to disentangle the motion and context of the frames to generate [35, 33, 5]. They assume that a scene can be decomposed as multiple objects, which allows them to use a fixed representation for the background. Our approach does not follow this modeling assumption and instead tries to capture the uncertainty in the future.

Autoregressive models [15, 28] approximate the full joint data distribution over pixels, which allows them to capture complex pixel dependencies but at the expense of making their inference mechanism slow and not scalable to high resolutions. Latent variable models using GANs [9] were proposed in  [37, 36, 33]. Training pure GAN video models is still an open research direction: training is unstable and most models require auxiliary losses.

A successful approach so far has been based on VAE [18, 29]/VRNN [2] models. SV2P [1] proposed to capture sequence uncertainty in a single set of latent variables kept fixed for each predicted sequence. SVG [4] adopted the VRNN formulation [2], introducing per-step latent variables (SVG-FP) and a variant with a learned prior (SVG-LP), which makes the prior at a certain timestep a function of previous frames. In recent work, SAVP [21] proposed to use the VAE-GAN [20] framework for video, a hybrid model that offers a trade-off between VAEs and GANs. Our model extends the VRNN formulation by introducing a hierarchy of latents to better approximate the data likelihood.

There are multiple works addressing hierarchical VAEs for non-sequential data [26, 23, 31, 17]. While hierarchical VAEs can model more flexible latent distributions, training them is usually difficult due to the multiple layers of conditional latents [30]. Ladder Variational Autoencoders [31] proposed a series of techniques to partially alleviate this issue. IAF [17] used a similar architecture to Ladder VAEs and extended it with a novel normalizing flow. Recent work [22] has trained very deep hierarchical models that produce visually compelling samples. We extend hierarchical latent variable models to sequential data and apply them to video prediction. Concurrent work  [19] has proposed a fully invertible model for video.

3 Preliminaries

We follow previous work in video prediction [4]. Given context frames and the following future frames , our goal is to learn a generative model that maximizes the probability .

VRNN follows the VAE formalism and introduces a set of latent variables to capture the variations in the observed variables at each timestep . It defines a likelihood model and a prior distribution which are parametrized in an autoregressive manner; \ieat each timestep observed and latent variables are conditioned on the past latent samples and observed frames. VRNN therefore uses a learned prior [2, 4]. Taking into account the temporal structure of the data, the probability is factorized as


Computing requires marginalizing over the latent variables , which is computationaly intractable. Instead, VRNN relies on Variation Inference [14] and defines an amortized approximate posterior that approximates the true posterior distribution . We then can derive the evidence lower bound (ELBO), a lower bound to the marginal log-likelihood :


VRNN can be optimized to fit the training data by maximizing the ELBO using stochastic backpropagation and the reparameterization trick [18, 29].

4 Hierarchical VRNN

We now introduce a hierarchical version of the VRNN model. At each timestep, we consider levels of latents variables . We then further factorize the latent prior as


The sampling process of the latent variable now depends on the latent variables from previous time steps for that level and on the latent variables of the previous levels at the current timestep . Similarly, we can write the approximate posterior as:


Using eq. 3 and eq.4, we can rewrite the ELBO as


Refer to the Appendix for a full derivation of the ELBO.

Figure 2: Graphical model for the learned prior with the dense latent connectivity pattern. Arrows in red show the connections from the input at the previous timestep to current latent variables. Arrows in green highlight skip connections between latent variables and connections to outputs. Arrows in black indicate recurrent temporal connections. We empirically observe that this dense-connectivity pattern eases the training of latent hierarchies.
Figure 3: Model Parametrization. Our model uses a CNN to encode frames individually. The representation of the context frames is used to initialize the states of the prior, posterior and likelihood networks, all of which use recurrent networks. At each timestep, the decoder receives an encoding of the previous frame, a set of latent variables (either from the prior or the posterior) and its previous hidden state and predicts the next frame in the sequence.

4.1 Dense Latent Connectivity

Training a hierarchy of latent variables is known to be challenging as it requires to backpropagate through multiple stochastic layers. Usually this results in models that only use one specific level of the hierarchy [18, 23, 30]. To ease the optimization we use a dense connectivity pattern between latent levels both for the prior and the approximate posterior, following [13, 22].

Fig 2 illustrates the dense connection of the learned prior (refer to the Appendix for the approximate posterior). For each latent level, the prior and posterior are implemented using recurrent neural networks which take as input a deterministic transformation of (red arrows in Fig 2), and to all the latent variables from the previous levels (green arrows in in Fig 2). In addition, each latent variable has a direct connection to the output variables .

4.2 Model Parametrization

We now describe an instantiation of the VRNN model that we will use in the experiments, illustrated in Fig. 3. First we compute features for each context frame and use them to initialize the hidden state of the prior/posterior/decoder networks, all of which have recurrent components. At a given timestep , the model takes as input the latent variable samples with the embedding of the previously generated frame and outputs the next frame . During training we draw latent samples from the approximate posterior distribution and maximize the ELBO. To generate unseen sequences, we sample from the learned prior . Note that since we have multiple levels of conditional latents we use ancestral sampling to generate , \iewe first sample from the top level of the hierarchy and we then sequentially sample the lower levels conditioning on the sampled values of the previous layers in the hierarchy.

Frame Encoder We use a 2D CNN with ResNet [11] blocks and max-pooling layers to represent input frames.

Prior/Approximate Posterior We parametrize both the prior and the posterior as a hierarchy of diagonal Normal distributions , where the parameters and are recurrent functions of samples from i) previous levels in the hierarchy and ii) the frame encoder features. Each level in the hierarchy operates at a different spatial resolution, with the top level features operating at a 1x1 resolution, \ienot having a spatial topology. At a given timestep , the parameters for a specific latent level are given by a ConvLSTM that consumes i) a previous hidden state, ii) samples from the previous levels in the hierarchy , iii) the feature map of a frame with the same spatial resolution as the ConvLSTM. For the prior network, the input frame embedding corresponds to the previously generated frame , while for the posterior the input comes from the frame to generate .

Likelihood/Frame Decoder At each timestep , the decoder takes a representation of the previously generated frame and the samples and generates according to . The decoder consists of ConvLSTMs interleaved with transposed convolutions that upscale the feature maps back to the input resolution.

Initial State The initial states of our prior, posterior and decoder/likelihood models are functions of the context. We use small CNNs to initialize each of the ConvLSTMs layers used in the VAE components.

5 Experiments

All our models are trained with Adam [16] and a batch size of on Nvidia DGX-1s. We use a learning rate warmup [10] starting with an initial learning rate = 2e-5 that is linearly increased at each timestep until reaching = 1.6e-4 in 5 epochs. We use = 0.5 and = 0.9 and weight decay = 1e-4. We train the autoregressive components of our models using teacher forcing [39].

Our models are also trained using beta warmup [31], which consists in gradually increasing the weight of the KL divergence in the ELBO, turning the model from an unregularized Autoencoder into a VAE progressively. VAEs trained with beta warmup usually encode more information in the latent variables. Refer to the Appendix for a complete description of our models.

5.1 Ablation Study

We first investigate the importance of each VRNN component, namely the likelihood, the prior and the posterior. We focus on the BAIR Push dataset [7] with 64x64 color sequences of a robotic arm interacting with children toys in a sandbox. Similarly to previous works [21], we use trajectories 256 to 511 as our test set and the rest for training, resulting in the 43264 train and 256 test sequences. At training we randomly subsample 12 frames from each train sequence, use the first 2 frames as the context, and learn to predict the remaining 10 frames. To evaluate the different model variations, we report the training objective (ELBO) obtained for the training set and the test set.

5.1.1 Scaling the Likelihood Model

Model Parameters Train/Test ELBO()
1-ConvLSTM 62.22M 3237/3826
3-ConvLSTM 86.64M 1948/2355
6-ConvLSTM 93.81M 1279.21/1731.31
  + higher capacity 194.15M 1113.31/1589.72
Table 1: Ablation - Likelihood We compare models with different number of recurrent layers for the likelihood network. We observe that the model performance increases monotonically as we add more ConvLSTMs. We further increase the size of the recurrent hidden states for the 6-ConvLSTM model (+ higher capacity variant), also leading to a better data fit. These results suggest that current video prediction models might underfit the data because of reduced likelihood capacity.

We assess the importance of the likelihood model . For this purpose, we build a VRNN with a single level of latent variables and modify the number of ConvLSTM layers in the decoder. Our aim is to investigate whether increasing the capacity of the mapping from latent to the observations results in better predictions.

In this experiment, our baseline likelihood model has one LSTM at 1x1 spatial resolution. We then gradually replace convolutional layers in the decoder with ConvLSTM layers, which increases the amount of information that can be carried from previous timesteps and, by extension, also increases the overall likelihood model capacity. We compare to a model with 3 ConvLSTM layers at resolutions 1x1, 4x4 and 8x8 and a model with 6 ConvLSTM layers at 1x1, 4x4, 8x8, 16x16, 32x32 and 64x64. Additionally, we also increase the size of the ConvLSTM layers for the model with 6 layers as another way of adding capacity.

Results can be found in Fig 1. We observe that, as a general trend, both the training and test ELBO decrease as we increase the model capacity, which suggests that current video prediction models might operate in an underfitting regime and benefit from higher capacity decoders.

5.1.2 More Flexible Prior and Posterior

We now investigate the importance of having more flexible prior and approximate posterior distributions and augment the 6-ConvLSTM VRNN model with a hierarchy of latent variables. For all models, we fix the frame encoder and likelihood model111To add the multiple levels of latents in the decoder we need to modify the likelihood network and slightly increase the number of parameters. However, most () of the added capacity when adding a new level of latents goes towards the prior and posterior networks. and change the networks that estimate the learned prior , and the approximate posterior over the latent variables. All these models use a dense connectivity pattern and beta warmup.

Model Parameters Train/Test ELBO ()
1 166.55M 1141.85/1536.93
1-8 220.60M 989.39/1313.02
1-8-32 230.74M 883.10/1162.24
1-8-16-32 245.19M 956.63/1256.22
Naive Training 224.18M 1127.33/1440.58
BW 224.18M 1101.39/1440.62
Dense 230.74M 1182.60/1547.05
BW and Dense 230.74M 883.10/1162.24
Table 2: Ablation - Hierarchy of Latents Top half: We compare a VRNN baseline with a single level of latents with no spatial topology (1), a model with two levels of latents at resolutions 1x1 and 8x8 (1-8), our full model with three levels of latents at 1x1, 8x8 and 32x32 (1-8-32), and a model with 4 levels of latents (1-8-16-32). Adding more levels of latents leads to a better fit, with reduced ELBOs. However, adding extra levels of latents without increasing the spatial resolution reduces the performance of the model due to the difficulties in training hierarchical latent variable models. Bottom half: To highlight the difficulties in training hierarchies of latents, we investigate the effects of using beta warmup (BW) [31] and having a dense connectivity (Dense) between latents when training the 1-8-32 model. Without these techniques the hierarchy of latents does not bring any benefit compared to the VRNN with 1 level of latent.

We compare a VRNN baseline with a single level of latents with no spatial topology, with a model with two levels of latents at resolutions 1x1 and 8x8 (1-8), three levels of latents at 1x1, 8x8 and 32x32 (1-8-32), and four levels of latents (1-8-16-32) in the top half of Table 2. All models are trained with beta warmup and dense latent connectivity. We observe that in general adding more levels of latents with higher resolution reduces the train and test ELBOs, supporting the hypothesis that a more flexible prior and posterior leads to a better data fit. However, we observe diminishing returns past 3 levels, as our 1-8-16-32 model does not outperform the 3 layers model. We attribute this to the difficulties in training deep hierarchies of latents, which remains a challenging optimization problem.

To further highlight the difficulties in training hierarchies of latents, we investigate the importance of using beta warmup [31] and having a dense connectivity between latents. The results of this experiment can be found in the bottom half of Table 2. We observe that these techniques are required to make our 1-8-32 model make use of the hierarchy of latents and improve upon the single level model.

Figure 4: Average normalized KL per latent channel. We visualize the mean normalized KL for each latent channel for models from Table 2. Without beta warmup and dense connectivity the hierarchy of latents is underutilized, with most information being encoded in a few latents of the top level. In contrast, the same model with these techniques utilizes all latent levels.

This is analyzed in more detail in Fig 4, where we visualize the KL between the prior and the posterior distributions for the test sequences of the BAIR Push dataset for the 1-8-32 model and the variant without warmup or dense connectivity (Naive training). We consider a channel to be active if its average KL is higher than 0.01 following [22], and consider that a unit with a KL higher than 0.15 is maximally activated. We observe that without these techniques the model only uses a few latents of the top level in the hierarchy. However, when using beta warmup and a dense connectivity most of the latents are active across levels.

5.2 Comparisons to Previous Approaches

Next, we compare our single latent level VRNN (Ours w/o Hier), and our hierarchical VRNN with 3 levels of latents (Ours w/ Hier) to previous video approaches on Stochastic Moving MNIST  [4], BAIR Push [7] and the Cityscapes  [3] datasets.

5.2.1 Evaluation and Metrics

Defining evaluation metrics for video prediction is an open research question. In general we want models to predict sequences that are plausible, look realistic and cover all possible outcomes. Unfortunately, we are not aware of any metric that reflects all these aspects.

To measure coverage and plausibility we adopt the evaluation protocol proposed in [4, 21]. For each ground truth test sequence, we sample random predictions from the model which are conditioned on the test sequence initial frames. Then we find the sample that best matches the ground truth sequence according to a given metric and report that metric value. Some common metric choices are Mean-Square Error (MSE), Structural Similarity (SSIM) [38] or Peak Signal-to-Noise Ratio (PSNR). In practice, these metrics have been shown to not correlate well with human judgement as they tend to prefer blurry predictions over sharper but imperfect generations [41, 21, 34]. LPIPS [41], on the other hand, is a perceptual metric that employs CNN features and has better correlation to human judgment. For this evaluation we choose to produce 100 samples following previous work and use SSIM and LPIPS as metrics. We have empirically observed that using 100 samples the results stay fairly consistent across different samplings. We report the metric average over the test set.

Additionally, we also use the recently proposed Fréchet Video Distance (FVD), which measures sample realism. FVD uses features from a 3D CNN and has also been shown to correlate well with human perception [34]. FVD compares populations of samples to assert whether they were both generated by the same distribution (it does not compare pairs of ground truth/generated frames directly). We form the ground truth population by using all the test sequences with their context. For the predicted population we randomly sample one video out of the generated for each test sequence. We repeat this process 5 times and report the mean of the FVD scores obtained, which stay fairly stable across samplings.

Model FVD () LPIPS () SSIM ()
SVG-LP  [4] 90.81 0.153 0.03 0.668 0.04
\hdashlineOurs w/o hier 63.81 0.102 0.04 0.763 0.09
Ours w/ hier 57.17 0.103 0.03 0.760 0.08
Table 3: Stochastic Moving MNIST. We compute the FVD metric between samples from different models and test sequences as well as the average LPIPS and SSIM of the best sample for each test sequence. Our models outperform the SVG-LP baseline on all metrics by a significant margin. While our model with hierarchical latent variables obtains a better FVD score, both variants obtain comparable results in this relatively simple dataset.
Context Predicted Frames


SAVP [21]
SVG-LP [4]
Ours w/ Hier


SVG-LP [4]
Ours w/ Hier
Figure 5: Selected Samples for BAIR Push and Cityscapes. We show a sequence for BAIR Push and Cityscapes and random generations from our model and baselines. On BAIR Push we observe that the SAVP predictions are crisp but sometimes depict inconsistent arm-object interactions. SVG-LP produces blurry predictions in uncertain areas such as occluded parts of the background or those showing object interactions. Our model generates plausible interactions with reduced blurriness relatively to SVG-LP. On Cityscapes, the SVG-LP baseline is unable to model any motion. Our model, using a hierarchy of latents, generates more visually compelling predictions. More samples can be found in the Appendix.

5.2.2 Stochastic Moving MNIST

Stochastic Moving MNIST is a synthetic dataset proposed in [4] which consists of black and white sequences of MNIST digits moving over a black background and bouncing off the frame borders. As opposed to the original Moving MNIST dataset [32] with completely deterministic motion, Stochastic Moving MNIST has uncertain digit trajectories - the digits bounce off the border with a random new trajectory. We train two variants of our model and compare to the SVG-LP baseline [4], for which we use a pretrained model from the official codebase. All models are trained using 5 frames of context and 10 future frames to predict. To evaluate the models, we follow the procedure in [4] described in section 5.2.1.

We report the results of the experiment in Table 3. We observe that both versions of our model (with/out the latent hierarchy) outperform the SVG-LP baseline by a significant margin on all metrics. Note that LPIPS and FVD might not be suited to this dataset as they use features from CNNs trained on real world images, but we report them for completeness. Visually, our samples (found in the Appendix) reproduce the digits more faithfully with reduced degradation over time. There are small differences between the two versions of our model, suggesting that the extra expressiveness of the hierarchical model is not necessary in this synthetic dataset.

5.2.3 BAIR Push

Model FVD () LPIPS () SSIM ()
SVG-LP [4] 256.62
SAVP [21] 143.43
\hdashlineOurs w/o Hier 149.22 0.829 0.06
Ours w/ Hier 143.40 0.055 0.03
Figure 6: BAIR Push - Results. Left: We show the evolution in time of the Average LPIPS and SSIM of the best predicted sample per test sequence. Right: We report the Average FVD, SSIM and LPIPS of the best sample for each test sequence. Compared to SVG-LP, both our model with a single level of latents and the hierarchical models improve all metrics. Compared to SAVP, we obtain better LPIPS and SSIM. Our model with a single level of latents performs better in SSIM but worse on perceptual metrics. When adding the hierarchy of latents, our model matches the FVD of SAVP and improves the LPIPS, indicating samples of similar visual quality and better coverage of the ground-truth sequences.

Model FVD () LPIPS () SSIM ()
SVG-LP  [4] 1300.26
\hdashlineOurs w/o Hier 682.08
Ours w/ Hier 567.51 0.264 0.07 0.628 0.10
Figure 7: Cityscapes - Quantitative Results We report FVD, SSIM and LPIPS scores on Cityscapes at 128x128 resolution for the SVG-LP  [4] baseline and two variants of our model. Increasing the capacity of the likelihood model brings an improvement in all metrics over the SVG baseline. When adding a hierarchy of latents we observe further improvements, validating its usefulness. Even though SVG matches our models in SSIM at later timesteps, this does not correlate well with human judgement, as the generated SVG samples show more blurriness (see Fig. 5).

We compare our VRNN models to SVG-LP [4] and SAVP [21]. We use their official implementations and pretrained models to reproduce their results. We use the experimental setup of previous works  [4, 21], using 2 context frames and generating 28 frames.

Results can be found on Fig 6. When the robotic arm is interacting with an object, SVG-LP tends to generate blurry predictions characterized by a high FVD score. SAVP exhibits a lower FVD as it produces more realistically looking predictions. However, SAVP does not have a better coverage of the ground truth sequences compared to SVG-LP as measured by LPIPS and SSIM. By inspecting the SAVP samples we notice that the SAVP generations tend to be sharper but sometimes they exhibit temporal inconsistencies or implausible interactions (see Fig 5). Our w/o Hier VRNN models obtain better scores than SVG-LP, the previous best VAE-type model. This supports the importance of having a high-capacity likelihood model. In addition, our hierarchical VRNN further improves both the FVD and LPIPS metrics, suggesting that the hierarchy of latents helps modeling the data In particular, our hierarchical VRNN shows an improvement of in terms of FVD and in terms of LPIPS over SVG-LP, the previous best VAE-based model. It also achieves a similar FVD than the SAVP GAN-VAE model, while outperforming it in terms of LPIPS by .

5.2.4 Cityscapes

The Cityscapes dataset contains sequences recorded from a car driving around multiple cities under different conditions. Cityscapes is a challenging dataset - while contiguous frames are locally similar, uncertainty grows significantly with time. Compared to previous datasets, the backgrounds in Cityscapes do not stay constant across time.

We consider sequences with 30 frames from the training set cities for a total of 1877 train sequences and randomly select 256 test sequences. We use 2 context and 10 prediction frames to train the models. At test time we predict 28 frames following the BAIR Push experimental protocol. We preprocess the videos by taking a 1024x1024 center crop of the original sequences and resizing them to 128x128 pixels. For evaluating the models we use the standard setup where we generate 100 samples per test sequence and report FVD, SSIM and LPIPS metrics. Since none of the baselines from previous experiments are trained on Cityscapes, we use the official SVG implementation (that defines models with 128x128 inputs) and train a SVG-LP model. We train all models for 100 epochs.

Results for this experiment can be found in Fig. 7. SVG-LP has trouble modelling motion in the dataset, usually predicting a static image similar to the last context frame. In contrast, our model without a hierarchy of latents is able to model the changing scene. When adding hierarchical latents our model is able to capture more fine-grained details, and as a result, it produces more visually appealing samples with a boost in all metrics. We note that the SSIM scores for SVG-LP match those of our models at later timesteps in the prediction, however this does not translate to better samples as can be seen in Fig. 5 or in the Appendix. This further indicates that SSIM might not be suitable to evaluate video prediction models.

6 Conclusions

We propose a hierarchical VRNN for video prediction that features an improved likelihood model and a hierarchy of latent variables. Our approach compares favorably to current state of the art models in terms of the Fréchet Video Distance, LPIPS and SSIM metrics, producing visually appealing and coherent samples. Our results demonstrate that current video prediction models benefit from increased capacity, opening the door to further gains with bigger and more flexible generative models.


  • [1] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
  • [2] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
  • [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [4] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1182–1191, 2018.
  • [5] E. L. Denton et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017.
  • [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [7] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
  • [8] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  • [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [10] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [12] M. D. Hoffman and M. J. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, 2016.
  • [13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • [15] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1771–1779. JMLR. org, 2017.
  • [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [17] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
  • [18] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [19] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma. Videoflow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019.
  • [20] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
  • [21] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
  • [22] L. Maaløe, M. Fraccaro, V. Liévin, and O. Winther. Biva: A very deep hierarchy of latent variables for generative modeling. arXiv preprint arXiv:1902.02102, 2019.
  • [23] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
  • [24] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations, 2016.
  • [25] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
  • [26] R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In International Conference on Machine Learning, pages 324–333, 2016.
  • [27] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
  • [28] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale autoregressive density estimation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2912–2921. JMLR. org, 2017.
  • [29] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • [30] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. How to train deep variational autoencoders and probabilistic ladder networks. In 33rd International Conference on Machine Learning (ICML 2016) International Conference on Machine Learning, 2016.
  • [31] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
  • [32] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
  • [33] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
  • [34] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  • [35] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
  • [36] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 98–106, 2016.
  • [37] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
  • [38] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [39] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
  • [40] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
  • [41] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description