FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing Autoencoder GANs

FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing Autoencoder GANs

Sandra Aigner and Marco Körner
Technical University of Munich
Munich, Germany
{sandra.aigner, marco.koerner}@tum.de
Abstract

We propose a new Autoencoder GAN model, FutureGAN, that predicts future frames of a video sequence given a sequence of past frames. Our approach extends the recently introduced progressive growing of GANs (PGGAN) architecture by Karras et al. (2018). During training, the resolution of the input and output frames is gradually increased by progressively adding layers in both the discriminator and the generator network. To learn representations that effectively capture the spatial and temporal components of a frame sequence, we use spatio-temporal 3d convolutions. We already achieve promising results for frame resolutions of over a variety of datasets ranging from synthetic to natural frame sequences, while theoretically not being limited to a specific frame resolution. The FutureGAN learns to generate plausible futures, learning representations that seem to effectively capture the spatial and the temporal transformations of the input frames. A great advantage of our architecture, in comparison to the majority of other video prediction models, is its simplicity. The model receives solely the raw pixel values as an input, generating output frames effectively, without relying on additional constraints, conditions, or complex pixel-based error loss metrics.

 

FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing Autoencoder GANs


  Sandra Aigner and Marco Körner Technical University of Munich Munich, Germany {sandra.aigner, marco.koerner}@tum.de

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Automatically predicting a plausible future for an observed video sequence has become a popular field of research in machine learning and computer vision. Teaching machines to anticipate future events based on internal representations of the environment, hence simulating an important part of the human decision-making process, is of relevance for many tasks. In robotics, as well as in autonomous driving, these predictions can be used for planning, especially in reinforcement learning settings. Furthermore, generated future sequences, as an additional input to an agent, can lead to better decisions, or at least to faster executions. Traditional computer vision tasks, such as object recognition, detection, and tracking, can benefit from the implicitly learned representations Mathieu et al. (2016).

There are several research branches addressing the pixel-level generation of future video sequences. Early, often purely deterministic approaches, tend to insufficiently model the uncertainty of the output, which leads to blurry predictions. To overcome this effect, some methods decompose the sequential input into its static and dynamic components and learn separate representations. Recently, models that explicitly add stochasticity to tackle this issue managed to beat the previous state-of-the-art. Our approach builds on the idea of generating video predictions, using generative adversarial networks (GANs) Goodfellow et al. (2014) to appropriately model the uncertainty of the multi-modal output in pixel space. GANs learn to model the underlying data distribution implicitly by utilizing a critic, the discriminator network, during training time. While being trained, the critic constantly provides feedback to the generator, whether generated samples look real or not. This forces the generator to output samples similar to the real ones. Although GAN based video prediction methods usually manage to preserve the sharpness in the generated frames effectively, there are two major drawbacks. Firstly, GANs are hard to train. Secondly, GANs often suffer from mode collapse Salimans et al. (2016) where the generator learns to fool the discriminator by producing images of a limited set of modes. This means the resulting generative model will not be able to capture the full underlying data distribution.

To overcome the problems arising when training GANs, our approach extends a recently proposed GAN method for high-resolution image generation, progressive growing of GANs (PGGAN) Karras et al. (2018). The basic principle is to gradually increase the image resolution by progressively adding layers in both the generator and the discriminator network. In this paper, we exhibit the power of this training strategy for the task of generating the future frames of video sequences. We show experimental results on three datasets ranging from toy datasets, such as MovingMNIST Srivastava et al. (2015) and MsPacman Cooper (2016), to real-world datasets, such as the KITTI Tracking dataset Geiger et al. (2012). We further investigate the generalization capabilities of the proposed method by recursively feeding the generated sequences back in as inputs. To quantitatively evaluate the resulting video sequences, we compare our methods to other video prediction approaches, as well as to a naive baseline of simply copying the last frame.

The primary contribution of this paper is to provide a simple, yet effective GAN based video prediction model. Our proposed method is trained to predict multiple future frames at once, while the problems that typically arise when training GANs are avoided. Contrary to other approaches, our networks solely use the raw pixel value information as input, without relying on additional priors or conditional information. Our models are theoretically not limited to a specific image resolution, as layers are added progressively during training. Further, we use the recently proposed Wasserstein GAN with gradient penalty (WGAN-GP) loss Gulrajani et al. (2017) rather than pure pixel-based error loss metrics to effectively increase the quality of the generated frame sequences. We show that our method is able to generate plausible, although not always correct, futures given a sequence of input frames. The generated video predictions show, that the FutureGAN model is able to learn representations of spatial and temporal transformations for various datasets. Our experiments indicate that the model generalizes reasonably well to predict deeper into future, even though it was trained on fewer frames. Karras et al. (2018) have shown that their method enables a stabilized training process without apparent mode collapse effects in practice. The experiments conducted in this paper verify this observation for the task of video prediction as well.

2 Related Work

Since 2014, predicting the future frames of a video given either a single input frame or a sequence of input frames has become a widely researched topic. Ranzato et al. (2014) were the first to provide a baseline model for video prediction using deep neural networks, adopting methods used in natural language modeling. Since then, various other approaches were introduced. Most of these combine the raw pixel values of the input frame(s) with temporal components Srivastava et al. (2015); Lotter et al. (2017); Wang et al. (2017); Oliu et al. (2017); Liu et al. (2017); Vukoti et al. (2016); Kalchbrenner et al. (2017); Goroshin et al. (2015), dynamically learned filters De Brabandere et al. (2016), latent variables Goroshin et al. (2015), or by explicitly incorporating time dependency Vukoti et al. (2016). Lotter et al. (2017), for example, utilize long short-term memory (LSTM) units to learn video representations with methods inspired by predictive coding to generate frames for one time step ahead. Others tackled the problem by learning separate representations for the static and dynamic components of a video by adding action or geometry-based conditions, such as pose, optical flow, or depth information Finn et al. (2016); Xue et al. (2016); Mahjourian et al. (2017); Patraucean et al. (2016); Byeon et al. (2017); Hao et al. (2018); Oh et al. (2015).

The most promising results, especially for long-term predictions, have been achieved just recently, by approaches that explicitly include stochasticity in their models Xue et al. (2016); Denton and Birodkar (2017); Denton and Fergus (2018); Walker et al. (2016); Babaeizadeh et al. (2018); Lee et al. (2018). Babaeizadeh et al. (2018), for example, combine the pixel value input with a set of latent variables and optionally used the action generator of Finn et al. (2016) to predict a different set of possible futures for each set of latent variables. Those methods, to our knowledge, produce the best outputs so far, because they address the uncertainty in predicting video frames directly. The effect of generating blurry predictions for increasing numbers of time steps is prevented by generating a set of possible futures rather than simply averaging over all modes.

Another attempt to address the multi-modal nature of the video prediction output, thus reducing the blurring effect, is to train the generative models in an adversarial setting. Our approach follows this research branch. As Mathieu et al. (2016) showed first, networks trained with an adversarial loss term tend to produce sharper results compared to networks only trained on pixel error-based loss metrics, such as the L2 loss. The idea of using GANs for making video predictions further evolved, when traditional image generation GANs were extended to generate image sequences from a set of random latent variables Vondrick et al. (2016); Saito et al. (2017). Vondrick et al. (2016) use their two-stream foreground and background separated network to generate a sequence of 32 frames using layer-wise spatial and temporal up-sampling with 3d convolutions Tran et al. (2015). By changing the generator’s input from random latent variables to the pixel values of an input image, predictions are made in a conditional GAN setting. Kratzwald et al. (2017) build on the approach of Vondrick et al. (2016). Instead of having two separate network streams for foreground and background, they jointly predict the dynamic and static patterns extending a Wasserstein GAN (WGAN) Arjovsky et al. (2017) for images. For video generation and prediction, they combine an application-specific L2 loss and an adversarial loss term.

In contrast to our approach, many GAN-based video prediction methods add additional information, such as temporal, spatial, geometry or action-based conditions Denton and Birodkar (2017); Bhattacharjee and Das (2017); Tulyakov et al. (2017); Chen et al. (2017); Xiong et al. (2018); Liang et al. (2017); Lu et al. (2017); Villegas et al. (2017a, b); Vondrick et al. (2017); Zeng et al. (2017). Lee et al. (2018), for instance, use a variational autoencoder (VAE) LSTM-based generative model in an adversarial setting to predict the transformation encoding between the previous and the next frame. Optionally, the two learned representations are later combined with a stochastic latent code to predict multiple plausible futures. Many of the GAN-based approaches use deterministic autoencoder (AE)-based networks with LSTM units and train them in an adversarial setting, adding an adversarial term in the loss function Lotter et al. (2016).

Mostly related to our approach are Mathieu et al. (2016); Kratzwald et al. (2017); Vondrick et al. (2016); Bhattacharjee and Das (2017), but the applied losses and the training strategies differ. Our approach extends the idea of using GANs in a multi-scale setting for video prediction. We build on the idea of Karras et al. (2018) to progressively grow the networks to increase the image resolution gradually. The idea of a multi-scale or multi-stage GAN setting for video frame prediction has previously been addressed, by either having separate networks or layer-wise up-sampling operations Mathieu et al. (2016); Bhattacharjee and Das (2017); Vondrick et al. (2016, 2017); Kratzwald et al. (2017). Adding the layers progressively during training as the image resolution is increased, is new in this context.

3 FutureGAN Model

Figure 1: FutureGAN Framework. We initialize our model to take a set of resolution frames and output frames of the same resolution. During training, layers are added progressively for each resolution step. The resolution of the input frames matches the resolution of the current state of the networks.

In the following, we describe our proposed FutureGAN architecture and the general training strategy. The framework is based on the idea of training a generative model in an adversarial setting and therefore consists of two separate networks. Our generator network is trained to predict a sequence of future video frames given a sequence of past frames. The second network, the discriminator, is trained to distinguish between the generated sequence and a real sequence from the training dataset. The discriminator is fed in real and fake sequences alternately and calculates a score whether a sequence appears real or not. An output score close to 0 indicates the discriminator rates a given sequence as probably fake. The higher the output score of the discriminator for a given sequence, the more realistic it appears to the network. The generator network updates its weight parameters according to the feedback it receives from the discriminator, trying to generate sequences that will fool the discriminator.

Our FutureGAN architecture extends the recently introduced PGGAN model of Karras et al. (2018). This model achieved impressive results for the task of generating high-resolution images from a set of random latent variables. The core idea of this approach is, to start the training of the networks from low resolution images. As the training proceeds, the resolution is gradually increased by adding layers progressively to both networks. Figure 1 illustrates this concept for our FutureGAN model. See Appendix A Table A.1 for details of the network building blocks. Karras et al. (2018) propose several improvements to the training procedure, that tackle the problems which typically arise when training GANs. To stabilize the training via constraining the signal magnitudes and the competition, the PGGAN model uses normalization in both the discriminator and the generator network. Firstly, a weight scaling procedure is proposed to equalize the dynamic range, and thus the learning speed, for all weights. Secondly, the authors suggest a pixel-wise normalization of the feature vectors in the generator. According to Karras et al. (2018), this prevents the escalation of signal magnitudes in the generator and discriminator that result from an unhealthy competition between the two networks. To increase the variation in the generator’s outputs, thus to prevent mode collapse, Karras et al. (2018) add a mini-batch standard deviation layer in one of the last layers of the discriminator.

To benefit from the improvements to this approach, we adopt the key aspects of both the model architecture and the training strategy and extend them in our FutureGAN. We modify the proposed networks to capture both the spatial and temporal components of the input sequence by exchanging the spatial 2d convolutions for spatio-temporal 3d convolutions. Spatio-temporal upsampling is realized via transposed 3d convolutions. Instead of zero padding for treating the border pixels of the frames, we use replication padding. To introduce non-linearity in the networks, leaky rectified linear units (LReLU) follow each convolution operation of the hidden layers. The number of feature maps in each layer initially is 512. Starting from a frame resolution of , the number of feature maps is halved with every resolution step.

3.1 Generator Network

We design the generator of our FutureGAN model to process video sequences and generate the future frames of this sequence. The output of the generator network can be described as the sequence of future video frames , and the input as the sequence of past video frames . The parameter corresponds to the temporal depth of the input sequence, corresponds to the temporal depth of the output sequence. To properly transform the content of the input sequence into an output sequence that represents a plausible future, we use an AE-based structure for our generator network. We extend the original PGGAN generator by adding an encoder part to learn representations of the information contained in the input sequence. The decoder part of our generator network consists of the basic components of the PGGAN generator. To properly encode and decode both the spatial and temporal components of the input frames, we use 3d convolutions and transposed 3d convolutions instead of 2d convolutions. An exemplary generator structure is illustrated in Appendix A Table A.2 for frame sequences of resolution .

Generator core module

Our FutureGAN generator’s core module consists of an encoder and a decoder part, operating on frames of the initial spatial resolution of . The encoder part is composed of two convolutional layers with 333 [depth (number of frames)heightwidth] convolutions, each producing 512 feature maps. A third convolutional layer with 11 convolutions, again calculating 512 feature maps, follows. This third layer encodes the temporal information of the whole input sequence, resulting in an output of shape 512144. The decoder part of this module starts with a temporal decoding layer using 11 convolutions. After this layer, we add two convolutional layers with 333 convolutions, producing an output of shape 51244. Building blocks are added to both the encoder and the decoder part during training, to increase the spatial frame resolution. The added blocks for higher resolutions each contain two convolutional layers with 333 convolutions and spatial upsampling layers in the decoder part, and spatial downsampling layers in the encoder part, respectively. For processing input and producing output frames of shape channelsdepthheightwidth, we use 111 convolutions. This operation corresponds to mapping the color channels to the 512 feature vectors and vice versa. Each convolutional layer of this module is followed by LReLU activations and a pixel-wise feature vector normalization layer, except for the convolutional output layer.

Pixel-wise feature vector normalization

For a stabilized training process, Karras et al. (2018) incorporate pixel-wise feature vector normalization after the activations of the convolutional layers. Based on a variant of the local response normalization Krizhevsky et al. (2012), the feature vector is normalized to unit length in each pixel. We slightly modify this layer of the PGGAN model to operate on both the spatial and temporal elements of the feature maps. The procedure can be described as , where , is the number of feature maps, is the original, and the normalized feature vector of the pixel .

3.2 Discriminator Network

The discriminator of our FutureGAN model is designed to distinguish between real frame sequences and fake frame sequences. As an input, the discriminator network alternately gets frames from the training set, representing the ground truth, and frames produced by the generator. The outputs of the discriminator network are scores and , respectively. This score resembles the discriminator’s estimated probability of the given input either being real or fake. We set the labels for the real sequence to and the labels for fake sequences to . Our discriminator closely resembles the discriminator of the PGGAN model. To learn representations that capture the spatial and temporal information of the frame sequence, we use 3d convolutions instead of 2d convolutions. The discriminator’s structure is shown in detail in Appendix A Table A.3, exemplary for frame sequences of resolution .

Discriminator core module

Except for the core module, the output block, the discriminator of our FutureGAN model is basically a mirror image of the generator’s encoder part. The core module operates on frames of the initial spatial resolution of . We first use a mini-batch standard deviation layer, followed by a convolutional layer with 333 convolutions, calculating 512 feature maps, and a LReLU activation. A third layer follows, using convolutions to encode the spatial and temporal information into an output of shape 512111. After a fully-connected layer and a linear layer, the discriminator outputs a score of shape 1. Building blocks are prepended to the core module of the discriminator during training, to increase the spatial frame resolution. The blocks added for higher resolutions each contain two convolutional layers with 333 convolutions and a spatial downsampling layer. The color channels of the input frames of shape channeldepthheightwidth are mapped to the 512 feature vectors by 111 convolutions.

Mini-batch standard deviation

To increase variation, Karras et al. (2018) insert a mini-batch standard deviation layer, producing an additional feature map, in one of the last layers of the discriminator. This layer computes the standard deviation for each feature in each spatial location over the mini-batch. Averaging these values over all features and spatial locations produces a scalar value. To get the additional feature map, the value is replicated for every spatial location in the mini-batch. We modify this layer to calculate this constant feature map for temporal depth as well as spatial locations.

3.3 Training Procedure

Our training procedure is similar to the one used by Karras et al. (2018) to train the PGGAN model. We initialize our networks to start the training process for a frame resolution of . This resolution is gradually increased by a factor of 2, as soon as the model has observed a specific number of frame sequences from the training dataset.

Adding layers for increased resolutions

To ensure a smooth transition when layers are added to the networks, we stick closely to the procedure of Karras et al. (2018). Adding new layers to the networks is performed in two steps. The first step is the transition phase, where the layers operating on the frames of the next resolution are treated as a residual block, whose weight increases linearly from 0 to 1. While the model is trained in the transition phase, interpolated inputs are fed into both of the networks, making the input frames match the resolution of the current state of the networks. The second step is the stabilization phase, where the networks are trained for a specified number of iterations before the resolution is doubled again. Other than proposed in the original paper, we do not grow our networks simultaneously, but separate the two steps for the generator and the discriminator. We start by transitioning the generator network, then stabilizing it and repeating the procedure for the discriminator network.

Weight scaling

To stabilize the training of their PGGAN model, Karras et al. (2018) add a weight-scaling layer on top of all the layers. This layer estimates the element-wise standard deviation of the weights and normalizes them to , where are the layer weights and is the normalization constant from He’s initializer He et al. (2015). Using this layer in a network equalizes dynamic range, and therefore the learning speed, for all weights. Our FutureGAN fully adopts this weight scaling strategy.

WGAN-GP loss with epsilon penalty

We use the same loss function as is used to optimize the PGGAN model. This loss function consists of the WGAN-GP loss Gulrajani et al. (2017) and an additional term to prevent the loss from drifting, the epsilon-penalty term.

The WGAN-GP loss with epsilon penalty for the discriminator is defined as

(1)

where is the data distribution, is the model distribution implicitly defined by , , is the epsilon-penalty coefficient, and is the gradient-penalty coefficient. is implicitly defined, sampling uniformly along straight lines between pairs of points sampled from the data distribution and the generator distribution .

The WGAN(-GP) loss for the generator is defined as

(2)

4 Experiments and Evaluation

To evaluate our proposed FutureGAN model, we conducted experiments on three different datasets. We used two synthetic toy datasets, MovingMNIST Srivastava et al. (2015) and MsPacman Cooper (2016). Our third dataset is a natural image dataset, the KITTI Tracking dataset Geiger et al. (2012). The experiments and evaluations on the MovingMNIST and the KITTI Tracking dataset are described in the following. Additional results for the MsPacman dataset can be found in Appendix B.

The general setting is, that we trained our model to predict a future sequence of six frames and conditioned it on a past sequence of also six frames. The penalty coefficients of the WGAN-GP loss with epsilon-penalty were set to and , as proposed by the original authors. For optimization, we used the ADAM-optimizer Kingma and Ba (2015) with and . Our initial learning rate was heuristically set to , decaying by a factor of with every resolution step. In each resolution step, we adjusted the batchsize dynamically during training, according to available GPU RAM. These settings remained fixed throughout all of our experiments. We implemented our FutureGAN using the PyTorch framework. The experiments were carried out on an NVIDIA DGX-1 cluster of 8 Tesla P100 GPUs, each of 16GB RAM, and on single NVIDIA Tesla P100 GPUs, also of 16GB RAM.

4.1 MovingMNIST

To verify the effectiveness of our model architecture in general, we utilized the MovingMNIST dataset as a toy example. We generated a set of 4500 frame sequences for training, each sequence of length 36. Each MovingMNIST frame of resolution displays two bouncing digits of distinct classes. Our generator network was trained to predict six future frames while being conditioned on six input frames, thus a total of 13499 frame sequences was used for training. For testing, we generated another set of 2250 frame sequences of length 36, resulting in a test set of length 6750. The training took about five days in total until the networks arrived at the desired resolution.

Figure 2: Qualitative results for the MovingMNIST test set. (a) Ground truth, (b) FutureGAN (ours), (c) FC-LSTMSrivastava et al. (2015).

In Figure 2, we show a qualitative comparison of our FutureGAN model to the FC-LSTM model of Srivastava et al. (2015), which is an AE-based LSTM approach. Note, that we took the image results for the network from the corresponding paper. Although our network produces a different digit, for instance a 2 instead of a 3, in sequence one, the objects appear sharper than those generated by the FC-LSTM network. We further observe, that our model seems to be capable of encoding the temporal components of the input frame sequences.

4.2 KITTI Tracking

To further investigate, how our FutureGAN is able to scale to real-world scenarios, we trained our network to predict the future frames of sequences from the KITTI dataset. We used the KITTI Tracking Sequences, split into training and testing set, as provided by Geiger et al. (2012). The dataset contains 21 frame sequences for training and 29 frame sequences for testing, both of varying sequence length. In total, our training set consisted of 657 frame sequences for predicting again six future frames based on six past frames. Our test set had 598 frame sequences. To match our intended network resolution of , we used the nearest neighbor interpolation to downsample the frames from their original resolution. No cropping operation was used to preprocess the data.

Figure 3: Qualitative results for the KITTI Tracking test set. (a) Ground truth, (b) FutureGAN (ours), (c) MS-GAN Bhattacharjee and Das (2017).

For quantitative evaluation and for comparison of our FutureGAN model to a different GAN approach, we calculate the structural similarity index (SSIM) and the peak signal-to-noise ratio (PSNR) between the ground truth and the generated frame sequence. Firstly, we compare our FutureGAN to a naive baseline of simply copying the last frame of the input frames sequence. Secondly, we compare our model to an approach that is most related to ours, the multi-stage GAN of Bhattacharjee and Das (2017). The quantitative measures for the comparison with MS-GAN are shown in Table 1 of Figure 5. Note, that we took the image results, PSNR, and SSIM values for the MS-GAN model from the corresponding paper. For a qualitative comparison, we refer to Figure 3.

Figure 4: Results of long-term predictions for the KITTI Tracking test set. (a) Ground truth, (b) FutureGAN.
(a) PSNR
(b) SSIM
Figure 5: Per frame and average PSNR and SSIM values between the ground truth and the generated frames for the KITTI Tracking test set.

In Figure 4, we show qualitative results for long-term predictions of 18 future video frames. We achieve this, by recursively feeding the six output frames produced by the generator network back in as inputs. Notice, that there is no obvious loss of image quality. The quantitative measures listed in Figure 5 support this impression. In some cases, it even seems, as if the quality increases with the number of predicted frames. Looking at the generated future frames in detail, we found out, that especially in scenarios where there is little to none camera movement, the generator can easily predict plausible futures.

4.3 Failure Cases

We experienced several cases, where our model fails to generate a correct future. For the experiments with the MovingMNIST Dataset, the generator starts to exchange the digits shown in the input sequence of frames by a, randomly chosen, different digit. On rare occasions, the generator predicts digits of the same class. Since we only included sequences containing distinct digits, this is an odd behavior. During the experiments on the KITTI dataset, we observed, that our generator mainly fails to generate plausible futures for frames where the lighting conditions are extreme.

5 Conclusion and Future Directions

We show, that our proposed FutureGAN model achieves promising results for the task of predicting the future frames of a video sequence. By gradually adding layers to the networks during training, the resolution of both the input frames and the generated output frames is increased progressively. The proposed training method by Karras et al. (2018) has the effect, that detailed information, typically contained in higher resolution images, is learned in the late training steps. This strategy can be interpreted as splitting the learned task into small, simplified sub-tasks. Our experimental results on the MovingMNIST and the KITTI Tracking dataset show, that the FutureGAN learns to capture the temporal and spatial components effectively using this training procedure. Although the predictions do not always seem to display the correct future, the generated frames still appear plausible. We observe, that our network identifies moving objects in the input frames and transforms these objects based on its learned internal representations. As our FutureGAN model is still at a relatively low-resolution state of , we expect it to learn even better representations, when trained on higher resolutions.

For future directions, we suggest adding the sequence of past frames to the input of the discriminator, thus making the discriminator network conditioned as well. We believe this will prevent the generator from abruptly changing the appearance of objects that were already present in the input frames. The discriminator will probably learn quickly, to lay focus on sudden changes in between two frames. Therefore, the generator will be forced to produce future frames, that contain more of the information captured in the input sequence.

Acknowledgements

The authors are very grateful for the computing resources provided by the Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences and Humanities (BAdW) and their excellent team, with special thanks to Yu Wang. Further, we gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU, used for prototyping this research. Finally, we thank our colleagues of the Computer Vision Research Group and our colleague Lloyd Hughes for providing valuable feedback.

References

  • Arjovsky et al. [2017] M. Arjovsky, S. Chitala, and L. Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.
  • Babaeizadeh et al. [2018] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic Variational Video Prediction. In ICLR, 2018.
  • Bhattacharjee and Das [2017] P. Bhattacharjee and S. Das. Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks. In NIPS. Curran Associates, Inc., 2017.
  • Byeon et al. [2017] W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos. Fully Context-Aware Video Prediction. CoRR, abs/1710.08518, 2017.
  • Chen et al. [2017] B. Chen, W. Wang, J. Wang, and X. Chen. Video Imagination from a Single Image with Transformation Generation. In MM, 2017.
  • Cooper [2016] M. Cooper. Adversarial Video Generation - GitHub Repository. https://github.com/dyelax/Adversarial_Video_Generation, 2016. MsPacman Dataset.
  • De Brabandere et al. [2016] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic Filter Networks. In NIPS. Curran Associates, Inc., 2016.
  • Denton and Fergus [2018] E. Denton and R. Fergus. Stochastic Video Generation with a Learned Prior. CoRR, abs/1802.07687, 2018.
  • Denton and Birodkar [2017] E. L. Denton and V. Birodkar. Unsupervised Learning of Disentangled Representations from Video. In NIPS. Curran Associates, Inc., 2017.
  • Finn et al. [2016] C. Finn, I. Goodfellow, and S. Levine. Unsupervised Learning for Physical Interaction through Video Prediction. In NIPS. Curran Associates, Inc., 2016.
  • Geiger et al. [2012] A.. Geiger, P. Lenz, and R. Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, 2012.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. In NIPS, 2014.
  • Goroshin et al. [2015] R. Goroshin, M. Mathieu, and Y. LeCun. Learning to Linearize Under Uncertainty. In NIPS. Curran Associates, Inc., 2015.
  • Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved Training of Wasserstein GANs. In NIPS. Curran Associates, Inc., 2017.
  • Hao et al. [2018] Z. Hao, X. Huang, and S. Belongie. Controllable Video Generation with Sparse Trajectories. In CVPR, 2018.
  • He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In ICCV, 2015.
  • Kalchbrenner et al. [2017] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video Pixel Networks. In ICML, 2017.
  • Karras et al. [2018] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR, 2018.
  • Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
  • Kratzwald et al. [2017] B. Kratzwald, Z. Huang, D. P. Paudel, , A. Dinesh, and L. Van Gool. Improving Video Generation for Multi-functional Applications. CoRR, abs/1711.11453, 2017.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS. Curran Associates, Inc., 2012.
  • Lee et al. [2018] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic Adversarial Video Prediction. CoRR, abs/1804.01523, 2018.
  • Liang et al. [2017] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual Motion GAN for Future-Flow Embedded Video Prediction. In ICCV, 2017.
  • Liu et al. [2017] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video Frame Synthesis Using Deep Voxel Flow. In ICCV, 2017.
  • Lotter et al. [2016] W. Lotter, G. Kreiman, and D. Cox. Unsupervised Learning of Visual Structure using Predictive Generative Networks. In ICLR, 2016.
  • Lotter et al. [2017] W. Lotter, G. Kreiman, and D. Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In ICLR, 2017.
  • Lu et al. [2017] C. Lu, M. Hirsch, and B. Schlkopf. Flexible Spatio-Temporal Networks for Video Prediction. In CVPR, 2017.
  • Mahjourian et al. [2017] R. Mahjourian, M. Wicke, and A. Angelova. Geometry-Based Next Frame Prediction from Monocular Video. In IV, 2017.
  • Mathieu et al. [2016] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
  • Oh et al. [2015] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games. In NIPS. Curran Associates, Inc., 2015.
  • Oliu et al. [2017] M. Oliu, J. Selva, and S. Escalera. Folded Recurrent Neural Networks for Future Video Prediction. CoRR, abs/1712.00311, 2017.
  • Patraucean et al. [2016] V. Patraucean, A. Handa, and R. Cipolla. Spatio-Temporal Video Autoencoder with Differentiable Memory. In ICLR, 2016.
  • Ranzato et al. [2014] M.’A. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (Language) Modeling: A Baseline for generative Models of natural Videos. CoRR, abs/1412.6604, 2014.
  • Saito et al. [2017] M. Saito, E. Matsumoto, and S. Saito. Temporal Generative Adversarial Nets With Singular Value Clipping. In ICCV, 2017.
  • Salimans et al. [2016] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved Techniques for Training GANs. In NIPS. Curran Associates, Inc., 2016.
  • Srivastava et al. [2015] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised Learning of Video Representation using LSTMs. In ICML, pages 843–852, 2015.
  • Tran et al. [2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, 2015.
  • Tulyakov et al. [2017] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation. CoRR, abs/1707.04993, 2017.
  • Villegas et al. [2017a] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing Motion and Content for Natural Video Sequence Prediction. In ICLR, 2017a.
  • Villegas et al. [2017b] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to Generate Long-term Future via Hierarchical Prediction. In ICML, 2017b.
  • Vondrick et al. [2016] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating Videos with Scene Dynamics. In NIPS. Curran Associates, Inc., 2016.
  • Vondrick et al. [2017] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating the Future with Adversarial Transformers. In CVPR, 2017.
  • Vukoti et al. [2016] V. Vukoti, A.-L. Pintea, C. Raymond, G. Gravier, and J. Van Gemert. One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network. In NCCV, 2016.
  • Walker et al. [2016] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An Uncertain Future: Forecasting from Static Images using Variational Autoencoders. In ECCV, 2016.
  • Wang et al. [2017] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs. In NIPS. Curran Associates, Inc., 2017.
  • Xiong et al. [2018] W. Xiong, W. Luo, L. Ma, W. Liu, and J. Luo. Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks. In CVPR, 2018.
  • Xue et al. [2016] T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks. In NIPS. Curran Associates, Inc., 2016.
  • Zeng et al. [2017] K.-H. Zeng, W. B. Shen, D.-A. Huang, M. Sun, and J. C. Niebles. Visual Forecasting by Imitating Dynamics in Natural Sequences. In ICCV, 2017.

Appendix

Appendix A Network Structure

a.1 Network Building Blocks

G-Spatial- Temporal- Temporal- D-Spatial-
Input Conv Downsample Downsample Upsample G-Output Downsample D-Output
Conv3d Conv3d Conv3d Conv3d Conv3dTranspose Conv3d AvgPool3d Conv3d
k = (1,1,1) k = (3,3,3) k = (1,2,2) k = (n,1,1) k = (n,1,1) k = (1,1,1) k = (1,2,2) k = (n,4,4)
s = (1,1,1) s = (1,1,1) s = (1,2,2) s = (1,1,1) s = (1,1,1) s = (1,1,1) s = (1,2,2) s = (1,1,1)
pad = (0,0,0) pad = (1,1,1) pad = (0,0,0) pad = (0,0,0) pad = (0,0,0) pad = (0,0,0) pad = (0,0,0) pad = (0,0,0)
pad = - pad = replication pad = - pad = - pad = - pad = - pad = - pad = -
WeightScale WeightScale WeightScale WeightScale WeightScale WeightScale WeightScale
LReLU (0.2) LReLU (0.2) LReLU (0.2) LReLU (0.2) LReLU (0.2)
FeatureNorm = FeatureNorm = FeatureNorm = FeatureNorm = FeatureNorm =
generator only generator only generator only generator only generator only
Table A.1: Network building blocks.

a.2 Generator Structure

Encoder part
128 128 64 64 32 32 16 16 8 8 4 4 1
Input Conv Conv Conv Conv Conv Conv
128128128 128128128 2566464 5123232 5121616 51288 51244
Conv Conv Conv Conv Conv Conv
128128128 2566464 5123232 5121616 51288 51244
Spatial- Spatial- Spatial- Spatial- Spatial- Temporal-
Downsample Downsample Downsample Downsample Downsample Downsample
2566464 5123232 5121616 51288 51244 512144
Decoder part
1 4 4 8 8 16 16 32 32 64 64 128 128
Temporal- NearestNeighbor- NearestNeighbor- NearestNeighbor- NearestNeighbor- NearestNeighbor- G-Output
Upsample Spatial-Upsample Spatial-Upsample Spatial-Upsample Spatial-Upsample Spatial-Upsample
512144 51288 5121616 5123232 5126464 256128128 c128128
3dConv 3dConv 3dConv 3dConv 3dConv 3dConv
51244 51288 5121616 5123232 2566464 128128128
3dConv 3dConv 3dConv 3dConv 3dConv 3dConv
51244 51288 5121616 5123232 2566464 128128128
Table A.2: Generator network (shown for inputs of frames / outputs of frames with and c channels).

a.3 Discriminator Structure

128 128 64 64 32 32 16 16 8 8 4 4 score
Input Conv Conv Conv Conv Conv MinibatchStd
128128128 128128128 2566464 5123232 5121616 51288 51344
Conv Conv Conv Conv Conv Conv
128128128 2566464 5123232 5121616 51288 51244
D-Spatial- D-Spatial- D-Spatial- D-Spatial- D-Spatial- D-Output
Downsample Downsample Downsample Downsample Downsample
2566464 5123232 5121616 51288 51244 512111
FullyConnencted
+ Linear
1
Table A.3: Discriminator network (shown for inputs of frames with and c channels).

Appendix B Results for the MsPacman Dataset

To show, that our FutureGAN is able to scale to a more difficult toy dataset, we trained our network to predict the future frames of sequences from the MsPacman dataset. We used the MsPacman Sequences, split into training and testing set, as provided by Cooper [2016]. The dataset contains 517 sequences for training and 45 sequences for testing, both of varying sequence length. In total, for predicting again six future frames based on six past frames, our training set consisted of 38364 frame sequences. To match our intended network resolution of , we used the same preprocessing as was used for the KITTI Tracking Dataset. We downsampled the frames from their original resolution using nearest neighbor interpolation without further cropping. Figure B.1 shows exemplary qualitative results for the prediction of six future frames based on an input of six past frames.

Figure B.1: Qualitative results for the MsPacman test set. (a) Ground truth, (b) FutureGAN.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
297477
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description