Learning to Forecast Videos of Human Activity with Multi-granularity Models and Adaptive Rendering
We propose an approach for forecasting video of complex human activity involving multiple people. Direct pixel-level prediction is too simple to handle the appearance variability in complex activities. Hence, we develop novel intermediate representations. An architecture combining a hierarchical temporal model for predicting human poses and encoder-decoder convolutional neural networks for rendering target appearances is proposed. Our hierarchical model captures interactions among people by adopting a dynamic group-based interaction mechanism. Next, our appearance rendering network encodes the targets’ appearances by learning adaptive appearance filters using a fully convolutional network. Finally, these filters are placed in encoder-decoder neural networks to complete the rendering. We demonstrate that our model can generate videos that are superior to state-of-the-art methods, and can handle complex human activity scenarios in video forecasting.
We may not be able to play soccer like Lionel Messi, but perhaps we can train deep networks to hallucinate imagery suggesting that we can. Consider the images in Fig. 1. In this paper we describe research toward synthesizing realistic sequences that forecast the appearance of people performing complex actions. The model can predict the future poses of people, and use sample appearance images to generate novel views of people that can capture fine details such as imaginary numbers that appear on the backs of people’s clothing.
Future prediction is a fundamental and important problem in many domains. Determining what will happen next can enable myriad applications. Recent examples of attempts to model this predictive process exist across a variety of research fields. Within robotics, work has explored predicting the consequences after interactions between an agent and its environment . In natural language processing, approaches [20, 24] have been proposed to tackle tasks such as text to image or image to text synthesis. Accurate generative models of video sequences are a core part of visual understanding and have received renewed attention from the vision community [32, 30].
In this paper, we focus on learning how to forecast videos of human actions in complex scenarios. Sports videos are an ideal setting for this study: complex in terms of multiple targets, rich in interactions, motion blur, and appearance variation. How to understand the patterns presented in sports videos and provide cues for the prediction of subsequent frames are of key importance here. Moreover, developing generation models that can realize the substantial variability in image content that arise from human body articulation and appearance variation is a challenge.
We address these challenges by developing a video forecasting approach with two main novel components. Human body pose is a natural intermediate representation for this forecasting, and hence utilized in many previous methods for synthesizing human motion and video [2, 6, 32]. We follow in this paradigm, predicting body poses and using them to generate video sequences of future human motion.
First, since we address complex video forecasting, we develop a novel hierarchical recurrent neural network structure that can model multiple people as well as their interactions. This structure captures levels of detail ranging from group-level dynamics down to predictions on individual human body joints. The first layer of our model captures group inference and predicts future poses by leveraging an interaction context. We devise a dynamic group-based interaction mechanism where people dynamically change groups according to the likelihood of interacting with people in that group, and the likelihood is estimated using both pose and location information. The second layer is a structured spatio-temporal LSTM , predicting finer adjustments for first-layer results to refine the prediction for each human joint.
After pose prediction, the core task is to generate realistic images of a particular person striking this pose. Simple networks  may generate blurry and distorted images. Stylistic methods  have shown great success in generating realistic images, but lack control over the appearance of the generated images. Our task requires the model to be able to generate images of a person with a specific appearance. Inspired by , we propose a novel appearance rendering network which encodes appearance into convolutional filters. These filters are operationalized using a fully convolutional network, and utilized in an image-to-image translation structure that transfers the desired appearance to the generated image.
To sum up, we contribute a new state of the art generative model that (1) focuses on forecasting videos of complex human activities involving multiple people; (2) dynamically infers group memberships; and (3) performs adaptive appearance rendering to create accurate depictions of human figures in these forecasted poses.
2 Related Work
Video forecasting: Data-driven video prediction has seen a renaissance in recent years. One major branch of methods uses RNN-based models such as encoder-decoder LSTMs for direct pixel-level video prediction [23, 29, 22, 21]. Another type of approach  models future frames in a probabilistic manner. These methods successfully synthesized low-resolution videos with relatively simple semantics, such as moving MNIST digits or human action videos with very regular, smooth motion.
Subsequent work has attempted to expand the quality of predicted video in terms of resolution and diversity in human activity. Earlier efforts were focused on optical flow-timescale prediction, further work pushed past into more complex motions (e.g. [31, 19]).
Predicting video frames directly in low-level pixel space is difficult and these types of approaches tend to generate blurry or distorted future frames. To tackle this problem, hierarchical models [32, 30] adopt intermediate representations. These models generate future frames in two stages: first, future poses are generated, then binary pose images are transformed into realistic frames. This type of approach can alleviate image blur, however the quality of generation largely depends on the the image generation network. Simple generation networks can still produce blurry images as shown in .
Further difficulties arise in generating accurate human poses. Previous generative approaches use simplistic temporal pose models. Within the field of 3D action recognition, human pose sequences are subjected to spatio-temporal analysis [17, 18]. Specifically, structure-based spatio-temporal LSTMs are effective for robust processing of human body joint position data .
Modeling human interactions: In this paper, we propose an architecture for predicting the future of multi-person video. We introduce a novel human-human interaction mechanism as well as a flexible image-to-image translation model. Previous work on human-human interactions includes the SocialLSTM , a generic data-driven approach for modeling interaction among pedestrians. Jain et al.  proposed a rich RNN mixture which is a spatio-temporal graph for modeling object-object interactions across time. Deng et al.  proposed a structure learning model where pair-wise interactions are learned and relations among persons are determined by imposing gates.
Generative image models: Image-to-image translation has achieved great success since the emergence of GANs . Recent work produces promising results using GAN-based models [12, 34]. Stylized images can be generated by using feed-forward networks  with the help of perceptual loss . The recent work of  proposes a structure to disentangle style and content for style transfer. Styles are encoded using a stylebank (set of convolution filters). Visual analogy making [25, 27] generates or searches for an new image analagous to an input one, based on other previously given example pairs.
In summary, our approach builds on the substantial body of related work in pose analysis, group interaction, and style/analogy-based image generation. We contribute a hierarchical method for pose prediction from the person-interaction down to the body joint level, combined with a novel adaptive appearance rendering model for image generation.
3 Forecasting Complex Human Activity
We propose a method for generating videos of complex human activities. An overview of the method is shown in Fig. 2. The input to our method is a video sequence of multiple people. Human poses are obtained using state-of-the-art techniques. From there, we first forecast poses with our multi-granularity model (Sec. 3.1). The predicted poses are rendered into images with our adaptive rendering technique (Sec. 3.2). This image synthesis technique is general, and can be utilized in other paradigms (e.g. inserting novel people, appearance adaptation), which are explored in our experiments in Sec. 4.
3.1 Multi-granularity Pose Prediction
We propose a multi-granularity model for predicting future pose for multiple targets. This is a hierarchical model that reasons over groups of people and uses this to predict future poses. The predictive process is illustrated in Fig 2(b). The first layer is equipped with a group-based dynamic interaction mechanism for modeling inter-person interactions. The second, intra-person layer is a refiner spatio-temporal LSTM that refines the generation from the first layer.
Group Interaction Mechanism
For complex human activity, analyzing relations among people can be beneficial. Given a set of people in a scene, not all people in the scene are interacting with each other, hence a mechanism for automatically inferring relations is important. As shown in Fig. 3, which is produced by our group-based interaction mechanism, our model learns to assign all people into groups. People having strong interactions with each other are learned to be grouped together so that information aggregated over each group can help better predict future poses for its members.
Given people in a scene, we define groups for representing potential interactions. We use to denote the group at time and is the size of the group. For each person , his/her temporal pose sequence is first processed by a person-level LSTM to obtain its representation at time .
Group membership is initialized at time step by arbitrarily placing 2 people in 1 group, with the remaining people spread into solo groups. Then for each time step , every person decides their group affiliation for next time step by choosing to join (or stay in) the group inside which people have the strongest interaction with him. The interaction score for two people and at is defined as:
The is a scalar score that measures the degree of interaction. are weights and biases for the state-to-score transformation. is the sigmoid function. With the interaction score between people, the person can then decide his group at , which is denoted by , as:
Note that since argmax is not differentiable, we use softmax with low temperature to approximate it (c.f. Gumbel Softmax ). In short, if the output of is represented as a one-hot vector, then as . When measuring the score for the current group of person , will be included to serve as a smoothing term that increases the probability for person to keep his group unchanged.
Groups also maintain their information via group-level LSTMs as shown by group nodes in Fig. 2(b). After all group memberships are determined for time , each group will update its state by one step of its LSTM:
where is the state of group at and is the group-level LSTM cell. Projection weights project person states to the space of group states. The group state will then serve as the interaction context of the target at time step for all .
In summary, we use the input pose sequences to predict which groups people in the scene belong to. Each person has affinity for people with related pose sequences. Each group has a feature representation based on the people who have been in the group over time. We use these group feature representations as interaction context for our pose prediction tasks.
Hierarchical Pose Prediction
We generate future poses using the predicted group memberships and encodings of observed pose sequences. The generation is a hierarchical process in a recurrent neural network framework. We process a given input sequence of poses from time 1 to and produce an output of predicted poses from time to .
The recurrent network takes as input the encoding of the observed poses , and the group state for person . It generates by forwarding one step. This new encoding is used to generate the next pose . The group states are then updated using all new encodings to obtain the values. The process is repeated to generate all predicted poses until time .
To allow for finer adjustment of each pose joint, the second layer of our hierarchical model, refiner LSTM, takes spatial relations among joints into consideration using spatio-temporal LSTM . With first-layer prediction as its extra input, the fine-granularity LSTM produces refinement vectors for joints based on (1) the states of the current joint at the previous time step and (2) the states of the previous joint at the current time step. The spatial order is defined based on the kinematic tree. This produces the final generated sequence of poses .
Two-stage Training and Loss Function
The multi-granularity LSTM is trained with a two-stage scheme. In the first stage, only the person-level LSTM with the interaction mechanism is trained to produce a reliable first-phase output. The loss is:
where is the pose MSE loss of the first-layer model.
After finishing the first-stage training, we train the whole model altogether with loss
where is the pose MSE loss of model’s final output defined by
3.2 Adaptive Rendering Network
After getting the pose predictions for each target from the first part of the architecture, the next step of our model is to synthesize for each target a realistic image of the target in the predicted pose. We represent the pose of every person using a posemap image in which white body joint points are drawn on a black background canvas. To accomplish this goal, we propose an adaptive rendering structure where the appearance filters are adaptively computed from an input reference image using a fully convolutional neural network (FCN). By incorporating this FCN into an encoder-decoder network a realistic image of a target consistent with the desired action and appearance can be generated.
Fig. 2(c) shows our adaptive rendering network (Ada-R Network) architecture, which consists of two branches: an encoder-decoder branch, and an adaptive rendering branch. The network requires two input images: a posemap image, and a reference image which provides the appearance of the same person in a previous frame. The goal of the network is to generate a realistic image of a person consistent with posemap and having appearance consistent with the reference image.
Encoder-Decoder: Instead of training an encoder-decoder network which can reconstruct input images, our encoder-decoder branch shown in Fig. 2 is a sketch image model.
We use the same input size and encoder-decoder structure as in : both generator and discriminator use modules of the form convolution-BatchNorm-Relu , the encoder consists of 8 convolutional layers with stride 2 and symmetrically the decoder consists of convolutional layers with fractional stride . We use filters of size . We also explore a more compact encoder-decoder network by reducing the number of convolutional and fractional strided convolutional layers in our encoder and decoder to 5.
Adaptive Rendering: The encoder-decoder network takes binary posemap images as inputs which do not contain any information about the uniform or clothes of the person. Hence, we propose to use another network to learn appearance information. By combining these two networks together we are able to generate realistic images of a person wearing the desired clothing. Here we introduce our Ada-R network.
To transfer the desired appearance to the encoder-decoder branch, we replace the last convolutional filter in the encoder-decoder branch with our adaptive appearance transfer filter. The adaptive appearance filter encoding appearance information of a person is derived from an input appearance reference image using a fully-convolutional network
Note the rendering of one person’s posemap sequence only requires one reference image, and it can simply be the first input frame for that person. The realistic motion sequence is obtained by performing adaptive appearance rendering frame by frame. The filter is applied to rendering procedure by
where is the encoder network, is the posemap image and is convolution operation. is the feature map generated by the encoder network and is the feature map after applying the adaptive appearance filter to the feature map . The person with desired appearance is finally produced with
where is the decoder network.
We propose three types of FCN architectures and all three architectures share same encoder-decoder structure. The first FCN with 5 convolutional layers and outputs filters with size ; the second FCN with 3 convolutional layers and outputs filters with size ; the third FCN with 3 convolutional layers and outputs filters with size .
Our network is trained in an adversarial setting, where the Ada-R network is the generator , and a discriminator is introduced to discriminate between the real and generated images. Let be the target image that we try to produce, and be the image that Ada-R network generated. The loss of Ada-R network is defined as
Where the appearance transfer loss is defined as
is the pixel level MSE loss between generated image and the target image, which is defined as
and are the content and style loss defined the same as Gatys et al. 
where is the feature map from layer of a pretrained VGG-19 network . are layers of VGG-19 used to compute the content loss. is the Gram matrix which learns the correlations of color distribution given two input images. are layers of VGG-19 used to compute the style loss.
The final objective is defined as
We demonstrate our model on the Volleyball dataset . We run person detection  and tracking  to get tracklets of each player in each clip. Then OpenPose detector  is used to obtain corresponding pose sequences for each tracklet. We follow the data split of original dataset and preprocessing is conducted to filter out instances with less than 10 joints and clips containing less than 10 targets. We get 1262 clips for training and 790 clips for testing. Images of players are cropped and then resized to pixels. Our model is trained to observe players in 6 input frames and predict their future for the next 5 frames.
Training Details: For the multi-granularity LSTM, the state size of person, group, and joint level LSTM are 256, 256, and 128, respectively. Pose data are normalized to between 0 and 1. We train the model with initial learning rate of 1e-5. We set in Eq. 5 to 0.1. To prevent gradient explosion in low-temperature softmax, we use the training strategy suggested by Jang et al.  and clip gradients as well. To train our Ada-R network, we compute content loss at layer relu4-2 and style loss at layer relu1-2, relu2-2, relu3-2, relu4-2 and relu5-2 of the pre-trained VGG-19 network. We set the learning rate to 1e-3, , , is set to bring the content and style losses to a similar scale. To make the training stable, in each iteration the generator is updated twice and the discriminator is updated one time.
4.1 Results of Pose Prediction
We compare our multi-granularity LSTM with two baseline models: (1) vanilla LSTM without interaction among targets; (2) model adapted from SocialLSTM  by replacing the trajectory prediction in the original work with pose and location prediction and use the social pooling as the group interaction mechanism. We also include comparisons among different variants of our model including: (1) MG w/o refine: our multi-granularity model without refinement layer; (2) MG: our whole multi-granularity model.
We evaluate the performance of future pose generation by measuring the distance between the prediction and the exact pose estimation. MSE is a standard metric for this, but is sensitive to localization error. A prediction will have high MSE even if every joint is off by a small number of pixels; in such cases MSE provides limited intuition as to the quality of generation. We define a score to measure whether a joint is correctly predicted within some tolerable range to the exact pose estimation using a piecewise function. Specifically, for each joint of pose estimation we measure how good the prediction is by calculating a score
where is the norm, and should be determined according to the size of posemap in a way that high-score prediction is reasonably close to desired target. In our experiments we set and : a joint prediction with 5-pixel error in resolution will get full score.
Quantitative measures of our multi-granularity model and the comparisons with baselines are summarized in Tab. 1. The result shows that our multi-granularity LSTM outperforms baselines on predicting future pose. Our one-layer multi-granularity model can generate poses closer to the exact future pose estimation than the model adapted from SocialLSTM, implying our dynamic group-based interaction mechanism is more effective than modeling interactions of nearby people. The refiner layer is able to further improve the prediction result. The comparison with vanilla LSTM shows that considering interactions among targets helps produce better future poses in multi-person scenes.
4.2 Results of Adaptive Rendering
We evaluate the generation using two quantitative measures and show qualitative results. We compare our approaches against a baseline of visual analogy making (VAM)  in which, similar to our main model, 8 convolutional or fractional-strided convolutional layers in the encoder and decoder are used, respectively, and are trained using adversarial loss and MSE loss. We also provide comparisons among different architectures (details shown in Tab. 2) of our model including (1) the 8-5-10 model; (2) the 8-3-56 model; (3) the 8-3-10 model; (4) the 5-5-10 model. To compare different architectures of our Ada-R network, we use posemap images generated from pose estimation results as inputs. To compare different models for pose predictions, we use posemap images from the predicted poses generated by different models as shown in Sec. 4.1 as inputs and use our 8-5-10 model to generate images. Reference images are achieved by cropping players given detection results and resized to . Our appearance rendering network generates images of the same size.
We adopt two evaluation metrics: (1) action classification over sequence; (2) MSE error and PSNR over sequence. An action classifier is trained using real video sequences over the 9 action classes in the training set and tested by using sequences generated by different models in the test set. Since the actions in this dataset are highly unbalanced, we report action classification accuracy on the overall dataset, and accuracy excluding the majority action standing. Quantitative measures are shown in Tab. 3 and Tab. 4. Visualizations are provided in Fig. 5 and Fig. 6.
|layer # in E-D||8||8||8||5|
|conv layer # in FCN||5||3||3||5|
|ada-filters # learnt||10||56||10||10|
The experimental results in Tab. 3 suggest that our Ada-R model can generate realistic sequences with more obvious motions while visual analogy making cannot capture the finer changes in poses and generate sequences with stable motions. Our multi-granularity LSTM can better forecast future poses of players compared with the two baselines: vanilla LSTM and SocialLSTM. Tab. 4 suggests that the quality of generated images of our Ada-R model is better and is more similar to the generation target compared with visual analogy making. The decreases of MSE and the increases of PSNR over vanilla LSTM and SocialLSTM suggest that our model can better forecast future poses which can benefit the adaptive rendering. Both tables suggest the 8-5-10 model can produce images with better quality, more obvious motion, and achieves the best performance.
Fig. 5 shows that most of our Ada-R architectures can generate more realistic images with both action and appearance consistent with the target images, while visual analogy making can generate images with correct appearance but distorted pose, implying explicitly encoding appearance information with filters learned from extra reference images can better disentangle the appearance and pose representations. Fig. 6 shows how our 8-5-10 model generates images given different pose prediction results. It is clear that our proposed model can better forecast future pose sequences with obvious motion more similar to the generation targets.
4.3 Hallucinating People in a Volleyball Game
Given a set of generated realistic images of two people obtained by fine-tuning models trained on volleyball dataset for extra iterations on the videos of the two people and a background image of volleyball court which is obtained by inpainting the players in a raw frame of resolution , we hallucinate people in a volleyball game (shown in Fig. 4) by segmenting the people out of generated realistic images and copy them to the background image. The two real images of the two people in the top left and right corners are the reference images we use for adaptive rendering.
We proposed a novel approach for forecasting complex human activity videos. The proposed approach first forecasts future poses using a hierarchical temporal model and then generates realistic images corresponding to the pose by adaptively rendering the appearance from a reference image. Both quantitative and qualitative results show that our model is superior to state-of-the-art approaches and can generate better predictions involving complex human activities. The success of our model demonstrates that the proposed dynamic group-based interaction mechanism can benefit analysis of complex human activity in videos and provide high quality intermediate representations for later image-to-image translation. The proposed adaptive rendering network can render the desired target appearance while adapting to the predicted pose.
- A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, F. F. Li, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- M. Brand and A. Hertzmann. Style machines. Conference on Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), 2000.
- Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: An explicit representation for neural image style transfer. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Z. Deng, A. Vahdat, H. Hu, and G. Mori. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In International Conference on Computer Vision (ICCV), 2003.
- C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems (NIPS), 2016.
- L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
- M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori. A hierarchical deep temporal model for group activity recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR), 2017.
- J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), 2016.
- D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research(JMLR), 10:1755–1758, 2009.
- J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision (ECCV), 2016.
- J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot. Global context-aware attention lstm networks for 3d action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In International Conference on Computer Vision (ICCV), 2017.
- E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. International Conference on Learning Representations (ICLR), 2016.
- M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In International Conference on Learning Representations (ICLR), 2016.
- J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems (NIPS), 2015.
- M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
- S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning (ICML), 2016.
- S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In Advances in Neural Information Processing Systems (NIPS), 2015.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
- F. Sadeghi, C. L. Zitnick, and A. Farhadi. Visalogy: Answering visual analogy questions. In Advances in Neural Information Processing Systems (NIPS), 2015.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
- N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning (ICML), 2015.
- R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. In International Conference on Machine Learning (ICML), 2017.
- J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision (ECCV), 2016.
- J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. In International Conference on Computer Vision (ICCV), 2017.
- T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems (NIPS), 2016.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision (ICCV), 2017.