Recurrent Flow-Guided Semantic Forecasting
Understanding the world around us and making decisions about the future is a critical component to human intelligence. As autonomous systems continue to develop, their ability to reason about the future will be the key to their success. Semantic anticipation is a relatively under-explored area for which autonomous vehicles could take advantage of (e.g., forecasting pedestrian trajectories). Motivated by the need for real-time prediction in autonomous systems, we propose to decompose the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction. Through this decomposition, we built an efficient, effective, low overhead model with three main components: flow prediction network, feature-flow aggregation LSTM, and end-to-end learnable warp layer. Our proposed method achieves state-of-the-art accuracy on short-term and moving objects semantic forecasting while simultaneously reducing model parameters by up to and increasing efficiency by greater than .
Reasoning about the future is a crucial element to the deployment of real-world computer vision systems. Specifically, garnering semantics about the future of an urban scene is essential for the success of autonomous driving. Future prediction tasks are often framed through video prediction (i.e., given past RGB frames of a video, infer about future frames). There is extensive prior work in directly modeling future RGB frames [41, 47, 39, 37, 11, 3, 14, 52, 56, 36, 32, 26, 12]. Although RGB reconstruction may provide insights into unsupervised and representation learning, the task itself is exceedingly complex. On the other hand, real-time autonomous systems could make more intelligent decisions through semantic anticipation (e.g., forecasting pedestrian  or vehicle  trajectories). Recent work has shown that decomposing complex systems into semantically meaningful components can decrease the probability of failure and better enable long-term planning [45, 44].
Scene understanding through semantic segmentation is one of the most successful tasks in deep learning [28, 27]. Most approaches to semantic segmentation focus on reasoning about the current frame [2, 6, 57, 31, 58] through convolutional neural networks (CNNs). However, recently a new task was proposed by Luc et al.  for prediction of future segmentation frames. Luc et al.  note that by moving away from raw RGB predictions and modeling pixel-level semantic labels instead, there is a greater ability to model scene and object dynamics. An auto-regressive, multi-scale CNN was proposed by  which provided an initial baseline for the semantic forecasting task on the Cityscapes dataset . Jin et al.  developed a multi-stage CNN for jointly predicting segmentation and optical flow, which demonstrated a new state of the art.
There are three limitations of prior approaches for semantic forecasting. First, both  and  force a CNN through many convolutional and non-linear activation layers to both learn and apply a warping operation to transform past segmentation features to the future. Jaderberg et al.  showed that despite the power of CNNs, they are limited by their inherent lack of ability to spatially transform the data. This inability extends to warping (or the ability to map one pixel to another location in an image), as visualized in Fig. 1. Secondly, Luc et al.  and Jin et al.  do not directly account for the inherent temporal dependency of video frames on one another. Rather, both approaches concatenate four past frames together as input to a CNN. Finally,  and  were not designed with low overhead and efficiency in mind, both of which are essential for real-time autonomous systems.
Our approach addresses each of these limitations with a simple, compact network. By decomposing motion and semantics, we reduce the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction. With this simplification, we can remove the CNN that takes as input the concatenated segmentation features of four past frames. This dramatically reduced the number of parameters of our model and redistributed the bulk of the work to a lightweight flow network. Through a learnable warp layer [59, 60] directly applied to the convolutional features, we address the first limitation by encoding warp into our structure rather than forcing the CNN to unnaturally learn it. Additionally, we can more intuitively utilize the temporal coherency of concurrent video frames through a convolutional long short-term (LSTM) module [19, 46] which aggregates optical flow features over time, thus addressing the second limitation. With a recurrent network, we limit the amount of redundant processing of frames. While the warp layer allows us to utilize only a single frame’s segmentation features. These two reductions combine to form an efficient, low overhead model, addressing the final limitation.
Our work can be summarized in three main contributions, novel with respect to the semantic forecasting task:
A learnable warp layer directly applied to segmentation features,
A convolutional LSTM to aggregate optical flow features and estimate future optical flow,
A lightweight, modular baseline that greatly reduces the number of model parameters, improves overall efficiency, and achieves state-of-the-art performance on short-term and moving objects semantic forecasting.
2 Related Work
Our work is largely driven by the decomposition of motion and semantics. Our underlying assumption is that the semantic forecasting task can be fully solved with a perfect current frame segmentation and a perfect optical flow that warps the current to the future frame. Our overarching motivation for this decomposition is also found in , although Villegas et al. focus on the future RGB prediction task. Further, [16, 23, 9] decompose motion and semantics, but focus on improving current frame segmentation, rather than on future frames.
We review prior work with respect to our three main contributions: learnable warp layer, recurrent motion prediction, and lightweight semantic forecasting.
Learnable warp layer.
Jaderberg et al.  provide the foundation for giving CNN the power to learn a transformation that would otherwise be incredibly difficult for convolutional and non-linear activation layers to learn (requiring many layers to increase the field-of-view).
Zhu et al. [59, 60] utilize a spatial warping layer for which the gradient can flow, enabling efficient, end-to-end training.
Ilg et al.  also use a similar warping layer to enable the stacking of multiple FlowNet modules for improving optical flow.
There are numerous other works that utilize a learnable warp layer [33, 25, 29, 30, 48].
Although  warps the RGB and segmentation features for recurrent fine-tuning, they do not include an end-to-end warp layer in their network structure.
As such, the learnable warp layer in our model facilitates strong performance using a single prediction, rather than relying on multiple recursive steps.
This influence can be observed in our state-of-the-art results on short-term forecasting.
Recurrent motion prediction. Numerous works have been published in motion prediction [54, 43, 55, 1, 51, 53, 5, 35, 50], which motivate flow prediction and using a recurrent structure for future anticipation. Villegas et al.  encode motion as the subtraction between two past frames and utilize a ConvLSTM to better aggregate temporal features. Similar to our work, Patraucean et al.  utilize an LSTM to predict optical flow that is used to warp segmentation features. However,  take only a single RGB frame as input to their network, whereas seen in Fig. 2, our network takes in a pair of frames. Additionally, we utilize the power of the FlowCNN to encode strong optical flow features as input to the LSTM, where  has only a single convolutional encoding layer. Finally,  utilizes the current RGB frame to produce the segmentation mask, which does not enable their structure for semantic forecasting. As such, our method, to the best of our knowledge, is the first to aggregate optical flow features using a recurrent network and predict future optical flow for semantic forecasting.
Semantic forecasting. Fig. 2 compares the network structures of the two main baselines for semantic forecasting on the Cityscapes dataset . Luc et al.  formulates a two-stage network which first obtains segmentation features through the Dilation10  current frame network (SegCNN). Next  concatenates these four past segmentation features as input to their SegPredCNN, denoted , to produce the future segmentation frame. When doing mid-term prediction, uses an auto-regressive approach passing their previous predictions back into the network. The main differences between and our proposed method include the removal of the SegPredCNN and addition of the FlowCNN. By directly encoding motion as a flow prediction, our method can achieve better overall accuracy on moving objects than .
Jin et al.  improves upon through its addition of a FlowCNN and joint training on both flow and segmentation. Note that both  and  utilize a fixed four past frames as input, while our method through utilization of an LSTM can take in any number of past frames. Additionally,  utilize a “transform” layer which serves as a residual connection between their SegPredCNN and FlowCNN. All three of the CNNs in  were described as Res101-FCN networks, adding numerous layers to the end of ResNet-101 . Note the overarching motivation for  vs. our method is quite different. Jin et al. emphasized the benefit of the flow prediction influencing semantic forecasting and vice versa. In contrast, we demonstrate the value of treating these tasks independently. Further, there are three main differentiating factors between  and ours: warp layer, FlowLSTM, and network size. With the combination of FlowLSTM and warp layer, we are able to forgo the SegPredCNN, thus dramatically reducing our network size, while maintaining high-fidelity segmentation results.
Semantic forecasting is defined as the task when given input frames denoted , predict the pixel-wise semantic segmentation for some future frame denoted , where is the future step. In this section, we will present models for both short-term () and mid-term () prediction.
3.1 Short-term Prediction
3.1.1 Model Overview
Our approach is motivated by the notion that “perfect” semantic forecasting can be decomposed into two sub-tasks: current frame segmentation and future optical flow prediction. Specifically, we propose
where is the ground truth optical flow from to such that warping using will result in . With much recent attention focused on current frame segmentation [2, 6, 57, 31, 58], our model is designed to be agnostic to the choice of backbone segmentation network. Thus, we focus on estimating future optical flow . Our method for this estimation contains three components: FlowCNN, FlowLSTM, and warp layer. The FlowCNN enables the extraction of high-level optical flow features. Iterating over many pairs of past frames, the FlowLSTM is able to aggregate flow features over time and produce a future optical flow prediction. The warp layer utilizes a learnable warp operation that applies the flow prediction directly to the strong segmentation features produced by the current frame segmentation network. Since the warp layer allows backpropagation, we utilize backpropagation through time (BPTT) for efficient, end-to-end training of all three components simultaneously as one model. In the following subsections, we will describe each component in detail.
Although most state-of-the-art optical flow methods are non deep learning approaches, the caveat is that due to these approaches running on the CPU vs GPU they can often take anywhere between min to an hour for a single frame. Since runtime efficiency is a priority for our work motivated by applications in autonomous driving, we utilize the FlowNet [15, 20] architectures for optical flow estimation with a CNN. Specifically, we use the FlowNet-c architecture  with pre-trained weights from FlowNet2 . We utilize these pre-trained weights because no ground truth optical flow is provided for the Cityscapes dataset . This CNN concatenates a pair of RGB frames as input to a network with ten convolutional layers each followed by ReLU  non-linear activation functions, concluding with a handful of refinement layers. For details regarding specifics of the network architecture, refer to [15, 20]. The FlowCNN can be defined as:
Our reasoning for using FlowNet-c is twofold: simple and effective. Firstly, this network has only approximately million parameters, which consumes less than GB of GPU memory at test time and can evaluate in less than second. Secondly, although Jin et al.  train their FlowCNN from scratch using weak optical flow ground truths, we note that through utilization of FlowNet-c our method demonstrates improved short-term performance with a dramatic reduction in the number of parameters. However, it should be noted that our method is not fixed to this specific FlowCNN architecture. As long as the flow prediction method utilizes a CNN, we can extract the flow features as input for the FlowLSTM, thus demonstrating the modularity of our method. Since our method is not fixed to a specific SegCNN or FlowCNN architecture, as performance on these individual tasks improve over time, our method can improve with them, demonstrating its potential longevity as a general-purpose baseline for semantic forecasting.
One of the main differentiating factors between our proposed method and previous semantic forecasting work [34, 24] is how time is treated. Specifically, both  and  do not use a recurrent structure to model the inherent temporal dynamics of the past frames. Rather,  and  concatenate four past frames at the channel level which ignores the temporal dependency among frames and encodes spatial redundancy. On the other hand, through the utilization of a recurrent module, our method can retain a memory as frames are processed over time.
When a pair of input frames is passed through the FlowCNN, the flow features are extracted at a certain level of the refinement layers. We observe similar performance regardless of the level of the refinement layer extracted (predict_conv6–predict_conv2), with one exception. We find it difficult to aggregrate the flow field directly (final prediction layer) relative to aggregating the flow features.
As such, we extract flow features right before the final prediction layer of FlowNet-c which represent channels at resolution. Using these features as input to the ConvLSTM with a kernel and padding, we can produce a future optical flow prediction with a single convolution layer after the ConvLSTM to reduce the channels from to for the flow field. The FlowLSTM can be defined as:
3.1.4 Warp Layer
The FlowCNN extracts motion from the input frames and FlowLSTM aggregates motion features over time. The next question is how do we use these features for the semantic forecasting task? Both Luc et al.  and Jin et al.  utilize a SegPredCNN which attempts not only to extract motion from the segmentation features but also to apply the motion. Jin et al.  combines the features extracted from the FlowCNN with features extracted from the SegPredCNN through a residual block that learns a weighted summation. By giving a CNN the ability to directly apply the warp operation with a warp layer rather than through convolutional operations, we can forgo the SegPredCNN that both  and  rely heavily upon.
3.1.5 Loss function
For short-term prediction, the main loss function used is cross-entropy loss with respect to the ground truth segmentation for the th frame. We note that through the recurrent formulation of our method, we do not need to rely on the weak segmentation or gradient difference loss used in [34, 24]. Thus our main loss is formulated as:
where is the ground truth class for the pixel at location and is the predicted probability that at location is class .
3.2 Mid-term Prediction
There are two main approaches for mid-term prediction: single-step and auto-regressive.  and  defined mid-term prediction as being nine and ten frames forward, respectively, which is approximately seconds in the future. The single-step approach uses past frames to predict a single optical flow , where is a mid-term jump , to warp to . The auto-regressive approach takes multiple steps to reach , warping both the segmentation features and the last RGB frame to pass back into the FlowCNN as input for the next step.
For example, consider when . For the single-step approach, our model would estimate the optical flow and warp to using this flow. For the auto-regressive approach, our model would instead estimate the shorter optical flow and warp and to and , respectively. Then and are passed back into the FlowCNN to produce the optical flow that warps to .
The benefits of the single-step approach include more efficient training and testing and no error propagation through regressive use its own prediction. On the other hand, the auto-regressive approach has the benefit of being a single model for any time step, as well as, being able to regress indefinitely into the future. Additionally, most optical flow methods are designed for next frame or short-term flow estimation. Therefore, a single large step may not be as easy to estimate than multiple small steps. Both approaches can learn non-linear flow due to the presence of the FlowLSTM. For the auto-regressive approach, the unrolled recurrent structure can be visualized in Fig. 3.
3.2.1 Loss function
For mid-term prediction, we utilize not only the cross-entropy loss but also an RGB reconstruction loss. Specifically for the auto-regressive approach, with each small step, we take the loss between the warped and the ground truth . For more stable training, we use the smooth loss , by:
where is the distance as:
The full loss function is
where and are weighting factors treated as hyper-parameters during the training process. We find setting early on during training and then iteratively lowering closer to 0 provides optimal results.
4.1 Dataset and evaluation criteria
Our experiments are focused on the latest urban street scene dataset, Cityscapes , which contain high-quality images () with pixel-wise annotations for nineteen semantic classes (pedestrian, car, road, etc.). Additionally,  provide a video sequence which contains frames that precede the frame containing the ground-truth semantic segmentation. Our method can utilize all previous frames for the training sequences.
We report performance of our models on the validation sequences using the standard mean Intersection over Union (IoU) :
where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively. We compute IoU with respect to the ground truth segmentation of the th frame in each sequence. To further demonstrate the effectiveness of our method on moving objects, we compute the mean IoU on the eight foreground classes (person, rider, car, truck, bus, train, motorcycle, and bicycle) denoted IoU-MO.
4.2 Implementation details
In our experiments, the resolution of any input image, , to either the SegCNN or FlowCNN was . The segmentation features, , extracted from the SegCNN were also at , while those extracted from PSP-half were at . The optical flow prediction, , from FlowNet-c was produced at . In most experiments, we upsample both and to . For training hyperparameters, we trained with SGD using the “poly” learning rate policy with learning rate initially set to , power set to , weight decay set to , and momentum set to . Additionally, we used a batch size of and trained for k iterations ( Epochs). All of our experiments were conducted with either NVIDIA Titan X or Ti GPUs using the Caffe library .
|Model||IoU ()||IoU ()|
|Copy last input |
|Warp last input |
4.3 Baseline comparison
We compare our method to the two main previous works in semantic forecasting on Cityscapes:  and . Our comparison is multifaceted: number of model parameters (in millions), model efficiency (in seconds), and model performance (in mean IoU). Additionally, we compare against  on the more challenging foreground, moving objects (IoU-MO) baselines. To more directly compare with , we utilize the definition from  of short-term as and mid-term as . However,  redefines short-term as and mid-term as . Thus, without reproducing each of their methods, we distinctly compare against each definition of short-term and mid-term prediction.
SP-MD : Jin et al. generate ,,, using their own Res101-FCN network and concatenate as input to their SegPredCNN (Res101-FCN). Simultaneously their Flow CNN (Res101-FCN) takes in , , , and generates . Finally, features from Flow CNN and SegPredCNN are fused via a residual block to make the final prediction.
The computational complexity of our methods relative to  and  in Tab. 1. Since our work is motivated by deployment of real-time autonomous systems, we emphasize both low overhead and efficiency. Our approach demonstrates a significant decrease in the number of SegPred parameters relative to . Specifically, we reduce the number of parameters from to resulting in a dramatic decrease. This large reduction can be directly linked to our removal of the large SegPredCNN replaced with the lightweight FlowNet-c as our FlowCNN combined with the warp layer. With respect to efficiency, our method sees a and speedup overall relative to  and , respectively. We infer this speedup is correlated with our need to only process a single past frame’s segmentation features and our ability to limit the number of redundant image processing.
Although, we increase the number of SegPred parameters from to relative to , we find a solid performance increase on short-term, mid-term, and moving objects prediction (Tab. 2). Specifically, we improve by and on short-term and mid-term, respectively. On the more challenging moving objects benchmark, we improve on short-term by and mid-term by . Because we directly encode motion into our model through our FlowCNN and aggregate optical flow features over time with our FlowLSTM, we can infer more accurate trajectories on moving objects, seen in Fig. 4.
Finally, in Tab. 3, we compare with the state-of-the-art method on and , . For next frame prediction, we demonstrate a new state-of-the-art pushing the margin by . For prediction, our method performs worse overall and only worse when controlling for recurrent fine-tuning. It should be noted that the model that performs achieves roughly a boost due to in-painting. More specifically this occurred when the flow field was occluded and predicted a 0 for flow of objects coming onto or leaving the frame. Thus, we implemented a naive copy last segmentation feature in-painting method to provide this additional small boost.
In summary, we achieve state-of-the-art performance on short-term (, ) and moving objects (, ) prediction. We also find only performance degradation on mid-term () prediction while reducing the number of model parameters by and increasing the efficiency .
4.4 Ablation studies
4.4.1 Impact of warp layer
Our main enabling factor for strong performance on next frame and short-term prediction is the warp layer. We can see in Tab. 4 how significant the degradation performance is when the warp layer is removed from our system. It is such an integral component that our method would not function without it. Substituting convolutional layer and concatenation/element-wise addition operations was proven highly ineffective. To truly ablate the impact of the warp layer, there would likely need to be a SegPredCNN in addition to these concatenation/addition operators. However, that would run contrary to the main motivation for our work being low overhead, effective systems.
|Configuration||IoU ()||IoU ()|
|Configuration||IoU ()||IoU ()|
4.4.2 Impact of FlowLSTM
One novel aspect of our work relative to others in semantic forecasting is the usage of a LSTM. Specifically, although others utilized recurrent fine-tuning to improve their systems with BPTT; our system is the first, to the best of our knowledge, to contain a memory module. In Tab. 5, the effects of FlowLSTM without recurrent fine-tuning are shown, which shows the value the memory module provided by LSTM, for both short- and mid-term prediction.
4.4.3 Impact of time and step size
We can observe in Tabs. 6-7, the benefit of including more past frames and using a small step size for short-term prediction. Smaller step size worked together with the optical flow network trained on next frame flow. Although notice the diminishing returns going further back in time from to past steps. Our method through the FlowLSTM aggregation of flow features is uniquely enabled to process frames at a step size of and a predict a jump of into the future. Similarly, our best performing mid-term model processes past frames at step size of to predict a jump of . Another aspect of our approach is the lack of redundant processing, as we only process each frame twice, compared to four times with previous methods. Further, our method can process any number of past frames while maintaining its prediction integrity and run-time efficiency.
|Step Size||Frames||IoU ()|
|Configuration||IoU ()||IoU ()|
4.4.4 Impact of auto-regressive vs. single-step
Tab. 8 shows that the two approaches are roughly equivalent in terms of accuracy, with only a slight degradation in favor of the single-step approach. We conjecture that error propagation with three recursive steps is more significant than doing a single large step. The overall pros and cons of single-step vs. auto-regressive depend upon the real-world application deployment. For instance, if efficiency is the highest priority in an autonomous vehicle, limiting the processing of redundant frames with a single-step model would be beneficial. On the other hand, if flexibility to predict any time step in the future is desired, an auto-regressive approach may be preferred.
In this paper, we posed semantic forecasting in a new way–decomposing motion and content. Through this decomposition into current frame segmentation and future optical flow prediction, we enabled a more compact model. This model contained three main components: flow prediction network, feature-flow aggregation LSTM, and an end-to-end warp layer. Together these components worked in unison to achieve state-of-the-art performance on short-term and moving objects segmentation prediction. Additionally, our method reduced the number of parameters by up to and demonstrated a speedup beyond . In turn, our method was designed with low overhead and efficiency in mind, an essential factor for real-world autonomous systems. As such, we proposed a lightweight, modular baseline for recurrent flow-guided semantic forecasting.
-  A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. Pattern Analysis and Machine Intelligence (PAMI), 2017.
-  B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Neural Information Processing Systems (NIPS), 2016.
-  G. Brazil, X. Yin, and X. Liu. Illuminating pedestrians via simultaneous detection & segmentation. In International Conference on Computer Vision (ICCV), 2017.
-  Y.-W. Chao, J. Yang, B. Price, S. Cohen, and J. Deng. Forecasting human dynamics from static images. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
-  Q. Chen and V. Koltun. Full flow: Optical flow estimation by global optimization over regular grids. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. In International Conference on Computer Vision (ICCV), 2017.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  F. Cricri, X. Ni, M. Honkala, E. Aksu, and M. Gabbouj. Video ladder networks. In NIPS workshop on ML for Spatiotemporal Forecasting, 2016.
-  E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Neural Information Processing Systems (NIPS), 2017.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 2015.
-  C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Neural Information Processing Systems (NIPS), 2016.
-  P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. Van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In International Conference on Computer Vision (ICCV), 2015.
-  R. Gadde, V. Jampani, and P. V. Gehler. Semantic video cnns through representation warping. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  R. Girshick. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In Neural Information Processing Systems (NIPS), 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen, J. Dong, L. Liu, Z. Jie, et al. Video scene parsing with predictive feature learning. In International Conference on Computer Vision (ICCV), 2017.
-  X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan. Predicting scene parsing and motion dynamics in the future. In Neural Information Processing Systems (NIPS), 2017.
-  A. Jourabloo, X. Liu, M. Ye, and L. Ren. Pose-invariant face alignment with a single cnn. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning (ICML), 2017.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
-  X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion gan for future-flow embedded video prediction. In International Conference on Computer Vision (ICCV), 2017.
-  Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In International Conference on Computer Vision (ICCV), 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2015.
-  W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. In International Conference on Learning Representations (ICLR), 2017.
-  C. Lu, M. Hirsch, and B. Schölkopf. Flexible spatio-temporal networks for video prediction. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  P. Luc, N. Neverova, C. Couprie, J. J. Verbeek, and Y. LeCun. Predicting deeper into the future of semantic segmentation. In International Conference on Computer Vision (ICCV), 2017.
-  Z. Luo, B. Peng, D.-A. Huang, A. Alahi, and L. Fei-Fei. Unsupervised learning of long-term motion dynamics for videos. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  R. Mahjourian, M. Wicke, and A. Angelova. Geometry-based next frame prediction from monocular video. In Intelligent Vehicles Symposium (IV), 2017.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In International Conference on Learning Representations (ICLR), 2016.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010.
-  J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In Neural Information Processing Systems (NIPS), 2015.
-  V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory. In International Conference on Learning Representations (ICLR) Workshop, 2016.
-  M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604.
-  J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In Computer Vision and Pattern Recognition (CVPR), 2015.
-  S. M. Safdarnejad, X. Liu, and L. Udpa. Robust global motion compensation in presence of predominant foreground. In British Machine Vision Conference (BMVC), 2015.
-  S. Shalev-Shwartz, N. Ben-Zrihem, A. Cohen, and A. Shashua. Long-term planning by short-term prediction. arXiv preprint arXiv:1602.01580, 2016.
-  S. Shalev-Shwartz and A. Shashua. On the sample complexity of end-to-end training vs. semantic abstraction training. arXiv preprint arXiv:1604.06915, 2016.
-  X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Neural Information Processing Systems (NIPS), 2015.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning (ICML), 2015.
-  L. Tran and X. Liu. Nonlinear 3D face morphable model. In Computer Vision and Pattern Recognition (CVPR), 2018.
-  R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representations (ICLR), 2017.
-  R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. In International Conference on Machine Learning (ICML), 2017.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In Computer Vision and Pattern Recognition (CVPR), 2016.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Neural Information Processing Systems (NIPS), 2016.
-  J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision (ECCV), 2016.
-  J. Walker, A. Gupta, and M. Hebert. Patch to the future: Unsupervised visual prediction. In Computer Vision and Pattern Recognition (CVPR), 2014.
-  J. Walker, A. Gupta, and M. Hebert. Dense optical flow prediction from a static image. In International Conference on Computer Vision (ICCV), 2015.
-  T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Neural Information Processing Systems (NIPS), 2016.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Computer Vision and Pattern Recognition (CVPR), 2017.
-  X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei. Flow-guided feature aggregation for video object detection. In International Conference on Computer Vision (ICCV), 2017.
-  X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. In Computer Vision and Pattern Recognition (CVPR), 2017.