MSnet: Mutual Suppression Network for Disentangled Video Representations
The extraction of meaningful features from videos is important as they can be used in various applications. Despite its importance, video representation learning has not been studied much, because it is challenging to deal with both content and motion information. We present a Mutual Suppression network (MSnet) to learn disentangled motion and content features in videos. The MSnet is trained in such way that content features do not contain motion information and motion features do not contain content information; this is done by suppressing each other with adversarial training. We utilize the disentangled features from the MSnet for several tasks, such as frame reproduction, pixel-level video frame prediction, and dense optical flow estimation, to demonstrate the strength of MSnet. The proposed model outperforms the state-of-the-art methods in pixel-level video frame prediction. The source code will be publicly available.
Keywords:Adversarial training; Video representation; Unsupervised learning, Video frame prediction
The understanding of videos is crucial because it can be actively applied to not only direct applications such as robot action decision and closed-circuit television (CCTV) surveillance, but also indirect applications such as unsupervised learning of image representation using videos. As deep learning has evolved, researches in computer vision have achieved tremendous success. In image classification tasks, for instance, the performance of a machine surpasses that of a human .
Representation learning in images has been studied extensively, and it has achieved successful results partly owing to its static attribute, which is easy to encode and partly owing to its vast dataset . Representation learning in videos, however, remains a difficult task compared to image representation, because of the scarcity of labeled data and the temporal attribute of videos, which is difficult to encode. Because it is not possible to annotate labels for every frame in a video, video representation learning is typically approached in an unsupervised manner. As videos contain rich information, such as spatial and temporal information, we can use naturally labeled information such as temporal coherence.
Video can naturally be decomposed into spatial and temporal components [33, 38, 39, 41, 42, 5]. Content feature learning has been extensively studied not only in images [10, 12] but also in videos [24, 43]. However, motion representation learning in videos has not been studied extensively. To obtain motion features, optical flow  or 3D Convolutional Neural Networks (CNNs) [36, 1] have been exploited, and reasonable results have been achieved. However, optical flow acquisition is time consuming and it is difficult to obtain an accurate optical flow. The 3D CNN involves intensive computation and memory usage, and therefore, it has limitations of network depth and is difficult to be trained . In addition, a 3D CNN is only marginally better than a 2D CNN based method in performance .
We can thus obtain meaningful motion and content features according to the following intuition.
- Separable feature
We can understand video by decomposing it into motion and content. The motion feature should not contain content information and the content feature should not contain motion information.
- Content from several frames
The majority of methods that encode video into motion and content obtains content features from only one frame. The content features, as well as the motion features, should be obtained from several frames. If two objects in a single frame are occluded or cannot be distinguished, a single frame is not sufficient to capture content information.
When three frames are given, the combination of content feature between and motion feature between should be able to reproduce .
- Time-reversibility of content
The content features from and content features from should be the same. We extract content features from two frames. However, the two frames have temporal information, and therefore, the extracted content feature has a possibility to be contaminated by motion information. Therefore, the content should have a time-reversal property to ensure that motion information is not contained in it.
In this paper, we demonstrate the significance of motion and content features from a Mutual Suppression network (MSnet) through some experiments. The contributions of our paper can be summarized as follows:
We propose a MSnet for the disentangled representations of video with some intuition.
We propose a novel architecture for encoder-decoder-based architecture. We are able to achieve better results with a much simpler network than those of the state-of-the-art methods.
We show the usefulness of the features extracted from the MSnet through several experiments, such as frame reproduction, future frame prediction, and optical flow estimation
We perform ablation studies to verify the effect of adversarial learning.
2 Related Work
With the remarkable development of deep learning, a myriad of networks and methods have been presented for image representation. These image representation methods have been successfully exploited for image classification [12, 10, 34], object detection [20, 29], and semantic segmentation [8, 27]. Video representation, however, is regarded as a challenging problem, as it has a time-varying attribute, which is difficult to encode. While image representation focuses only on static information, video analysis considers two properties that behave differently. It is difficult to represent both time-invariant information and time-variant information at once. Because CNNs have achieved huge success in image processing, a 3D CNN is utilized to encode the spatio-temporal feature in videos. In previous methods, 3D CNN is used to perform video action recognition and action detecion [36, 37, 11].
The recent trend of video representation learning focuses on identifying spatial parts and temporal parts in natural videos [28, 6, 33, 38, 15, 39, 1, 45]. Ranzato et al.  attempted to discover both spatial and temporal correlations, drawing inspiration from language modeling . Simonyan et al.  addressed the two-stream network for video action recognition motivated by the human visual cortex to decouple the complementary information appearing in videos. The spatial stream uses a single frame containing the spatial information whereas the temporal stream uses the stacked optical flow to consider motion information. Feichtenhofer et al.  studies how to combine the information from motion and content stream with various types of interactions. Carreira et al.  proposed a two-stream 3D CNN by leveraging Resnet for action recognition.
As the two-stream network has become mainstream and produced outstanding results in action recognition, the two-stream network is being widely utilized in various applications. Jain et al.  proposed a two-stream fully convolutional neural network for segmentation of objects in video. Two-stream network is also applied to other tasks, such as action detection [26, 45] and person re-identification in videos [22, 2]. MoCoGAN  addressed video generation from motion latent space and content latent space. Jin et al.  constructed a two-way stream network that can jointly predict the motion dynamics according to the flow and the scene statics by scene parsing. MCnet  and DRnet  used decomposed motion and content features for video future frame prediction.
Video future frame prediction involves the prediction of future frames from some previous frames. In order to predict precise future frames, the given frames should be understood correctly. There have been studies to interpret given videos by decomposing it motion and content information. Vondrick et al.  attempted to generate future frames by modeling the foreground separately from the background. This work encoded the given frames into a latent space and delivered this latent space to foreground and background streams. The outputs from both streams are merged to extrapolate the succeeding frames. MCnet  proposed an encoder-decoder style architecture by decomposing the content and motion appearing in videos. The content encoder in MCnet is designed to extract spatial feature from the last frame of a given video, whereas the motion encoder recurrently captures motion dynamics from the history of the pixel-wise difference between two frames. MCnet uses image gradient difference loss (gdl)  and adversarial loss  to obtain sharper results. DRnet decomposed the content and pose attribute in frames using only a content discriminator. The content discriminator examines whether two pose features are from the same content or not. The decomposed features of the given frames are used to predict future pose features, and these predicted features are used to generate future frames.
In this section, we describe the proposed method in detail. First, we introduce the frame reproduction task and proposed network to obtain disentangled features (section 3.1), then present the pixel-wise video future frame prediction (section 3.2), and finally describe that our model can be utilized to estimate the optical flow (section 3.3). For the following descriptions, let denote the th frame in video , where , and denote width, height, and the number of channels, respectively.
3.1 Frame Reproduction
The proposed model is composed of multiple components as described in Figure 2. There are two encoders to extract features relevant to content and motion. In addition, we use a generator and three discriminators for content, motion, and frame. The content encoder uses two consecutive frames and to derive the content information . The motion encoder takes two frames and , which should not be adjacent, to extract motion information . In sequence, we feed the two concatenated features into the generator, , and estimate the last frame of the input, . We should be able to reproduce from and if these features contain meaningful information. Here, the generator uses the residual connection from the content encoder like the UNet architecture . However, as a skip connection preserves information about previous frames and , not , there is a tendency that the afterimage of and is remained in reproduced . To resolve this problem, we append the feature of residual connection with bi-linearly upscaled motion features , pass them into a convolutional layer to adjust the number of channels, and add residual connections . We call this layer the skip conv connection.
where represent the temporal distance between the target frame and the reference frame.
For reproduction, Denton et al.  used only the mean squared error (MSE). However, MSE loss inherently produces blurry results for image reconstruction . We thus utilize extra frame adversarial loss in Eq. (4) to generate a frame by conditioning on a previous frame similar to pix2pix network . The frame discriminator takes two frames and classifies whether the input is a real pair of frames. Using Eq. (3), we can generate a sharper frame conditioned to previous frame .
where denotes the loss function for training the frame discriminator.
Next, we train the content discriminator and motion discriminator to apply separable feature intuition. Let denote the video which is not video . The content discriminator considers two motion features and classifies whether these two motion features are from the same video. The fact that a content discriminator can distinguish between two motion features from different videos means that the motion feature contains content information. If the motion encoder would like to deceive a content discriminator, motion features should not contain content information. With Eqs. (5) and (6), we train the motion encoder to fool the content discriminator which tries to find content information in motion feature so that motion encoder can obtain a pure motion feature.
where and indicate different time step indexes.
Similar to the content discriminator, the motion discriminator considers two content features and classifies whether these two content features are from adjacent frames or not. The fact that a motion discriminator can distinguish between two content features from different time gaps means that the content feature contains motion information. In order that a content encoder deceives a motion discriminator, the content features should not contain motion information. With Eqs. (7) and (8), we train the content encoder to deceive the motion discriminator, which tries to find motion information in the content feature so that the content encoder can obtain a pure content feature.
where and denote adjacent frames and and denote nonadjacent frames.
We train our networks using the following objective functions. We optimize and alternately.
where , and are hyperparameters.
3.2 Pixel-wise Video Future Prediction
Section 3.1 deals with a single frame reproduction. We expand this to pixel-wise video future prediction to confirm how meaningful features are obtained by the feature extractor. The MSnet is given frames and trained to predict following frames. Let denote the given frames and denote the frames to be predicted. The motion encoder extracts motion features from , and the content encoder extracts content feature from . The convolutional LSTM (convLSTM)  takes concatenated features at each time step and predicts the next frame motion features . This predicted motion feature is fed to convLSTM again and the following motion features are predicted.
Then, predicted motion features at each time step are fed into a generator with the content features and the next frames are predicted. The overall architecture for future frame prediction is shown in Figure 3 and convLSTM is trained with the following objective function.
3.3 Dense Optical Flow Estimation
To demonstrate if our motion encoder extracts meaningful motion information from a video, we carried out an experiment to estimate dense optical flow using only the motion encoder. The motion encoder takes two frames and extracts the motion feature . The motion feature is upsampled with transpose convolution until its size is equal to the input shape. We do not use brightness constancy or spatial smoothness, which are general assumptions in several optical flow estimation methods . There are benchmark datasets for optical flow estimation such as KITTI 2015  and Flying Chairs . However, these datasets have two frames to estimate optical flow and our network needs three frames to be trained. We thus generated a ground-truth optical flow field using EpicFlow . Let denote the ground-truth optical flow from EpicFlow and denote the predicted optical flow. We train the optical flow network by end point error (EPE) loss:
where and denote spatial locations.
We performed experiments using the Moving Mnist and KTH datasets [35, 32]. First, we conducted the frame reproduction experiments using Moving Mnist to demonstrate the advantage of the proposed model. Then, we performed frame reproduction, future frame prediction and optical flow estimation using the KTH dataset. For the experiments, we design our content encoder with five convolutional layers and its symmetric architecture is used for the generator, as shown in Figure 2. The motion encoder has six convolutional layers. Each feature is normalized with . The details of the model architecture are represented in the supplementary material. To evaluate the performance of the proposed model, we used structural similarity (SSIM), the peak signal-to-noise ratio (PSNR), and the MSE as evaluation metrics as in previous studies [21, 39]. We used the average end point error (EPE) as the evaluation metric for optical flow estimation.
4.1 Moving Mnist
The Moving Mnist dataset contains 10,000 sequences and each sequence has 20 frames. In each frame, two digits move around in patch. Each digit can be occluded and bounced when it reaches the wall of the box. We split 8,000 sequences for training and 2,000 sequences for validation. As a DRnet  conducted experiment with the self-generated moving digits data containing color information, we retrained the DRnet with the Moving Mnist data in  for the fair comparison by removing the color information. We used motion features of size spatial map with channels and content features of size spatial map with channels. For Moving Mnist, we emphasize the content features with a larger channel size because the motion is not as complex as natural videos. We set in Section 3.1 to be in range of . We set , and in Eq. (9).
For the reproduction task, three frames are given, and MSnet reproduces the third frame; two frames are given, and DRnet reproduces the second frame. These quantitative results are shown in Table 1, and qualitative results are shown in Figure 4. As shown in Figure 4, DRnet generates digits in the wrong places in the first row as it uses pose information. On the contrary, the MSnet encodes motion information, not pose information like the DRnet, so that the MSnet is able to move the digits to the correct places. Motion is a more natural attribute of videos than a pose. In the third row, if two digits in a given frame are occluded, DRnet cannot obtain the content feature correctly. As the MSnet obtains content features from two frames, the two occluded digits can be identified correctly.
4.2 KTH Dataset
The KTH dataset contains videos of 25 persons performing six actions (running, jogging, walking, boxing, hand-clapping, and hand-waving). For the following experiments, we used 16 persons for training and 9 persons for validation. We resized frames to pixels, as in a previous study . As the DRnet  uses 20 persons for training and 5 persons for validation, we reproduced DRnet with our train-val setting for a fair comparison. For training, we used trimmed data, which contains the frames with real motion used in MCnet . We used motion features of size spatial map with channels and content features of size spatial map with channels. We set in Section 3.1 to be in range of . We set , and in Eq. (9).
4.2.1 Frame reproduction
The quantitative results of reproduction task are shown in Table 2 and qualitative results are shown in Figure 5. As the MSnet adopts the frame discriminator, the reproduced frame has a similar shape to that of given frame than the reproduced frame of DRnet in Figure 5 (a). In Figure 5 (c), the MSnet is able to reproduce a sharper frame compared to DRnet that uses only MSE loss. In Figure 5 (d), DRnet fails to extract meaningful content information because it uses only content discriminator, whereas MSnet uses both content and motion discriminators.
4.2.2 Future frame prediction
This section presents the results of future frame prediction. All networks are trained by observing the given 10 frames to predict the following 10 frames on the KTH dataset. We also reproduced DRnet with the same train-val setting with ours and MCnet. For validation, using given 10 frames, networks predict following 20 frames. We validate 3,559 frame sequences like in MCnet. We use a two-layer convolutional LSTM with 256 hidden units. Figure 7 shows the results of our proposed (MSnet) and state-of-the-art networks, and hows the results of ablation studies in the MSnet. In this future frame prediction task, better results mean that the network extracts more meaningful motion and content features from the given video, such that LSTM can easily interpret the given feature and predict high-quality following features to generate future frames.
Figure 7 (a) and (b) shows that our proposed MSnet, which uses both motion and content discriminators, outperforms all other ablation settings. The results from the MSnet without a content encoder(green and gray lines) show a significant drop in performance through time steps. This is because the content encoder helps the motion encoder to focus on the motion feature. As the MSnet with only the content encoder(blue line) cannot obtain meaningful content features, it starts with an inferior score. However, the score do not drop significantly through time steps as this network can capture the meaningful motion feature. It is confirmed that the bi-directional suppression is useful for obtaining meaningful motion and content features.
As shown in Figure 7 (c) and (d), the MSnet outperforms the existing methods. Note that MCnet uses VGG-based networks and DRnet uses VGG-Unet and ResNet-18. These results suggest that our mutual suppression and proposed skip convolutional Unet are useful for obtaining meaningful disentangled features with simpler networks. As MCnet defines two streams without a specific restriction, the motion and content features are not disentangled properly, and the scores drop significantly through time steps. DRnet uses a basic Unet and only one-directional suppression; therefore, these networks extract meaningful features. The qualitative results are in Figure 6.
4.2.3 Optical flow estimation
We conducted ablation experiments with the motion discriminator and the content discriminator to demonstrate their influence on the purity of motion information. Table 3 shows the results of ablation studies of optical flow estimation. MSnets with the content discriminator (MSnet and MSnet w/o MD) yielded better results than others. These results confirm that the content discriminator assists the motion encoder to focus only on motion information. The MSnet obtained better results than MSnet with only the content discriminator, which suggests that tight bi-directional suppression is effective in obtaining well-disentangled features.
In this paper, we proposed a method to learn disentangled features from videos. We introduced mutual suppression adversarial training to acquire disentangled representations. In addition, we applied skip conv connection to refine the content information from previous frames to be useful for future frame prediction. The MSnet was able to obtain well-disentangled features; thus, our method obtained better results in terms of frame reproduction, future frame prediction, and optical flow estimation compared to state-of-the-art methods which have more complex architectures and other ablation settings.
-  Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4724–4733 (2017)
-  Chung, D., Tahboub, K., Delp, E.J.: A two stream siamese convolutional neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1983–1991 (2017)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)
-  Denton, E.L., et al.: Unsupervised learning of disentangled representations from video. In: Advances in Neural Information Processing Systems. pp. 4417–4426 (2017)
-  Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems. pp. 3468–3476 (2016)
-  Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7445–7454 (2017)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems. pp. 2672–2680 (2014)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2980–2988 (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034 (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
-  Hou, R., Chen, C., Shah, M.: An end-to-end 3d convolutional neural network for action detection and segmentation in videos. arXiv preprint arXiv:1712.01111 (2017)
-  Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. p. 3 (2017)
-  Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. vol. 2 (2017)
-  Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017)
-  Jain, S.D., Xiong, B., Grauman, K.: Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384 2(3), 6 (2017)
-  Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: Advances in Neural Information Processing Systems. pp. 6918–6927 (2017)
-  Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)
-  Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks. In: Advances in Neural Information Processing Systems. pp. 353–363 (2017)
-  Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: Hybrid densely connected unet for liver and liver tumor segmentation from ct volumes. arXiv preprint arXiv:1709.07330 (2017)
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision. pp. 21–37 (2016)
-  Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)
-  McLaughlin, N., del Rincon, J.M., Miller, P.: Recurrent convolutional network for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1325–1334 (2016)
-  Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
-  Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: European Conference on Computer Vision. pp. 527–544 (2016)
-  Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4694–4702 (2015)
-  Peng, X., Schmid, C.: Multi-region two-stream r-cnn for action detection. In: European Conference on Computer Vision. pp. 744–759 (2016)
-  Pinheiro, P.O., Lin, T.Y., Collobert, R., Dollár, P.: Learning to refine object segments. In: European Conference on Computer Vision. pp. 75–91. Springer (2016)
-  Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
-  Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 779–788 (2016)
-  Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1164–1172 (2015)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234–241. Springer (2015)
-  Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the IEEE International Conference on Pattern Recognition. vol. 3, pp. 32–36 (2004)
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. pp. 568–576 (2014)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
-  Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using lstms. In: International Conference on Machine Learning. pp. 843–852 (2015)
-  Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497 (2015)
-  Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
-  Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993 (2017)
-  Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: Proceedings of the International Conference on Learning Representations (2017)
-  Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances In Neural Information Processing Systems. pp. 613–621 (2016)
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision. pp. 20–36. Springer (2016)
-  Wang, X., Farhadi, A., Gupta, A.: Actions~ transformations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2658–2667 (2016)
-  Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. arXiv preprint arXiv:1505.00687 (2015)
-  Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems. pp. 802–810 (2015)
-  Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. arXiv preprint arXiv:1708.00042 (2017)