Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm predicts temporal intervals of human actions given video-level class labels with no requirement of temporal localization information of actions. This objective is achieved by proposing a novel deep neural network that recognizes actions and identifies a sparse set of key segments associated with the actions through adaptive temporal pooling of video segments. We design the loss function of the network to comprise two terms—one for classification error and the other for sparsity of the selected segments. After recognizing actions with sparse attention weights for key segments, we extract temporal proposals for actions using temporal class activation mappings to estimate time intervals that localize target actions. The proposed algorithm attains state-of-the-art accuracy on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with weak supervision.
Action recognition in videos is a critical problem for high-level video understanding necessary for tasks such as event detection, video summarization, visual question answering in videos. Many researchers have been investigating the problem extensively in the last decades. The main challenge in action recognition is lack of appropriate representation methods of videos. Contrary to almost immediate success of convolutional neural networks (CNNs) in many visual recognition tasks related to images, applying deep neural networks to video data is not straightforward due to inherently complex structures of data, high computation demand, lack of knowledge for modeling temporal information, and so on. This issues means that techniques based on the representations from deep learning [15, 25, 31, 36] were not significantly better than methods relying on hand-crafted visual features [18, 32, 33]. As a result, many existing algorithms attempt to achieve state-of-the-art performance by combining hand-crafted and learned features.
Another issue in this problem is the lack of annotations required for video understanding. Most of existing techniques assume trimmed videos for video-level classification or rely on annotations of action intervals for temporal localization. Because untrimmed videos typically contain a large number of irrelevant frames pertaining to their class labels, both video representation learning and action classification are likely to fail due to challenges in extracting the salient information from raw videos. On the other hand, annotating a large scale dataset for action detection is prohibitively expensive and time-consuming, making it more practical to develop competitive algorithms running without such labels.
Our goal is to localize actions in untrimmed videos temporally. To this end, we propose a novel deep neural network that has the capability to select a sparse subset of frames useful for action recognition, where the loss function measures classification error and sparsity of the frame selection in each video. For localization, Temporal Class Activation Mappings (T-CAMs) are employed to generate one dimensional temporal action proposals and compute target actions’ localization in the temporal domain. Note that we do not exploit any temporal information from the actions in the target dataset during training, and learn models based only on video-level class labels of actions. The overview of our algorithm is illustrated in Figure 1.
The contributions of this paper are summarized below.
We introduce a principled deep neural network architecture for weakly supervised action recognition and localization on untrimmed videos, where actions are detected from a sparse subset of frames identified by the network.
We present a technique to compute temporal class activation mappings followed by temporal action proposals using learned attention weights for localizing target actions.
2 Related Work
We need proper video datasets to learn the models for action recognition and detection. There are several existing datasets for action recognition including UCF101 , Sports-1M , HMDB51 , and AVA . However, they include only trimmed videos, where target actions appear in all frames throughout videos. In contrast, the THUMOS14  and ActivityNet  datasets contain background frames with annotations about which frames are relevant to target actions. Note that each video in THUMOS14 and ActivityNet may have multiple actions even at the same frame.
Action recognition aims to identify a single or multiple actions per video, and is often formulated as a simple classification problem. In the long history of addressing this problem, the algorithm based on improved dense trajectories  presented outstanding performance before deep learning started being used actively. Convolutional neural networks are very successful in many computer vision problems and have been applied to action recognition tasks. There are several algorithms focusing on video representation learning and applying the learned representations to action recognition. Two-stream networks  and 3D convolutional neural networks (C3D)  are popular solutions for video representation, and those techniques, including their variations, are widely used for action recognition. Recently, a combination of the two-stream network and 3D convolution, referred to as I3D , is proposed as a generic video representation method. On the other hand, many algorithms develop technologies to learn actions based on existing representation methods [36, 38, 6, 9, 7, 22].
Action detection and localization is a slightly different problem from action recognition, because it requires detection of temporal or spatio-temporal volumes containing target actions. Most approaches for this task rely on supervised learning and employ annotations for action localization to learn the models. There are various existing methods based on deep learning including structured segment network , contextual relation learning , multi-stage CNNs , temporal association of frame-level action detections , and techniques using recurrent neural networks [42, 19]. To facilitate action detection and localization, many algorithms employ action proposals [3, 5, 34], which is a straightforward extension of object proposals for object detection in images.
There are only a few approaches based on weakly supervised learning, which rely on video-level labels only to localize actions in the temporal space. UntrimmedNet  first extracts proposals to recognize and detect actions, where softmax functions across class labels and action proposals are used for action recognition and localization results. However, the use of the softmax function across proposals may not be effective to detect multiple instances. Hide-and-seek  applies the same technique—hiding random regions to force attention learning—to weakly supervised object detection and action localization. This method works well in spatial localization but not in the temporal domain. Both algorithms are motivated by recent success in weakly supervised object localization in images. In particular, the formulation of UntrimmedNet for action localization relies heavily on the idea proposed in .
3 Proposed Algorithm
We describe our weakly supervised temporal action localization algorithm based only on video-level action labels. This goal is achieved by designing a deep neural network for video classification based on a sparse subset of segments and identifying time intervals relevant to target classes.
3.1 Main Idea
We claim that an action can be recognized from a video by identifying a series of key segments presenting important action components. Our algorithm proposes a novel deep neural network to predict class labels per video using a subset of representative and unique segments to target actions, which are selected automatically from an input video. Note that the proposed deep neural network is designed for classification but has the capability to measure the importance of each segment in predicting classification labels. After finding the relevant classes in each video, we estimate temporal intervals corresponding to the identified actions by computing temporal attention of individual segments, generating temporal action proposals, and aggregating relevant proposals. Our approach relies only on video-level class labels to perform temporal action localization and presents a principled way to extract key segments and determine appropriate time intervals corresponding to target actions. It is possible to recognize and localize multiple actions in a single video using our framework. The deep neural network architecture for our weakly supervised action recognition component is illustrated in Figure 2. We describe each step of our algorithm as follows.
3.2 Action Classification
To predict class labels in each video, we first sample a set of video segments from an input video and extract a feature representation from each segment using pretrained convolutional neural networks. Each of these representations is then fed to an attention module that consists of two fully connected (FC) layers and a ReLU layer located between the two FC layers. The output of the second FC layer is given to a sigmoid function, forcing the generated attention weights to be normalized between 0 and 1. These attention weights are then used to modulate the temporal average pooling—a weighted sum of the feature vectors—to create a video-level representation. We pass this representation through a FC and sigmoid layers to obtain the class scores.
Formally, let be the dimensional feature representation extracted from a video segment centered at time , and be the corresponding attention weight. The video level representation, denoted by , corresponds to an attention weighted temporal average pooling, which is given by
where is a vector of the scalar outputs from the sigmoid function to normalize the range of activations, and is the total number of video segments considered for classification. The attention weight vector is learned with the sparsity constraint in a class-agnostic way. This is useful to identify temporal segments relevant to any action of interest and estimate the time intervals for action candidates.
The loss function in the proposed network is composed of two terms, the classification and the sparsity loss, which is given by
where denotes the classification loss computed on the video level, is the sparsity loss, and is a constant to control the trade-off between the two terms. The classification loss is based on the standard multi-label cross-entropy loss between ground-truth and (after passing through a few layers as illustrated in Figure 2), while the sparsity loss is given by loss on attention weights as . Since we apply a sigmoid function to each attention weight , all the attention weights are likely to have near 0-1 binary values due the loss. Note that integrating the sparsity loss is aligned with our claim that an action can be recognized with a sparse subset of key segments in a video.
3.3 Temporal Class Activation Mapping
To identify the time intervals corresponding to target actions, we first extract a number of action interval candidates. Based on the idea in , we derive one dimensional class activation mapping in temporal domain, referred to as Temporal Class Activation Mapping (T-CAM). Denote by the -th element in the classification model parameter , corresponding to class . The input to the final sigmoid layer for class is
T-CAM, denoted by , indicates the relevance of the representation to individual classes at time step , where each element for class () is given by
Figure 3 illustrates an example of attention weights and T-CAM outputs in a video given by the proposed algorithm. We can observe that the discriminative temporal regions are highlighted by the attention weights and T-CAMs effectively. Note that some temporal intervals with large attention weights do not correspond to large T-CAM values, because such intervals may represent other actions of interest. The attention weights measure the generic actionness of temporal video segments, while T-CAMs present class-specific information.
3.4 Two-stream CNN Models
We employ the recently proposed I3D model  to compute the representations of video segments. Using multiple streams of information such as RGB and optical flow has become a standard practice in action recognition and detection [4, 8, 25], as it often provides a significant boost in performance. We also learn two action recognition networks with identical settings as illustrated in Figure 2 for the RGB and the flow stream. Note that we use the I3D network as a feature extraction machine without any fine-tuning on the target datasets. The two separately trained networks are then fused to localize actions in an input video. The procedure is discussed in the following subsection.
3.5 Temporal Action Localization
For an input video, we first identify relevant class labels based on video-level classification scores from the deep neural network described in Section 3.2. For each relevant action, we generate temporal proposals, i.e., one-dimensional time intervals that potentially consist of multiple segments, with their class labels and confidence scores. The proposals correspond to video segments that potentially enclose target actions and are detected using T-CAMs in our algorithm.
To generate temporal proposals, we first compute T-CAMs from both the RGB and the flow streams using (4) as and , and use them to derive the weighted T-CAMs, and as
Note that is an element of a sparse vector , and multiplying can be interpreted as a soft selection of the value from the following sigmoid function. Similar to , we apply thresholds to the weighted T-CAMs, and , to segment these signals. The temporal proposals are then the one-dimensional connected components extracted from each streams independently. It is intuitive to generate action proposals using the weighted T-CAMs, instead of directly from attention weights, because each proposal should contain a single kind of action. Optionally, we linearly interpolate the weighted T-CAM signals between sampled segments before thresholding to improve the temporal resolution of the proposals.
Unlike the original CAM-based bounding box proposals  where only the largest bounding box is retained, we keep all the connected components that pass the predefined threshold. Each proposal defined by is assigned a score as the weighted mean T-CAM of all the frames within the proposal:
where . This value corresponds to the temporal proposal score in the each stream for class . Finally, we perform non-maximum suppression among temporal proposals of each class independently to remove highly overlapped detections.
Our algorithm attempts to localize actions in untrimmed videos temporally by estimating sparse attention weights and T-CAMs for generic and specific actions, respectively. We believe that the proposed method is principled and novel compared with the existing UntrimmedNet  algorithm, because it has a unique deep neural network architecture with classification and sparsity losses, and its action localization procedure is based on a completely different pipeline that leverages class-specific action proposals using T-CAMs. Note that  follows a similar framework used in , where softmax functions are employed across both action classes and proposals; it has a critical limitation in handling multiple action classes and instances in a single video.
|Fully supervised||Heilbron et al. ||–||–||–||–||13.5||–||–||–||–|
|Richard et al. ||39.7||35.7||30.0||23.2||15.2||–||–||–||–|
|Shou et al. ||47.7||43.5||36.3||28.7||19.0||10.3||05.3||–||–|
|Yeung et al. ||48.9||44.0||36.0||26.4||17.1||–||–||–||–|
|Yuan et al. ||51.4||42.6||33.6||26.1||18.8||–||–||–||–|
|Escordia et al. ||–||–||–||–||13.9||–||–||–||–|
|Shou et al. ||–||–||40.1||29.4||23.3||13.1||07.9||–||–|
|Yuan et al.||51.0||45.2||36.5||27.8||17.8||–||–||–||–|
|Xu et al.||54.5||51.5||44.8||35.6||28.9||–||–||–||–|
|Zhao et al. ||66.0||59.4||51.9||41.0||29.8||–||–||–||–|
|Alwasssel et al. *||49.6||44.3||38.1||28.4||19.8||–||–||–||–|
|Weakly supervised||Wang et al ||44.4||37.7||28.2||21.1||13.7||–||–||–||–|
|Singh & Lee ||36.4||27.8||19.5||12.7||06.8||–||–||–||–|
This section first describes the details of our benchmark datasets and the evaluation setup. Then, our algorithm is compared with state-of-the-art techniques based on fully and weakly supervised learning. Finally, we analyze the contribution of individual components in our algorithm.
4.1 Datasets and Evaluation Method
We evaluate the proposed algorithm on two popular action detection benchmark datasets, THUMOS14  and ActivityNet1.3 . Both datasets are untrimmed, meaning that the videos include frames that contain no target action, and we do not exploit the temporal annotations during training. Note that there may exist multiple actions in a single video and even in a single frame in these datasets.
The THUMOS14 dataset has video-level annotations of 101 action classes in its training, validation, and testing sets, and temporal annotations for a subset of videos the validation and testing sets for 20 classes. We train our model with the 20-class validation subset, which is composed of 200 untrimmed videos, without using the temporal annotations. We evaluate our algorithm using 212 videos in the 20-class testing subset with temporal annotations. This dataset is challenging as some videos are relatively long (up to 26 minutes) and contain multiple action instances. The length of an action varies significantly, from less than a second to minutes.
The ActivityNet dataset is a recently introduced benchmark for action recognition and detection in untrimmed videos. We use ActivityNet1.3, which originally consists of 10,024 videos for training, 4,926 for validation, and 5,044 for testing111There are 9,740, 4791, and 4911 videos accessible from YouTube in the training, validation, and testing set, respectively, in our experiments., of 200 activity classes. This dataset contains a large number of natural videos that involve various human activities under a semantic taxonomy.
We follow the standard evaluation protocol based on mean average precision (mAP) values at several different levels of intersection over union (IoU) thresholds. The evaluation of both datasets is conducted using the benchmarking code for the temporal action localization task provided by ActivityNet222https://github.com/activitynet/ActivityNet/blob/master/Evaluation/. The result on the ActivityNet1.3 testing set is obtained by submitting results to the evaluation server.
4.2 Implementation Details
We employ the two-stream I3D networks  trained on the Kinetics dataset  to extract features from individual video segments. For the RGB stream, we rescale the smallest dimension of a frame to 256 and perform the center crop of size . For the flow stream, we apply the TV- optical flow algorithm  and truncate the flow magnitude to be in the range of . We add a third channel of all 0’s to the optical flow image. The input for I3D is a stack of 16 (RGB or optical flow) frames. To save space and processing time, we subsampled the video at 10 frame per second.
In all experiments, we randomly sample segments at uniform interval from each video in both training and testing. During training, we perform stratified random perturbation on the segments sampled for data augmentation. The network is trained using Adam optimizer with learning rate . We stop training when the model localization performance reaches its peak on the training set. At testing time, we first reject classes whose video-level probabilities are below , and then retrieve box proposals for the remaining classes. Our algorithm is implemented in TensorFlow.
Table 1 summarizes the test results on the THUMOS14 dataset for all published action localization methods in the past two years. We included both fully and weakly supervised approaches in the table. Our algorithm outperforms the other two existing approaches based on weakly supervised learning [35, 28]. Even with significant difference in the level of supervision, our algorithm presents competitive performance to several recent fully supervised approaches.
We also present performance of our algorithm on the validation and the testing set of the ActivityNet1.3 dataset in Table 2 and 3, respectively. We can see that our algorithm outperforms some fully supervised approaches in the validation and the testing set. Note that most of the available action localization results on ActivityNet1.3 are from the ActivityNet Challenge submission, and we do not believe they are directly comparable to our algorithm. To our knowledge, this is the first attempt to evaluate weakly supervised action localization performance on this dataset, and we report the results as a baseline for future reference.
The qualitative results on the THUMOS14 dataset are demonstrated in Figure 4. As mentioned in Section 4.1, videos in this dataset are often long and contain many action instances, which are even from different categories. Figure 3(a) shows an example with many action instances along with our predictions and the corresponding T-CAM signals. Even with video-level labels only, our algorithm effectively pinpoints temporal boundaries of action instances. In Figure 3(b), the appearances of all frames are similar and there is little motion between each frame. Despite this challenge, our model still localizes the target action fairly well. Figure 3(c) illustrates an example of a video containing action instances from two different classes. Visually, the two involved action classes—Shotput and ThrowDiscus—look similar in their appearances (green grass, person with blue shirt, on a gray platform) and motion patterns (circular throwing). Our algorithm is still able to not only localize the target action but also classify the action category of the proposed window successfully although there are several short-term false positives. Figure 3(d) shows a instructional video for JavelinThrow, where our algorithm detects most of the ground-truth action instances while it also generates many false positives. There are two causes for the false alarms. First, the ground-truth annotations for JavelinThrow are often missing, making true detections counted as false positives. The second source is related to the segments, where the instructors demonstrate javelin throwing but only parts of such actions are visible. These segments resemble a real JavelinThrow action in both appearance and motion.
|Fully supervised||Singh & Cuzzonlin *||34.5||–||–|
|Wang & Tao *||45.1||04.1||00.0|
|Shou et al. *||45.3||26.0||00.2|
|Xiong et al. *||39.1||23.5||05.5|
|Montes et al. ||22.5||–||–|
|Xu et al. ||26.8||–||–|
|Fully supervised||Singh & Cuzzolin *||17.83|
|Wang & Tao *||14.62|
|Xiong et al. *||26.05|
|Singh et al. ||17.68|
|Zhao et al. ||28.28|
4.4 Ablation Study
We investigate the contribution of several components proposed in our weakly supervised architecture and implementation variations. All the experiments for our ablation study are performed on THUMOS14 dataset.
Choice of architectures
Our premise is that an action can be recognized by a sparse subset of segments in a video. When we learn our action classification network, two loss terms—classification and sparsity losses—are employed. Our baseline is the architecture without the attention module and the sparsity loss, which share motivation with the architecture in . We also test another baseline with the attention module but without sparsity loss. Figure 5 shows comparisons between our baselines and the full model. We observe that both the sparsity loss and the attention weighted pooling make substantial contributions to performance improvement.
Choice of features
As mentioned in Section 3.4, the representation of each temporal segment is based on the two-stream I3D network, which employs two sources of information: one from the RGB image and the other from optical flow. Figure 6 illustrates the effectiveness of each modality and their combination. When comparing the individual performances of each modality, the flow stream offers stronger performance than the RGB steam. Similar to action recognition, the combination of these modalities provides significant performance improvement.
We presented a novel weakly supervised temporal action localization technique, which is based on deep neural networks with classification and sparsity losses. The classification is performed by evaluating a video-level representation given by a sparsely weighted mean of segment-level features, where the sparse coefficients are learned with the sparsity loss in our deep neural network. For weakly supervised temporal action localization, one-dimensional action proposals are extracted, from which proposals relevant to target classes are selected to present time intervals of actions. The proposed approach achieved the state-of-the-art results in THUMOS14 dataset and, to the best of our knowledge, we are the first to report weakly supervised temporal action localization results on the ActivityNet1.3 dataset.
-  H. Alwassel, F. C. Heilbron, and B. Ghanem. Action search: Learning to search for human activities in untrimmed videos. In arXiv preprint arXiv:1706.04269, 2017.
-  H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
-  S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. SST: single-stream temporal action proposals. In CVPR, 2017.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
-  V. Escorcia, F. C. Heilbron, J. C. Niebles, , and B. Ghanem. DAPs: deep action proposals for action understanding. In ECCV, 2016.
-  C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
-  C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
-  R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, 2017.
-  G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015.
-  C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, and J. Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In arXiv:1705.08421, 2017.
-  F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. ActivityNet: a large-scale video benchmark for human activity understanding. In CVPR, 2015.
-  F. C. Heilbron, J. C. Niebles, and B. Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR, 2016.
-  Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes, 2014.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
-  I. Laptev. On space-time interest points. IJCV, 64(2-3):107–123, 2005.
-  S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016.
-  A. Montes, A. Salvador, S. Pascual, and X. Giro-i Nieto. Temporal activity detection in untrimmed videos with recurrent neural networks. In 1st NIPS Workshop on Large Scale Computer Vision Systems (LSCVS), 2016.
-  A. Richard and J. Gall. Temporal action detection using a statistical language model. In CVPR, 2016.
-  Y. Shi, Y. Tian, Y. Wang, W. Zeng, and T. Huang. Learning long-term dependencies for action recognition with a biologically-inspired deep network. In ICCV, 2017.
-  Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. CVPR, 2017.
-  Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In CVPR, 2016.
-  G. Singh and F. Cuzzolin. Untrimmed video classification for activity detection: submission to ActivityNet challenge. arXiv preprint arXiv:1607.01979, 2016.
-  K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, 2017.
-  K. Soomro, H. Idrees, and M. Shah. Action localization in videos through context walk. In ICCV, 2015.
-  K. Soomro, A. R. Zamir, and M. Shah. UCF101: a dataset of 101 human action classes from videos in the wild. Technical Report CRCV-TR-12-01, University of Central Florida, 2012.
-  D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
-  L. Wang, Y. Qiao, and X. Tang. Motionlets: Mid-level 3d parts for human motion recognition. In CVPR, 2013.
-  L. Wang, Y. Qiao, X. Tang, and L. V. Gool. Actionness estimation using hybrid fully convolutional networks. In CVPR, 2016.
-  L. Wang, Y. Xiong, D. Lin, and L. van Gool. Untrimmednets for weakly supervised action recognition and detection. In CVPR, 2017.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. val Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  R. Wang and D. Tao. UTS at Activitynet 2016. AcitivityNet Large Scale Activity Recognition Challenge, 2016.
-  Y. Wang, M. Long, J. Wang, and P. S. Yu. Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
-  A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers. An Improved Algorithm for TV- Optical Flow. Statistical and geometrical approaches to visual motion analysis. Springer, 2009.
-  Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716, 2017.
-  H. Xu, A. Das, and K. Saenko. R-C3D: region convolutional 3d network for temporal activity detection. In ICCV, 2017.
-  S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, 2016.
-  J. Yuan, B. Ni, X. Yang, and A. A. Kassim. Temporal action localization with pyramid of score distribution features. In CVPR, 2016.
-  Z. Yuan, J. C. Stroud, T. Lu, and J. Deng. Temporal action localization by structured maximal sums. In CVPR, 2017.
-  Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In ICCV, 2017.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.