CMSN: Continuous Multi-stage Network and Variable Margin Cosine Loss for Temporal Action Proposal Generation
Accurately locating the start and end time of an action in untrimmed videos is a challenging task. One of the important reasons is the boundary of action is not highly distinguishable, and the features around the boundary are difficult to discriminate. To address this problem, we propose a novel framework for temporal action proposal generation, namely Continuous Multi-stage Network (CMSN), which divides a video that contains a complete action instance into six stages, namely Backgroud, Ready, Start, Confirm, End, Follow. To distinguish between Ready and Start, End and Follow more accurately, we propose a novel loss function, Variable Margin Cosine Loss (VMCL), which allows for different margins between different categories. Our experiments on THUMOS14 show that the proposed method for temporal proposal generation performs better than the state-of-the-art methods using the same network architecture and training dataset.
In recent years, with the development of deep learning, classification accuracy of action recognition on trimmed videos has been significantly improved [10, 32, 33]. But real-world videos are usually untrimmed, making location and action classification in untrimmed videos increasingly important. This task named Action Detection aims to identify the action class, and accurately locate the start and end time of each action in untrimmed videos.
By definition, this task is similar to object detection. While object detection aims to produce spatial location in a 2D image, action detection aims to produce temporal location in a 1D sequence of frames. Because of the similarity between object and action detection, many methods for action localization are inspired from advances in object detection [13, 24], which first generate temporal proposals and then classify each proposal. However, the performances of these methods remain unsatisfactory. As found in the previous work [3, 25, 35, 5], this can mainly be attributed to the different properties between action detection and object detection. For example, the duration of action instances varies a lot, from one second to a few hundred seconds , and video streaming always contains a lot of information not relevant to action, making action detection still a challenging problem.
Because the generation of high-quality proposals is the basis for improving detection performance, so in this paper, we mainly focus on the temporal proposal generation. We observe that at the start and end of an action, the action is continuous and does not suddenly start or end. So we think the probability of the start and end time should not be predicted separately, but should be linked to the front and back to form a continuous sequence of features.
We propose the Continuous Multi-stage Network (CMSN), and the architecture of our network is illustrated in Fig. 1. As shown in the figure, we designed six stage categories based on the video process, and directly predict the category of each frame. The main contributions of our work are two-folds:
We introduce a new architecture (CMSN), which expands a complete action instance and divides it into six stages, namely Background, Ready, Start, Confirm, End, Follow, corresponding to background, a short period before the start, a short period after the start, the middle stage, a short period before the end, a short period after the end, and consider six stages as six categories. Then each action instance could be considered as a continuous category sequence, and the categories around the start time and end time are separated. This strategy could improve recall performance and location accuracy, and could handle large variations in action duration to some extent.
In order to locate the start and end of an action instance more accurately, we propose the Variable Margin Cosine Loss. The loss function adds a variable angular margin between different classes, which allows similar samples to keep a certain margin and dissimilar samples to have a large margin.
2 Related work
Temporal action detection task focuses on predicting action classes, and temporal boundaries of action instances in untrimmed videos. Most action detection methods are inspired by the success of image object detection [24, 23, 20]. The mainstream methods can be divided into two types, one is a two-stage pipeline [26, 36, 5, 39, 6], and the other is a single-shot pipeline [2, 18, 38]. For the two-stage pipeline, the first step is to generate proposals, and the second step is to classify proposals. From previous works, under the same conditions, the two-stage method performs better than the single-shot method.
In proposal generation task, earlier works [16, 22, 31] mainly use sliding windows as candidates. Recently, many methods [12, 26, 36, 5] use preset fixed-time anchors to generate proposals. TAG  divides an activity instance into three stages, and generates some snippets which contain several consecutive frames. Then TAG uses an actionness classifier to evaluate the binary actionness probabilities for individual snippets, and uses a watershed algorithm to generate proposals. SMS  assumes that each temporal window begins with a single start frame, followed by one or more middle frames, and finally a single end frame. Then SMS predicts the probability that each frame belongs to start, middle or end.
Different from them, we use six stages to represent a complete action instance. Each stage corresponds to a period of an action rather than a frame. We do not just predict the probability of a frame, but predict an action as a continuous multi-stage sequence.
3 Proposed Approach
We propose a novel network for generating proposals in continuous untrimmed video streams, namely Continuous Multi-stage Network (CMSN). As shown in Fig. 2, CMSN takes a video as input, outputs a sequence of action stage categories after Action stage Subnet, and outputs a set of proposals for the input video in the last network. Next, we describe Feature Extractor Subnet (FES) and Action Stage Subnet (ASS) in Section 3.1, Proposal Generation Subnet (PGS) in Section 3.2, Proposal Evaluation Subnet (PES) and Soft NMS in Section 3.3.
3.1 Feature Extactor Subnet and Action stage Subnet
Suppose the input video frames , have dimension , where denotes the frame at time step , and is the total number of frames in the video. Let denote a proposal, is the starting time and is the ending time, thus the duration of is . To make full use of the context information, we expand the starting time to , and expand the ending time to , as done in , but we also divide into three segments by time : , , . In this way, the expanded proposal is divided into segments: , , , , , which corresponds to Ready, Start, Confirm, End, Follow,respectively. And with the stage of background, we have six action stage categories: Backgroud, Ready, Start, Confirm, End, Follow. So each frame of the input video corresponds to one of the six action stage categories.
Feature Extactor Subnet To extract features from a given video, we could use various convolutional networks for action recognition. In our framework, we adapt C3D network  as FES, and two-stream network  also be experimented. Take C3D as an example, the input of FES is RGB frames. The input frames with dimension are input into FES and output base features. The output features have dimension: , with the length, width, height are scaled relative to the original input frames by the network. Suppose the scale of FES on the dimension length is ,then .
Action Stage Subnet As shown in Fig. 2, ASS consists of the CNN network and the LSTM network. The CNN network scales the base features so that they are suited for the LSTM network, containing two convolutional network layers with kernel size and hidden size , and one max-pooling layer. The outputs of the CNN network have size , where is the input channel of the LSTM network. The LSTM network which we used is a bidirectional LSTM  having 2 layers, which can make use of the context information to a maximum extent. The outputs of the CNN network feed into the LSTM network, then pass through the classification layer, and finally output action stage categories sequence corresponding to the length . Let be the output action stage categories sequence, where . Because the output length is scaled so the frame-level action stage category also needs to be scaled down by the same ratio .
The reason why we divide the proposal into six segments is based on the following observations. Firstly an action never happens suddenly in a continuous video stream. For example, Billiards, before the action starts really, the person playing must be close to the pool table and adjust the position. Secondly, many actions could not be confirmed to happen in a short period after the starting, but need to observe for a while. For example, Long Jump, in the run upstage, one could not distinguish whether the action instance is Long-distance or Long jump. Similarly, before an action ends, we often can predict the action is coming to an end. That is, the start stage, the middle stage and the end stage of an action often are distinguishable. Thus, we argue that dividing an action into three stages could more effectively describe the action than one stage, and superadd the Ready and Follow stage could more effectively make use of the context information.
We regard the stage from Ready to Confirm as the stage of increasing probability, and the stage from Confirm to Follow as the stage of decreasing probability. We do not just predict the probability of a certain moment, such as the starting time, but predict the whole action process and model an action instance as a sequence of action stage categories. In this way, the sequence is continuous as a piece of video, so if the start and end time is not detected but an intermediate time is detected, we can still estimate the start and end time. And the sequence is separate around the start time and end time, so we can control the margin around them as method in Section 4.2.
3.2 Proposal Generation Subnet
The output of ASS is an action categories sequence , which is arranged in the order from Ready to Follow. This corresponds well with the ground truth, but sometimes the order may be error. In the subsequence of the sequence , let denote the Ready sequence, denote the Start sequence, denote the Confirm sequence, denote the End sequence, denote the Follow sequence. As shown in Fig. 3, we choose starting and ending locations, and combine them as candidate proposals as follows:
We choose a location in the action categories sequence as starting location, if belongs to or or the first half of . Then we could obtain candidate starting point sets .
We choose location in the action categories sequence as ending location, if belongs to or or the second half of . Then we could obtain candidate ending point sets .
We combine a starting location belongs to and an ending location belonging to locations, if it contains at least one of Start stage and Confirm stage and Follow stage from the starting to the ending (excluding the Background stage). Finally, we could obtain the candidate proposal point sets .
3.3 Proposal Evaluation Subnet and Soft NMS
The input features of PES is the output of the CNN in ASS with size , we intercept the features sequence by the candidate proposal sets , and obtain features sets . Because features in have different lengths, we use RoI pooling  to extract the fixed-size volume features. The output of the RoI pooling is fed into two fully connected layers with hidden size , and output the score corresponding to set .
Because the outputs of ASS could correspond well with the ground truth, we set a preset score for each proposal. Let a sequence from Ready to Follow be , we take the first Start as the starting point , take the last End as the ending point . For a proposal generated from the sequence , we calculate the distance between the start of the proposal and , denote as ; calculate the distance between the end of the proposal and , denote as . Then we could compute pre-score as follows:
where i is the decay rate. Then for each proposal, we use the product of the PES score and the preset score as the final score. Then we use non-maximum suppression (NMS)  to remove redundant proposals.
3.4 Training and Prediction
Training We train the Action Stage Subnet firstly. For an input video , we need to assign labels to the action stage categories sequence of ASS. We use ground-truth to generate action stage category labels described in Section 3.1. Because the ground-truth is frame level, suppose the scale ratio is , as described in Section 3.1, we firstly generate frame-level categories sequence, then sample the sequence at fps to generate label sequence corresponding to action stage categories sequence. The used loss function is Variable Margin Cosine Loss, the categories are assigned to corresponding to the stages from Backgroud to Follow. After Action Stage Subnet trained, we could generate proposals described in Section 3.1. Then we select proposals by Intersection-over-Union (IoU) with some ground-truth activity as follows: firstly select all proposals with IoU higher than as Big, suppose the number of Big is ; secondly, select proposals with IoU higher than and lower than with the same number as Middle; finally, select proposals with IoU lower than with the number as Smaller. Then train the IoU Evaluation Subnet with Smooth L1 loss .
Prediction For a video to be predicted, we first sample to generate some frames sequences with the same length in training , then we could generate action stage categories sequence with the architecture described in Section 3.1. Because a complete proposal may be truncated at the beginning or end of frames sequence, we have half overlap between contiguous sample frames, finally we can generate proposals described in Section 3.2 and Section 3.3.
4 Variable Margin Cosine Loss
4.1 Softmax Loss and Variations
Softmax loss is widely used in classification task and can be formulated as follows:
where denotes the input feature vector of the -th sample, corresponding to the -th class. denotes the th column of the weight vector and is the bias term. The size of mini-batch and the number of class is and , respectively. The decision boundary of Softmax loss is: . Obviously, the boundary does not have a gap between a positive sample and a negative sample. To solve this problem, some variants were proposed, such as CosFace(LMCL) , ArcFace  and A-softmax . Take LMCL as an example, the loss function is formulated as follows:
where is the numer of training samples, is the -th feature vector corresponding to the ground-truth class of , the is the weight vector of the -th class, and is the angle between and . The decision boundary is: .
All these variants have a margin m between positive and negative samples, and could perform well on some classification tasks such as face recognition. But they do not consider the margin value between different negative samples and the same positive sample, that is, the margin is the same for all negative samples and positive samples. There is a similarity relationship between different sample categories, denoted by . We define this relationship as the distance between two categories in the sample space, and define the features set for classification as feature space. Then these variations could not maintain the distance when mapped from the sample space to the feature space. We believe that it is unreasonable to use the same margin for positive and negative samples and hypothesize that the margin should be varied depending on the distance , i.e., the larger is, the larger the margin should be.
4.2 Variable Margin Cosine Loss
To achieve the above goal, we propose the Variable Margin Cosine Loss (VMCL). Formally, let denote -th class, denote -th class, we define VMCL as follows:
where is variable that determines the margin between and . The decision boundary is: . could have various forms as long as it guarantees that is bigger when the distance between i and j is bigger. Without loss of generality, we define as follows:
where is the minimum margin, controls the growth rate of the interval. is the predefined value of class i. The setting for is based on the distance between two classes.
When classifying the action stage category, in order to localize the start and end time more accurately, we increase the margin between the ready stage and start stage, end-stage and follow stage. Let , , , , , , denote Backgroud, Ready, Start, Middle, End, Follow, respectively, we preset as Fig. 4, each category corresponds to one or more values, then we calculate as follows:
This setting makes a large margin between Ready and Start, End and Follow. And the setting also makes a bigger margin when the distance between the two stages is bigger. For example, let and , the margin between Ready and other stages is , which corresponds to stages from Background to Follow.
Compared with Softmax Loss and Variations, VMCL considers the different margin values between different categories, and maps the distance in the sample space to the variable margin in the feature space. And because is variable, we could set a large margin even if the distance is small. VMCL can also have other forms. For example, based on ArcFace, VMCL can be defined as follows:
the symbols are the same as Equation 5.
For action stage classification, because we set between action category Confirm and Start to be smaller than between Confirm and Ready, then the probability that VMCL predicts Confirm as Start is higher than the probability of predicting Confirm as Ready. This property can improve recall in theory.
5.1 Dataset and Experimental Settings
Dataset We compare our method with the state-of-the-art methods on the temporal action detection benchmark of THUMOS14 . THUMOS14 dataset 200 and 213 temporal annotated untrimmed videos with 20 action classes in validation and test sets, respectively. On average, each video contains more than 15 action instances. Because the training set of THUMOS14 contains trimmed videos, so following the settings in previous works, we use 200 untrimmed videos in the validation set to train our model and evaluate on the test set.
Evaluation metrics For the temporal action proposal generation task, Average Recall (AR) calculated with multiple IoU thresholds is typically used as the evaluation metrics. In our experiments, we use the IoU thresholds set on THUMOS14. And we also evaluate AR with Average Number of proposals (AN) on THUMOS14, which is denoted as AR@AN.
Implementation details We experiment with C3D network  and two-stream network . For the C3D network, our implementation uses RGB frames continuously extracted from a video as input. The length of the input is set to , all video frames are resized into , then crop into . Since the GPU memory is limited, the mini-batch size is set to . So the input dimensions are . The FES adopts the convolutional layers (conv1a to conv5b) of C3D  pre-trained on ActivityNet-1.3  training set, and freeze the first two convolutional layers. The learning rate is set to and the weight decay parameter is , the dropout is . The prescore decay rate is , the n is , m is . We first train the Action stage Subnet for three epochs, then train Iou Evaluation Subnet for four epochs while freezing the Feature Extractor Subnet and the Action stage Subnet.
For the 2-Stream network, we use the architecture which described in . The network uses BN-Inception as the temporal network, and uses ResNet as the spatial network. We use the network pre-trained on ActivityNet-1.3 . The extract features are concatenated from the output of the last fully-connected layer, with dimensions . Because the shape of the extract features has changed, we fine-tune the structure of the ASS. We change the CNN in the ASS from two convolutional layers and one max-pooling layer to one convolutional layer. The rest of the network structure remains unchanged. We first train the Action stage Subnet for twelve epochs, then train Iou Evaluation Subnet for ten epochs. The other settings are the same as when using the C3D network.
Comparison of AR@AN We first evaluate the average recall performance with numbers of proposals (AR@AN) and area under the AR-AN curve. Table 1 summarizes all comparative results conducted on the test set of THUMOS14 which uses the C3D network. It is observed that our method outperforms other methods when AN ranges from 50 to 200. For example, for AR@100, our method improves the performance from the previous record 37.46% to 44.59%. Specifically, for AR@50, our method significantly improves the performance from the previous record 29.58% to 38.92%, and surpasses the best performance by using the 2-Stream network. Table 2 shows the results using 2-Stream network. It can also be seen that CMSN has the best performance. These results suggest that our method is efficient and does not rely on a specific feature extraction network.
Comparison of different NMS Because most previous works adopt Greedy-NMS  for redundant proposals suppression, we also experiment with the effect of different NMS methods on the results. Results in Table 1 and Table 2 show that the results of different NMS methods are very close, suggesting that our method does not rely on a specific NMS method.
Comparison of AR@AN cureves Fig. 5 and Fig. 6 illustrate the Recall@AN=100 curves of different methods on THUMOS14. When using the c3d networks, our method performance is significantly better than other methods. When using the 2-stream network, our method also achieved the best performance. Fig. 7 and Fig. 8 illustrate the average recall against the average number of proposals curve of THUMOS14. It can be seen that our methods have achieved the best performance. Especially in the low AN region, our method shows obvious advantages. These results suggest that our method could generate proposals with higher quality.
Comparison of different pre-trained model To more accurately evaluate the effects of our designed model and loss function, we present experimental results of CMSN with different pre-trained C3D model in Table 3. In Table 3, UCF-101 denote the model is pre-trained on UCF-101 , Activitynet denote the model is pre-trained on ActivityNet-1.3 training set. For the C3D model pre-trained on UCF-101, we freeze all convolutional layers (conv1a to conv5b), the other settings are the same as when using the C3D model pre-trained on ActivityNet-1.3. As shown in Table 3, the results of different pre-trained C3D models are very close, suggesting that our method does not depend on a specific pre-trained model.
Comparison with or without PES Table 4 shows the results of CMSN with and without PES, where without PES means that the Proposal Evaluation Subnet and the NMS Subnet are removed. The result is very close to adding PES and NMS Subnet, and is still better than those reported in the previous work. These results suggest the performance enhancement can mainly be attributed to ASS and VMCL.
|ASS with Softmax Loss||34.73||38.89||42.58|
|ASS with LMCL ||34.82||38.99||42.25|
|ASS with VMCL||38.52||42.41||49.35|
5.3 Ablative Study for VMCL
To demonstrate the effectiveness of different settings in VMCL, we run several ablations to analyze VMCL. All experiments in this ablation study are performed on the THUMOS14 dataset with the C3D network. And for removing the effect of the Proposal Evaluation Subnet, we only compare the results after removing PES and the Soft-NMS Subnet.
Effect of VMCL Table 5 shows the results of different loss functions. With Softmax Loss, we still get a good result, which suggests that the architecture of ASS is reasonable. With LMCL, we have the same margin between different classes, and the performance of LMCL is worse than VMCL. VMCL performs best and has significant advantages over other loss functions. These results suggest that ASS is effect and VMCL could achieve better performance.
Effect of m and n Since the parameter m and n of VMCL control the decision boundary, we conduct experiments with different values. As shown in Table 6, the preformance is better when and , and performance degradation when and . This means that a too small or too large margin will cause performance degradation.
In this paper, we proposed a Continuous Multi-stage Network (CMSN) and a Variable Margin Cosine Loss (VMCL) for temporal action proposal generation. The CMSN directly predicts the entire sequence of action processes rather than predicting the probability of a frame. The VMCL could arbitrarily adjust the distance between different classes in the feature space. The results on large-scale benchmarks THUMOS14 suggest that our proposed method could significantly improve the location accuracy.
-  Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
-  Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. End-to-end, single-stream temporal action detection in untrimmed videos. In BMVC, volume 2, page 7, 2017.
-  Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2911–2920, 2017.
-  Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
-  Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1130–1139, 2018.
-  Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S Davis, and Yan Qiu Chen. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 5793–5802, 2017.
-  Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. 2005.
-  Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
-  Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Daps: Deep action proposals for action understanding. In European Conference on Computer Vision, pages 768–784. Springer, 2016.
-  Basura Fernando, Efstratios Gavves, Jose M Oramas, Amir Ghodrati, and Tinne Tuytelaars. Modeling video evolution for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5378–5387, 2015.
-  Jiyang Gao, Kan Chen, and Ram Nevatia. Ctap: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 68–83, 2018.
-  Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision, pages 3628–3636, 2017.
-  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. Bidirectional lstm networks for improved phoneme classification and recognition. In International Conference on Artificial Neural Networks, pages 799–804. Springer, 2005.
-  Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
-  Svebor Karaman, Lorenzo Seidenari, and Alberto Del Bimbo. Fast saliency based pooling of fisher encoded dense trajectories. In ECCV THUMOS Workshop, volume 1, page 5, 2014.
-  Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3889–3898, 2019.
-  Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 988–996. ACM, 2017.
-  Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
-  Dan Oneata, Jakob Verbeek, and Cordelia Schmid. The lear submission at thumos 2014. 2014.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016.
-  Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016.
-  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
-  Khurram Soomro, Amir Roshan Zamir, and M Shah. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2012.
-  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
-  Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274, 2018.
-  Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge, 1(2):2, 2014.
-  Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4305–4314, 2015.
-  Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
-  Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, and Xiaoou Tang. Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797, 2016.
-  Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716, 2017.
-  Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision, pages 5783–5792, 2017.
-  Zehuan Yuan, Jonathan C Stroud, Tong Lu, and Jia Deng. Temporal action localization by structured maximal sums. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2017.
-  Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. S3d: Single shot multi-span detector via fully 3d convolutional networks. arXiv preprint arXiv:1807.08069, 2018.
-  Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923, 2017.