Detecting the Starting Frame of Actions in Video
To understand causal relationships between events in the world, it is useful to pinpoint when actions occur in videos and to examine the state of the world at and around that time point. For example, one must accurately detect the start of an audience response – laughter in a movie, cheering at a sporting event – to understand the cause of the reaction. In this work, we focus on the problem of accurately detecting action starts rather than isolated events or action ends. We introduce a novel structured loss function based on matching predictions to true action starts that is tailored to this problem; it more heavily penalizes extra and missed action start detections over small misalignments. Recurrent neural networks are used to minimize a differentiable approximation of this loss. To evaluate these methods, we introduce the Mouse Reach Dataset, a large, annotated video dataset of mice performing a sequence of actions. The dataset was labeled by experts for the purpose of neuroscience research on causally relating neural activity to behavior. On this dataset, we demonstrate that the structured loss leads to significantly higher accuracy than a baseline of mean-squared error loss.
Iljung S. Kwakiskwak@cs.ucsd.edu1 \addauthorDavid Kriegmankriegman@ucsd.edu2 \addauthorKristin Bransonbransonk@hhmi.janelia.org1 \addinstitution Janelia Research Campus \addinstitution UCSD BMVC
Video-based action recognition tasks are generally framed in one of two ways. In action classification [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei, Soomro et al.(2012)Soomro, Zamir, and Shah, Idrees et al.(2017)Idrees, Zamir, Jiang, Gorban, Laptev, Sukthankar, and Shah], the goal is to assign a single category to a trimmed video. In fine-grained action detection or segmentation [Idrees et al.(2017)Idrees, Zamir, Jiang, Gorban, Laptev, Sukthankar, and Shah, Caba Heilbron et al.(2015)Caba Heilbron, Escorcia, Ghanem, and Carlos Niebles], the goal is to determine time intervals (start and end frames) of each action category. In this work, we propose the intermediate problem of detecting and classifying the starting frame of each action bout. We believe that this is a practically important problem for the following reasons.
First, to understand causal relationships in the world, one could detect the start of an action, then examine the state of the world at or just prior to the action start. For example, to understand why a basketball player dribbled, one could examine the relative positions of all other players on the court at the start of the dribble. Or, to understand which plays in a game were important, one could detect the start of cheering by the crowd, and examine the play at that time point. In neuroscience, pinpointing the start of actions is extremely important. Researchers detect the starting time of a behavior, then examine neural activity just prior to this to understand how the brain controls behavior [Sauerbrei et al.(2018)Sauerbrei, Guo, Zheng, Guo, Kabra, Verma, Branson, and Hantman] (Figure 1a). Thus, we believe that detecting the start of an action is often more important than detecting its end, and, for many applications, should be the focus. Second, it is sometimes the case that the start of an action is at a well-defined time point, while the end of the action may be ambiguous and more difficult to localize (Figure 1a). At a basketball game, fans may start applauding simultaneously, but the end of the reaction is more difficult to pinpoint and is thus difficult for human annotators to label. Third, many actions may consist of multiple concatenated bouts of the same movement. For example, walking consists of individual strides. However, existing data sets do not contain examples of abutting bouts of the same action, nor temporally overlapping bouts of different actions. Representing the starts of behaviors would facilitate these interpretations.
There are three classes of errors that can be made in detecting action starts. One can miss an action start (false negative), have an extra action start (false positive), or be offset from the true start by some number of frames. Using an unstructured, per-frame error between the true and predicted action starts would incorrectly penalize being offset by a small number of frames more than having a false positive or negative (Figure 1c). Thus, we propose a structured loss that involves finding the best match between action start predictions and labels, and this allows a proper weighting of each of these three types of errors. We propose using a recurrent neural network (RNN) to minimize this structured loss using gradient descent. As this loss is not differentiable, we also propose to minimize a differentiable proxy of this based on the Earth Mover’s Distance (EMD).
We introduce a new video data set, The Mouse Reach Dataset, that has been annotated with the starting frames of a set of behaviors. In particular, the data set consists of videos of mice performing a task that starts with reaching for a food pellet and ends with chewing that pellet; when successful, the task consists of a sequence of six actions (Figure 1a). The sequence has strong temporal structure that can be exploited by an RNN, but can also vary substantially. Actions may be repeated, such as when a mouse fails to grab the food pellet on the first attempt and then tries again. We show that an RNN trained to minimize either of our structured losses outperforms an RNN trained to minimize the per-frame loss. Furthermore, reaching tasks are often used in rodents and primates to study motor control, learning, and adaptation, and tools for automatic quantification of reach behavior would have immediate impact on neuroscience research [Whishaw and Pellis(1990), Guo et al.(2015)Guo, Graves, Guo, Zheng, Lee, Rodriguez-Gonzalez, Li, Macklin, Phillips, Mensh, et al., Sauerbrei et al.(2018)Sauerbrei, Guo, Zheng, Guo, Kabra, Verma, Branson, and Hantman, Krakauer et al.(2019)Krakauer, Hadjiosif, Xu, Wong, and Haith].
In summary we a) propose action start detection as an important problem in activity recognition, b) introduce a novel structured loss function and show how RNNs can be used to minimize this, and c) contribute a new, real-world dataset for fine-grained action start detection, that has been annotated in the course of neuroscience research. We describe our algorithm in Section 3, our dataset in Section 4, and experimental results in Section 5.
2 Related Work
Although we focus on detecting the start of an action, our work has many similarities to fine-grained action detection, in which the goal is to temporally localize the duration of a behavior, or to predict the start and end of a behavior. To incorporate the temporal structure of video data, 3D convolutional networks and recurrent networks have been used to detect actions [Shou et al.(2017)Shou, Chan, Zareian, Miyazawa, and Chang, Xu et al.(2017)Xu, Das, and Saenko, Singh et al.(2016)Singh, Marks, Jones, Tuzel, and Shao, Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras]. Following the success of object proposals [Uijlings et al.(2013)Uijlings, Van De Sande, Gevers, and Smeulders, Ren et al.(2015)Ren, He, Girshick, and Sun] for object detection, algorithms for proposing temporal segments for action classification have been developed [Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia, Escorcia et al.(2016)Escorcia, Heilbron, Niebles, and Ghanem]. Our work follows past recurrent network algorithms that provide a label for each frame.
Action detection often leverages feature representations developed for action recognition [Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras, Escorcia et al.(2016)Escorcia, Heilbron, Niebles, and Ghanem, Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia]. Action recognition focuses on classifying trimmed videos. Large-scale action recognition datasets [Carreira and Zisserman(2017), Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei] have helped produce strong representations of short video snippets, which then can be used by detection algorithms. 3D convolutional networks leverage lessons learned from successful image recognition networks [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri] and simultaneously learn appearance and motion information. Recurrent models, such as LSTM’s, have been used to model long range temporal interactions [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei]. More recently, two stream networks [Simonyan and Zisserman(2014), Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool, Carreira and Zisserman(2017)] have been successful at action recognition. In our work we use two-stream feature representations as inputs to our detection model.
Online detection of action start (ODAS) [Shou et al.(2018)Shou, Pan, Chan, Miyazawa, Mansour, Vetro, Giro-i Nieto, and Chang] is the most similar past work to ours. In ODAS, the goal is to accurately detect the start of an action for use in real-time systems. In contrast our work focuses on offline detection of action starts to understand causes of behaviors, for example to formulate hypotheses of brain activity that caused a behavior. Both offline and online start detection have similar difficulties in label sparsity. In this work, we provide a dataset for which the accuracy of the action start labels was the main focus in dataset creation. Our dataset will be useful for both online and offline action start detection research.
3.1 Problem Formulation
Let be a sequence of video frames, where is the feature representation of each frame. The goal of our work is to predict, for each frame and behavior , whether the frame corresponds to the start of a behavior bout () or not (). Let be the sequence of ground truth labels for , where and is the number of behaviors.
Let be a predicted sequence of labels. We propose to measure the following structured error. We match behavior starts with predictions . Each label can be matched with at most one prediction within frames. Labeled starts without a matched predicted start are false negatives (FN) and get a fixed penalty of . Predicted starts without a matched true start are false positives (FP) and get a fixed penalty .
More formally, let be a matching from true to predicted starts, where and means that the true start of behavior at frame is matched to a predicted start of behavior at frame , and indicates that the true start at frame is not matched. Similarly, let denote an inverse matching from predictions to labels consistent with . Then, our error criterion can be written as a minimum over matchings :
where is a weight for behavior (usually set to one over the number of true starts of that behavior) and is the indicator function. Note that for false negatives and for false positives.
We can use the Hungarian algorithm [Kuhn(1955)] to efficiently compute the optimal matching in this criterion. For each behavior, our bipartite graph consists of two sets of nodes, where and are the number of true and predicted action starts. In the first set, the first nodes correspond to true starts and the last nodes correspond to false positives. In the second set, the first nodes correspond to predicted starts and the last to false negatives. The distance matrix is then
where is the th true action start and is the th predicted action start.
3.2 Matching Loss
We propose a structured loss based on this error criterion, which we refer to as the Matching Loss. The output of our classifiers are continuous values . To compute this loss, by binarize the classifier outputs by thresholding and non-maximal suppression, resulting in a sequence of predicted action starts . We use to select an optimal matching using the Hungarian algorithm, as described above. Then, we minimize the following loss, which is a differentiable function of the continuous classifier outputs:
where and are parameters weighing the importance of false negatives and false positives to misalignments of true positives.
Note that our loss can be applied to any matching, but we happen to choose the optimal matching. With this loss, we can directly enforce the importance of predicting near the true behavior start frame while avoiding spurious predictions. A correct prediction is penalized by the distance to the true behavior start frame and the confidence of the network output. Any prediction that is not matched, will be penalized by the network’s output score.
Given the matching , this loss is differentiable and thus can be minimized using gradient descent. However, selecting the optimal matching is not differentiable. Following [Stewart et al.(2016)Stewart, Andriluka, and Ng, Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras], we iteratively hold fixed the network and select the the optimal matching, then fix the matching and apply gradient descent to optimize the network. In our experiments, we found that, using this training procedure, our networks were able to learn to localize behavior start locations. One concern that our loss function is that it is not fully differentiable. For example, suppose we have a predicted start matched to a true start. While the total loss would decrease if the prediction were closer to the true start frame, gradient descent on our loss with fixed matching will not do this.
3.3 Wasserstein/EMD Loss
This Matching Loss penalty has similarities to the Earth Mover’s Distance (EMD). Fortunately EMD can be computed with the differentiable 1-Wasserstein Loss, and we can use this as an approximation of the Matching Loss , Equation 3.2. Similar to [Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras], we use the squared EMD loss, a variant of the 1-Wasserstein Distance, as an alternative structured cost for the sequence. Unlike [Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras], we do not apply a matching before computing the loss. Additionally, the Wassertein loss is applied to all predictions for a behavior simultaneously, rather than to each prediction separately. This allows the loss to be completely differentiable. The predicted label sequence and the ground truth label sequence are first normalized as follows,
We add in case there are no labels or predictions in a given behavior class for a video.. Our Wasserstein structured loss is then defined as the sum of the cumulative differences over all behaviors:
The Wasserstein Loss enforces our goal of reducing false positives by linking multiple predictions. Minimizing Wasserstein Loss tends to transfer probability mass closer to sharp spike labels. This maps all network predictions to true action start locations and when multiple predictions contribute to a true action start, the loss will penalize the extra mass (Figure 1c). In contrast, a per-frame loss will treat spurious predictions separately, and penalize multiple predictions less aggressively.
3.4 Per-Frame Loss
We define a per frame loss, , as the mean squared error between and . We found it is important to include this in a combined loss as the structured losses ( and ) sometimes struggle to localize behaviors early in the training process. This is especially true for the Matching Loss, since there may not be any initial predictions that pass the network classifier score thresholding.
3.5 Combined Loss
Similar to [Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras], we use a combined loss function
where is a per frame loss, the structured loss is either or , and is a hyper parameter between 0 and 1. In our experiments we found it can useful to vary the importance of the two losses functions during training.
4 The Mouse Reach Dataset
We introduce a new video dataset for detecting action starts, which we call the Mouse Reach Dataset. This dataset was collected by neuroscientists to research the relationship between brain activity and the start of behaviors. Unlike most action detection datasets, where the duration of the bout is labeled, the neuroscientists are only interested in the start of an action and labeled the data accordingly. This provides an interesting opportunity for computer vision research to develop tools to automatically detect action starts for neuroscience research. The dataset can be downloaded at http://research.janelia.org/bransonlab/MouseReachData/.
The dataset contains recordings of four different mice, named M134, M147, M173, and M174, attempting to grab and eat a food pellet. The mice are in a fixed position and are recorded multiple times a day for several days. Each recording consists of two videos recorded from different view points. The high speed videos are recorded at 500 frames per second in near infrared. The dataset contains a total of 1169 videos.
The biologists labeled the starting frame for six different behaviors which are described as follows. "Lift" occurs when the mouse begins to lift its paw from the perch. "Hand-open" occurs when the mouse begins to open its paw to grab a pellet. "Grab" is when the mouse begins to close its paw around a food pellet. "Supinate" is the mouse turning the paw towards its mouth. "At-mouth" is when the paw is at the mouth and the pellet is starting to be placed into the mouth. "Chew" occurs when the food pellet is in the mouth, and the mouse starts to eat the pellet. It is important to note that the food pellet may not be in the mouse’s paw for the grab and supinate behaviors, however it is necessary for the mouth and chew actions. The most common behavior is the "Hand-open" behavior with 2227 labels, or about 1.91 labeled instances per video. In contrast the least seen behavior is "Chew", with 664 labeled instances. The supplementary material shows a more detailed breakdown of the dataset statistics.
In these videos, the mouse’s behavior leads to an ordered sequence of labels since, for example, the mouse cannot eat a food pellet without grabbing it. The most common sequence of labels is "Lift", "Hand-open", "Grab", "Supinate", "At-mouth", and "Chew". However the mouse may fail to grab and eat the food pellet, and when this happens, it will often attempt to grab it again. In turn, this results in an imbalance in labels.
We believe that this dataset will provide computer vision researchers an opportunity to work with high quality behavior start labels. As mentioned previously, bout boundary detection has gained interested in the vision research community and this provides an excellent dataset for focusing on that research while providing useful tools for neuroscientists.
5.1 Model Implementation
For all experiments, we used pre-computed or fine tuned features as inputs to a recurrent neural network. Our complete model consists of a fully connected layer, ReLU, Batch Normalization, two bi-directional LSTM layers, a fully connected layer and a sigmoid activation layer. The LSTMs each have 256 hidden units. We used ADAM with a different learning rate depending on the loss. The network was trained for 400 epochs with a batch size of 10.
Two types of input features were used: HoG+HOF [Dalal and Triggs(2005)] and I3D [Carreira and Zisserman(2017)]. The HoG+HOF are hand-designed features that capture image gradients and motion gradients. For the mouse dataset, we computed these features on overlapping windows on each view point, resulting in an 8000 dimensional feature vector. I3D is a state of the art action recognition network that uses sets of RGB and optical flow frames as input. We used the output of the last average pooling layer before the convolutional classification layer as the I3D feature representation. For each frame in the video sequence, I3D was applied to a 64 frame window, centered around the input frame. We refer to the features from the model trained on the Kinetics dataset as Canned I3D. We also fine-tune the I3D network on our dataset by training the I3D network to classify frames of the videos as one of the behavior starts or background. This feature set is referred to as Finetuned I3D. HoG+HOF features will be provided with the dataset.
We trained RNNs with each of the three feature types (HOGHOF, Canned I3D, and Finetuned I3D) and each of three losses: Matching (Section 3.2), Wasserstein (Section 3.3), and MSE (Mean-squared error, Section 3.4).
When using the HOG+HOF and Canned I3D features as inputs with the Matching Loss in the combined loss (3) was reduced from an initial value to with an exponential step size of every five epochs until and were weighted equally. For the Finetuned I3D features, was reduced to . We set the , , . For the Wasserstein Loss, . In order to help the Wasserstein and MSE losses deal with the scarcity of positive samples, we blur the ground truth label sequence with a Gaussian kernel with window size 19 frames and standard deviation of 2 frames.
5.2 Mouse Experiments
We test our loss functions on the Mouse Reach Dataset. The goal of this task is to detect the start of a behavior within frames. For these experiments we set frames, which is equivalent to 0.02 seconds. For each of the four mice, we train the with all other mice’s videos and the first half of that mouse’s videos, and test on the second half of that mouse’s videos. Test sets consisted of 125, 55, 274, and 192 videos for the four mice.
As mentioned before, a correct prediction is one that is within frames from the ground truth frame start. These represent true positive results. All other network predictions are false positives and missed ground truth starts are false negatives. We can then calculate accuracy using the F1-Score, which is the harmonic mean of precision and recall.
We consider a network trained with MSE loss as a baseline and compare it to a networks trained with our two structured loss functions. Table 2 shows the precision, recall and F1 score for the three loss functions on the Mouse Reach Dataset. We also show the scores of the fine-tuned feed-forward I3D network. The structured losses have a better F1 score, and this is due to higher precision, implying an improved false positive rate. This matches one of the goals of our losses, to reduce the number of spurious predictions by adding the structured loss. The Matching loss explicitly penalizes false positive predictions and the Wasserstein Loss attempts to match the number of predicted behavior starts with the ground truth. Overall, the Wasserstein Loss performs best regardless of the input features. In Table 1, we can see the breakdown performance with respect to each behavior. The behavior that the structured losses most improved was lift and supinate.
The MSE+Finetuned I3D performs far worse than expected and we believe it is due to our training method. We only use the labeled frames as positive samples, and sample no other frames in a window with radius 10 around the positive sample. Additionally we do not do any hard negative mining at the border of this window. The 64 frame input to I3D and unseen frames may be causing the features to be very similar for the MSE loss. Because MSE does not penalize false positive as harshly as the structured losses, it is happy to predict many extra action starts, as seen by the poor precision score. We believe that the feature representation and the MSE performance can be improved by implementing hard negative mining.
Figure 3 shows the distribution of predictions for the lift. Due to space constraints we only show one behavior here. More examples are available in the supplementary materials. Within the frames to the true action start, we see that our proposed losses predict far fewer false positives, while predicting a similar number of correct predictions. From this we can infer that our losses are more likely to produce a single prediction for an action start than MSE. However, this graph does not include false positives that are far from the true start of the behavior. Figure 4 shows how predictions far from the action start eventually are matched.
Perhaps surprisingly, we do not see a big difference in performance between HOG/HOF features and the fine-tuned I3D features learned for this task.
In this work we show that it is possible to predict the frame where behaviors start with high accuracy. Due to the nature of our task, we focused on developing structured loss functions that would reduce the number of false positive predictions. We suggest using the Wasserstein Loss over the Matching Loss because it has fewer parameters to tune, and the loss is differentiable. In the future, we plan apply our method for action start detection to other video datasets.
- [Caba Heilbron et al.(2015)Caba Heilbron, Escorcia, Ghanem, and Carlos Niebles] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- [Carreira and Zisserman(2017)] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- [Dalal and Triggs(2005)] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
- [Escorcia et al.(2016)Escorcia, Heilbron, Niebles, and Ghanem] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Daps: Deep action proposals for action understanding. In ECCV, 2016.
- [Gao et al.(2017)Gao, Yang, Chen, Sun, and Nevatia] Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV, 2017.
- [Guo et al.(2015)Guo, Graves, Guo, Zheng, Lee, Rodriguez-Gonzalez, Li, Macklin, Phillips, Mensh, et al.] Jian-Zhong Guo, Austin R Graves, Wendy W Guo, Jihong Zheng, Allen Lee, Juan Rodriguez-Gonzalez, Nuo Li, John J Macklin, James W Phillips, Brett D Mensh, et al. Cortex commands the performance of skilled movement. Elife, 2015.
- [Idrees et al.(2017)Idrees, Zamir, Jiang, Gorban, Laptev, Sukthankar, and Shah] Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos "in the wild". Computer Vision and Image Understanding, 2017.
- [Karpathy et al.(2014)Karpathy, Toderici, Shetty, Leung, Sukthankar, and Fei-Fei] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- [Krakauer et al.(2019)Krakauer, Hadjiosif, Xu, Wong, and Haith] John W Krakauer, Alkis M Hadjiosif, Jing Xu, Aaron L Wong, and Adrian M Haith. Motor learning. Comprehensive Physiology, 9:613–663, 2019.
- [Kuhn(1955)] Harold W. Kuhn. The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly, 1955.
- [Ren et al.(2015)Ren, He, Girshick, and Sun] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, 2015.
- [Sauerbrei et al.(2018)Sauerbrei, Guo, Zheng, Guo, Kabra, Verma, Branson, and Hantman] Britton Sauerbrei, Jian-Zhong Guo, Jihong Zheng, Wendy Guo, Mayank Kabra, Nakul Verma, Kristin Branson, and Adam Hantman. The cortical dynamics orchestrating skilled prehension. bioRxiv, page 266320, 2018.
- [Shou et al.(2017)Shou, Chan, Zareian, Miyazawa, and Chang] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017.
- [Shou et al.(2018)Shou, Pan, Chan, Miyazawa, Mansour, Vetro, Giro-i Nieto, and Chang] Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i Nieto, and Shih-Fu Chang. Online detection of action start in untrimmed, streaming videos. In ECCV, 2018.
- [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
- [Singh et al.(2016)Singh, Marks, Jones, Tuzel, and Shao] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In CVPR, 2016.
- [Soomro et al.(2012)Soomro, Zamir, and Shah] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- [Stewart et al.(2016)Stewart, Andriluka, and Ng] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. End-to-end people detection in crowded scenes. In CVPR, 2016.
- [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
- [Uijlings et al.(2013)Uijlings, Van De Sande, Gevers, and Smeulders] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. IJCV, 2013.
- [Wang et al.(2016)Wang, Xiong, Wang, Qiao, Lin, Tang, and Van Gool] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
- [Wei et al.(2018)Wei, Wang, Nguyen, Zhang, Lin, Shen, Mech, and Samaras] Zijun Wei, Boyu Wang, Minh Hoai Nguyen, Jianming Zhang, Zhe Lin, Xiaohui Shen, Radomír Mech, and Dimitris Samaras. Sequence-to-segment networks for segment detection. In Advances in Neural Information Processing Systems, 2018.
- [Whishaw and Pellis(1990)] Ian Q Whishaw and Sergio M Pellis. The structure of skilled forelimb reaching in the rat: a proximally driven movement with a single distal rotatory component. Behavioural brain research, 41(1):49–59, 1990.
- [Xu et al.(2017)Xu, Das, and Saenko] Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV, 2017.