Efficient Activity Detection in Untrimmed Video with Max-Subgraph Search
We propose an efficient approach for activity detection in video that unifies activity categorization with space-time localization. The main idea is to pose activity detection as a maximum-weight connected subgraph problem. Offline, we learn a binary classifier for an activity category using positive video exemplars that are “trimmed” in time to the activity of interest. Then, given a novel untrimmed video sequence, we decompose it into a 3D array of space-time nodes, which are weighted based on the extent to which their component features support the learned activity model. To perform detection, we then directly localize instances of the activity by solving for the maximum-weight connected subgraph in the test video’s space-time graph. We show that this detection strategy permits an efficient branch-and-cut solution for the best-scoring—and possibly non-cubically shaped—portion of the video for a given activity classifier. The upshot is a fast method that can search a broader space of space-time region candidates than was previously practical, which we find often leads to more accurate detection. We demonstrate the proposed algorithm on four datasets, and we show its speed and accuracy advantages over multiple existing search strategies.
ctivity detection, action recognition, maximum weighted subgraph search.
The activity detection problem entails both recognizing and localizing categories of activity in an ongoing (meaning “untrimmed”) video sequence. In other words, a system must not only be able to recognize a learned activity in a new clip; it must also be able to isolate the (potentially small) portion of a long input sequence that contains the activity. Reliable activity detection would have major practical value for applications such as video indexing, surveillance and security, and video-based human computer interaction.
While the recognition portion of the problem has received increasing attention in recent years, state-of-the-art methods largely assume that the space-time region of interest to be classified has already been identified. However, for most realistic settings, a system must not only name what it sees, but also partition out the temporal or spatio-temporal extent within which the activity occurs. The distinction is non-trivial; in order to properly recognize an action, the spatio-temporal extent usually must be known simultaneously.
To meet this challenge, existing methods tend to separate activity detection into two distinct stages: the first generates space-time candidate regions of interest from the test video, and the second scores each candidate according to how well it matches a given activity model (often a classifier). Most commonly, candidates are generated either using person-centered tracks [22, 27, 36, 16] or using exhaustive sliding window search through all frames in the video [14, 8, 29]. Both face potential pitfalls. On the one hand, a method reliant on tracks is sensitive to tracking failures, and by focusing on individual humans in the video, it overlooks surrounding objects that may be discriminative for an activity (e.g., the car a person is approaching). On the other hand, sliding window search is clearly a substantial computational burden, and its frame-level candidates may be too coarse, causing clutter features to mislead the subsequent classifier. In both cases, the scope of space-time regions even considered by the classifier is artificially restricted, e.g., to person bounding boxes or a cubic subvolume.
Our goal is to unify the classification and localization components into a single detection procedure. We propose an efficient approach that exploits top-down activity knowledge to quickly identify the portion of video that maximizes a classifier’s score. In short, it works as follows. Given a novel video, we construct a 3D graph in which nodes describe local video subregions, and their connectivity is determined by proximity in space and time. Each node is associated with a learned weight indicating the degree to which its appearance and motion support the action class of interest. Using this graph structure, we show the detection problem is equivalent to solving a maximum-weight connected subgraph problem, meaning to identify the subset of connected nodes whose total weight is maximal. For our setting, this in turn is reducible to a prize-collecting Steiner tree problem, for which practical branch-and-cut optimization strategies are available. This means we can efficiently identify both the spatial and temporal region(s) within the sequence that best fit a learned activity model. See Figure 1.
The proposed approach has several important properties. First, for the specific case where our space-time nodes are individual video frames, the detection solution is equivalent to that of exhaustive sliding window search, yet costs orders of magnitude less search time due to the branch-and-cut solver. Second, we show how to create more general forms of the graph that permit “non-cubic” detection regions, and even allow hops across irrelevant frames in time that otherwise might mislead the classifier (e.g., due to a temporary occluding object). This effectively widens the scope of candidate video regions considered beyond that allowed by any prior methods; the upshot is improved accuracy. Third, we explore a two-stage search extension that increases the speed of the proposed subgraph search for long videos, and show its generality for detecting multiple activity instances in a single input sequence. Finally, the method accommodates a fairly wide family of features and classifiers, making it flexible as a general activity detection tool. To illustrate this flexibility, we devise a novel high-level descriptor amenable to subgraph search that reflects human poses and objects as well as their relative temporal ordering.
We validate the algorithm on four challenging datasets. The results demonstrate its clear speed and accuracy advantages over both standard sliding window search as well as a state-of-the-art branch-and-bound solution .
2 Related Work
We focus our literature review on methods handling action detection in video. There is also a large body of work in activity recognition (from either a sequence or a static frame) where one must categorize a clip/frame that is already trimmed to the action of interest. Representations developed in that work are complementary and could enhance results attainable with our detection scheme.
One class of methods tackles detection by explicitly tracking people, their body parts, and nearby objects (e.g., [22, 27, 16]). Tracking “movers” is particularly relevant for surveillance data where one can assume a static camera. However, as shown in Figure 2(c), relying on tracks can be limiting; it makes the detector sensitive to tracking errors, which are expected in video with large variations in backgrounds or rapidly changing viewpoints (e.g., movies or YouTube video). Furthermore, while good for activities that are truly person-based—like handwaving or jumping jacks—a representation restricted to person-tracks will suffer when defining elements of the action are external to people in the scene (e.g., the computer screen a person is looking at, or the chair he may sit in).
Conscious of the difficulty of relying on tracks, another class of methods has emerged that instead treats activity classes as learned space-time appearance and motion patterns. The bag of space-time interest point features model is a good example [19, 30]. In this case, at detection time the classifier is applied to features falling within candidate subvolumes within the sequence. Typically the search is done with a sliding window over the entire sequence [14, 8, 29], or in combination with person tracks .
Given the enormous expense of such an exhaustive search for sliding window method, some recent work explores branch-and-bound solutions to efficiently identify the subvolume that maximizes an additive classifier’s output [38, 37, 4]. This approach offers fast detection and can localize activities in both space and time, whereas sliding windows localize only in the temporal dimension. However, as shown in Figure 2(b), in contrast to our approach, existing branch-and-bound methods are restricted to searching over cubic subvolumes in the video; that limits detections to cases where the subject of the activity does not change its spatial position much over time. Our results demonstrate the value of the more general detection shapes allowed by our method.
An alternative way to avoid exhaustive search is through voting algorithms. Recent work explores ways to combine person-centric tracks or pre-classified sequences with a Hough voting stage to refine the localization [36, 21, 1], or to use voting to generate candidate frames for merging . Like any voting method, such approaches risk being sensitive to noisy background descriptors that also cast votes, and in particular will have ambiguity for actions with periodicity. Furthermore, in contrast to our algorithm, they cannot guarantee to return the maximum scoring space-time region for a classifier.
Rather than pose a detection task, the multi-class recognition approach of  uses dynamic programming to select the temporal boundaries per action. Like our technique it jointly considers recognition and segmentation. However, unlike our method, it localizes only in the temporal dimension, assumes a multi-class objective where all parts of the sequence will belong to some pre-trained category (thus requiring one to learn a “background” activity class), and cannot detect multiple activities occurring at the same time.
The branch-and-cut algorithm we use to optimize the subgraph has also been explored for object segmentation in static images . In contrast, our approach addresses activity detection, and we explore novel graph structures relevant for video data.
This article extends our earlier conference paper . The main new additions are (1) an extension to the method to permit efficient spatially localized search even over long sequences, (2) an extension to the method to detect multiple instances of an activity in a single sequence, (3) new results on a fourth dataset specifically designed for the detection task (THUMOS 2014 ), (4) new qualitative results, and (5) new figures to better present our ideas and approach.
Our approach first trains a detector using a binary classifier and training examples where the action’s temporal extent is known. Then, given test sequences for which we have no knowledge of the start and end of the activity, it returns the subsequence (and optionally, the spatial regions of interest) that maximizes the classifier score. This works by creating a space-time graph over the entire test sequence, where each node is a space-time cube, and the cubes are linked according to their proximity in space and time. Each node is weighted by a positive or negative value indicating its features’ contribution to the classifier’s score. Thus, the subsequence for which the detector would yield the maximal score is equivalent to the maximum weight connected subgraph. This subgraph can be efficiently computed using an existing branch-and-cut algorithm, thereby finding the optimal solution without exhaustive search through all possible sets of connected nodes.
We first define the classifiers accommodated by our method (Sec. 3.1), and the features we use (Sec. 3.2). Then we describe how the graphs are constructed (Sec. 3.3); we introduce variants of the node structure and linking strategy that allow us to capture different granularities at detection time. Next, we briefly explain the maximum subgraph problem and branch-and-cut search (Sec. 3.4). Finally, we devise two extensions of our basic framework that can deal with spatio-temporal detection even in long videos (Sec. 3.5) and detection of multiple instances in a single sequence (Sec. 3.6).
3.1 Detector Training and Objective
We are given labeled training instances of the activity of interest, and train a binary classifier to distinguish positive instances from all other action categories. This classifier can score any subvolume of a novel video according to how well it agrees with the learned activity. To perform activity detection, the goal is to determine the subvolume in a new sequence that maximizes the score
If we were to restrict the subvolume in the spatial dimensions to encompass the entire frame, then would correspond to the output of an exhaustive sliding window detector. More generally, the optimal subvolume is the set of contiguous voxels of arbitrary shape in that returns the highest classifier score.
Our approach requires the classifier to satisfy two properties. First, it must be able to score an arbitrarily shaped set of voxels. Second, it must be defined such that features computed within local space-time regions of the video can be combined additively to obtain the classifier response for a larger region. The latter is necessary so that we can decompose the classifier response across the nodes of the space-time graph, and thereby associate a single weight with each node. Suitable additive classifiers include linear support vector machines (SVM), boosted classifiers, or Naive Bayes classifiers computed with localized space-time features, as well as certain non-linear SVMs .
Our results use a linear SVM with histograms (bags) of quantized space-time descriptors. The bag-of-features (BoF) representation has been explored in a number of recent activity recognition methods (e.g., [19, 15, 23]), and, despite its simplicity, offers very competitive results. We consider BoF’s computed over two forms of local descriptors. The first consists of low-level histograms of oriented gradients and flow computed at space-time interest points; the second consists of a novel high-level descriptor that encodes the relative layout of detected humans, objects, and poses. Both descriptors are detailed below in Sec. 3.2.
In either case, we compute a vocabulary of visual words by quantizing a corpus of features from the training images. A video subvolume with local features is initially described by the set , where each refers to the 3D feature position in space and time, and is the associated local descriptor. Then the subvolume is converted to a -dimensional BoF histogram by mapping each to its respective visual word , and tallying the word counts over all features.
We use the training instances to learn a linear SVM, which means the resulting scoring function has the form:
where indexes the training examples, and , denote the learned weights and bias. This can be rewritten as a sum over the contributions of each feature. Let denote the -th bin count for histogram . The -th word is associated with a weight
for . Thus the classifier response for a subvolume is:
where again is the index of the visual word that feature maps to, . By writing the score of a subvolume as the sum of its features’ “word weights”, we now have a way to associate each local descriptor occurrence with a single weight—its contribution to the total classifier score.
We stress that our method is not limited to linear SVMs; alternative additive classifiers with the properties described above are also permitted. Our experiments in Sec. 4 focus on linear SVMs due to their efficacy. We have also successfully implemented the framework using others, e.g., Naive Bayes, with the same input features. The results are sound, however across the board we find that classifier is less effective than the SVM for our task.
Furthermore, while the additive requirement does lead to an orderless bag-of-features representation, it is still possible to encode temporal ordering into the approach depending on how the local descriptors are extracted. For example, in Sec. 3.2.2 we provide one way to record the space-time layout of neighboring objects into high-level visual words.
3.2 Localized Space-Time Features
We consider two forms of localized descriptors for the vectors above: a conventional low-level gradient-based feature, and a novel high-level feature.
For low-level features, we employ an array of widely used local video descriptors from the literature. In general, they capture the texture and motion within localized space-time volumes, either at interest points or dense positions within the video. In particular, we use histograms of oriented gradients (HoG) and histograms of optical flow (HoF) computed in local space-time cubes [19, 15]. The local cubes are centered at either 3D Harris interest points  or densely sampled. These descriptors capture the appearance and motion in the video, and their locality lends robustness to occlusions. We also incorporate dense trajectory  and motion boundary histogram (MBH)  features in a bag-of-features representation. We refer the reader to the original papers about the descriptors for more details.
As is typical in visual recognition, we can expect better accuracy as a function of the greater the variety and complementarity of the features we use, but with some tradeoff in computational cost. Specifically, the main influence the features will have on our method’s complexity is their density in the video; while their density will not at all affect the node structure (cf. Sec. 3.3), it will dictate how many visual word mappings must be computed. In Sec. 4 we provide more discussion about how we select among these descriptors for different datasets; in short, our selection is largely based on empirical findings from previous work about which are best suited.
We introduce a novel descriptor for an alternative high-level representation. While low-level gradient features are effective for activities defined by gestures and movement (e.g., running vs. diving), many interesting actions are likely better defined in terms of the semantic interactions between people and objects [10, 6, 26]. For example, “answering phone” should be compactly describable in terms of a person, a reach, a grasp of the receiver, etc.
To this end, we compose a descriptor that encodes the objects and poses occurring in a space-time neighborhood. First, we run a bank of object detectors  and a bank of mid-level “poselet” detectors  on all frames. To capture human pose, we categorize each detected person into one of “person types”. These types are discovered from person detection windows in the training data: for each person window we create a histogram of the poselet activations that overlap it, and then quantize the space of all such histograms with -means to provide discrete types. Each reflects a coarse pose—for example, a seated person may cause upper body poselets to fire, whereas a hugging person would trigger poselets from the back.
Given the sparse set of bounding box object detections in a test sequence, we form one neighborhood descriptor per box. This descriptor reflects (1) the type of detector (e.g., person type #3, car) that fired at that position, (2) the distribution of object/person types that also fired within a 50-frame temporal window of it, and (3) their relative space-time distances. See Figure 4.
To quantize this complex space into discriminative high-level “words”, we devise a random forest technique. When training the random forest, we choose spatial distance thresholds, temporal distance thresholds, and object types to parameterize semantic questions that split the raw descriptor inputs so as to reduce action label entropy. Each training and testing descriptor is then assigned a visual word according to the indices of the leaf nodes it reaches when traversing each tree in the forest. Essentially, this reduces each rich neighborhood of space-time object relationships to a single quantized descriptor, i.e., a single index in Eqn. 5.
In contrast to the low-level features, this descriptor encodes space-time ordering, demonstrating that our max-subgraph scheme is not limited to pure bag-of-words representations. Furthermore, it leads to faster node weight computations, since the number of detected objects is typically much fewer than the number of space-time interest points.
3.3 Definition of the Space-Time Graph
So far we have defined the training procedure and features we use. Now we describe how we construct a space-time graph for a novel test video, where is a set of vertices (nodes) and is a set of edges. Recall that a test video is “untrimmed”, meaning that we have no prior knowledge about where an action(s) starts or ends in either the spatial or temporal dimensions. Our detector will exploit the graph to efficiently identify the most likely occurrences of a given activity. We present two variants each for the node and link structures, as follows.
Each node in the graph is a set of contiguous voxels within the video. In principle, the smallest possible node would be a pixel, and the largest possible node would be the full test sequence. What, then, should be the scope of an individual node? The factors to consider are (1) the granularity of detection that is desired (i.e., whether the detector should predict only when the action starts and ends, or whether it should also estimate the spatial localization), and (2) the allowable computational cost. Note that nodes larger than individual voxels or frames are favorable not only for computational efficiency, but also to aggregate neighborhood statistics to give better support when the classifier considers that region for inclusion.
With this in mind, we consider two possible node structures. The first breaks the video into frame-level slabs, such that each node is a sequence of consecutive frames. The second breaks the video into a grid of space-time cubes. In all our results, we set or , and let and be of the frame dimensions.
After building a graph with either node structure for a test video, we compute the weight for each node :
where is the 3D coordinate of the -th local descriptor falling within node , and is its quantized feature index. We assign the features from Sec. 3.2 to their respective graph nodes as follows. For the case of low-level features, is the space-time interest point position. For the case of high-level features, is the center of the originating object detection window. In either case, a feature is claimed by the space-time node containing its central position.
Intuitively, nodes with high positive weights indicate that the activity covers that space-time region, while nodes with negative weights indicate the absence of the activity.
The connectivity between nodes also affects both the shape of candidate subvolumes and the cost of subgraph search. We explore two strategies. In the first, we link only those neighboring nodes that are temporally (and spatially, for the ST node structure) adjacent (see Figure 6 (a)). In the second, we additionally link nodes that are within the first two temporal neighbors (see Figure 6 (b)); we call this variant T-Jump-Subgraph. Since at test time we will seek a maximum scoring connected subgraph, the former requires detection subvolumes to be strictly contiguous in time (and thus equates to the options available to a sliding window), while the latter allows subvolumes that “jump” over an adjacent neighbor in time.
By allowing jumps, we can ignore misleading features that may interrupt an otherwise good instance of an action. For example, Figure 6 depicts some temporal nodes and their associated weights ’s, under either connectivity scheme. The max subgraph without jumps in (a) is the first two nodes only; in contrast, for the same node weights, the max subgraph with jumps in (b) extends to include the fourth node, yielding a higher weight subgraph (4+2+5 vs. 4+2). This can be useful when the skipped node(s) contain noisy features, such as an object temporarily blocking the person performing the activity. Like the space-time nodes presented above, the use of temporal jumps further expands the space of candidate subvolumes our method can search, at some additional computational cost.
3.4 Searching for the Maximum Weight Subgraph
Having defined the graph constructed on an untrimmed test sequence, we are ready to describe the detection procedure to maximize in Eqn. 1. Our detection objective is an instance of the maximum-weight connected subgraph problem (MWCS): Given a connected undirected, vertex-weighted graph with weights , find a connected subgraph of , that maximizes the score . The best-scoring subgraph is the subvolume in the video most likely to encompass the activity of interest. That is the output of our approach. In Sec. 3.6 we explain how we iteratively apply the subgraph search procedure to retrieve multiple detections in the same video.
With both positive and negative weights, the problem is NP-complete ; an exhaustive search would enumerate and score all possible subsets of connected nodes. However, MWCS can be transformed into an instance of the prize-collecting Steiner tree problem (PCST)  which has the same graph structure as original MWCS and vertex profits and edge costs . This MWCS is solvable by transforming the graph into a directed graph and formulating an integer linear programming (ILP) problem with binary variables for every vertex and edge. Then by relaxing the integrality requirement, the problem can be solved with linear programming using a branch-and-cut algorithm (see ). This method gives optimal solutions and is very efficient in practice for the space-time graphs in our setting.
3.5 Two Stage Spatio-temporal Detection
Next we describe an extension to the framework that further improves efficiency of spatio-temporal detections, at some loss in search completeness. Basically this extension offers a way to further scale-up our detection strategy for long input videos. It is relevant in the spatio-temporal detection variant of our method (cf. Fig. 5(b)), not the temporal-only variant (cf. Fig. 5(a)). The fine-grained space-time detection offered by the ST-Subgraph comes from its greater number of nodes and denser connectivity. In particular, in terms of the number of edges as a function of the number of frames, for temporal-only graph, one more temporal node will add one more edge, as for spatio-temporal graph, one more temporal node will add number of edges quadratically to the spatial nodes. Thus, to detect the activity efficiently without reducing the granularity of search scope, we consider how a modest sacrifice on detection accuracy (i.e., giving up the exhaustive search equivalency promised so far) can yield a significantly larger detection speed-up.
To this end, we propose a hierarchical bottom-up two stage strategy for the space-time search setting. The basic idea is to first perform space-time detections in each temporal slab, and then propagate those detection results up to a second level of processing that performs temporal detection across the slabs. See Figure 7.
Given a test video, we divide the video into spatio-temporal nodes (as depicted in Fig. 7, left) and compute their weights as described in Sec. 3.3. Next, we search for the best detection volume in two stage: (1) a spatial detection stage and (2) a temporal detection stage. For the spatial detection stage, we connect nodes in the same temporal slice into a 2D connected weighted graph (see Fig. 7, top right). This yields a series of graphs, each of which has nodes representing the features in different spatial positions in the respective temporal slab. We then apply the subgraph search procedure from Sec. 3.4 to find the maximum weighted connected subgraph in each slab. Next, the detection score for each 2D subgraph is used to represent the weight of each temporal slab, and these slabs are connected into a 1D temporal graph (see Fig. 7, bottom right). Finally, we find the maximum weighted subgraph along the temporal dimension to obtain the detection output. The spatio-temporal detection result is determined by set of spatial-temporal nodes in the 2D max-subgraph that are also selected in 1D max-subgraph.
This hierarchical process reduces the computational cost by dividing the original 3D graph structure into a 2D1D graph structure. Note, however, that the detection result from two-stage subgraph search may differ from that returned by the original ST-Subgraph. Whereas the ST-Subgraph is guaranteed to return the same result an exhaustive search over connected subgraphs, in this modified two-stage procedure, the temporal connection between nodes is always reduced to one edge (vs. nine edges for the original ST-Subgraph). However, the two-stage search process still provides broader searching scope than the simpler T-Subgraph structure.
In practice, when the length of testing video clip is over 1,000 frames, the two-stage subgraph would be preferred over ST-subgraph for efficient spatial temporal localization. Also, the two-stage subgraph is an approximation of ST-subgraph, if the feature is too noisy, the two-stage subgraph may provide lower accuracy since it ignores many edges when computing the maximum weighted subgraph.
3.6 Detecting Multiple Activity Instances
Thus far, we have described detection in terms of localizing the single space-time region most likely to contain the activity of interest, In particular, the max-subgraph search returns the subvolume for which the trained classifier would score most highly out of all possible subvolumes. To address the scenario where the novel test sequence may contain multiple instances of the activity, and/or to provide multiple confidence-rated hypotheses for the detection output, we extend the max-subgraph search technique as follows.
To detect multiple instances, the main idea is to iteratively run the max-subgraph procedure on adjusted versions of the original input graph, each time adjusting the graph to reflect the most recent detection. The most straightforward approach to modifying the graph would be to take all the nodes selected for the most recent detection and re-weight each one to . Doing so is equivalent to removing those nodes, and it would force the next search iteration to choose other nodes for its next hypothesis. This approach has shortcomings in practice, however. While the max-subgraph output from the first detection is optimal in terms of the classifier and features chosen, it need not be perfect in terms of localizing the actual activity. So, flattening nodes to have weight leads to fragmented secondary detections.
Therefore, we instead downweight those nodes already involved in a detection, but we do not remove them from the graph entirely. Specifically, each node is re-weighted to 0, as determined empirically on validation data. In this way, the modified graph coming into the next iteration of the max-subgraph computation will favor finding new high-scoring detections, but may still partially re-use portions of the previous detection(s).
The effect of this process is roughly analogous to standard non-maximum suppression (NMS) as applied in object/action detection with sliding windows. With sliding windows, any window with a positive classifier score could be reported as a detection output. However, many windows with positive scores overlap highly with others, and are actually covering the same object/action instance. To reduce redundant detections, NMS is used to select a single representative output window among a group that highly overlaps. A key parameter that determines the behavior of NMS is the threshold for overlap between detections: candidate windows overlapping with the selected window by more than the selected threshold are not added to the detection output. When the threshold is high, one generates more detection outputs at the risk of redundancy. The re-weighting value applied to nodes in our graphs is analogous to that threshold. A NMS threshold of 0 in traditional sliding windows would correspond to a re-weighting value of in our setup; a higher NMS threshold corresponds to a higher re-weighting value, allowing some overlap in output detections.
4 Experimental Results
We next present experimental results applying our method for activity detection on several public benchmark datasets. We evaluate our approach compared to both sliding window and sliding cuboid baselines as well as an existing state-of-the-art subvolume detection method that exploits branch-and-bound search. Throughout we are interested in both the speed and accuracy attainable. Ideally, we would like to achieve very accurate detection but at a small fraction of the run-time cost incurred by traditional sliding window methods. Furthermore, in some scenarios we hope to improve the accuracy over sliding windows, since our method will permit searching a more complete set of windows than is tractable with a naive search implementation.
In what follows, we first describe the datasets, baselines, and metrics used in our experiments, and we provide implementation details for our approach not already covered above. Then, the next four subsections present results organized around each of the four datasets. This is the most natural organization, since the dataset properties and their respective available ground truth dictate which variants of our approach are relevant for testing (e.g., temporal detection only, fully spatio-temporal, two-stage for spatio-temporal with long sequences, etc.).
Activity Detection Datasets
We validate on four datasets, all of which are publicly available:
UCF Sports 
3: UCF Sports consists of 10 actions from various sports typically found on TV, such as diving, golf swing, running, and skate boarding. The data originates from stock footage websites like BBC Motion or GettyImages. The provided clips are trimmed to the action of interest, so we expand them into longer test sequences by concatenating clips to form “UCF-Concat” (details below). The ground truth contains the action label and the bounding box annotation of the human.
Hollywood Human Actions 
4: The training set contains 219 clips originating from 12 Hollywood movies, and the test set contains 211 clips from a disjoint set of 20 Hollywood movies. The activities are things like answer phone, get out of car, shake hands, etc. We test with the noisy “uncropped” versions of the test sequences which are only roughly aligned with the action and contain about 40% extraneous frames. In all data there is a variety of camera motion and dynamic scenes. The ground truth consists of the action label for the clip, as well as the correct temporal boundaries of the activity in the case of the uncropped sequences.
MSR Actions 
5: The MSR dataset consists of 16 test clips with three activity classes—hand clapping, hand waving, and boxing—performed in front of cluttered and moving backgrounds. They are performed by 10 subjects, both indoor and outdoor. The ground truth consists of a spatio-temporal bounding box for each action. To our knowledge, this is the only available activity dataset with both spatial and temporal annotations (others are limited to temporal boundaries only). For this dataset, we train the activity classifiers using the disjoint KTH dataset , following .
THUMOS 2014 
6: THUMOS consists of videos collected from YouTube containing 101 different action classes. The emphasis on the THUMOS challenge is to cope with temporally untrimmed videos. Accordingly, the test sequences contain the target actions naturally embedded in other content, and the ground truth includes the temporal boundaries of the true action. Following the localization setting of the winners for the ECCV 2014 workshop’s detection task , we divide the 1010 validation videos into two equal parts for testing and training. The test data contains 20 activity classes: baseball pitch, basketball dunk, billiards, clean and jerk, cliff diving, cricket bowling, cricket shot, diving, frisbee catch, golf swing, hammer throw, high jump, javelin throw, long jump, pole vault, shot put, soccer penalty, tennis swing, throw discus, volleyball spiking.
See Table 1 for a summary of the dataset properties. In particular, we include each dataset’s typical clip lengths and the portion of the sequence occupied by the action to be detected. On average, the action of interest occupies only 28% of the total test sequence, making detection (as opposed to classification) necessary.
|Dataset||Features||Num test videos||Ave length (#frames)||Ave length of action|
|THUMOS||STIP+HoG/HoF, Trajectory, MBH||111||1717||29%|
We compare our approach to three baselines:
ST-Cube-Sliding: a variant of sliding window that searches all cuboid subvolumes having any rectangular combination of the spatial-nodes used by our method. Its search scope is similar to our ST-Subgraph, except that it lacks all possible spatial links, meaning the detected subvolume cannot shift spatial location over time. While most existing methods simply apply a sliding temporal window, with no spatial localization, we include this baseline as the natural straightforward extension of sliding window search if one wants to obtain localization.
ST-Cube-Subvolume: the state-of-the-art branch-and-bound method of . It considers all possible cube-shaped subvolumes, and returns the one maximizing the sum of feature weights inside. Its scope is more flexible than ST-Cube-Sliding. Its objective is identical to ours, except that it is restricted to searching cube-shaped volumes that cannot shift spatial location over time. We use the authors’ code.
We stress that our approach is a new strategy for detection; results in the literature focus largely on classification, and so are not directly comparable. The sliding window and subvolume baselines are state-of-the-art methods for detection, so our comparisons with identical features and classifiers will give clear insight into our method’s performance.
We consider four variants of our approach: T-Subgraph, T-Jump-Subgraph, ST-Subgraph, and two-stage ST-Subgraph, as defined in Sec. 3. Recall that T-Subgraph provides equivalent accuracy to T-Sliding, but is faster.
Figure 8 depicts the scope of the regions searched by each method, both ours and the baselines.
We adopt standard metrics for detection evaluation. Following [36, 16, 38], we use the mean overlap accuracy. Whether performing temporal or full spatio-temporal detection, this metric computes the intersection of the predicted detection region with the ground truth, divided by the union. We use detection time (on our 3.47GHz Intel Xeon CPUs) to evaluate computational cost.
For all datasets, we train a binary SVM to build a detector for each action. We use the descriptors described in Sec. 3.2, following the guidance of prior work [34, 33] to select which particular sampling strategies and local space-time descriptors to employ per dataset. In particular, recommendations from  lead us to employ HoG/HoF for Hollywood and HoG3D for UCF with dense sampling. For the THUMOS dataset we use the features provided with the dataset, which augments the HoG/HoF set with dense trajectories and MBH. In particular, on THUMOS we train one-versus-all binary SVMs with four types of features: trajectory , HOG, HOF, and MBH , where the features are quantized to a bag of words representation via k-means with a dictionary size . We use the authors’ code for HoG3D/HoG/HoF/trajectory/MBH [19, 15, 33, 24], with default parameter settings. We test the high-level descriptors on Hollywood, since that dataset has substantial person-object interactions, whereas actions in the others are more person-centric (e.g., diving, clapping, skateboarding). We construct our temporal graphs with a node size of 10 frames per slab.
The next four sections describe the results on each dataset in turn.
4.1 Temporal Detection on UCF Sports
|Detection time (ms)||T-Sliding||ST-Cube-Subvol ||Our-T-Subgraph||Our-T-Jump-Subgraph|
Since the UCF clips are already cropped to the action of interest, we modify it to make it suitable for detection. We form 12 test sequences by concatenating 8 different clips each from different verbs. All test videos are totally distinct, and are available on our project website. We train the SVM on a disjoint set of cropped instances. We perform temporal detection only, since the activities occupy the entire frame.
Table 2 shows the accuracy results, and Table 3 shows the search times. For almost all verbs, our subgraph approaches outperform the baselines. Further, our T-Jump variant gives top accuracy in most cases, showing the advantage of ignoring noisy features (in this data, often found near the onset or ending of the verb). Figure 9 shows an example where T-Jump performs robust detection in spite of occlusions, whereas the baseline sliding window or basic T-Sliding fails.
On this dataset, the ST-Cube-Subvolume baseline is often weaker than sliding window. Upon inspection, we found it often fires on a small volume with highly weighted features when the activity changes in spatial location over time. However, it is best on “Swing-Bench”, likely because the backgrounds are fairly static, minimizing misleading features. As we see in Table 3, both our subgraph methods are orders of magnitude faster than the baselines. Note that the ST-Cube-Subvolume’s higher cost is reasonable since here it is searching a wider space.
4.2 Temporal Detection on Hollywood
|Detection Time (ms)||T-Sliding||ST-Cube-Subvol ||Our-T-Subgraph||Our-T-Jump-Subgraph|
We next test the Hollywood data, which also permits a study of temporal detection. As noted above, we test with the untrimmed data provided by the dataset creators. Existing work uses this data for classification, and so trains and tests with the cropped versions. To perform temporal detection, we instead train with the cropped clips, and test with the uncropped clips.
Table 4 shows the accuracy results, and Table 5 shows the search times. Our T-Jump-Subgraph achieves the best accuracy for 6 of the 8 verbs, with even more pronounced gains than on UCF. This again shows the value of skipping brief negatively weighted portions; e.g., “AnswerPhone” can transpire across several shot boundaries, which tends to mislead the baselines.
As Table 5 reveals, our method is again significantly faster than the baselines. Our T-Jump-Subgraph is slower than our T-Subgraph search, given the higher graph complexity (which also makes it more accurate). Hence, which variant to apply depends on how an application would like to make this cost-accuracy trade off.
|Test sequence composition||Accuracy|
|Raw uncropped clips||24.83%|
|Output from T-Subgraph||29.66%|
|Manual ground truth||29.97%|
One might wonder whether a naive detector that simply classifies the entire uncropped clip could do as well. To check, we compare recognition results when we vary the composition of the test sequence to be either (a) the uncropped clip, (b) the output of our detector, or (c) the ground truth cropped clip. Table 6 shows the result. We see indeed that detection is necessary; using our output is much better than the raw untrimmed clips, and only slightly lower than using the manually provided ground truth.
We also test our high-level descriptor (cf. Sec. 3.2.2) on Hollywood, since its actions contain human-object interactions. We apply six object detectors—bus, car, chair, dining table, sofa, and phone—to every fifth frame, and use random forests with 10 trees. Table 7 shows the results, compared to our method using low-level features. For five of the eight actions, the proposed high-level descriptor improves accuracy. It is best for activities based on the interaction between two people (e.g., kiss) or involving an obvious change in pose (e.g., sit up), showing the strength of the proposed person types to capture pose and temporal ordering. For other verbs with varied objects (answer phone, get out of car), it hurts accuracy, likely due to object detector failures in this dataset. It remains future work outside the scope of this project to bolster the component object detectors fed into this higher-level neighborhood descriptor.
|Verbs||T-Subgraph (HoG/HoF)||T-Subgraph (high-level)|
4.3 Temporal Detection with Multiple Instances on THUMOS
Next we evaluate our approach on the THUMOS dataset. THUMOS allows temporal detection (like UCF Sports and Hollywood), plus, unlike the others, it contains test sequences with multiple instances of the activity. This aspect lets us test our iterative max-subgraph strategy to produce multiple detections, as discussed in Sec. 3.6.
In these experiments, the sliding window baseline represents the same search strategy taken by the leading approach on this dataset . As such, we follow the authors’ parameter choices for the window search in order to provide a close comparison. That means for the T-Sliding baseline, we use a step size of 10 frames, and evaluate the windows with durations of 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, and 150 frames . We fix the NMS threshold at 0.5 (after we did not observe better results for the baseline shifting this threshold within the range (0,1]), and we fix the node re-weighting value at 0 for our method (cf. Sec. 3.6). Note that with a skip size of 10 frames, the sliding window baseline (T-Sliding) does not exhaustively search all subsequences, whereas our method does. For each testing video, we return up to 10 positive detection windows.
Table 8 shows the accuracy results for T-Sliding and our T-Subgraph method, both in terms of overlap and the mean average precision (mAP) as defined by , which is a useful metric for the case when there are multiple instances per testing clip. Our method obtains higher accuracy than the standard sliding window baseline. This is a direct consequence of the efficiency of our approach in considering all possible windows. We also get a noticeable further advantage in overlap accuracy applying our T-Jump variant, yet it harms average precision. Upon inspection, we find that for this challenging data, the classifier scores per node are noisier, which leads T-Jump to cover too many frames; T-Jump can easily find some small-valued positive nodes to skip over highly negative nodes, leading to some poorer detection outputs as seen in the mAP. The high overlapping score of T-Jump confirms this observation and illustrates why mAP is a better metric than overlapping accuracy in multiple instance detection. We also tried a variant of our approach that less aggressively reduces the weights on nodes already involved in a prior iteration’s detections: we set the weight of a “used” node to the mean weight of all nodes, with the intent to encourage more overlapping detections. However, this led to slightly worse accuracy for our method (0.2043 overlap accuracy vs. 0.2186 in Table 8).
Table 9 shows the computation time for both methods. Similar to previous results, our T-Subgraph method for detecting multiple instances provides significantly faster running time compared to T-Sliding. For the sliding window method, no matter how many output detections we want, all the candidate window are evaluated. In contrast, for our T-Subgraph, we only return one optimal window in each subgraph search iteration and re-weight the underlying nodes for next iteration. Therefore, in this experiment, we need to run our T-Subgraph 10 times to find top 10 detection windows—yet, in spite of that repetition, it is still about an order of magnitude faster than evaluating all the candidate windows in the T-Sliding method.
|Metric||T-Sliding||Our T-Subgraph||Our T-Jump-Subgraph|
|Time (ms)||T-Sliding||Our T-Subgraph||Our T-Jump-Subgraph|
Finally, we more closely analyze the behavior of the sliding window baseline (T-Sliding) as it compares to our T-Subgraph. The goal is to see in practice what density of windowed search (skip sizes) is necessary for best results. In other words, if we allow T-Sliding more candidate windows and hence longer running time, at what point does it come close to the optimal result from our method? Since running this experiment is rather costly for the baseline, we limit this test to four of the 20 verbs in the THUMOS test set (chosen randomly: basketball dunk, clean and jerk, cliff diving, and hammer throw).
Figure 10 shows the results in terms of the average accuracy over all four actions tested. As expected, increasing the pool of candidate windows searched by T-Sliding increases its accuracy, but at a corresponding linear increase in run-time. At a search time of 200 ms per frame, the baseline is searching 35 different window sizes (out of 300 window sizes for exhausted search) and achieves accuracy of 0.26, nearing but not as good as the result from T-Subgraph of 0.30 accuracy obtained with just a few ms per frame.
4.4 Space-Time Detection on MSR Actions
As the fourth and final dataset, we experiment with MSR Actions. In contrast to all of the above datasets, MSR Actions contains ground truth for the spatial localization of the action—not just the temporal extent. Furthermore, the actors change their position over time and a test sequence may contain multiple simultaneous instances of different actions. Therefore, this dataset is a good testbed to evaluate our ST-Subgraph with the node structure in Figure 5(b), where we link neighboring nodes both in space and time. In what follows, we present results with both the exact maximum subgraph from ST-Subgraph as well as its approximate counterpart, the two-stage search process described in Sec. 3.5.
First we isolate temporal detection accuracy alone. We run the temporal and spatio-temporal variants of our method, and project the spatio-temporal results to temporal results. Table 10 shows results. Even under the temporal criterion, our ST-Subgraph and two stage ST-Subgraph are most accurate, since they can isolate those nodes that participate in the action. Figure 11 illustrates how our space-time node structure succeeds when the location of activity changes over time, whereas ST-Cube-Subvolume may be trapped in cube-shaped maxima. Compared to ST-Subgraph, our two-stage method yields similar accuracy for Boxing and Clapping videos and provides lower accuracy for Waving videos. This result shows the two-stage method is able to provide good approximation to ST-subgraph method.
|Detection Time (ms)||T-Sliding||ST-Cube-Sliding||ST-Cube-Subvol ||Our-T-Subgraph||Our-ST-Subgraph||Our-Two-Stage-ST|
Next we examine the complete space-time localization accuracy. Table 12 shows the results, evaluated under the ground truth annotation for the person who performs the action
Finally, we analyze the run-times for all methods tested in Table 11. Here we see the substantial practical impact of our two-stage spatio-temporal variant, which yields significantly lower computation time. It is even faster than the sliding temporal window search that produces no spatial localization, and orders of magnitude faster than the existing branch-and-bound subvolume method . The two-stage method is slightly slower than the T-Subgraph variant of our method, since it requires additional computation for the spatial detection in the first stage for each slab.
As discussed in Sec. 3.5, we can achieve efficient spatio-temporal localization with the our proposed two stage subgraph search method. In previous section, our ST-Subgraph provides more accurate space-time localization of action with higher computational cost. In this section, we speed up the ST-Subgraph with our two stage subgraph for space time detection on MSR action dataset.
Table 12 and Table 11 also show the comparison of detection accuracy and search time for our Two-Stage-ST-Subgraph and our original ST-Subgraph. By dividing the node structure into temporal slices, the computation time of two stage method is reduced by three orders compared to original ST-Subgraph. As expected, the two stage method is slightly slower than the T-Subgraph because it requires additional computation for spatial detection in first stage for each temporal node. For detection accuracy, recall that the two stage method doesn’t guarantee to provide the optimal spatial-temporal volumes since it ignores the temporal link between nodes in the first stage, it is expected that the two stage method will be less accurate than the ST-Subgraph method. As shown in Table 12, Two-Stage-ST method achieves similar accuracy to the ST-Subgraph for hand clapping and hand waving clips, but lower accuracy for boxing clips. It is because the learned activity model for boxing is less accurate than the learned models for other two actions (it provides lower overlap accuracy for ST-Subgraph), and our two stage method is more sensitive to the noisy node score due to the pruned connections between nodes.
4.5 Summary of Trade-Offs in Results
Having presented all the results, now we step back and attempt to summarize the outcomes succinctly. There are three dimensions of trade-offs between all methods tested: search time, search scope, and detection accuracy.
Figure 12 summarizes all trade-offs for three datasets. Here we show the accuracy versus the detection time for each result, and encode the search scope of the method by the complexity of its polygonal symbol. More complex symbols mean wider search scope. For example, recalling Figure 8, the least complex search scope is T-Sliding/T-Subgraph, which is plotted as a triangle, whereas the most complex search scope is the ST-Subgraph, which is plotted as a 14-sided star.
Importantly, we see that increased search scope generally boosts accuracy. In addition, the flexibility of the graph structure in our subgraph algorithm allows it to perform best per dataset in terms of either speed (see vertical blue dotted lines) or accuracy (see horizontal red dotted lines).
Our method can be used to produce equivalent results as sliding window search, but without the exhaustive search. However, due to the additive restriction our method places on the classifier (cf. Sec. 3.1), it cannot normalize each window’s bag-of-feature histograms. Would such normalization help the accuracy of sliding windows? We find it actually hurt the baseline, letting tiny subvolumes with few positively weighted features dominate the detection outputs. We can improve the normalization by also re-weighting the detection score by the length of the window to encourage longer detections . Table 13 shows the result. The T-Sliding accuracy increases in three of the four datasets, yet remains inferior to our method’s best results. Our accuracy advantage comes from our flexible subgraph node and linking strategies.
|UCF (ave. overlap)||0.5453||0.5417||0.5504|
|Hollywood (ave. overlap)||0.3337||0.3565||0.3715|
|MSR (ave. overlap)||0.1288||0.1513||0.1890|
|THUMOS (ave. mAP)||0.1983||0.2026||0.2143|
We provide our source code and data in our project page.
We presented a novel branch-and-cut subgraph framework for activity detection that efficiently searches a wide space of temporal or space-time subvolumes. Compared to traditional sliding window search, it significantly reduces computation time. Compared to existing branch-and-bound methods, its flexible node structure offers more robust detection in noisy backgrounds. Our novel high-level descriptor also shows promise for complex activities, and makes it possible to preserve the spatio-temporal relationships between humans and objects in the video, while still exploiting the fast subgraph search.
We thank the anonymous reviewers for their feedback, and Sudheendra Vijayanarasimhan for helpful discussions. This research is supported in part by ONR PECASE N00014-15-1-2291.
- The bias term can be ignored for the purpose of maximizing .
- Rather than space-time cubes, one could consider using space-time segments from a bottom-up grouping algorithm. This would have some potential advantages, including finer-grained localization. However, our preliminary attempts indicated that the regular grid nodes are preferable to segments in practice, for both accuracy and speed. That is because (1) the irregularly shaped segment nodes lead to dense adjacency structures, hurting run-time, and (2) the difficulty in producing quality supervoxels makes it easy to over/under-segment.
- http://www.di.ens.fr/ laptev/actions/
- We found its behavior sensitive to its penalty value parameter, which is a negative prior on zero-valued pixels . The default setting was weak for our data, so for fairest comparisons, we tuned for best results on UCF.
- For the special case of temporal search, one can obtain equivalent solutions using 1-D branch-and-bound search to detect the max subvector along the temporal axis . In practice we find this method’s run-time to be similar or slightly faster than T-Subgraph. Note, however, that it is not applicable for any other search scope handled by our approach.
- The original ground truth labels only the hand regions (see Figure 11), whereas this ground truth labels the whole person performing the action.
- S. Bandla and K. Grauman. Active learning of an action detector from untrimmed videos. In ICCV, 2013.
- J. Bentley. Programming pearls: algorithm design techniques. Commun. ACM, 27(9):865–873, Sept. 1984.
- L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, 2009.
- L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action recognition. In CVPR, 2010.
- C.-Y. Chen and K. Grauman. Efficient activity detection with max-subgraph search. In CVPR, 2012.
- C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for static human-object interactions. In Workshop on Structured Models in Computer Vision, Computer Vision and Pattern Recognition (SMiCV), 2010.
- M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. Mî ¦ler. Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics, 2008.
- O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human actions in video. In ICCV, 2009.
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32:1627–1645, 2010.
- A. Gupta, A. Kembhavi, and L. Davis. Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI, 31, 2009.
- M. Hoai, Z.-Z. Lan, and F. De la Torre. Joint segmentation and classification of human actions in video. In CVPR, 2011.
- T. Ideker, O. Ozier, B. Schwikowski, and A. F. Siegel. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 2002.
- Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
- Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event detection using volumetric features. In ICCV, 2005.
- A. Kläser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008.
- A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman. Human focused action localization in video. In International Workshop on Sign, Gesture, Activity, 2010.
- C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In CVPR, 2008.
- I. Laptev. On space-time interest points. IJCV, 64(2):107–?23, 2005.
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
- I. Ljubic, R. Weiskircher, U. Pferschy, G. Klau, P. Mutzel, and M. Fischetti. An algorithmic framework for the exact solution of the prize-collecting Steiner tree problem. Math. Prog., 2006.
- K. Mikolajczyk and H. Uemura. Action recognition with motion-appearance vocabulary forest. In CVPR, 2008.
- D. Moore, I. Essa, and M. Hayes. Exploiting human actions and object context for recognition tasks. In CVPR, 1999.
- J. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 2008.
- D. Oneata, J. Verbeek, and C. Schmid. Action and Event Recognition with Fisher Vectors on a Compact Feature Set. In ICCV 2013 - IEEE International Conference on Computer Vision, pages 1817–1824, Sydney, Australia, Dec. 2013. IEEE.
- D. Oneata, J. Verbeek, and C. Schmid. The LEAR submission at THUMOS 2014, 2014.
- A. Prest, C. Schmid, and V. Ferrari. Weakly supervised learning of interactions between humans and objects. PAMI, 34:601?614, 2012.
- D. Ramanan and D. Forsyth. Automatic annotation of everyday movements. In NIPS, 2003.
- M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In CVPR, 2008.
- S. Satkin and M. Hebert. Modeling the temporal extent of actions. In ECCV, 2010.
- C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In ICPR, 2004.
- A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. In CVPR, 2010.
- S. Vijayanarasimhan and K. Grauman. Efficient region search for object detection. In CVPR, 2011.
- H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In IEEE Conference on Computer Vision & Pattern Recognition, pages 3169–3176, Colorado Springs, United States, June 2011.
- H. Wang, M. M. Ullah, A. Kläser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.
- G. Willems, J. Becker, T. Tuytelaars, and L. V. Gool. Exemplar-based action recognition in video. In BMVC, 2009.
- A. Yao, J. Gall, and L. van Gool. A Hough transform-based voting framework for action recognition. In CVPR, 2010.
- G. Yu, J. Yuan, and Z. Liu. Unsupervised random forest indexing for fast action search. In CVPR, 2011.
- J. Yuan, Z. Liu, and Y. Wu. Discriminative subvolume search for efficient action detection. In CVPR, 2009.