Activity Detection with Latent Sub-event Hierarchy Learning

Activity Detection with Latent Sub-event Hierarchy Learning


In this paper, we introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture temporal structure in continuous activity videos. Our layer is designed to allow the model to learn a latent hierarchy of sub-event intervals. Our approach is fully differentiable while relying on a significantly less number of parameters, enabling its end-to-end training with standard backpropagation. We present our convolutional video models with multiple TGM layers for activity detection. Our experiments on multiple datasets including Charades and MultiTHUMOS confirm the benefit of our TGM layers, illustrating that it outperforms other models and temporal convolutions.

Human Activity Recognition; Activity Detection; Sub-event Learning; Temporal Convolution

1 Introduction

Human activity detection is the problem of identifying time intervals (i.e., starting times and ending times) of occurring activities given a continuous video input. This is very essential for many societal application including intelligent surveillance, quality-of-life systems such as monitoring of patients, smart cities and environments, and online video indexing/retrieval. This can also enable robot perception of human activities in public and private places (i.e., provide activity-level situation awareness), enabling the robots to know what humans are doing and what they want to do. Activity detection is a more challenging problem compared to the standard activity ‘classification’ task, which is the problem of categorizing pre-segmented videos. In activity detection, multiple activities may (or may not) occur in continuous videos with any length.

In the past 4-5 years, activity recognition approaches taking advantage of Convolutional Neural Networks (CNNs) particularly obtained successful results, advancing the state-of-the-arts. They were able to perform feature/representation learning optimized for video training data, benefiting overall recognition. There were approaches to learn 3-D XYT convolutional kernels to be directly applied to video inputs [1, 2], as well as the approaches to model temporal structure on top of per-frame representations [3, 4]. A few works also studied CNN models for the problem of detecting multiple activities [5, 6].

In this paper, we introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer, and present how it can be used to efficiently capture temporal structure in activity videos. Our temporal Gaussian mixture layer is a temporal convolutional layer, whose filters/kernels are controlled by a set of (temporal) Gaussian distribution parameters. Each of our temporal Gaussian distributions specify (temporally) ‘where’ the model should look, and our Gaussian mixture layer combines them as multiple convolutional filters to be applied on top of continuous per-frame (or per-segment) representations. This layer allows the video representation at each time step to be constructed while focusing on different neighboring temporal regions, instead of only focusing on its local segment. It is a convolutional layer governed by a much smaller set of parameters (e.g., location/variance of Gaussians) that are fully differentiable.

The motivation behind our temporal Gaussian mixture layer is to model the relative location of each sub-event of an activity with each temporal Gaussian kernel. Sub-events are shorter/simpler events composing an activity, and many prior works emphasized the importance of capturing sub-events for high-level activity recognition [3, 7, 8, 9]. Our objective is to make our model automatically learn such latent sub-event locations optimal for the recognition, by learning our temporal layer parameters solely based on the training data in an end-to-end fashion. Importantly, since our sub-events are represented using one temporal ‘layer’, we are able to capture multiple levels of (latent) sub-event hierarchy by learning multiple sequential temporal Gaussian mixture layers. This allows our approach to represent long-term temporal information in activity videos by having multiple levels of abstractions. This is in contrast to previous CNN-based sub-event learning works [3] which were limited to the learning of sub-event hierarchy at just one level: activities and their atomic sub-events.

Furthermore, our temporal Gaussian mixture layer is designed to share a smaller set of Gaussian parameters to form many different mixtures using the soft-attention mechanism. Instead of making the model learn separate sub-event hierarchy for each activity class, our layer more efficiently represents sub-events of an activity as a mixture of shared distributions. The number of parameters we need to learn in our temporal Gaussian mixture layer is significantly less than the number of parameters in general temporal convolutional layers or 3-D XYT convolutional layers, enabling its efficient optimization even with a relatively smaller scale video datasets.

We present video-CNN models using our temporal Gaussian mixture layers for activity detection from continuous videos. Our model stacks the temporal Gaussian mixture layers on top of several state-of-the-art base CNNs such as I3D [2] and two-stream Inception [10, 11]. This enables our model to capture longer-term temporal information than what we use as base CNNs, hierarchically modeling temporal structure with multiple Gaussian mixture layers. As a result, our model is able to build activity representations while considering time intervals of 10s of seconds. Our model was evaluated with multiple public datasets including MultiTHUMOS [6] and Charades [5], and was able to outperform the best previous activity detection CNNs as well as the models with various types of temporal convolutional layers.

2 Related Works

Activity recognition has been a popular research topic in computer vision [11, 12, 13, 14, 15, 16]. Hand-crafted features, such as dense trajectories [15] gave promising results on many datasets. Recently, many works have focused on learning good features/representations for activity recognition [2, 11, 17]. Two-stream CNN approaches take both RGB frames and optical flow frames as input to capture both motion and spatial image features [11, 18]. 3D spatio-temporal (XYT) convolutional filters have been learned and applied to many activity recognition tasks [1, 2, 17, 19]. Large scale datasets for activity detection, such as THUMOS [20], ActivityNet [21], Kinetics [22], and Charades [5] provided these approach the necessary training data to learn the models.

Recently, segment-based 3D CNNs have been used to capture spatio-temporal information simultaneously for activity detection [23, 24, 25]. Shou et al. [25] used strided temporal convolutional to downsample the temporal length followed by temporal deconvolution for upsampling to make dense predictions and better localize activity start and end times. Zhao et al. [26] use a binary ‘actionness’ classifier to propose many action segments per video then classify each segment individually. However, these approaches all rely on the 3D CNN to capture temporal dynamics, which usually only contain 16 frames.

Some works studied capturing longer-term temporal structures  [2, 16, 14, 27], but it was generally done with a temporal pooling of local representations or (spatio-)temporal convolutions with larger fixed intervals. Recurrent neural networks (RNNs) also have been used to model activity transitions between frames [6, 28, 29], but they were strictly sequential and had limitations in maintaining temporal information over a longer temporal duration, particularly for higher-level activities having a temporal structure.

Learning hierarchical structure of human activities has been shown to benefit their recognition in many traditional works [7, 9, 30, 31, 32, 33]. The idea was to decompose each activity into multiple sub-events hierarchically, allowing models to capture long-term temporal structure in activity videos. The major limitation was that these traditional works were not learned in an end-to-end fashion. Recently, CNN models for end-to-end learning of latent activity sub-events and classifiers have been studied [3, 4]. However, these works limited their learning to sub-event hierarchy having only one level: (1) the activity and (2) its (atomic) sub-events. That is, although a high-level activity (such as a human-human interaction) may be decomposed to many sub-events of multiple levels, prior CNN-based works either ignores such modeling or focuses on just 1-level.

This paper proposes a new temporal convolutional layer designed to (hierarchically) learn temporal sub-event structure, and present how it can be used for better detection of activities. Our layer is different from the previous standard (spatio-)temporal convolutional layers in that it explicitly tries to capture temporal structure while relying on significantly fewer parameters. Our temporal layer is also different from previous Gaussian Mixture Model layers [34] in that our layer is convolutional while they were not.

3 Approach

In this section, we introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer, and present how it can be used for activity recognition. Our temporal Gaussian mixture layer is a temporal convolutional layer to be applied on top of a sequence of representations (usually from frame-level or segment-level CNNs), whose filters/kernels are controlled by a set of (temporal) Gaussian distribution parameters. The motivation is to make each temporal Gaussian distribution specify (temporally) ‘where’ the sub-event is located with respect to the activity center, and make the activity be represented as a collection/mixture of such temporal Gaussians convolved with video features. Our layer is fully differentiable and trainable using standard backpropagation.

Our TGM layer differs from the standard temporal convolutional layers of learning 1-D (time) or 2-D (time-by-channel) filters in the following aspects:

  1. Instead of learning temporal convolution filters of any arbitrary values, our filters are forced to have the form of a temporal Gaussian mixture shared across all frame-level feature channels. This allows the layer to explicitly reflect the concept of temporal sub-events, and rely on significantly fewer number of (fully differentiable) parameters.

  2. Our temporal Gaussian mixture layer handles multiple 3-D tensors internally to maintain frame-level feature channels in the input while producing multiple event-level channels in the output. It allows learning representations that have multiple output channels, which can be viewed as representations of different (latent) intermediate events. When multiple TGM layers are stacked, such intermediate events serve as candidate sub-events in the next TGM layer. This enables the learning of latent sub-event hierarchy.

The TGM layer allows the video representation at each time step to be constructed while focusing on different neighboring temporal regions, instead of only focusing on its local segment.

3.1 Temporal Gaussian Mixture layer

Figure 1: Example illustrating how our Temporal Gaussian Mixture layer is computed. Multiple () temporal Gaussian distributions are learned, and they are combined with the learned soft attention weights to form the temporal convolution filters. is the temporal length of the filter.

Our temporal Gaussian mixture layer takes a 3-D input with the dimensionality of , where is the number of input channels, is the dimensionality of the representations from frame-level (or segment-level) CNNs, and is the time. Given such input, the TGM layer convolves it with number of filters/kernels, generating a -dim representation as an output. is usually 1K or 4K and is the number of time steps such as frames in each video (i.e., it differs per video).

Our layer is composed of a set of Gaussians. Each Gaussian has 2 parameters: a center and a width . Each layer has additional hyper-parameters: the temporal duration, , and the number of Gaussians to learn. We transform/enforce the learned center to be between and and ensure that is positive.


We use the above and to construct the temporal Gaussian kernels. This acts as a strong sparsity constraint on the convolutional kernel as well as a drastic reduction of learnable parameters. We construct a temporal Gaussian mixture convolutional kernel as:


where is a normalization constant such that . This results in being an matrix.

Instead of making the model learn a separate set of Gaussian distributions per activity class for its sub-event modeling, we take the approach of maintaining multiple Gaussian distributions shared across classes and obtain a Gaussian ‘mixture’ filter by learning soft attention weights. We learn a set of soft-attention weights per output channel , . We create the soft-attention weights by applying the softmax function over the Gaussians, enforcing each input channel weights sum to 1.


Based on temporal Gaussian distributions and attention weights , the temporal convolution filters our TGM layer is computed as:


This provides us convolutional filters having the form of mixtures of temporal Gaussians, controlled based on parameters (instead of learning parameters without any constraint, as in standard temporal convolution with ). An overview of this process is shown in Fig. 1.

Single TGM layer - direct per-class activity modeling

Figure 2: Illustration of a TGM layer that directly models per-class sub-events. This layer learns a set of Gaussian mixtures that are convolved with the input channels.

The representation we obtain by applying our base CNNs to each frame (or local segment) has the dimensionality of , and stacking them along time axis provides us the representation with -dim. That is, in the case of using only one TGM layer to capture activity representations (i.e., modeling each activity with the sub-event hierarchy of a single level), our is fixed to and is fixed to be the number of activity classes. This is the simplest case of our model, attaching one TGM layer on top of the representation.

Our convolutional kernel, , has a learned Gaussian mixture for each activity class. Given the video features , a matrix, each is a 2-D convolutional filter, and convolving this with provides us a representation with number of responses since is 1 in this case. This per-class representation can then be used as input to a fully-connected layer for activity classification. For :


Multiple TGM layers - grouped convolution

We generalize the above formulation to allow the TGM layers to be sequentially applied. The idea is to enable our model to capture a hierarchical temporal structure by having multiple levels of temporal layers. In this case, the input for each layer is dimensional, where the input channels are the number of output channels from the previous layer. Our kernels at each layer, , are parameterized and learned as before.

By using grouped convolution with the number of groups set to , we can efficiently separate the input into per-channel values and convolve each of them with the designated kernel, as shown in Fig. 2. For ,


Here, is a matrix, where is the dimensionality of the feature and is the number of frames. , the result of the per-channel convolution, is a representation. We concatenate these representations along the channel axis, resulting in , a representation. As this convolution results in the same output shape, we can stack these layers. Each layer is able to capture increasing temporal resolution, allowing the model to capture a latent hierarchy of sub-events.

Each output channel of our TGM layer can be interpreted as a representation obtained by posing a particular sub-event structure (i.e., mixture) to the input video representation. Such output channels directly corresponded to activity classes (we want to detect) in the case of having a single TGM layer. In the case of having multiple TGM layers stacked, the output channels from a lower TGM layer conceptually corresponds to (latent) intermediate events that serves as inputs to the next layer. Such intermediate event representations are learned solely based on the training data, optimized for the final detection of the activities. Learning of the representations through multiple layers can be interpreted as the learning of a sub-event hierarchy having multiple levels.

Multiple TGM layers - channel combination

Figure 3: Illustration of a TGM layer with channel combination. The kernels are applied to each input channel, , and a 1x1 convolution is applied to combine the input channels for each output channel, .

In the above subsection, we introduced an approach of stacking multiple TGM layers to model a hierarchical composition of temporal representations. However, in the grouped convolution case, each output channel of the layer is solely dependent on its corresponding input channel. That is, each output channel of the layer only considers information from a single output channel of the previous layer. If we interpret each input channel as a representation of a learned event from the previous layer, this means that our temporal layer is only considering different temporal combinations of the same (sub-event) representation. This could lead our learning to end up obtaining a restricted form of temporal hierarchy structure.

Therefore, we further generalize our TGM layer so that the layer combines representations from multiple input channels for each output channel using the learned temporal kernels. This can be viewed as activity classes sharing multiple latent sub-events at every layer level. We learn a set of convolutional kernels (i.e., we learn gaussian mixtures). Given , the representation, for each output channel , and each input channel we convolve the associated filters with the input.


where each is a -dim representation.

We then learn a 1x1 convolution followed by the ReLU activation function for each , which we call , that maps from channels to 1 channel. The 1x1 convolution learns to combine the previous sub-event structures and adds non-linearity to our layer.


We then stack the representations along the channel axis to produce , the -dim representation. This process is illustrated in Fig. 3. This method generalizes our approach to allow the layer to take input of and produce output of . These layers can easily be stacked to learn a hierarchical representation.

3.2 Video CNN models with TGM layers

Figure 4: An overview of an example video CNN model with two TGM layers. It is able to handle videos with any length, because of its fully convolutional design.

Our goal is to do activity detection which we define as making a per-frame (or per-segment) classification. Given a video, at each time step , we want to make the model decide which activity the frame corresponds to (including no-activity). As a baseline, we train a fully-connected layer that classifies each per-frame -dimensional vector, . As multiple activities can occur at the same time, or no activities at all, we treat this as a mutli-label classification task. We minimize binary cross entropy:


where is the ground truth label, 1 if activity is occurring at time and is the output of our model for class at time .

We apply our TGM layers to the video features to obtain the sub-event representation. We then classify each representation:


Where is the function applying our TGM layers making the classification depend on a temporal interval. Our layer is fully differentiable, and we minimize the loss with standard backpropagation.

4 Experiments

4.1 Implementation Details

As our base per-segment CNN, we use the I3D [2] network pretrained on the ImageNet and Kinetics [22] datasets. I3D obtained state-of-the-art results on segmented video tasks, and this allows us to obtain reliable . We also use two-stream version of InceptionV3 [10] pretrained on Imagenet and Kinetics as our base per-frame CNN, and compared them. We chose InceptionV3 as it is deeper than previous two-stream CNNs such as [11, 18]. We extracted frames from the videos at 25 fps, computed TVL1 [35] optical flow, clipped to . For InceptionV3, we computed features for every 3 frames (8 fps). For I3D, every frame was used as the input. I3D has a temporal stride of 8, resulting in 3 features per second (3 fps).

We implemented our TGM layers as well as other baseline layers in PyTorch. We set for frame-based features (i.e., InceptionV3) and for segment-based features (i.e., I3D), as each segment already contains some temporal information. We found these values to work well on a held out portion of the training set of MultiTHUMOS. In all models, we used one fully-connected layer at the end to make the per-frame or per-segment classification.

We trained our models using the Adam [36] optimizer with the learning rate set to 0.01. We decayed the learning rate by a factor of 0.1 after every 10 training epochs. We trained our models for 50 epochs. We plan to make all our source code and trained models publicly available once the paper is published.

4.2 Baselines

In order to confirm the advantages of our TGM layers particularly against previous temporal models, we implemented several baselines. Our first baseline is (i) a standard per-frame classifier in which the prediction at each time-step only depends on a single feature vector with no contextual temporal information. We also used (ii) LSTMs on top of per-frame representations, which were popularly used to capture temporal information [37]. We train a bi-directional LSTM with 512 hidden units to make per-frame predictions. We also tried (iii) the fixed pyramid temporal max-pooling [8]. We split each local video segment into different lengths (1/2, 1/4, 1/8) and pool each sub-interval. This is applied in a sliding-window fashion to the input frames. The window length was set to 5 segment features or 10 frame features, making the model consider multiple frames. We then concatenate the pooled sub-intervals to form a representation used for classification. Finally, we compare our model against (iv) the model with standard temporal convolutional layers (i.e., 1-d convolution with a kernel) on top of per-frame representations. In all our experiments, we follow the standard evaluation setting of computing per-frame mean average precision (mAP) and report those values.

In addition, we also tried the approach of combining our TGM layers with the recent super-event representations from [4]. This allows our model to consider not only hierarchical sub-event learning but also global video context information to make the per-frame activity decisions. We concatenated the learned super-event representation with our representations from TGM layers.

4.3 MultiTHUMOS


MultiTHUMOS [6] is an extended version of the THUMOS [20] dataset that densely annotates the continuous videos. The dataset consists of 65 different classes, compared to 20 in THUMOS, and contains on average 10.5 activities per video and 1.5 labels per frame and up to 25 activity instances in each video. This is in contrast to many other activity detection dataset such as ActivityNet [21], which only has on average 1 activity per video. MultiTHUMOS consists of YouTube videos of various sport activities such as basketball games, volleyball games, weight lifting, and track and field.

We followed the standard MultiTHUMOS evaluation setting. There are 1010 validation videos and 1574 test videos. We used these continuous validation videos for the training of our models. Although there is a separate training set with segmented videos, we did not need to take advantage of them: even without using all the (segmented) training data, we were able to outperform the state-of-the-arts as we describe more below.

I3D InceptionV3
Spatial Temporal Two-Stream Spatial Temporal Two-Stream
Baseline 29.8 31.2 32.1 13.6 14.1 15.2
Temporal Conv 32.5 35.5 38.4 15.2 15.5 15.8
3 Temporal Conv 20.4 23.4 24.4 5.3 6.1 6.5
TGM layers with grouped convolution
1 TGM 35.1 37.8 40.5 16.3 17.5 18.0
3 TGM 36.4 42.3 43.5 17.5 18.3 19.2
TGM layers with channel combination
1 TGM (soft) 35.2 37.9 40.2 17.2 17.6 18.4
1 TGM (1x1) 36.1 38.2 40.8 17.2 17.7 18.4
3 TGM (soft) 36.2 40.1 42.3 17.5 19.1 21.2
3 TGM (1x1) 37.2 42.1 44.3 17.9 19.3 22.2
Table 1: Comparison of various architectures on MultiTHUMOS using both I3D per-segment and InceptionV3 per-frame features. We found that TGM layers with 1x1 convoltuion channel combination performed the best.


To confirm the effectiveness of our TGMs, we compared baselines as well as multiple different versions of our architectures, shown in Table 1. Importantly, we found that our efficient grouped convolution version of our TGM layer improves the performance over the baseline I3D (or InceptionV3) while using the same per-segment representations. Learning 3 TGM layers further improved the performances. On the other hand, we found that stacking multiple standard convolutional layers does not improve performance, often performing worse than the baseline. While a single standard temporal convolutional layer improves over the baseline, having multiple of them significantly increases the number of parameters to learn (Table 2) and we suspect that this was causing the overfitting with the limited amount of data we have.

Model # of parameters
LSTM 10.5M
1 Temporal Conv 10.5M
3 Temporal Conv 31.5M
1 TGM Layer 10K
3 TGM Layers 100K
Table 2: Additional number of parameters for models when added to the base architecture (e.g., I3D or Inception V3).

We were able to able to confirm that learning multiple TGM layers with a channel combination outperforms the grouped convolution version of TGM and all the baselines. We also experimented the version of using soft-attention weights to combine the TGM layer channels, in addition to our method (Fig. 3) of using 1x1 convolution followed by a ReLU (to gain non-linearity). We found that the 1x1 convolution performed better. We tested various number of Gaussian mixtures (i.e., output channels) and found that using 80 for the first and second layer and using 65 (i.e., number of classes) for the final layer performs best.

Table 3 compares our model with TGM layers with multiple previous state-of-the-art approaches and baselines such as LSTM. Our approach meaningfully outperforms all previous approaches. Importantly, we are comparing our approach with different methods of capturing temporal information such as LSTMs and fixed temporal pyramid pooling while making them use the exactly same per-frame representations. We found that while all these methods capture some temporal information, the TGM layers provide the best performance. Further, combining the super-event representation [4] with our TGM feature also benefited detection, confirming that our TGMs and super-events capture different aspects of the activity videos. In Fig. 5, we show an example of the various models predictions on a basketball video.

Compared to the previous state-of-the-art (i.e., [4]), we are obtaining the performance (mAP) higher by 10% (36.4 vs. 46.4).

Figure 5: Figure illustrating the temporal regions classified as various basketball activities from a basketball game video in MultiTHUMOS. Our TMG layers greatly improve performance.
Two-stream [6] 27.6
Two-stream + LSTM [6] 28.1
Multi-LSTM [6] 29.6
Predictive-corrective [38] 29.7
I3D baseline 29.7
I3D + LSTM 29.9
I3D + temporal pyramid 31.2
I3D + super-events [4] 36.4
I3D + our TGMs 44.3
I3D + super-events [4] + our TGMs 46.4
Table 3: Performances of the state-of-the-art methods and our approach on MultiTHUMOS. Our approach meaningfully outperforms all previous results.

4.4 MLB-YouTube Dataset

Figure 6: Examples of several of the activities in the MLB-YouTube dataset: (a) Pitch, (b) Hit, (c) Bunt, (d) Hit by pitch, (e) No activity. This shows the difficulty of this dataset, as the difference between hit and bunt, swing and no swing are very small.


We newly created a large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube. This dataset consists of over 42 hours of video. We extracted 2,126 1-2 minute long clips from the video. We densely annotated each clip with the baseball activities that occur. We chose 8 activity classes: pitch, strike, ball, swing, hit, foul, hit by pitch, and bunt. Examples of some of these classes are shown in Fig. 6. Each continuous clip contains on average of 7.2 activities, giving a total of over 15,000 activity instances in the dataset. We will release this dataset upon publication.

What makes this dataset challenging is that the variation between classes is very small. In ActivityNet [21], for example, the difference between swimming and brushing hair is drastic. The background, motion, and even size of the person in the video is different. However, in broadcast baseball videos, the difference between a ball and a strike, or a swing and a bunt, are small. All actions are recorded from the same camera angle as we can confirm from Fig. 6.


In Table 4, we compare various approaches on this dataset. Our TGM layers improve over the baseline by 6% (40.1 vs. 34.2). Additionally, we compare to methods using the super-event representation [4], which previously achieved state-of-the-art performance on several activity detection datasets. On this dataset, our approach outperforms [4], and the concatenation of our TGM representation with such super-event representation performs best by a significant margin (13% compared to the baseline). This suggests that TGMs and super-event capture different temporal information and are both useful to the detection task.

Model Spatial Temporal Two-stream
Random 13.4 13.4 13.4
InceptionV3 31.2 31.8 31.9
InceptionV3 + LSTM 32.1 33.5 34.1
InceptionV3 + super-events 31.5 36.2 39.6
InceptionV3 + 1 TGM 32.4 36.3 37.4
InceptionV3 + 3 TGM 33.2 38.2 38.2
InceptionV3 + 3 TGM+super-events 34.6 42.4 42.9
I3D 33.8 35.1 34.2
I3D + LSTM 36.2 37.3 39.4
I3D + super-events 38.7 38.6 39.1
I3D + 1 TGM 35.5 37.5 38.5
I3D + 3 TGM 36.5 38.4 40.1
I3D + 3 TGM+super-events 39.4 46.0 47.1
Table 4: Result on the MLB-YouTube dataset using InceptionV3 [10] and I3D [2] to obtain features. Our TMG layers significantly outperform the baseline models.

4.5 Charades


Charades [5] is a large scale dataset with 9848 videos across 157 activity classes. These videos were recorded in home environments of the participants based on provided scripts. Each video contains on an average of 6.8 activity instances, and there are often complex activities co-occurring. The activities were mainly performed at home, and its example activity classes are ‘preparing a meal’, ‘eating’, ‘sitting’, ‘cleaning’, etc.

In our experiments, we follow the original Charades detection setting (i.e., charades_v1_localize evaluation), which is the setting used in many previous approaches [39, 23, 2, 4]. This is the original setting more challenging than the Charades Challenge 2017 setting (whose evaluation server is no longer available), in the aspect that it uses less amount of training videos. Similar to MultiTHUMOS, mean average precision (mAP) of per-frame annotations is measured for the performance evaluation. We fine-tuned I3D on Charades based on the code that won the 2017 Charades challenge.


We compare our results with the state-of-the-arts in Table 5. To our knowledge, our method is obtaining the best known performance in the original localization setting of the Charades dataset. Notably, it is performing better than I3D that obtained the best competition performance in 2017, while using the same feature. Our method also outperforms standard temporal convolution, LSTMs, and fixed pyramid pooling, as well as the use of latent super-events [4].

Random [39] 2.42
RGB [39] 7.89
Predictive-corrective [38] 8.9
Two-stream [39] 8.94
Two-stream+LSTM [39] 9.6
Sigurdsson et al. [39] 12.1
R-C3D [23] 12.7
I3D baseline 17.2
I3D + 3 temporal conv. layers 17.5
I3D + LSTM 18.1
I3D + fixed temporal pyramid 18.2
I3D + super-events [4] 19.4
I3D + 3 TGMs 20.6
I3D + 3 TGMs + super-events 21.8
Table 5: Results on Charades dataset, with the original test setting (i.e., charades_v1_localize setting). Note that this setting is a bit different from the Charades Challenge 2017 competition setting; this setting uses less training data.

5 Conclusions

We newly introduced the temporal Gaussian mixture layer and demonstrated its effectiveness for multi-activity detection in continuous videos. Our layer is fully differentiable and trainable using standard backpropagation. Our layer is able to capture temporal dynamics. We were able to confirm that our layer performs superior to state-of-the-art methods on different activity detection datasets suchs as MultiTHUMOS and Charades. Additionally, we introduced a challenging new activity detection dataset, MLB-YouTube, and confirmed that our TGM layers benefits the detection meaningfully.


  1. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3d: generic features for video analysis. CoRR, abs/1412.0767 2(7) (2014)  8
  2. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  3. Piergiovanni, A., Fan, C., Ryoo, M.S.: Learning latent sub-events in activity videos using temporal attention filters. In: Proceedings of the American Association for Artificial Intelligence (AAAI). (2017)
  4. Piergiovanni, A., Ryoo, M.S.: Learning latent super-events to detect multiple activities in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
  5. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. In: Proceedings of European Conference on Computer Vision (ECCV). (2016)
  6. Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision (IJCV) (2015) 1–15
  7. Niebles, J.C., Chen, C.W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of European Conference on Computer Vision (ECCV), Springer (2010) 392–405
  8. Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 896–904
  9. Gaidon, A., Harchaoui, Z., Schmid, C.: Actom sequence models for efficient action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2011) 3201–3208
  10. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 2818–2826
  11. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS). (2014) 568–576
  12. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing Surveys 43 (April 2011) 16:1–16:43
  13. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2008) 1–8
  14. Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2015) 4694–4702
  15. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2011) 3169–3176
  16. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014) 1725–1732
  17. Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
  18. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 1933–1941
  19. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the ICCV Workshop on Action, Gesture, and Emotion Recognition. Volume 2. (2017)  4
  20. Jiang, Y.G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes. (2014)
  21. Fabian Caba Heilbron, Victor Escorcia, B.G., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 961–970
  22. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  23. Xu, H., Das, A., Saenko, K.: R-c3d: Region convolutional 3d network for temporal activity detection. arXiv preprint arXiv:1703.07814 (2017)
  24. Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 1049–1058
  25. Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. arXiv preprint arXiv:1703.01515 (2017)
  26. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. arXiv preprint arXiv:1704.06228 (2017)
  27. Varol, G., Laptev, I., Schmid, C.: Long-term Temporal Convolutions for Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
  28. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 2678–2687
  29. Escorcia, V., Heilbron, F.C., Niebles, J.C., Ghanem, B.: Daps: Deep action proposals for action understanding. In: Proceedings of European Conference on Computer Vision (ECCV), Springer (2016) 768–784
  30. Laxton, B., Lim, J., Kriegman, D.: Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2007) 1–8
  31. Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2009) 2012–2019
  32. Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE (2011) 778–785
  33. Ryoo, M.S., Matthies, L.: First-person activity recognition: What are they doing to me? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2013)
  34. Variani, E., McDermott, E., Heigold, G.: A gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, IEEE (2015) 4270–4274
  35. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Joint Pattern Recognition Symposium, Springer (2007) 214–223
  36. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  37. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 2625–2634
  38. Dave, A., Russakovsky, O., Ramanan, D.: Predictive-corrective networks for action detection. arXiv preprint arXiv:1704.03615 (2017)
  39. Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. arXiv preprint arXiv:1612.06371 (2016)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description