Fine-grained Video Classification and Captioning

Fine-grained Video Classification and Captioning

Farzaneh Mahdisoltani1,2 University of Toronto1, Twenty Billion Neurons2
1{farzaneh, fleet}, {firstname.lastname}
   Guillaume Berger2 University of Toronto1, Twenty Billion Neurons2
1{farzaneh, fleet}, {firstname.lastname}
   Waseem Gharbieh2
David Fleet1
University of Toronto1, Twenty Billion Neurons2
1{farzaneh, fleet}, {firstname.lastname}
   Roland Memisevic2
University of Toronto1, Twenty Billion Neurons2
1{farzaneh, fleet}, {firstname.lastname}

We describe a DNN for fine-grained action classification and video captioning. It gives state-of-the-art performance on the challenging Something-Something dataset, with over videos and fine-grained actions. Classification and captioning on this dataset are challenging because of the subtle differences between actions, the use of thousands of different objects, and the diversity of captions penned by crowd actors. The model architecture shares features for classification and captioning, and is trained end-to-end. It performs much better than the existing classification benchmark for Something-Something, with impressive fine-grained results, and it yields a strong baseline on the new Something-Something captioning task. Our results reveal that there is a strong correlation between the degree of detail in the task and the ability of the learned features to transfer to other tasks.

Video Understanding, Fine-grained Video Classification, Video Captioning, Common Sense, Something-Something Dataset.

1 Introduction

Common-sense video understanding entails fine-grained recognition of actions, objects, spatial and temporal relations, as well as physical interactions, arguably well beyond the capabilities of current techniques. A general framework will need to discriminate myriad variations of actions and interactions, not unlike the emergence of fine-grained tasks in visual object recognition. For example, we need to be able to discriminate similar actions that differ in relatively subtle ways, for instance, ’putting a pen beside the cup’, ’putting the pen in the cup’, or perhaps ’pretending to put the pen on the table’. Similarly, one must be able to cope with diversity of the actors in a scene and the objects involved in the action. Such generality and fine-grained discrimination are key challenges to video understanding.

Compared to most current approaches to action recognition, and video captioning techniques applied to small corpora with relatively coarse-grained actions, this paper considers fine-grained action classification and captioning tasks on large-scale video corpora. Training is performed on the Something-Something dataset [1], with fine-grained action categories and several thousand different objects. In particular, the captioning task requires the inference of many actions, different forms of object interaction and spatio-temporal relations, and an extremely broad set of objects, all under significant variations in lighting, viewpoint, background clutter, and occlusion.

This paper describes a deep neural network architecture comprising a two-channel convolutional network and an LSTM recurrent network for video encoding. The same encoding is then shared for action classification and caption generation. The resulting network performs several times better than baseline action classification results in [1]. It also provides impressive results on an extremely challenging fine-grained video captioning task. We also demonstrate the quality of learned features through transfer learning from Something-Something features to a kitchen action dataset.

2 Video Data

Video action classification and captioning have received significant attention for several years. Nevertheless, progress has lagged related work on static images, in part because of the lack of large-scale corpora. Using web sources (e.g., YouTube, Vimeo and Flickr) and human annotators, larger datasets have been collected in recent years. Examples include Kinetics [2] and Moments in Time [3]. The limitation of these corpora stems from the lack of control over variations in pose, motion and other scene properties that might be important for learning fine-grained models.

More recently, crowd-sourced data based on crowd-acting have emerged. Crowd workers are asked to generate videos depicting template actions. This allows one to target specific video domains and action classes, with control over the similarity and differences of actions, which is needed for fine-grained corpora. Examples include the Something-Something dataset [1] and the Charades dataset [4].

The first version of Something-Something [1] has videos of human-object interactions, comprising coarse-grained action groups, which are further broken down into closely related action categories. The videos exhibit significant diversity in viewing and lighting, objects and backgrounds, and the ways in which the actors performed the actions. Baseline performance in [1] was a correct action classification rate of , and on action groups. [5] reports classification accuracy on Something-Something action categories.

The second version of Something-Something is larger, with videos for the same action categories. In addition to action labels, each video in this version includes a caption that was authored and uploaded by the crowd actor. These captions incorporate the action class as well as descriptions of the objects involved in the action. That is, the captions mirror the action template, but with the generic placeholder Something replaced by the specific object(s) chosen by the actor. As an example, a video with template action ’Putting [something] on [something]’, might have a caption ’Putting a red cup on a blue plastic box’.

In a nutshell, this dataset provides different levels of granularity: coarse-grained action groups, fine-grained action categories, and even more fine-grained labeling via video captions.

3 Video Classification and Captioning Tasks

Action Classification:

Video-based action classification dates back to seminal work by Laptev et al [6] with hand-tuned features, while most recent approaches have focused on DNN features. Due to the prevalence of datasets like UCF-101 [7], sports1M [8] and more recently Kinetics [2], most research on action classification has been biased towards models that encode rough, coarse-grained properties of the scene. In extreme cases, action classes can be can be discriminated from a rough glimpse of the scene, often encoded in isolated frames; e.g., inferring “soccer play” from the presence of a green soccer-field.

However, even when motion is essential in order to distinguish action classes, many existing approaches do well by encoding rough statistics over velocities, directions, and motion positions. Little work has been devoted to the task of representing details of object interactions or how their configuration changes over time. One notable exception is the work by Wu [9], where the goal is to extract physical object properties from videos of physical experiments.

Video Captioning:

Image and video captioning have received significant attention since the release of large-scale captioning corpora, notably, Microsoft COCO [10] and MSR-VTT [11]. One problem with existing approaches to captioning (for images or videos) is that many datasets implicitly allow models to “cheat”, e.g., by generating phrases that are grammatically and semantically coherent, but only loosely related to the fine-grained visual structure that models purport to represent. It has been shown, for example, that a language model trained on unordered, individual words (e.g., object-nouns) predicted by a separate NN can compete with captioning model trained on the actual task end-to-end (e.g., [12, 13]). Similarly, nearest neighbor methods have been surprisingly effective for existing captioning tasks [14].

In principle, captioning tasks, if designed appropriately, could represent extremely detailed scene properties. One can therefore conclude that the above problems are more a function of the task and datasets than the specific architectures. Labels with more subtle and fine-grained distinctions would directly expose the ability (or inability) of a network to correctly infer the scene properties encoded in the captions.

The captions provided with the Something-Something dataset are designed to be sufficiently fine-grained that attention to details is needed to yield high prediction accuracy. For example, to generate correct captions, networks need not only infer the correct actions, but must also learn to recognize different objects and their properties, as well as object interactions.

4 Approach

Most existing captioning architectures are based on an encoder-decoder framework. For video captioning (e.g., [15, 16, 17]) the encoder is typically a convolutional or recurrent convolutional network.

Inspired by these works, we use a modified encoder-decoder architecture that has an action classifier in addition to the encoder and decoder components (see Fig. 1). The decoder or classifier can be switched off, leaving pure classification or captioning models respectively. It is also possible to jointly train classification and captioning models.

Figure 1: Our model architecture includes a two-channel CNN followed by an LSTM video encoder, an action classifier, and an LSTM decoder for caption generation.

4.1 Modified Video Encoder-Decoder

In our model, the encoder receives the input video , and maps it to an embedded representation . Conditioned on , a caption decoder generates the caption , and a classifier predicts the action category. We describe each component in more detail below.

The video encoder, inspired in part by magno- and parvo-cellular pathways in visual cortex, first processes the video through a two-channel convolutional architecture (Fig. 1). A spatial 2D-CNN and a spatio-temporal 3D-CNN are applied in parallel to the input video. Both CNN channels are VGG-like convolutional networks. The 3D-CNN features are used in lieu of a separate module to compute motion (e.g., optical flow) features. The basic building block of each channel is a ( in 2D-CNN channel) convolution filter with batchnorm [18] and ReLU activation. We interleave the convolutional layers with spatial max-pooling.

In order to aggregate features across time, feature vectors from each channel are combined and fed to a 2-layer bidirectional LSTM. Because there is no temporal pooling in CNN channels, the number of feature vectors produced by the final LSTM layer is equal to the number of input frames. We average these features to get an encoding of the entire video, . This encoding is used by both the classifier and the captioning decoder.

The action classifier applies a fully-connected layer to the encoder output , followed by a softmax layer. For training we use a cross-entropy loss over the action categories.

The caption decoder is a two-layer LSTM. Much like conventional encoder-decoder methods for video captioning [16, 15] and machine translation [19], our decoder generates captions using a softmax over the vocabulary words, conditioned on previously generated words. The loss used for a caption is the usual negative log-probability of the word sequence:


where denotes the word of the caption, is the video encoding, and denotes model parameters. In order to optimize speed and memory usage during training, the length of captions generated by the decoder is fixed at 14 words111Less than of Something-Something captions have more than 14 words.. As common for encoder-decoder networks, we train using teacher-forcing [20]. At test time, the input to the decoder at each time-step is the token generated at the previous time-step (i.e., teacher forcing is not used).

The model is trained end-to-end for classification and captioning with a weighted sum of the classification and captioning losses, i.e.,


The hyperparameter ; with or , we end up with pure classification and captioning tasks respectively. For other values of , both tasks are trained jointly. The encoder is shared by the action classifier and the caption decoder. The experiments below also compare this joint training regime with models for which the encoder is trained on the classification loss or the captioning loss alone.

Related work:

Existing action classification methods differ in the way they aggregate information through time. Many approaches to video classification rely primarily on spatial features with CNNs applied to individual frames (e.g., see [8]). While such features have been successful for image classification and object recognition, they fail to exploit the available temporal information in video.

Other approaches make use of temporal and spatial information. For instance, [8, 21, 22] use a 3D-CNN without recurrent connections. Recent methods involve Imagenet-pretrained 2D-CNNs which are inflated into 3D [23, 24]. In contrast, our video encoder is most closely related to approaches that perform temporal reasoning via a recurrent convolutional architecture [25, 26, 27]. It is also related to TwoStream architectures [28]; but our model does not explicitly use optical flow, opting instead for generic 3D CNN features.

5 Experiments

We train networks for captioning and classification, and further evaluate the learned features on other related tasks.

Training settings

In all our experiments we use frame rate of . During training we randomly pick consecutive frames. For videos with less than frames, we replicate the first and last frames to achieve the intended length. We resize the frames to , and then use random cropping of size . For validation and testing, we use center cropping. We optimize all models using Adam, with an initial learning rate of .

5.1 Exploring architectures for classification

As a baseline, we use ImageNet-pretrained models [29, 30] on individual frames, to which we then add additional layers. For the first baseline, we use just the middle frame of the video, with a classifier comprising a 2-layer MLP with hidden units. We also consider a baseline in which we apply this approach to all frames, after which we average the frame by frame predictions.

We experiment with aggregating information over time using a LSTM layer with 1024 units. We report results in Table 1. There is a marked improvement with the LSTM, confirming that this task requires some form of temporal analysis. Our best baseline result was achieved with a VGG architecture, and the test accuracy is close to the best architecture reported to date on Something-Something V1 (e.g., [5]).

Test Accuracy
VGG16 + MLP 1024 (single middle frame)
VGG16 + MLP 1024 (averaged over 48 frames)
VGG16 + LSTM 1024
ResNet152 + MLP 1024 (single middle frame)
ResNet152 + MLP 1024 (averaged over 48 frames )
ResNet152 + LSTM 1024 (48 steps)

Table 1: Classification results on 174 action categories using VGG16 and ResNet152 as frame encoders, along with different strategies for temporal aggregation. MLP and LSTM models pass the frame features through 2-layer MLP and LSTM respectively. In both cases we use 1024 hidden units before producing predictions.

The performance of our model is reported in Table 2. For the pure classification task, with during training, we consider models with different numbers of features produced by the 2D (spatial) CNN and the 3D (spatio-temporal) CNN. The total number of features is capped at in all cases. Interestingly, we find the best performance occurs when 2D and 3D channel have similar numbers of features.

To the best of our knowledge, these results are state-of-the-art test accuracy on Something-Something, and are several times better than the initial performance reported in [1]. The results also illustrate that the model benefits from combining 2D and 3D features. Although the performance is relatively stable for different numbers of 2D and 3D features, the even split M(-) provides a good trade-off between performance and model complexity. We therefore use this model below, unless otherwise specified.

Number of
M() 256 0 8.9M 50.06 48.84
M() 512 0 24.1M 51.96 51.15
M() 384 128 16.2M 51.11 49.96
M() 256 256 11.5M 51.62 50.44
M() 128 384 10.0M 50.82 49.57
M() 0 512 5.8M 39.78 37.80
M() 0 256 11.5M 40.2 38.83

Table 2: Validation and test accuracy for the pure classification task , with different numbers of 2D-CNN and 3D-CNN features used for video encoding.

5.2 Coarse- vs Fine-Grained Classification

Something-Something provides coarse-grained categories called action groups, which comprise disjoint sets of fine-grained actions. To evaluate how well we discriminate fine-grained vs coarse-grained actions we also trained a model on coarse-grained action groups, using the M(-) architecture. As one would expect, classification accuracy is higher, at (see Table 3 (top-left)).

An alternative way to perform coarse-grained classification is to map predictions from the fine-grained classifier onto the action groups. This can achieved by summing the probabilities of all fine-grained actions belonging to each action group. Interestingly, the resulting accuracy on coarse-grained action groups increases markedly, to . Such increased performance suggests that fine-trained training provides higher fidelity features.

Finally, it is interesting to examine to what extent coarse-grained performance accounts for fine-grained accuracy. That is, how better is fine-grained performance than chance conditioned on coarse-grain performance. For example, conditioned on a predicted action group, if one selected the most frequent action within the action group, one’s fine-grained accuracy on the test set would be . A better approach might fix the coarse-grained model and train a linear classifier on top of its penultimate features. This achieves a test accuracy of , still below test performance for the corresponding architecture trained on the fine-grained problem. This further supports our contention that forcing DNNs to distinguish fine-grained details yields richer features.

Coarse-grained Testing
Fine-grained Testing
Coarse-grained Training 57.6 41.7
Fine-grained Training 62.5 50.44

Table 3: Comparison of classification accuracy of fine-grained and coarse-grained models, tested on fine-grained actions (using action categories) versus coarse-grained actions (using action groups).

5.3 Fine-grained Captioning

The ground truth object placeholders in Something-Something video captions (i.e., the object descriptions provided by crowd actors) are not highly constrained. Crowd actors have the option to type in the objects they used, along with multiple descriptive or quantitative adjectives, elaborating shape, color, material or the number of objects involved. Accordingly, it is not surprising that the distribution over object placeholders is extremely heavy-tailed, with many words occurring rarely. To facilitate training we therefore replaced all words that occurred times or less by [Something]. After removing rare words, we are left with words comprising around distinct object placeholders (i.e., different combination of nouns, adjectives, etc).

As the major challenge we confront is the striking diversity of object labels, and the variation in their frequency of occurrence, we consider a simplified task in which we modified the ground truth captions so they only contain one word per placeholder. We do this by pre-processing the object placeholders so that we keep the last noun, removing all other words. Afterwards, by substituting the pre-processed object placeholders into the action category, we obtain a simplified caption. Table 4 shows an example of the process. The result is a reduced vocabulary with words. In the spectrum of granularity, captioning with simplified objects can be considered as a middle ground between fine-grained action classification and captioning with full labels.

5.3.1 Captioning baseline.

To the best of our knowledge there are no baselines for the Something-Something captioning task as the dataset was just recently released. To quantify the performance of our captioning models, we count the percentage of generated captions that match ground truth word by word. We refer to this as “Exact-Match Accuracy”. This is a challenging metric as the model is deemed correct only if it generates the entire caption correctly. Chance performance in terms of Exact-Match Accuracy is , where denotes the vocabulary, and denotes the length of the sequence. In our experiments, as mentioned before, and we set . There is a very low chance of getting the entire caption correct by picking a random word from the vocabulary at each time-step.

If we use the action category predicted by model M(-), trained for classification, and replace all occurrences of [something] with the most likely object string conditioned on that action class, the Exact-Match accuracy is . The same baseline for simplified object placeholders is .

Video ID 81955
Template Holding [something] in front of [something]
Somethings “a blue plastic cap”, “a men’s short sleeve shirt”
Full Caption Holding a blue plastic cap in front of a men’s short sleeve shirt
Simplified Sth’s “cap”, “shirt”
Simplified Caption Holding cap in front of shirt
Table 4: An example of annotation file for a Something-Something video
M() 22.75 44.54 22.40 8.46 50.64
M() 23.28 45.29 22.75 8.47 50.96
M() 23.02 44.86 22.58 8.53 50.73
M() 23.04 44.89 22.60 8.63 51.38
M() 22.76 44.40 22.39 8.33 50.04

Table 5: Performance of our two-channel models with different sizes of channel features on for simplified objects. For this task we use . The maximum sequence length is .

5.3.2 Joint training of Classification and Captioning.

While the captioning task theoretically entails action classification, we found that our two-channel networks optimized on the pure captioning task do not perform as well as models trained jointly on classification and captioning (see Table 6). By coarsely tuning the hyper-parameter empirically, we found to work well and fix it at this value for the captioning experiments below. More specifically, we first train with a pure classification loss, by setting , and subsequently introduce the captioning loss by gradually decreasing to 0.1.

Classification Accuracy
Exact-Match Accuracy
39.78 5.96
51.32 8.63

Table 6: Comparing models trained with pure captioning task vs joint captioning and classification. Results are shown for captioning with simplified object placeholders. The test classification accuracy for the pure captioning model was obtained by freezing the video encoder and training a linear regressor on top of penultimate features.

5.3.3 Captioning with simplified object placeholders.

We train different variations of our two-channel architecture on the simplified version of Something-Something captions with single word object labels. Table 5 summarizes our results. They show that the model with an equal number of 2D- and 3D-channels performs best (albeit by a fairly small margin). They also show that the best captioning model also performs best on the classification task. We also evaluate the models using standard captioning metrics: BLEU [31], ROUGE-L [32] and METEOR [33].

5.3.4 Captioning with full object placeholders.

We also train networks on the full object placeholders. This constitutes the finest level of action granularity. The experimental results are shown in Table 7. They show that, again, the best captioning model also yields the highest corresponding classification accuracy. The Exact-Match accuracy is significantly lower than for the simplified object placeholders, as it has to account for a much wider variety of phrases. The captioning models produce impressive qualitative results with a high degree of approximate action and object accuracy. Some examples are shown in Figure 6. More examples can be found in the appendix.

Exact Match
M() 16.87 40.03 19.13 3.33 50.48
M() 16.92 40.54 19.26 3.56 49.81
M() 17.99 41.82 20.03 3.80 50.92
M() 17.61 41.28 19.69 3.76 50.56
M() 16.80 39.98 19.11 3.61 49.24

Table 7: Performance of captioning models with different sizes of channel features on full object placeholders. For this task we use . The maximum sequence length is .
(a) Ground Truth: Stacking 4 coins.
(b) Model output: Piling coins up.
(c) Ground Truth: Lifting up one end of flower pot, then letting it drop down.
(d) Model output: Lifting up one end of bucket, then letting it drop down.
(e) Ground Truth: Removing cup, revealing little cup behind.
(f) Model output: Removing mug, revealing cup behind.
Figure 2: Captioning examples.

6 Transfer Learning

One of the most astonishing properties of neural networks is their ability to learn representations that can be successfully transferred to other tasks. This has been demonstrated clearly with features from the penultimate layer of networks trained on ImageNet (e.g., [34, 35]). A distinguishing feature of the ImageNet task, which likely contributes to its potential for transfer learning, is the dataset size and the wide variety of fine-grained discriminations required. One motivation for studying fine-grained video classification is to understand and improve the potential for transfer learning on video tasks. In what follows we explore transfer learning performance as a function of course task granularity.

6.1 20bn-kitchenware

We introduce 20bn-kitchenware, a few-shot video classification dataset that contains 390 videos equally divided into 13 action categories. As depicted in Figure 3, this dataset contains video clips of a single person manipulating a fork, a spoon, a knife or tongs for roughly 4 seconds. Similar to the Something-Something dataset, 20bn-kitchenware was designed to capture fine-grained actions with subtle differences. For each kitchen utensil , the target label can belong to any one of 3 actions, namely, “Using , “Pretending to use or “Trying but failing to use . In addition to these 12 action categories derived from kitchenware manipulations, we also include a fall-back class labeled “Doing other things” (the appendix lists the action categories).

While these action categories are somewhat ambiguous by design, we further encourage the model to pay attention to visual details by including unused ‘negative’ objects in the scene. The last row of Figure 3 shows one such example; while the target label indicates a manipulation of tongs, the clip also contains a spoon with an egg in it that could fool a model which simply recognizes objects. Given the limited amount of data available for training222130 samples are available for training. The rest is used for validation and testing., the action granularity and the presence of negative objects, we hypothesize that only models that have some understanding of physical world properties will perform well on this dataset. We therefore use 20bn-kitchenware as a benchmark for evaluating the quality of models pre-trained on Something-Something or other datasets. We plan to release 20bn-kitchenware upon publication of this paper.

(a) Using a knife to cut something
(b) Pretending to use a spoon to pick something up
(c) Trying but failing to pick something up with tongs
Figure 3: 20bn-kitchenware samples.

6.2 Proposed benchmark

6.2.1 Model candidates:

To further investigate whether training on fine-grained labels leads to better features, our transfer learning experiments consider four two-channel models that have been respectively pre-trained on coarse-grained labels (classification on action groups), on fine-grained labels (classification on 174 action categories), on simplified captions (captioning on fine-grained action categories expanded with single object descriptor) and on template captions (captioning on fine-grained action categories expanded with object descriptors). We also include in this benchmark two neural nets that were pre-trained on other datasets, namely, a VGG16 network pre-trained on ImageNet, and an Inflated-ResNet34 pre-trained on Kinetics333

6.2.2 Training procedure:

The overall training procedure remains the same for all models. We freeze each pre-trained model and then train a neural net on top of extracted penultimate features. Independent of the architecture being used, we use the pre-trained model to produce 12 feature vectors per second. To achieve this, where necessary we split the input video into smaller clips and sequentially apply the pre-trained network on each clip individually444VGG16 is applied to each frame individually. For Inflated-ResNet34, we split the video into overlapping clips of 16 frames.. Then, in the simplest case, we pass the obtained sequence of features through a logistic regressor and then average predictions over time. We also report results for which we classify pre-trained features using an MLP with 512 hidden units as well as a single bidirectional LSTM layer with 128 hidden states. This allows the network to perform some temporal analysis about the target domain.

Figure 4: 20bn-kitchenware transfer learning results: averaged scores obtained using (from left to right) a VGG16, an Inflated ResNet34, as well as two-channel models trained respectively on coarse-grained (CG) actions, fine-grained (FG) actions, single-object (SO) captions and FG actions jointly, captions and FG actions jointly, and on captions only. We report results using 1 training sample per class, 5 training samples per class or the full training set.

6.2.3 Results:

For each pretrained model and each classifier, we evaluate 1-shot, 5-shot and 10-shot performance, averaging scores obtained over 10 runs. Figure 4 shows the average scores as well as 95 confidence intervals. The most noticeable findings are as follows:

  • Logistic Regression vs MLP vs BiLSTM: Overall, using a recurrent network yields better performance.

  • Something-Something features vs other features: Our models pre-trained on Something-Something outperform other external models. This is not surprising given the nature of the target domain. 20bn-kitchenware samples are, by design, closer to Something-Something samples than ImageNet or Kinetics ones. However, it is surprising that VGG16 features perform better on 20bn-kitchenware than Kinetics features.

  • Effect of the action granularity: Figure 4 supports the contention that training on fine-grained tasks leads to better features. The best model on this benchmark is our model trained jointly on full captions and action categories. As expected, the only exception to this rule is the model that was trained to do pure captioning.

7 Conclusions

Pre-training neural networks on large labeled datasets has become a driving force in many deep learning applications. Some might argue that it may be considered a serious competitor to unsupervised learning as a means to generate universal features that represent the visual world. Ever since ImageNet was used as a generic visual feature extractor, the hypothesis has emerged that it is the dataset size, the amount of detail and the variety of labels that drive a network’s capability to learn useful generic features. To the degree that this hypothesis is true, generating visual features capable of transfer learning should involve source tasks that (i) are as fine-grained and complex as possible, and (ii) ideally involve video not still images, because video is a much more fertile domain for defining complex tasks that represent aspects of the physical world.

In this work, we provide further evidence for that hypothesis, showing that the amount of detail in the task has a strong influence on the quality of the learned features. We also show that captioning, which to the best of our knowledge has hitherto been used only as a target task in transfer learning, can be a powerful source task itself. Our work suggests that one gets substantial leverage by utilizing ever more fine-grained recognition tasks, represented in the form of captions, possibly in combination with question-answering. Unlike the current trend of engineering neural networks to perform bounding box generation, semantic segmentation, or tracking, the appeal of fine-grained textual labels is that they provide a simple homogeneous interface. More importantly, they may provide “just enough” localization and tracking capability to solve a wide variety of important tasks, without allocating valuable compute resources to satisfy intermediate goals at an accuracy that these tasks may not actually require.


  • [1] Goyal, R., Kahou, S.E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The ”something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV17 (2017)
  • [2] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset (2017) preprint arXiv:1705.06950.
  • [3] Monfort, M., Zhou, B., Bargal, S.A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C., Oliva, A.: Moments in time dataset: one million videos for event understanding (2018) preprint arXiv:1801.03150.
  • [4] Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [5] Zhou, B., Andonian, A., Torralba, A.: Temporal relational reasoning in videos. CoRR (2017)
  • [6] Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR08 (2008)
  • [7] Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild (2012) preprint arXiv:1212.0402.
  • [8] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014) 1725–1732
  • [9] Wu, J.: Computational perception of physical object properties. Master’s thesis, Massachusetts Institute of Technology (2016)
  • [10] Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: Data collection and evaluation server. CoRR (2015)
  • [11] Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE (2016) 5288–5296
  • [12] Yao, L., Ballas, N., Cho, K., Smith, J.R., Bengio, Y.: Oracle performance for visual captioning. arXiv preprint arXiv:1511.04590 (2015)
  • [13] Heuer, H., Monz, C., Smeulders, A.W.: Generating captions without looking beyond objects. arXiv preprint arXiv:1610.03708 (2016)
  • [14] Devlin, J., Gupta, S., Girshick, R.B., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. CoRR (2015)
  • [15] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 2625–2634
  • [16] Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
  • [17] Kaufman, D., Levi, G., Hassner, T., Wolf, L.: Temporal tessellation for video annotation and summarization. arXiv preprint arXiv:1612.06950 (2016)
  • [18] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR (2015)
  • [19] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. CoRR (2014)
  • [20] Williams, R., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1 (1989) 270–280
  • [21] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. (2015) 4489–4497
  • [22] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35(1) (2013) 221–231
  • [23] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. arXiv preprint arXiv:1705.07750 (2017)
  • [24] Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? CoRR (2017)
  • [25] Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. (2016) 4207–4215
  • [26] Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. (2011) 29–39
  • [27] Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding (2017)
  • [28] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. (2014) 568–576
  • [29] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014)
  • [30] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR (2015)
  • [31] Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. (2002)
  • [32] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
  • [33] Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation. (2014)
  • [34] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)
  • [35] Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2014) 806–813
  • [36] Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016)


In this supplementary document, we provide:

  • Qualitative examples for our classification and captioning.

  • Visualization of our classification and captioning models using Grad-CAM.

  • The full list of action categories of 20bn-kitchenware.

Qualitative Examples of Classification

Here we provide video examples and their ground truth action categories, along with model predictions for each. We use our M(-) which is trained with . Interestingly, notice that even when the predicted actions are incorrect, e.g. row in Figure 5, they are, nevertheless, usually quite sensible.

(a) Ground Truth: Stacking [number of] [something].
(b) Model Prediction: Putting [something] and [something] on the table.
(c) Ground Truth: Pouring [something] into [something] until it overflows.
(d) Model Prediction: Pouring [something] into [something].
(e) Ground Truth: Lifting [something] with [something] on it.
(f) Model Prediction: Lifting [something] with [something] on it.
(g) Ground Truth: Showing [something] behind [something].
(h) Model Prediction: Holding [something] behind [something].
(i) Ground Truth:Putting [something] onto [something].
(j) Model Prediction:Covering [something] with [something].
Figure 5: Ground truth and model prediction for classification examples.

Qualitative Examples of Captioning

Below are examples of videos. accompanied by their their ground truth caption and the caption generated by the model. We use model M(-) in this section as well, which is also trained jointly for classification and captioning (with ).

(a) Ground Truth: Touching (without moving) the head of a toy.
(b) Model output: Poking a stuffed animal so lightly that it doesnt or almost doesnt move.
(c) Ground Truth: Pushing duster with white coloured pen.
(d) Model output: Pushing phone with pen.
(e) Ground Truth: Plugging a charger into a phone.
(f) Model output: Plugging charger into phone.
(g) Ground Truth: Piling bowl up.
(h) Model output: Stacking bowls.
(i) Ground Truth: Removing cup, revealing little cup behind.
(j) Model output: Removing mug, revealing cup behind.
(k) Ground Truth: Hitting cup with spoon.
(l) Model output: Hitting mug with spoon.
(m) Ground Truth: Stacking 4 coins.
(n) Model output: Piling coins up.
(o) Ground Truth: Taking toffee eclairs from jar.
(p) Model output: Taking battery out of container.
(q) Ground Truth: Rolling paper towels on a flat surface.
(r) Model output: Letting bottle roll along a flat surface.
(s) Ground Truth: Pretending to put nail polish into jar.
(t) Model output: Pretending to put bottle into container.
(u) Ground Truth: Letting toy truck roll along a flat surface.
(v) Model output: Pushing car from right to left.
(w) Ground Truth: Lifting up one end of flower pot, then letting it drop down.
(x) Model output: Lifting up one end of bucket, then letting it drop down.
(y) Ground Truth: Letting roll roll down a slanted surface.
(z) Model output: Letting spray can roll down a slanted surface.
(aa) Ground Truth: Lifting plate with cutlery on it.
(ab) Model output: Lifting plate with spoon on it.
(ac) Ground Truth: Tearing napkin into two pieces.
(ad) Model output: Tearing paper into two pieces.
(ae) Ground Truth: Putting pen on a surface.
(af) Model output: Putting pen that cant roll onto a slanted surface, so it stays.
(ag) Ground Truth: Putting pen into box.
(ah) Model output: Showing that pen is inside the box.
(ai) Ground Truth: Covering selfi stick with shawl.
(aj) Model output: Covering scissors with a blanket.
(ak) Ground Truth: Lifting a book up completely, then letting it drop down.
(al) Model output: Lifting a book up completely, then letting it drop down.
(am) Ground Truth: A deodorant falling like a rock.
(an) Model output: Bottle falling like a rock.
(ao) Ground Truth: Pretending to poke a book.
(ap) Model output: Pretending to poke a book.
Figure 6: Ground truth captions and model outputs for video examples.

Visualization of classification model with Grad-CAM

To visualize regularities learned from data, we extracted temporally-sensitive saliency maps using Grad-CAM [36], for both classification and captioning task. To this end we extended the Grad-CAM implementation for video processing. Figure 7 shows saliency maps of examples from Something-Something obtained with model M(-) trained on fine-grained action categories, with (i.e., the pure classification task).

Figure 7: Grad-CAM for M(-) on video examples predicted correctly during fine-grained action classification. We can see that the model focuses on different parts of different frames in the video in order to make a prediction.

Visualization of captioning model using Grad-CAM

To get saliency maps during the captioning process, we calculate the Grad-CAM once for each token, for which different regions of the video are highlighted. Figures 8,-10 shows saliency maps for the captioning model, jointly trained with . Notice how the attentional focus of the model changes qualitatively as we perform Grad-CAM for different tokens in the target caption.

Figure 8: Grad-CAM on video example with ground truth caption Pretending to pick mouse up. The model focuses on hand motion in the beginning and end of the video for the token “Up”.
Figure 9: Grad-CAM on video example with ground truth caption Moving toy closer to toy. We can see that the model focuses on the gap between toys when using “Moving” token. It also looks at both toy objects when using the token “Closer”.
Figure 10: Grad-CAM on video example with ground truth caption Bottle being deflected from ball during captioning process. The model focuses on the collision between bottle and ball, when using token “Deflected”.

20bn-kitchenware action categories

Table 8 lists the full list of 20bn-kitchenware action categories.

Action categories
Using a fork to pick something up
Pretending to use a fork to pick something up
Trying but failing to pick something up with a fork
Using a spoon to pick something up
Pretending to use a spoon to pick something up
Trying but failing to pick something up with a spoon
Using a knife to cut something
Pretending to use a knife to cut something
Trying but failing to cut something with a knife
Using tongs to pick something up
Pretending to use tongs to pick something up
Trying but failing to pick something up with tongs
Doing other things
Table 8: The 13 action categories represented in 20bn-kitchenware.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description