Fine-grained Video Classification and Captioning
We describe a DNN for fine-grained action classification and video captioning. It gives state-of-the-art performance on the challenging Something-Something dataset, with over videos and fine-grained actions. Classification and captioning on this dataset are challenging because of the subtle differences between actions, the use of thousands of different objects, and the diversity of captions penned by crowd actors. The model architecture shares features for classification and captioning, and is trained end-to-end. It performs much better than the existing classification benchmark for Something-Something, with impressive fine-grained results, and it yields a strong baseline on the new Something-Something captioning task. Our results reveal that there is a strong correlation between the degree of detail in the task and the ability of the learned features to transfer to other tasks.
Keywords:Video Understanding, Fine-grained Video Classification, Video Captioning, Common Sense, Something-Something Dataset.
Common-sense video understanding entails fine-grained recognition of actions, objects, spatial and temporal relations, as well as physical interactions, arguably well beyond the capabilities of current techniques. A general framework will need to discriminate myriad variations of actions and interactions, not unlike the emergence of fine-grained tasks in visual object recognition. For example, we need to be able to discriminate similar actions that differ in relatively subtle ways, for instance, ’putting a pen beside the cup’, ’putting the pen in the cup’, or perhaps ’pretending to put the pen on the table’. Similarly, one must be able to cope with diversity of the actors in a scene and the objects involved in the action. Such generality and fine-grained discrimination are key challenges to video understanding.
Compared to most current approaches to action recognition, and video captioning techniques applied to small corpora with relatively coarse-grained actions, this paper considers fine-grained action classification and captioning tasks on large-scale video corpora. Training is performed on the Something-Something dataset , with fine-grained action categories and several thousand different objects. In particular, the captioning task requires the inference of many actions, different forms of object interaction and spatio-temporal relations, and an extremely broad set of objects, all under significant variations in lighting, viewpoint, background clutter, and occlusion.
This paper describes a deep neural network architecture comprising a two-channel convolutional network and an LSTM recurrent network for video encoding. The same encoding is then shared for action classification and caption generation. The resulting network performs several times better than baseline action classification results in . It also provides impressive results on an extremely challenging fine-grained video captioning task. We also demonstrate the quality of learned features through transfer learning from Something-Something features to a kitchen action dataset.
2 Video Data
Video action classification and captioning have received significant attention for several years. Nevertheless, progress has lagged related work on static images, in part because of the lack of large-scale corpora. Using web sources (e.g., YouTube, Vimeo and Flickr) and human annotators, larger datasets have been collected in recent years. Examples include Kinetics  and Moments in Time . The limitation of these corpora stems from the lack of control over variations in pose, motion and other scene properties that might be important for learning fine-grained models.
More recently, crowd-sourced data based on crowd-acting have emerged. Crowd workers are asked to generate videos depicting template actions. This allows one to target specific video domains and action classes, with control over the similarity and differences of actions, which is needed for fine-grained corpora. Examples include the Something-Something dataset  and the Charades dataset .
The first version of Something-Something  has videos of human-object interactions, comprising coarse-grained action groups, which are further broken down into closely related action categories. The videos exhibit significant diversity in viewing and lighting, objects and backgrounds, and the ways in which the actors performed the actions. Baseline performance in  was a correct action classification rate of , and on action groups.  reports classification accuracy on Something-Something action categories.
The second version of Something-Something is larger, with videos for the same action categories. In addition to action labels, each video in this version includes a caption that was authored and uploaded by the crowd actor. These captions incorporate the action class as well as descriptions of the objects involved in the action. That is, the captions mirror the action template, but with the generic placeholder Something replaced by the specific object(s) chosen by the actor. As an example, a video with template action ’Putting [something] on [something]’, might have a caption ’Putting a red cup on a blue plastic box’.
In a nutshell, this dataset provides different levels of granularity: coarse-grained action groups, fine-grained action categories, and even more fine-grained labeling via video captions.
3 Video Classification and Captioning Tasks
Video-based action classification dates back to seminal work by Laptev et al  with hand-tuned features, while most recent approaches have focused on DNN features. Due to the prevalence of datasets like UCF-101 , sports1M  and more recently Kinetics , most research on action classification has been biased towards models that encode rough, coarse-grained properties of the scene. In extreme cases, action classes can be can be discriminated from a rough glimpse of the scene, often encoded in isolated frames; e.g., inferring “soccer play” from the presence of a green soccer-field.
However, even when motion is essential in order to distinguish action classes, many existing approaches do well by encoding rough statistics over velocities, directions, and motion positions. Little work has been devoted to the task of representing details of object interactions or how their configuration changes over time. One notable exception is the work by Wu , where the goal is to extract physical object properties from videos of physical experiments.
Image and video captioning have received significant attention since the release of large-scale captioning corpora, notably, Microsoft COCO  and MSR-VTT . One problem with existing approaches to captioning (for images or videos) is that many datasets implicitly allow models to “cheat”, e.g., by generating phrases that are grammatically and semantically coherent, but only loosely related to the fine-grained visual structure that models purport to represent. It has been shown, for example, that a language model trained on unordered, individual words (e.g., object-nouns) predicted by a separate NN can compete with captioning model trained on the actual task end-to-end (e.g., [12, 13]). Similarly, nearest neighbor methods have been surprisingly effective for existing captioning tasks .
In principle, captioning tasks, if designed appropriately, could represent extremely detailed scene properties. One can therefore conclude that the above problems are more a function of the task and datasets than the specific architectures. Labels with more subtle and fine-grained distinctions would directly expose the ability (or inability) of a network to correctly infer the scene properties encoded in the captions.
The captions provided with the Something-Something dataset are designed to be sufficiently fine-grained that attention to details is needed to yield high prediction accuracy. For example, to generate correct captions, networks need not only infer the correct actions, but must also learn to recognize different objects and their properties, as well as object interactions.
Most existing captioning architectures are based on an encoder-decoder framework. For video captioning (e.g., [15, 16, 17]) the encoder is typically a convolutional or recurrent convolutional network.
Inspired by these works, we use a modified encoder-decoder architecture that has an action classifier in addition to the encoder and decoder components (see Fig. 1). The decoder or classifier can be switched off, leaving pure classification or captioning models respectively. It is also possible to jointly train classification and captioning models.
4.1 Modified Video Encoder-Decoder
In our model, the encoder receives the input video , and maps it to an embedded representation . Conditioned on , a caption decoder generates the caption , and a classifier predicts the action category. We describe each component in more detail below.
The video encoder, inspired in part by magno- and parvo-cellular pathways in visual cortex, first processes the video through a two-channel convolutional architecture (Fig. 1). A spatial 2D-CNN and a spatio-temporal 3D-CNN are applied in parallel to the input video. Both CNN channels are VGG-like convolutional networks. The 3D-CNN features are used in lieu of a separate module to compute motion (e.g., optical flow) features. The basic building block of each channel is a ( in 2D-CNN channel) convolution filter with batchnorm  and ReLU activation. We interleave the convolutional layers with spatial max-pooling.
In order to aggregate features across time, feature vectors from each channel are combined and fed to a 2-layer bidirectional LSTM. Because there is no temporal pooling in CNN channels, the number of feature vectors produced by the final LSTM layer is equal to the number of input frames. We average these features to get an encoding of the entire video, . This encoding is used by both the classifier and the captioning decoder.
The action classifier applies a fully-connected layer to the encoder output , followed by a softmax layer. For training we use a cross-entropy loss over the action categories.
The caption decoder is a two-layer LSTM. Much like conventional encoder-decoder methods for video captioning [16, 15] and machine translation , our decoder generates captions using a softmax over the vocabulary words, conditioned on previously generated words. The loss used for a caption is the usual negative log-probability of the word sequence:
where denotes the word of the caption, is the video encoding, and denotes model parameters. In order to optimize speed and memory usage during training, the length of captions generated by the decoder is fixed at 14 words111Less than of Something-Something captions have more than 14 words.. As common for encoder-decoder networks, we train using teacher-forcing . At test time, the input to the decoder at each time-step is the token generated at the previous time-step (i.e., teacher forcing is not used).
The model is trained end-to-end for classification and captioning with a weighted sum of the classification and captioning losses, i.e.,
The hyperparameter ; with or , we end up with pure classification and captioning tasks respectively. For other values of , both tasks are trained jointly. The encoder is shared by the action classifier and the caption decoder. The experiments below also compare this joint training regime with models for which the encoder is trained on the classification loss or the captioning loss alone.
Existing action classification methods differ in the way they aggregate information through time. Many approaches to video classification rely primarily on spatial features with CNNs applied to individual frames (e.g., see ). While such features have been successful for image classification and object recognition, they fail to exploit the available temporal information in video.
Other approaches make use of temporal and spatial information. For instance, [8, 21, 22] use a 3D-CNN without recurrent connections. Recent methods involve Imagenet-pretrained 2D-CNNs which are inflated into 3D [23, 24]. In contrast, our video encoder is most closely related to approaches that perform temporal reasoning via a recurrent convolutional architecture [25, 26, 27]. It is also related to TwoStream architectures ; but our model does not explicitly use optical flow, opting instead for generic 3D CNN features.
We train networks for captioning and classification, and further evaluate the learned features on other related tasks.
In all our experiments we use frame rate of . During training we randomly pick consecutive frames. For videos with less than frames, we replicate the first and last frames to achieve the intended length. We resize the frames to , and then use random cropping of size . For validation and testing, we use center cropping. We optimize all models using Adam, with an initial learning rate of .
5.1 Exploring architectures for classification
As a baseline, we use ImageNet-pretrained models [29, 30] on individual frames, to which we then add additional layers. For the first baseline, we use just the middle frame of the video, with a classifier comprising a 2-layer MLP with hidden units. We also consider a baseline in which we apply this approach to all frames, after which we average the frame by frame predictions.
We experiment with aggregating information over time using a LSTM layer with 1024 units. We report results in Table 1. There is a marked improvement with the LSTM, confirming that this task requires some form of temporal analysis. Our best baseline result was achieved with a VGG architecture, and the test accuracy is close to the best architecture reported to date on Something-Something V1 (e.g., ).
The performance of our model is reported in Table 2. For the pure classification task, with during training, we consider models with different numbers of features produced by the 2D (spatial) CNN and the 3D (spatio-temporal) CNN. The total number of features is capped at in all cases. Interestingly, we find the best performance occurs when 2D and 3D channel have similar numbers of features.
To the best of our knowledge, these results are state-of-the-art test accuracy on Something-Something, and are several times better than the initial performance reported in . The results also illustrate that the model benefits from combining 2D and 3D features. Although the performance is relatively stable for different numbers of 2D and 3D features, the even split M(-) provides a good trade-off between performance and model complexity. We therefore use this model below, unless otherwise specified.
5.2 Coarse- vs Fine-Grained Classification
Something-Something provides coarse-grained categories called action groups, which comprise disjoint sets of fine-grained actions. To evaluate how well we discriminate fine-grained vs coarse-grained actions we also trained a model on coarse-grained action groups, using the M(-) architecture. As one would expect, classification accuracy is higher, at (see Table 3 (top-left)).
An alternative way to perform coarse-grained classification is to map predictions from the fine-grained classifier onto the action groups. This can achieved by summing the probabilities of all fine-grained actions belonging to each action group. Interestingly, the resulting accuracy on coarse-grained action groups increases markedly, to . Such increased performance suggests that fine-trained training provides higher fidelity features.
Finally, it is interesting to examine to what extent coarse-grained performance accounts for fine-grained accuracy. That is, how better is fine-grained performance than chance conditioned on coarse-grain performance. For example, conditioned on a predicted action group, if one selected the most frequent action within the action group, one’s fine-grained accuracy on the test set would be . A better approach might fix the coarse-grained model and train a linear classifier on top of its penultimate features. This achieves a test accuracy of , still below test performance for the corresponding architecture trained on the fine-grained problem. This further supports our contention that forcing DNNs to distinguish fine-grained details yields richer features.
5.3 Fine-grained Captioning
The ground truth object placeholders in Something-Something video captions (i.e., the object descriptions provided by crowd actors) are not highly constrained. Crowd actors have the option to type in the objects they used, along with multiple descriptive or quantitative adjectives, elaborating shape, color, material or the number of objects involved. Accordingly, it is not surprising that the distribution over object placeholders is extremely heavy-tailed, with many words occurring rarely. To facilitate training we therefore replaced all words that occurred times or less by [Something]. After removing rare words, we are left with words comprising around distinct object placeholders (i.e., different combination of nouns, adjectives, etc).
As the major challenge we confront is the striking diversity of object labels, and the variation in their frequency of occurrence, we consider a simplified task in which we modified the ground truth captions so they only contain one word per placeholder. We do this by pre-processing the object placeholders so that we keep the last noun, removing all other words. Afterwards, by substituting the pre-processed object placeholders into the action category, we obtain a simplified caption. Table 4 shows an example of the process. The result is a reduced vocabulary with words. In the spectrum of granularity, captioning with simplified objects can be considered as a middle ground between fine-grained action classification and captioning with full labels.
5.3.1 Captioning baseline.
To the best of our knowledge there are no baselines for the Something-Something captioning task as the dataset was just recently released. To quantify the performance of our captioning models, we count the percentage of generated captions that match ground truth word by word. We refer to this as “Exact-Match Accuracy”. This is a challenging metric as the model is deemed correct only if it generates the entire caption correctly. Chance performance in terms of Exact-Match Accuracy is , where denotes the vocabulary, and denotes the length of the sequence. In our experiments, as mentioned before, and we set . There is a very low chance of getting the entire caption correct by picking a random word from the vocabulary at each time-step.
If we use the action category predicted by model M(-), trained for classification, and replace all occurrences of [something] with the most likely object string conditioned on that action class, the Exact-Match accuracy is . The same baseline for simplified object placeholders is .
|Template||Holding [something] in front of [something]|
|Somethings||“a blue plastic cap”, “a men’s short sleeve shirt”|
|Full Caption||Holding a blue plastic cap in front of a men’s short sleeve shirt|
|Simplified Sth’s||“cap”, “shirt”|
|Simplified Caption||Holding cap in front of shirt|
5.3.2 Joint training of Classification and Captioning.
While the captioning task theoretically entails action classification, we found that our two-channel networks optimized on the pure captioning task do not perform as well as models trained jointly on classification and captioning (see Table 6). By coarsely tuning the hyper-parameter empirically, we found to work well and fix it at this value for the captioning experiments below. More specifically, we first train with a pure classification loss, by setting , and subsequently introduce the captioning loss by gradually decreasing to 0.1.
5.3.3 Captioning with simplified object placeholders.
We train different variations of our two-channel architecture on the simplified version of Something-Something captions with single word object labels. Table 5 summarizes our results. They show that the model with an equal number of 2D- and 3D-channels performs best (albeit by a fairly small margin). They also show that the best captioning model also performs best on the classification task. We also evaluate the models using standard captioning metrics: BLEU , ROUGE-L  and METEOR .
5.3.4 Captioning with full object placeholders.
We also train networks on the full object placeholders. This constitutes the finest level of action granularity. The experimental results are shown in Table 7. They show that, again, the best captioning model also yields the highest corresponding classification accuracy. The Exact-Match accuracy is significantly lower than for the simplified object placeholders, as it has to account for a much wider variety of phrases. The captioning models produce impressive qualitative results with a high degree of approximate action and object accuracy. Some examples are shown in Figure 6. More examples can be found in the appendix.
6 Transfer Learning
One of the most astonishing properties of neural networks is their ability to learn representations that can be successfully transferred to other tasks. This has been demonstrated clearly with features from the penultimate layer of networks trained on ImageNet (e.g., [34, 35]). A distinguishing feature of the ImageNet task, which likely contributes to its potential for transfer learning, is the dataset size and the wide variety of fine-grained discriminations required. One motivation for studying fine-grained video classification is to understand and improve the potential for transfer learning on video tasks. In what follows we explore transfer learning performance as a function of course task granularity.
We introduce 20bn-kitchenware, a few-shot video classification dataset that contains 390 videos equally divided into 13 action categories. As depicted in Figure 3, this dataset contains video clips of a single person manipulating a fork, a spoon, a knife or tongs for roughly 4 seconds. Similar to the Something-Something dataset, 20bn-kitchenware was designed to capture fine-grained actions with subtle differences. For each kitchen utensil , the target label can belong to any one of 3 actions, namely, “Using ”, “Pretending to use ” or “Trying but failing to use ”. In addition to these 12 action categories derived from kitchenware manipulations, we also include a fall-back class labeled “Doing other things” (the appendix lists the action categories).
While these action categories are somewhat ambiguous by design, we further encourage the model to pay attention to visual details by including unused ‘negative’ objects in the scene. The last row of Figure 3 shows one such example; while the target label indicates a manipulation of tongs, the clip also contains a spoon with an egg in it that could fool a model which simply recognizes objects. Given the limited amount of data available for training222130 samples are available for training. The rest is used for validation and testing., the action granularity and the presence of negative objects, we hypothesize that only models that have some understanding of physical world properties will perform well on this dataset. We therefore use 20bn-kitchenware as a benchmark for evaluating the quality of models pre-trained on Something-Something or other datasets. We plan to release 20bn-kitchenware upon publication of this paper.
6.2 Proposed benchmark
6.2.1 Model candidates:
To further investigate whether training on fine-grained labels leads to better features, our transfer learning experiments consider four two-channel models that have been respectively pre-trained on coarse-grained labels (classification on action groups), on fine-grained labels (classification on 174 action categories), on simplified captions (captioning on fine-grained action categories expanded with single object descriptor) and on template captions (captioning on fine-grained action categories expanded with object descriptors). We also include in this benchmark two neural nets that were pre-trained on other datasets, namely, a VGG16 network pre-trained on ImageNet, and an Inflated-ResNet34 pre-trained on Kinetics333https://github.com/kenshohara/3D-ResNets-PyTorch.
6.2.2 Training procedure:
The overall training procedure remains the same for all models. We freeze each pre-trained model and then train a neural net on top of extracted penultimate features. Independent of the architecture being used, we use the pre-trained model to produce 12 feature vectors per second. To achieve this, where necessary we split the input video into smaller clips and sequentially apply the pre-trained network on each clip individually444VGG16 is applied to each frame individually. For Inflated-ResNet34, we split the video into overlapping clips of 16 frames.. Then, in the simplest case, we pass the obtained sequence of features through a logistic regressor and then average predictions over time. We also report results for which we classify pre-trained features using an MLP with 512 hidden units as well as a single bidirectional LSTM layer with 128 hidden states. This allows the network to perform some temporal analysis about the target domain.
For each pretrained model and each classifier, we evaluate 1-shot, 5-shot and 10-shot performance, averaging scores obtained over 10 runs. Figure 4 shows the average scores as well as 95 confidence intervals. The most noticeable findings are as follows:
Logistic Regression vs MLP vs BiLSTM: Overall, using a recurrent network yields better performance.
Something-Something features vs other features: Our models pre-trained on Something-Something outperform other external models. This is not surprising given the nature of the target domain. 20bn-kitchenware samples are, by design, closer to Something-Something samples than ImageNet or Kinetics ones. However, it is surprising that VGG16 features perform better on 20bn-kitchenware than Kinetics features.
Effect of the action granularity: Figure 4 supports the contention that training on fine-grained tasks leads to better features. The best model on this benchmark is our model trained jointly on full captions and action categories. As expected, the only exception to this rule is the model that was trained to do pure captioning.
Pre-training neural networks on large labeled datasets has become a driving force in many deep learning applications. Some might argue that it may be considered a serious competitor to unsupervised learning as a means to generate universal features that represent the visual world. Ever since ImageNet was used as a generic visual feature extractor, the hypothesis has emerged that it is the dataset size, the amount of detail and the variety of labels that drive a network’s capability to learn useful generic features. To the degree that this hypothesis is true, generating visual features capable of transfer learning should involve source tasks that (i) are as fine-grained and complex as possible, and (ii) ideally involve video not still images, because video is a much more fertile domain for defining complex tasks that represent aspects of the physical world.
In this work, we provide further evidence for that hypothesis, showing that the amount of detail in the task has a strong influence on the quality of the learned features. We also show that captioning, which to the best of our knowledge has hitherto been used only as a target task in transfer learning, can be a powerful source task itself. Our work suggests that one gets substantial leverage by utilizing ever more fine-grained recognition tasks, represented in the form of captions, possibly in combination with question-answering. Unlike the current trend of engineering neural networks to perform bounding box generation, semantic segmentation, or tracking, the appeal of fine-grained textual labels is that they provide a simple homogeneous interface. More importantly, they may provide “just enough” localization and tracking capability to solve a wide variety of important tasks, without allocating valuable compute resources to satisfy intermediate goals at an accuracy that these tasks may not actually require.
-  Goyal, R., Kahou, S.E., Michalski, V., MaterzyÅska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The ”something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV17 (2017)
-  Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset (2017) preprint arXiv:1705.06950.
-  Monfort, M., Zhou, B., Bargal, S.A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C., Oliva, A.: Moments in time dataset: one million videos for event understanding (2018) preprint arXiv:1801.03150.
-  Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
-  Zhou, B., Andonian, A., Torralba, A.: Temporal relational reasoning in videos. CoRR (2017)
-  Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR08 (2008)
-  Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild (2012) preprint arXiv:1212.0402.
-  Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014) 1725–1732
-  Wu, J.: Computational perception of physical object properties. Master’s thesis, Massachusetts Institute of Technology (2016)
-  Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: Data collection and evaluation server. CoRR (2015)
-  Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, IEEE (2016) 5288–5296
-  Yao, L., Ballas, N., Cho, K., Smith, J.R., Bengio, Y.: Oracle performance for visual captioning. arXiv preprint arXiv:1511.04590 (2015)
-  Heuer, H., Monz, C., Smeulders, A.W.: Generating captions without looking beyond objects. arXiv preprint arXiv:1610.03708 (2016)
-  Devlin, J., Gupta, S., Girshick, R.B., Mitchell, M., Zitnick, C.L.: Exploring nearest neighbor approaches for image captioning. CoRR (2015)
-  Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 2625–2634
-  Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
-  Kaufman, D., Levi, G., Hassner, T., Wolf, L.: Temporal tessellation for video annotation and summarization. arXiv preprint arXiv:1612.06950 (2016)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR (2015)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. CoRR (2014)
-  Williams, R., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1 (1989) 270–280
-  Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. (2015) 4489–4497
-  Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35(1) (2013) 221–231
-  Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. arXiv preprint arXiv:1705.07750 (2017)
-  Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? CoRR (2017)
-  Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. (2016) 4207–4215
-  Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. (2011) 29–39
-  Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding (2017)
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. (2014) 568–576
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR (2015)
-  Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. (2002)
-  Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004)
-  Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation. (2014)
-  Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)
-  Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2014) 806–813
-  Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016)
In this supplementary document, we provide:
Qualitative examples for our classification and captioning.
Visualization of our classification and captioning models using Grad-CAM.
The full list of action categories of 20bn-kitchenware.
Qualitative Examples of Classification
Here we provide video examples and their ground truth action categories, along with model predictions for each. We use our M(-) which is trained with . Interestingly, notice that even when the predicted actions are incorrect, e.g. row in Figure 5, they are, nevertheless, usually quite sensible.
Qualitative Examples of Captioning
Below are examples of videos. accompanied by their their ground truth caption and the caption generated by the model. We use model M(-) in this section as well, which is also trained jointly for classification and captioning (with ).
Visualization of classification model with Grad-CAM
To visualize regularities learned from data, we extracted temporally-sensitive saliency maps using Grad-CAM , for both classification and captioning task. To this end we extended the Grad-CAM implementation for video processing. Figure 7 shows saliency maps of examples from Something-Something obtained with model M(-) trained on fine-grained action categories, with (i.e., the pure classification task).
Visualization of captioning model using Grad-CAM
To get saliency maps during the captioning process, we calculate the Grad-CAM once for each token, for which different regions of the video are highlighted. Figures 8,-10 shows saliency maps for the captioning model, jointly trained with . Notice how the attentional focus of the model changes qualitatively as we perform Grad-CAM for different tokens in the target caption.
20bn-kitchenware action categories
Table 8 lists the full list of 20bn-kitchenware action categories.
|Using a fork to pick something up|
|Pretending to use a fork to pick something up|
|Trying but failing to pick something up with a fork|
|Using a spoon to pick something up|
|Pretending to use a spoon to pick something up|
|Trying but failing to pick something up with a spoon|
|Using a knife to cut something|
|Pretending to use a knife to cut something|
|Trying but failing to cut something with a knife|
|Using tongs to pick something up|
|Pretending to use tongs to pick something up|
|Trying but failing to pick something up with tongs|
|Doing other things|