Video Action Understanding: A Tutorial

Video Action Understanding: A Tutorial

Abstract.

Many believe that the successes of deep learning on image understanding problems can be replicated in the realm of video understanding. However, the span of video action problems and the set of proposed deep learning solutions is arguably wider and more diverse than those of their 2D image siblings. Finding, identifying, and predicting actions are a few of the most salient tasks in video action understanding. This tutorial clarifies a taxonomy of video action problems, highlights datasets and metrics used to baseline each problem, describes common data preparation methods, and presents the building blocks of state-of-the-art deep learning model architectures.

video understanding, action understanding, action recognition, action prediction, action proposal, action localization, action detection
12345

1. Introduction

Video understanding is a natural extension of deep learning research efforts in computer vision. The image understanding field has benefited greatly from the application of artificial neural network (ANN) machine learning (ML) methods. Many image understanding problems—object recognition, scene classification, semantic segmentation, etc.—have workable deep learning “solutions.” FixEfficientNet-L2 currently boasts 88.5%/98.7% Top-1/Top-5 accuracy on the ImageNet object classification task (touvron2020fixing; russakovsky2014imagenet). Hikvision Model D scores 90.99% Top-5 accuracy on the Places2 scene classification task (russakovsky2014imagenet; zhou2017places). HRNet-OCR yields a mean IoU of 85.1% on the Cityscapes semantic segmentation test (cordts2016cityscapes; barbier201901). Naturally, many hope that deep learning methods can achieve similar levels of success on video understanding problems.

Drawing from Diba et al. (2019), semantic video understanding is a combination of understanding the scene/environment, objects, actions, events, attributes, and concepts (diba2019large). This article focuses on the action understanding component and is presented as a tutorial by introducing a common set of terms and tools, explaining basic and fundamental concepts, and providing concrete examples. We intend this to be accessible to a general computer science audience and assume readers have a basic understanding of supervised learning—the paradigm of learning from input-output examples.

1.1. Action Understanding

While the literature often uses the terms action and activity synonymously (Ke_2013; CHAQUET2013633; cheng2015advances), we prefer to use action in this article for a few reasons. First, action is the dominant term across the field, and we would need significant reason to divert from that. Second, the use of activity is generally biased towards human actors rather than non-human actors and phenomenon. We prefer action for its broader applicability. Third, activity recognition is a term already used in several non-video domains (6208895; 10.1145/1964897.1964918; WANG20193). Meanwhile, action recognition is a primarily computer vision and video-based term.

But what is an action? Kang and Wildes (2016) (kang2016review) consider an action to be “a motion created by the human body, which may or may not be cyclic.” Zhu et al. (2016) (ZHU201642) define action as an “intentional, purposive, conscious and subjectively meaningful activity.” Several human action surveys create a spectrum of action complexity from gestures to interactions or group activities (ZHU201642; cheng2015advances; GUO20143343). Unlike these surveys, we will use a broader definition of action, one that includes actions of both human and non-human actors because (1) video datasets are being introduced that use this broader definition (monfort2018moments; monfort2019multimoments), (2) most deep learning metrics and methods are equally applicable to both settings, and (3) the colloquial use of action has no distinction between human and non-human actors. Merriam-Webster’s Dictionary and the Oxford English Dictionary define action as “an act done” and “something done or performed”, respectively (merriam-webster-action; oed-action). Therefore, this article defines action as something done or performed intentionally or unintentionally by a human or non-human actor from which a human observer could derive meaning. This includes everything from low-level gestures and motions to high-level group interactions.

Figure 1. Overview of action understanding steps (problem formulation, dataset selection, data preparation, model development, and metric-based evaluation) and underlying principles (computational performance, data diversity, transferability, robustness, and understandability. This serves as the framework for this tutorial.
\Description

Action understanding is placed as an umbrella over five boxes labeled with the five problem steps. The problems box lists recognition, prediction proposal, and localization. The datasets box lists Moments in Time, HACS, AVA-Kinetics, AViD, etc. and annotation types (action class, temporal markers, and spatiotemporal bounding boxes). The data preparation box lists cleaning, augmentation, and hand-crafted features. The models box lists single/multi-stream, generative/non-generative, top-down/bottom-up, single/multi-stage, and building blocks (CNNs, RNNs, and Fusion). The metrics box lists Top-k accuracy, mAP, Hit@k, AR@AN, Average mAP, Frame-mAP, Video-mAP, etc. Arrows show the flow of steps from left to right. An underlying principles box sits below the steps.

As shown in Figure 1, action understanding encompasses action problems, video action datasets, data preparation techniques, deep learning models, and evaluation metrics. Underlying these steps are computer vision and supervised learning principles of computational performance, data diversity, transferability, model robustness, and understandability.

1.2. Related Work and Our Contribution

Actions Topics Problems
Survey Year Cited H N Ds Mc Md AR AP TAP TAL/D SAL/D
Poppe (POPPE2010976) 2010 2,252
Weinland et al. (WEINLAND2011224) 2011 1,050
Ahad et al. (6060230) 2011 28
Chaquet et al. (CHAQUET2013633) 2013 364
Guo and Lai (GUO20143343) 2014 170
Cheng et al. (cheng2015advances) 2015 116
Zhu et al. (ZHU201642) 2016 69
Kang and Wildes (kang2016review) 2016 31
Zhang et al. (ZHANG201686) 2016 168
Herath et al. (HERATH20174) 2017 342
Koohzadi and Charkari (koohzadi) 2017 27
Asadi-Aghbolaghi et al. (7961779) 2017 103
Kong and Fu (kong2018human) 2018 91
Zhang et al. (s19051005) 2019 60
Bhoi (bhoi2019spatiotemporal) 2019 1
Singh and Vishwakarma (video-benchmarks) 2019 18
Xia and Zhan (9062498) 2020 0
Rasouli (rasouli2020deep) 2020 0
Ours 2020
Table 1. Coverage of surveys on action understanding. Tabular information includes year of publication, number of citations on Google Scholar as of August 2020, action coverage: human (H) and non-human (N), topic coverage: datasets (Ds), metrics (Mc), models/methods (Md), and problem coverage: action recognition (AR), action proposal (AP), temporal action proposal (TAP), temporal action localization/detection (TAL/D), spatiotemporal action localization/detection (SAL/D).

Table 1 shows a selection of surveys written in the last decade on action understanding. Of the more recent examples, Kong and Fu (kong2018human), Xia and Zhan (9062498), and Rasouli (rasouli2020deep) are the most thorough in their independent directions. Despite all of these works, few focus on more than one or two action problems or present more than a narrow coverage of video action datasets. The vast majority only consider a narrow (human) definition of actions. Additionally, the few that cover metrics generally do so shallowly. Relative to the literature noted above, we present this article as a tutorial and contribute the following:

  • Clear definitions of recognition, prediction, proposal, and localization/detection problems.

  • An extensive and up-to-date catalog of video action datasets.

  • Descriptions of the oft neglected, yet important methods of data preparation.

  • Explanations of common deep learning model building blocks.

  • Groupings of state-of-the-art model architectures.

  • Formal definitions of evaluation metrics across the span of action problems.

Our paper is organized in the following way. Section 2 defines and organizes action understanding problems. Section 3 catalogs video action datasets by annotation type which directly relates to the problems for which they are applicable. Section 4 provides an introduction to video data and data preparation techniques. Section 5 presents basic model building blocks and organizes state-of-the-art methods. Section 6 defines standard metrics used across these problems, formally shows how they are calculated, and points to examples of their usage in high-profile action understanding competitions. Section 7 summarizes and concludes the tutorial.

2. Problems

Several problems fall under the umbrella of action understanding. In this section, we introduce a taxonomy of these problems, provide definitions, and indicate disagreements in the literature.

2.1. Taxonomy

Figure 2. Action understanding problem taxonomy.
\Description

Two overlapping bubbles forming a venn-diagram. Action recognition and action prediction sit in only the classification bubble. Temporal action proposal and spatiotemporal action proposal sit only in the search bubble. Temporal action localization/detection and spatiotemporal action localization/detection sit in the intersection of classification and search bubbles.

As shown in Figure 2, we organize the main action understanding problems into overlapping classify and search bins. Classification problems involve labeling videos by their action class. Search problems involve temporally or spatiotemporally finding action instances.

Figure 3. Action understanding problems: action recognition (AR), action prediction (AP), temporal action proposal (TAP), temporal action localization/detection (TAL/D), spatiotemporal action proposal (SAP), and spatiotemporal action localization/detection (SAL/D). Video is depicted as a 3D volume where frames are densely stacked along a temporal dimension.
\Description

Videos are shown as long rectangular prisms. Arrows point to colored regions indicating action classes, action proposals, and action detections.

Definitions

The following are the action understanding problems covered in this tutorial:

Action Recognition (AR) is the process of classifying a complete input (either an entire video or a specified segment) by the action occurring in the input. If the action instance spans the entire length of the input, then the problem is known as trimmed action recognition. If the action instance does not span the entire input, then the problem is known as untrimmed action recognition. Untrimmed action recognition is generally more challenging because a model would need to complete the action classification task while disregarding non-action background segments of the input.

Action Prediction (AP) is the process of classifying an incomplete input by the action yet to be observed. One sub-problem is action anticipation (AA) in which no portion of the action has yet to be observed and classification is entirely based on observed contextual clues. Another is early action prediction (EAP) in which a portion, but not the entirety, of the action instance has been observed. Both AR and AP are classification problems, but AP often requires a dataset with temporal annotations so that there is a clear delimiter between a ”before-action” segment and ”during-action” segment for AA or between ”start-action” and ”end-action” for EAP.

Temporal Action Proposal (TAP) is the process of partitioning an input video into segments (consecutive series of frames) of action and inaction by indicating start and end markers of each action instance. Temporal Action Localization/Detection (TAL/D) is the process of creating temporal action proposals and classifying each action.

Spatiotemporal Action Proposal (SAP) is the process of partitioning an input video by both space (bounding boxes) and time (per-frame OR start and end markers of a segment) between regions of action and inaction. If a linking strategy is applied to bounding boxes across several frames, the regions of actions that are constrained in the spatial and temporal dimensions are often referred to as tubes or tubelets. Spatiotemporal Action Localization/Detection (SAL/D) is the process of creating spatiotemporal action proposals and classifying each frame’s bounding boxes (or action tubes when a linking strategy is applied).

Literature Observations

This taxonomy and these definitions are intended to clarify several term discrepancies in the literature. First, recognition and classification are sometimes used interchangeably (e.g. (4270162; 4270157; Girdhar_2017_CVPR)). We believe that should be avoided because both recognition (an identification task) and prediction (an anticipation task) require arranging inputs into categories (i.e. classification). To use recognition and classification synonymously would incorrectly equate recognition and prediction. Second, localization and detection are often used interchangeably (e.g. (Shou_2016_CVPR; Zhao_2017_ICCV; Chao_2018_CVPR)). However, in this case, because the task involved finding and identifying, we feel the terms are appropriate. While detection appears slightly more prevalent in the temporal action literature and localization appears slightly more prevalent in the spatiotemporal action literature, this article will remain neutral and use localization/detection (L/D) together as a single term. Third, action proposal and action proposal generation are used interchangeably (e.g. (Lin_2018_ECCV; Gao_2018_ECCV; Liu_2019_CVPR)). We chose to use the former because it is shorter and proposal can be defined as the act of generating a proposal. Referring to proposal generation is redundant. An important takeaway is that there are many examples in the literature where different terms are referring to the same video action problem (e.g. (bhoi2019spatiotemporal) and (ESCORCIA2020102886)). Similarly, there are many examples where the same terms are referring to different video action problems (e.g. (zeng2019graph) and (xu2017rc3d)). To compound the issue, many video action datasets can be applied to more than one of these problems. We encourage those entering the field to carefully examine a paper’s purpose before assuming it is related to a particular line of interest.

Another notable observation from the literature is that while TAP and TAL/D are sometimes studied independently, SAP is not studied outside of a SAL/D framework. Therefore, the remainder of this article will not refer to SAP independently of SAL/D.

2.2. Related Problems

Here, we highlight a few video problems related to but not included in our main taxonomy.

Action instance segmentation (AIS) is the labeling of individual instances or examples of an action within the same video input even when these action instances may overlap in both space and time. Therefore, AIS is a constraint that can be placed on top of TAL/D or SAL/D. For example, a model performing SAL/D on a video of a concert may identify the frames and bounding boxes sections where the audience is shown and label the proposed temporal segment with the action “clapping.” Applying the AIS constraint on top of this would require the model to divide the bounding boxes into each individual clapping member of the audience and track these individual actions across time. Useful action instance segmentation literature includes Weinland et al. (2011) (WEINLAND2011224), Saha et al. (2017) (saha2017spatiotemporal), Ji et al. (2018) (Ji_2018_ECCV), and Saha et al. (2020) (Saha2020).

Dense captioning is the generation of sentence descriptions for videos. This problem spans several of the video understanding semantic components and is worth noting because it is often paired with action understanding problems in public challenges (ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019; activitynetchallenge2020). Similarly, video captioning datasets (such as MSVD (chen2011collecting), MVAD (torabi2015using), MPII-MD (Rohrbach_2015_CVPR) and ActivityNet Captions (krishna2017dense)) will sometimes be included in video action understanding dataset lists. For more on video captioning, Li et al. (2019) (8627985) present a survey on methods, datasets, difficulties, and trends.

Action spotting (AS), proposed by Alwassel et al. (2018) (Alwassel_2018_ECCV) is the process of finding any temporal occurrence of an action in a video while observing as little as possible. This differs from TAL/D in two ways. First, AS requires only finding a single frame within the action instance segment rather than start and end markers. Second, AS is concerned with the efficiency of the search process.

Object tracking is the process of detecting objects and linking detections between frames to track them across time. Object tracking is a relevant related problem because some metrics used for object detection in videos were adopted in video action detection (kpkl2019watch; pascal-voc). We recommend Yao et al. (2019) (yao2019video) for a recent and broad survey on video object segmentation and tracking.

3. Datasets

Data is critical to successful machine learning model. In this section, we catalog video action datasets, describe the diversity of foundational and emerging benchmarks, and highlight competitions using these datasets that have been the pinnacle drivers of model development and progress in the field.

3.1. Video Action Dataset Catalog

The last two decades has seen huge growth in available video action datasets. To the best of our knowledge, we have organized the most comprehensive collection of these datasets in the literature. We catalog 137 video action datasets sorted by release year. Due to the scale of this catalog, 30 of the most historically influential, current state-of-the-art, and emerging benchmarks datasets are highlighted in Table 2 while the full catalog can be found in Appendix A.

Action Actors Annotations
Video Dataset Year Cited Classes Instances H N C T S Theme/Purpose
KTH (1334462) 2004 3,853 6 2,391 B/W, static background
Weizmann (1544882) 2005 1,890 10 90 human motions
Coffee & Cigarettes (4409105) 2007 491 2 246 movies and TV
Hollywood2 (5206557) 2009 1,312 12 3,669 movies
VIRAT (5995586) 2011 536 23 10,000 surveillance, aerial-view
HMDB51 (6126543) 2011 1,928 51 7,000 human motions
UCF101 (soomro2012ucf101) 2012 2,470 101 13,320 web videos, expand UCF50
ADL (6248010) 2012 619 18 1,200 egocentric, daily activities
THUMOS’13 (THUMOS13; idrees2017thumos; soomro2012ucf101) 2013 146 *101 13,320 web videos, extend UCF101
J-HMDB-21 (Jhuang_2013_ICCV) 2013 458 51 928 re-annotate HMDB51 subset
Sports-1M (Karpathy_2014_CVPR) 2014 4,361 487 1,000,000 multi-label, sports
MEXaction2 (mexaction2) 2015 n/a 2 1,975 culturally relevant actions
ActivityNet200 (v2.3) (Heilbron_2015_CVPR) 2016 797 200 23,064 untrimmed web videos
Kinetics-400 (kay2017kinetics) 2017 810 400 306,245 diverse web videos
AVA (Gu_2018_CVPR) 2017 270 80 392,416 atomic visual actions
Moments in Time (MiT) (monfort2018moments) 2017 137 339 836,144 intra-class variation, web videos
MultiTHUMOS (multithumos) 2017 231 65 16,000 multi-label, extends THUMOS
Kinetics-600 (carreira2018short) 2018 52 600 495,547 extends Kinetics-400
EGTEA Gaze+ (Li_2018_ECCV) 2018 52 106 10,325 egocentric, kitchen
Something-Something-v2 (mahdisoltani2018effectiveness) 2018 5 174 220,847 extends Something-Something
Charades-Ego (sigurdsson2018charadesego) 2018 19 157 68,536 egocentric, daily activities
Jester (Materzynska_2019_ICCV) 2019 12 27 148,092 crowd-sourced, gestures
Kinetics-700 (carreira2019short) 2019 33 700 650,000 extends Kinetics-600
Multi-MiT (monfort2019multimoments) 2019 1 313 1,020,000 multi-label, extends MiT
HACS Clips (zhao2017hacs) 2019 31 200 1,500,000 trimmed web videos
HACS Segments (zhao2017hacs) 2019 31 200 139,000 extends and improves SLAC
NTU RGB-D 120 (8713892) 2019 55 120 114,480 extends NTU RGB-D 60
EPIC-KITCHENS-100 (damen2020rescaling) 2020 6 97 90,000 extends EPIC-KITCHENS-55
AVA-Kinetics (li2020avakinetics) 2020 5 80 238,000 adds annotations, AVA+Kinetics
AViD (piergiovanni2020avid) 2020 0 887 450,000 diverse peoples, anonymized faces
*Only 24 classes have spatiotemporal annotations. This subset is also known as UCF101-24.
Table 2. 30 historically influential, current state-of-the-art, and emerging benchmarks of video action datasets. Tabular information includes dataset name, year of publication, citations on Google Scholar as of August 2020, number of action classes, number of action instances, actors: human (H) and/or non-human (N), annotations: action class (C), temporal markers (T), spatiotemporal bounding boxes/masks (S), and theme/purpose. The full catalog can be found in Appendix A.

Criteria

We include a dataset in our catalog if it meets the following criteria:

  1. The dataset was released between 2004 and 2020.

  2. The dataset contains single-channel (B/W) or three-channel (RGB) videos.

  3. The dataset includes annotations of each video or defined segments of each video.

  4. The dataset contains at least 2 action classes.

  5. The dataset contains at least one of the following types of annotations: (C) action class labels, (T) temporal start/end segment markers or frame-level labels, or (S) spatiotemporal frame-level bounding boxes or masks.

Content

For each dataset, we report the name, release year, citations on Google Scholar as of August 2020, number of action classes, number of action instances, types of actors: human (H) and/or non-human (N), annotations: class (C), temporal (T), or spatiotemporal (S), and theme/purpose. We chose to include the total number of action instances rather than total number of videos because supervised learning (the predominant action understanding deep learning paradigm) is dependent on the number of positively labeled examples in the training set. Including annotation type is critical because those determine the types of action understanding problems for which the datasets are useful. The theme/purpose is intended to provide some insight into the applicability of a particular dataset. While the catalog may not include a dataset for your specific research purpose, we hope that it helps in finding suitable data for pretraining and transfer learning.

Figure 4. Trends in video action dataset sizes from 2004 to mid-2020. Both the number of action classes and action instances in these datasets have increased by several orders of magnitude. Note that the action classes dimension is log-scaled. Datasets have increased by several orders of magnitude in both number of action classes and number of action instances.
\Description

A scatter plot of datasets with the year on the x-axis and number of action classes on the y-axis. Circle marker size is represents number of action instances. Circle marker color indicates which combination of action annotations are present (class, class+spatiotemporal, class+temporal, class+temporal+spatiotemporal). Datasets generally trend upward in size and along the y-axis across time.

Trends

By plotting these datasets by year and size in Figure 4, several trends and observations emerge. First, these datasets have grown considerably over the past two decades in both number of action classes and number of action instances. This trend is present across all of the use cases and has occurred over several orders of magnitude. Larger datasets are essential for training deep learning models with often millions of parameters. Second, datasets only useful for classification (mainly AR) are considerably larger and more prevalent than temporally or spatiotemporally annotated datasets. This is expected because temporal markers or spatiotemporal bounding boxes are more challenging to create. An annotator may require only a few seconds to identify whether a particular video contains a given action but would need much more time to mark the start and end of an action. Additionally, solving AR is often considered a prerequisite for effective TAL/D or SAL/D. Therefore, recognition research has generally preceded localization/detection research.

3.2. Foundational and Emerging Benchmarks

Below, we describe datasets and dataset families in three groups: (1) datasets with only action class annotations primarily for AR, (2) datasets with temporal annotations most useful for TAP, TAL/D, and sometimes AP and (3) datasets with spatiotemporal annotations most useful for SAL/D. Because many of the earlier influential video action datasets such as KTH, Weizmann, etc. are described at length in previous survey papers (6060230; CHAQUET2013633; kong2018human), we focus the following descriptions on the current largest and highest quality datasets.

Action Recognition Datasets

Table 5 plots AR-focused datasets by number of classes and number of instances. Here we describe some of the largest and highest quality among them.

Figure 5. Datasets with only action class annotations mainly useful for AR. Note that the plot is log-scaled in both action instances and action classes dimensions.
\Description

A scatter plot with class-only annotations datasets plotted along action instances on the y-axis and action classes on the x-axis. Most fall under 100 classes and 100,000 instances, but a few fall above in both regards.

Sports-1M (Karpathy_2014_CVPR) was produced in 2014 as a large-scale video classification benchmark for comparing CNNs. Examples of the 487 sports action classes include ”cycling”, ”snowboarding”, and ”american football”. Note that some inter-class variation is low (e.g. classes include 23 types of billiards, 6 types of bowling, and 7 types of American football). Videos were collected from YouTube and weakly annotated using text metadata. The dataset consists of one million videos with a 70/20/10 training/validation/test split. On average, videos are 5.5 minutes long, and approximately 5% are annotated with class. As one of the first large-scale datasets, Sports-1M was critical for demonstrating the effectiveness of CNN architectures for feature learning.

Something-Something (Goyal_2017_ICCV) (a.k.a. 20BN-SOMETHING-SOMETHING) was produced in 2017 as a human-object interaction benchmark. Examples of the 174 classes include ”holding something”, ”turning something upside down”, and ”folding something”. Video creation was crowd-sourced through Amazon Mechanical Turk (AMT). The dataset consists of 108,499 videos with an 80/10/10 training/validation/test split. Each single-instance video lasts for 2-6 seconds. The dataset was expanded to Something-Something-v2 (mahdisoltani2018effectiveness) in 2018 by increasing the size to 220,847 videos, adding object annotations, reducing label noise, and improved video resolution. These datasets are important benchmarks for human-object interaction due to their scale and quality.

The Kinetics dataset family was produced as ”a large-scale, high quality dataset of URL links” to human action video clips focusing on human-object interactions and human-human interactions. Kinetics-400 (kay2017kinetics) was released in 2017, and examples of the 400 human actions include ”hugging”, ”mowing lawn”, and ”washing dishes”. Video clips were collected from YouTube and annotated by AMT crowd-workers. The dataset consists of 306,245 videos. Within each class, 50 are reserved for validation and 100 are reserved for testing. Each single-instance video lasts for 10 seconds. The dataset was expanded to Kinetics-600 (carreira2018short) in 2018 by increasing the number of classes to 600 and the number of videos to 495,547. The dataset was expanded again to Kinetics-700 (carreira2019short) in 2019 by increasing to 700 classes and 650,317 videos. These are among the most cited human action datasets in the field and continue to serve as a standard benchmark and pretraining source.

NTU RGB-D(shahroudy2016ntu) was produced in 2016 as ”a large-scale dataset for RGB-D human action recognition.” The multi-modal nature provides depth maps, 3D skeletons, and infrared in addition to RGB video. Examples of the 60 human actions include ”put on headphone”, ”toss a coin”, and ”eat meal”. Videos were captured with a Microsoft Kinect v2 in a variety of settings. The dataset consists of 56,880 single-instance video clips from 40 different subjects in 80 different views. Training and validation splits are not specified. The dataset was improved to NTU RGB-D 120 (8713892) in 2019 by increasing the number of classes to 120, videos to 114,480, subjects to 106, and views to 155. This serves as a state-of-the-art benchmark for human AR with non-RGB modalities.

Moments in Time (MiT) (monfort2018moments) was produced in 2018 with a focus on broadening action understanding to include people, objects, animals, and natural phenomenon. Examples of the 339 diverse action classes include ”running”, ”opening”, and ”picking”. Videos clips were collected from a variety of internet sources and annotated by AMT crowd-workers. The dataset consists of 903,964 videos with a roughly 89/4/7 training/validation/test split. Each single-instance video lasts for 3 seconds. The dataset was improved to Multi-Moments in Time (M-MiT) (monfort2019multimoments) in 2019 by increasing the number of videos to 1.02 million, pruning vague classes, and increasing the number of labels per video (2.01 million total labels). MiT and M-MiT are interesting benchmarks because of the focus on inter-class and intra-class variation.

Jester (Materzynska_2019_ICCV) (a.k.a. 20BN-JESTER) was produced in 2019 as ”a large collection of densely labeled video clips that show humans performing pre-defined hand gestures in front of laptop camera or webcam.” Examples of the 27 human hand gestures include ”drumming fingers”, ”shaking hand”, and ”swiping down”. Data creation was crowd-sourced through AMT. The dataset consists of 148,092 videos with an 80/10/10 training/validation/test split. Each single-instance video lasts for 3 seconds. The Jester dataset is the first large-scale, semantically low-level human AR dataset.

Anonymized Videos from Diverse countries (AViD) (piergiovanni2020avid) was produced in 2020 with the intent of (1) avoiding the western bias of many datasets by providing human actions (and some non-human actions) from a diverse set of people and cultures, (2) anonymizing all human faces to protect the privacy of the individuals, and (3) ensuring that all videos in the dataset are static with a creative commons license. Most of the 887 classes are drawn from Kinetics (carreira2019short), Charades (10.1007/978-3-319-46448-0_31), and MiT (monfort2018moments) while removing duplicates and any actions that involve the face (e.g. ”smiling”). 159 actions not found in any of those datasets are also added. Web videos in 22 different languages were annotated by AMT crowd-workers. The dataset consists of approximately 450,000 videos with a 90/10 training/validation split. Each single-instance video lasts between 3 and 15 seconds. We believe AViD will quickly become a foundational benchmark because of the emphasis on diversity of actors and privacy standards.

Temporally Annotated Datasets

Table 6 plots temporally annotated datasets by number of classes and action instances. Here we describe some of the largest and highest quality among them.

Figure 6. Datasets with temporal annotations useful for TAP, TAL/D, and possibly AP. Note that the plot is log-scaled in both action instances and action classes dimensions. The SLAC dataset (slac) is excluded because while it has a very large number of temporally annotated action instances, the dataset was of poor quality. HACS Segments was developed out of SLAC and has significantly fewer temporally annotated action instances.
\Description

A scatter plot with class-only annotations datasets plotted along action instances on the y-axis and action classes on the x-axis. Most fall above 20 classes and 1000 actions. None have more than 200 classes and approximately 100,000 instances.

The ActivityNet dataset (heilbron2014collecting; Heilbron_2015_CVPR) family was produced ”to compare algorithms for human activity understanding: global video classification, trimmed activity recognition and activity detection.” Example human action classes include ”Drinking coffee”, ”Getting a tattoo”, and ”Ironing clothes”. ActivityNet 100 (v1.2) was released in 2015. The 100-class dataset consists of 9,682 videos divided into a 4,819 videos (7,151 instances) training set, a 2,383 videos (3,582 instances) validation set, and a 2,480 videos test set. ActivityNet 200 (v1.3) was released in 2016. The 200-class dataset consists of 19,994 videos divided into a 10,024 videos (15,410 instances) training set, a 4,926 videos (7,654 instances) validation set, and a 5,044 videos test set. On average, action instances are 51.4 seconds long. Web videos were temporal annotated by AMT crowd-workers. ActivityNet has remained as a foundational benchmark for TAP and TAL/D because of the dataset scope and size. It is also commonly applied as an untrimmed multi-label AR benchmark.

Charades (10.1007/978-3-319-46448-0_31) was produced in 2016 as a crowd-sourced dataset of daily human activities. Examples of the 157 classes include ”pouring into cup”, ”running”, and ”folding towel”. The dataset consists of 9,848 videos (66,500 temporal action annotations) with a roughly 80/20 training/validation split. Videos were filmed in 267 homes with an average length of 30.1 seconds and an average of 6.8 actions per video. Action instances average 12.8 seconds long. Charades-Ego was released in 2018 using similar methodologies and the same 157 classes. However, in this dataset, an egocentric (first-person) view and a third-person view is available for each video. The dataset consists of 7,860 videos (68.8 hours) capturing 68,536 temporally annotated action instances. Charades has served as a TAL/D benchmark along with ActivityNet, but it also has found a use as a multi-label AR benchmark because of the high average number of actions per video. Charades-Ego presents a multi-view quality unique among large-scale daily human action datasets.

MultiTHUMOS (multithumos) was produced in 2017 as an extension of the dataset used in the 2014 THUMOS Challenge (THUMOS14). Examples of the 65 human action classes include ”throw”, ”hug”, and ”talkToCamera”. The dataset consists of 413 videos (30 hours) with 38,690 multi-label, frame-level annotations (an average of 1.5 per frame). The total number of action instances—where an instance is a set of sequential frames with the same action annotation—is not reported. The number of action instances per class is extremely variable ranging from ”VolleyballSet” with 15 to ”Run” with 3,500. Each action instance lasts on average for 3.3 seconds with some lasting only 66 milliseconds (2 frames). Like Charades, the MultiTHUMOS dataset offers a benchmark for multi-label TAP and TAL/D. It stands out due to its dense multi-labeling scheme.

VLOG (Fouhey_2018_CVPR) was produced in 2018 as an implicitly gathered large-scale daily human actions dataset. Unlike previous daily human action datasets (doi:10.1177/0278364913478446; 10.1007/978-3-319-46448-0_31; Goyal_2017_ICCV) in which the videos were created, VLOG was compiled from internet daily lifestyle video blogs (vlogs) and annotated by crowd-workers. The method improves diversity of participants and scenes. The dataset consists of 144,000 videos (14 days, 8 hours) using a 50/25/25 training/validation/test split. The 30 classes are the objects with which the person is interacting (e.g. ”Bag”, ”Laptop”, and ”Toothbrush”). Clips are labeled with these hand/object classes and temporally annotated with the state (positive/negative) of hand-object contact. Because of the collection and annotation methods, VLOG brings actions in daily life datasets closer on par with other temporally annotated large-scale datasets.

HACS Segments (zhao2017hacs) was produced in 2019 as ”a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos”. Both HACS Segments and HACS Clips (the AR portion) are improvements on the SLAC dataset produced in the 2017 (slac). HACS uses the same 200 human action classes as ActivityNet 200 (1.3). Videos were collected from YouTube and temporally annotated by crowd-workers. HACS Segments consists of 50,000 videos with a 76/12/12 training/validation/test split. The dataset contains 139,000 action instances (referred to as segments). Compared to ActivityNet, the number of action instances per video is greater (2.8 versus 1.5), and the average action instance duration is shorter (40.6 versus 51.4). HACS Segments is an emerging benchmark and provides a more challenging task for human TAP and TAL/D.

Spatiotemporally Annotated Datasets

Table 7 plots spatiotemporally annotated datasets. Here we describe some of the largest and highest quality among them. We also describe two smaller but still highly relevant datasets: UCF101-24 and J-HMDB-21.

Figure 7. Datasets with spatiotemporal annotations useful for SAP and SAL/D. Note the the plot is log-scaled in both action instances and action classes dimensions.
\Description

A scatter plot with class-only annotations datasets plotted along action instances on the y-axis and action classes on the x-axis. Most fall between 2 and 20 classes and between 10 and 100 instances. A few extend upwards of 120 classes and 100,000 instances.

VIRAT (5995586) was created in 2011 as ”a new large-scale surveillance video dataset designed to assess the performance of event recognition algorithms in realistic scenes.” It includes both ground and aerial surveillance videos. Examples of the 23 classes include ”picking up”, ”getting in a vehicle”, and ”exiting a facility”. The dataset consists of 17 videos (29 hours) with between 10 and 1,500 action instances per class. Due to the camera to action distance across the varying views, the human to video height ratio is between 2% and 20%. Crowd-workers created bounding boxes around moving objects and temporal event annotations. While this is a smaller dataset, VIRAT is the highest quality surveillance-based spatiotemporal dataset and is used in the latest SAL/D competitions (activitynetchallenge2019; activitynetchallenge2020).

UCF101-24, the spatiotemporally labelled data subset of THUMOS’13 (THUMOS13), was produced in 2013 as part of the THUMOS’13 challenge. Examples of the 24 human action classes include ”BasketballDunk”, ”IceDancing”, ”Surfing”, and ”WalkingWithDog”. Note, that the majority of the classes are sports. It consists of 3,207 videos from the original UCF101 dataset (soomro2012ucf101). Each video contains one or more spatiotemporally annotated action instances. While multiple instances within a video will have separate spatial and temporal boundaries, they will have the same action class label. Videos average 7 seconds long. The dataset is organized into three train/test splits. While a small dataset, UCF101-24 remains a foundational benchmark for SAL/D.

J-HMDB-21 (Jhuang_2013_ICCV) was produced in 2013 for pose-based action recognition. Examples of the 21 human action classes include ”brush hair”, ”climb stairs”, and ”shoot bow”. The dataset consists of 928 videos from the original HMDB51 dataset (6126543) and is divided into three 70/30 train/test splits similar to UCF101. Each video contains one action instance that lasts for the entire duration of the video. 2D joint masks and human-background segmentation annotations were created by AMT crowd-workers. Because all of the action classes are human actions, bounding boxes could easily be derived from the joint masks or segmentation masks. Along with UCF101-24, J-HMDB serves as a early foundational benchmark for SAL/D.

EPIC-KITCHENS-55 (Damen_2018_ECCV) was produced in 2018 as a large-scale benchmark for egocentric kitchen activities. Examples of 149 human action classes include ”put”, ”open”, ”pour”, and ”peel”. Videos were captured by head-mounted GoPro cameras on 32 individuals in 4 cities who were instructed to film anytime they entered their kitchen. AMT crowd-workers located relevant actions and objects as well as created final action segment start/end annotations and object bounding boxes. The dataset consists of 432 videos (55 hours) divided into a 272 video train/validation set, 106 video test set 1 (for previously seen kitchens), and a 54 video test set 2 (for previously unseen kitchens). These sets correspond to 28,561, 8,064, and 2,939 action instances, respectively. The dataset was improved to EPIC-KITCHENS-100 (damen2020rescaling) in 2020 by increasing the number of videos to 700 (100 hours), action instances to 89,879, participants to 37, and environments to 34. Annotation quality was also improved. This dataset serves as a state-of-the-art egocentric kitchen activities benchmark.

Atomic Visual Actions (AVA) (Gu_2018_CVPR) was produced in 2017 as the first large-scale spatiotemporally annotated diverse human action dataset. Examples of the 80 classes include ”swim”, ”write”, and ”drive”. The dataset consists of 437 15-minute videos with an approximately 55/15/30 training/validation/test split. When only using the 60 most prominent classes (i.e. excluding those with fewer than 25 action instances), the dataset contains 214,622 training, 57,472 validation, and 120,322 test action instances. Videos were gathered from YouTube and segments were annotated by crowd-workers. Ground truth ”tracklets” were calculated between manually annotated sections. Because of the dataset scale, AVA serves a large-scale multi-label benchmark for TAL/D.

The AVA-Kinetics dataset (li2020avakinetics) was produced in 2020 with the purpose of using an existing large-scale human action recognition dataset to create a large-scale spatiotemporally annotated atomic video action dataset. The dataset consists of combining a subset of videos from Kinetics-700 (carreira2019short) and all videos from AVA (Gu_2018_CVPR) for a total of 238,906 videos with a roughly 59/14/27 training/validation/test split. For each 10-second video from Kinetics-700, a combination of algorithm and human crowd-workers created a bounding box for the frame with the highest person detection. Crowd-workers then labeled the set of action instances performed by the person using the 80 possible action classes from the AVA dataset. This dataset is an emerging benchmark because it improves upon AVA by dramatically expanding the number of annotated frames and increases the visual diversity.

3.3. Competitions

Several competitions have introduced state-of-the-art datasets, galvanized model development, and standardized metrics. THUMOS Challenges (THUMOS13; THUMOS14; THUMOS15) conducted through the International Conference on Computer Vision (ICCV) in 2013, the European Conference on Computer Vision (ECCV) in 2014, and the Conference on Computer Vision and Pattern Recognition (CVPR) in 2015. These primarily focused on AR and TAL/D tasks. ActivityNet Large Scale Activity Recognition Challenges (Heilbron_2015_CVPR; ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019; activitynetchallenge2020) were held at CVPR from 2016 through 2020 and have slowly expanded into scope encompassing trimmed AR, untrimmed AR, TAP, TAL/D, and SAL/D competitions. Other challenges have been modeled off THUMOS and ActivityNet such as the Workshop on Multi-modal Video Analysis and Moments in Time Challenge6 held at ICCV in 2019. We provide an overview of these competitions in Appendix A.

4. Data Preparation

While some datasets are available in pre-processed forms, others are presented raw—using the original frame rate, frame dimensions, and duration. Data preparation is the process of transforming data prior to learning. This step is essential to extract relevant features, fit model input specifications, and prevent overfitting during training. Key preparation processes include:

  • Data cleaning is the process of removing detecting and removing incomplete or irrelevant portions of the dataset. For datasets that simply link to YouTube or other web videos (e.g. (kay2017kinetics; carreira2018short; carreira2019short; zhao2017hacs)), this step of determining which videos are still active on the site could be very important and affect the dataset quality.

  • Data augmentation is the process of transforming data to fit model input specifications and increase data diversity. Data diversity helps prevent overfitting—when a model too closely matches training data and fails to generalize to unseen examples. Overfitting can occur when the model learns undesired, low-level biases rather than desired, high-level semantics.

  • Hand-crafted feature extraction is the process of transforming raw RGB video data into a specified feature space to provide insights that a model may not be able to independently learn. With video data, motion representations are the most common extracted features.

4.1. Video Data

Video is composed of a series of still-image frames where each frame is made of rows and columns of pixels, the smallest elements of raster images. In standard 3-channel red-green-blue (RGB) video, each pixel is a 3-tuple with an intensity value from 0 to 255 for each of the three color channels. RGB-D video contributes a fourth channel that represents depth often determined by a depth sensor such as the Microsoft Kinect.7

As used throughout this article, a common abstraction to represent video is a 3-dimensional (3D) volume in which frames are densely stacked along a temporal dimension. However, with multi-channel pixels, this volume actually has four dimensions. The desired order of these dimensions can vary between software packages with (frames, channels, height, width) known as channels first (NCHW) and (frames, height, width, channels) known as channels last (NHWC). This order can lead to performance improvements or degradation depending on the training environment (e.g. Theano and MXNet8 versus CNTK and TensorFlow9).

Figure 8. Common video augmentations. (Frames from the Moments in Time dataset (monfort2018moments), class washing)
\Description

6 frames from an original video are shown under 9 different augmentations (3 geometric, 3 photometric, and 3 chronometric). For resizing, the frames are all scaled down by a factor of about 25%. Cropping removes most of the frames except for a square center. Flipping shows the frames reversed along the x-axis. Color jittering shows the frames shifted bluer. Edge enhancement shows more defined edges of objects in the frames. Noise injection shows random pixel alterations on the frames. Trimming reduces it to only frames 3, 4, and 5. Sampling reduced it to only frames 1, 3, and 5. Looping expands the number of frames to 1-6, then 1-3.

4.2. Data Augmentation

Geometric Augmentation Methods

In the context of video, geometric augmentation methods are transformations that alter the geometry of frames (8628742). To be effective, these must be applied equally across all frames. If separate geometric transformations are applied on different frames, a video could quickly loose its semantic meaning. Common geometric augmentations include:

  • Resizing—the process of scaling a video’s frames from a given height and width to a new height and width via spatial up-sampling or down-sampling (imageresizing). Ratio jittering (10.1007/978-3-319-46484-8_2) is resizing that permutes the aspect ratio done for data diversification.

  • Cropping—the process of transforming a video’s frames from a given height and width to a new, smaller height and width via removing exterior rows or columns. Techniques include random cropping (NIPS2012_4824; chatfield2014return; imageaugmentationsurvey) and corner cropping (8454294).

  • Horizontal (left-right) flipping—the process of mirroring a video’s frames across the vertical axis (i.e. reversing the order of columns in each frame). Random horizontal flipping is a popular and computationally efficient method of introducing data diversity (NIPS2012_4824; NIPS2014_5353; carreira2017quo; 8454294).

Other geometric augmentation methods that are less popular for video include vertical flipping, shearing, piecewise affine transforming, and rotating. Shorten and Khoshgoftaar (2019) (imageaugmentationsurvey) present a survey on image augmentation which describes some of these alternative techniques that could easily be applied to video. While some might be more likely to change the semantic meaning of actions. For example, jumping is an action generally predicated on an actor moving upward. Vertical flipping or a 180 degree rotation would change the apparent direction of motion possibly confusing the model into believing the action is falling.

Photometric Augmentation Methods

In the context of video, photometric augmentation methods are transformations that alter the color-space of the pixels making up each frame (8628742). Unlike geometric augmentation, these transformations can generally be applied on a per-frame basis and are overall less common in the action understanding literature. These include:

  • Color jittering—the process of transforming a video’s hue, saturation, contrast, or brightness. This can be done randomly (Wu_2015_CVPR; Han_2019_ICCV; NIPS2014_5353) or via a specific adjustments (Razavian_2014_CVPR_Workshops; NIPS2012_4824).

  • Edge enhancement—the process of increasing the appearance of contours in a video’s frames. In some settings, this speed up the learning process since it has been shown that the first few layers in convolutions neural networks learn to detect edges and gradients (NIPS2012_4824).

Other photometric augmentation methods that may be useful in future settings are superpixelization, random gray (Han_2019_ICCV), random erasing (imageaugmentationsurvey), and vignetting (Han_2019_ICCV). However, currently these are not only absent from the action understanding but uncommon in the image understanding

Chronometric Augmentation Methods

Because the literature does not appear to have a term for transformations that affect the duration of the video input, we refer to these as chronometric augmentation following the naming pattern of geometric and photometric. These transformations are generally used to fit a model’s input specifications rather than increase data diversity.

  • Trimming—the process of altering the start and end of a video—essentially temporal cropping. This may be useful to remove the portion of the video that does not include the labeled action.

  • Sampling—the process of extracting frames from a video—essentially temporal resizing. This can be done from specific frame indices (Feichtenhofer_2016_CVPR; carreira2017quo) or randomly selected frame indices (NIPS2014_5353; 8454294).

  • Looping—the process of repeating a video’s frames to increase the duration—essentially temporal padding (carreira2017quo). This might be necessary when a video segment has fewer frames than the model’s input specifies.

4.3. Hand-Crafted Feature Extraction

Figure 9. (1) Original RGB, (2) dense optical flow (OF) computed using the Farneback method (10.1007/3-540-45103-X_50) and OpenCV packages (opencv_library) (color indicates direction), (3) RGB difference/derivative (dRGB), and (4) phase difference/derivative (dPhase) computed using the approach described in (Hommos_2018_ECCV_Workshops). (Video frames from the Moments in Time dataset (monfort2018moments), class washing)
\Description

6 frames in four difference representations. The first row shows the original RGB. The second shows optical flow. The third shows RGB difference. The forth shows phase difference.

While shallow learning has become less common since the deep learning revolution, several hand-crafted motion features have found their way into state-of-the-art deep learning models (NIPS2014_5353; Feichtenhofer_2016_CVPR; carreira2017quo; 8454294). These motion representations generally fall under two classical field theories: Lagrangian flow (ouellette2006quantitative) and Eulerian flow (10.1145/2185520.2185561).

Lagrangian Motion Representations

Lagrangian flow fields track individual parcel or particle motion. In the video context, this refers to tracking pixels by looking at nearby appearance information in adjacent frames to see if that pixel has moved. The most common Lagrangian motion representation is optical flow (OF) (gibson1950perception). Many methods exist for computing this feature: the Lucas–Kanade method (lucas1981iterative), the Horn–Schunck method (10.1117/12.965761), the TV- approach (10.1007/978-3-540-74936-3_22), the Farneback method (10.1007/3-540-45103-X_50), and others (10.1145/212094.212141). It is also possible to warp OF to attempt to reduce background or camera motion (Wang_2013_ICCV). This technique (WarpFlow) requires computing the homography, a transformation between two planes, between frames. Optical flow has been noted for its usefulness in action understanding because it is invariant to appearance (10.1007/978-3-030-12939-2_20).

Eulerian Motion Representations

Eulerian flow fields represent motion through a particular spatial location. In the video context, this refers to determining visual information differences at a particular spatial location across frames. Two Eulerian motion representations are RGB difference/derivative (dRGB) (8454294; 10.1007/978-3-319-46484-8_2; Hommos_2018_ECCV_Workshops) and phase difference/derivative (dPhase) (Hommos_2018_ECCV_Workshops). RGB difference is the difference between RGB pixel intensities at equivalent spatial locations in adjacent frames. To compute this, one frame is subtracted from another. Phase difference requires converting each frame into frequency domain before taking the difference and converting back to the time domain.

5. Models

The past decade of action understanding research has seen a paradigm shift from primarily shallow, hand-crafted approaches to deep learning where multi-layer artificial neural networks are able to learn complex non-linear relations in structured data. In this section, we describe network building blocks that are common across the diversity of action understanding models and organize state-of-the-art models into architecture families.

5.1. Model Building Blocks

Convolutional Neural Networks

No deep learning architecture component has had a greater impact on action understanding (and computer vision at large) than convolutional neural networks (CNNs), also commonly referred to as ConvNets. A CNN is primarily composed of convolutional, pooling, normalization, and fully-connected layers. For further details, a multitude of tutorials exist on utilizing standard CNN layers (e.g. (lecun2013deep; le2015tutorial)). CNNs are useful in video understanding because the sharing of weights dramatically decreases the number of trainable parameters and therefore reduces computational cost compared to fully-connected networks. Generally, deeper models (i.e. those with more layers) outperform shallower models by increasing the receptive field—the portion of the input that contributes to the feature—of individual neurons in the network (NIPS2016_6203). However, deep models can suffer from problems like exploding or vanishing gradients (doi:10.1142/S0218488598000094).

Figure 10. An example of 2D and 3D convolutional layers and and max pooling layer on single-channel image and video inputs. Note that these filter kernels were chosen randomly and do not necessarily lead to good embedded features.
\Description

The left shows an example of 2D convolution with a 9x9 pixel image and two 3x3 kernels. The 7x7 convolutional outputs are then reduced via max pooling to 5x5 outputs. The right shows an example of 3D convolution with a 9x9x5 input and two 3x3x3 kernels. The 7x7x3 convolutional outputs are then reduced via max pooling to 5x5x1 outputs.

1-Dimensional CNNs (C1D), 2-Dimensional CNNs (C2D), and 3-Dimensional CNNs (C3D) are the backbone for many state-of-the-art models and use 1D, 2D, and 3D kernels, respectively. C1D is primarily applicable for convolutions along the time dimension of embedded features while C2D and C3D are primarily applicable for extracting feature vectors from individual frames or stacked frames. Single-channel examples of 2D and 3D convolutions are shown in Figure 10. Note that when using multi-channel inputs, the convolutional kernels must be expanded to include a depth dimension with the same number of channels as the input tensor, and the output is summed across channels. The CNN literature is vast, but we briefly note a few influential developments consistently employed throughout the action understanding literature:

  • Residual networks (ResNets) (He_2016_CVPR)—utilize skip connections to avoid vanishing gradients

  • Inception blocks (Szegedy_2015_CVPR; Szegedy_2016_CVPR)—utilize multi-size filters for computational efficiency

  • Dense connections (DenseNet) (Huang_2017_CVPR)—utilize skip connections between each layer and every subsequent layer for strengthening feature propagation

  • Inflated networks (carreira2017quo)—expand lower dimensional networks into a higher dimension in a way that benefits from lower dimensional pretrained weights (e.g. I3D)

  • Normalization (ioffe2015batch)—methods of suppressing the undesired effects of random initialization and random internal distribution shifts. These include batch normalization (BN) (ioffe2015batch), layer normalization (LN) (ba2016layer), instance normalization (IN) (ulyanov2016instance), and group normalization (GN) (Wu_2018_ECCV)

Recently, many hybrid CNNs have introduced new convolutional blocks, layers, and modules. Some focus on reducing the large computational costs of C3D: P3D (Qiu_2017_ICCV), R(2+1)D (Tran_2018_CVPR; Ghadiyaram_2019_CVPR), ARTNet (Wang_2018_CVPR), MFNet (Chen_2018_ECCV), GST (Luo_2019_ICCV), and CSN (Tran_2019_ICCV). Others focus on recognizing long-range temporal dependencies: LTC-CNN (7940083), NL (Wang_2018_CVPR), Timeception (Hussein_2019_CVPR), and STDA (LI2020107037). Some unique modules include TSM (Lin_2019_ICCV) which shifts individual channels along the temporal dimension for improved 2D CNN performance and TrajectoryNet (NIPS2018_7489) which uses introduces a TDD-like (Wang_2015_CVPR) trajectory convolution to replace temporal convolutions.

Recurrent Neural Networks

The second most common artificial neural network architecture employed in action understanding is the recurrent neural network (RNN). RNNs use a directed graph approach to process sequential inputs such as temporal data. This makes them valuable for action understanding because frames (or frame-based extracted vectors) can be fed as inputs. The most common type of RNN is the long short-term memory (LSTM) (doi:10.1162/neco.1997.9.8.1735). An LSTM cell uses an input/forget/output gate structure to perform long-range learning. The second most common type of RNN is the gated recurrent unit (GRU) (cho2014learning). A GRU cell uses a reset/update gate structure to perform less computationally intensive learning than LSTM cells. Several thorough tutorials cover RNN, LSTM, and GRU use and underlying principles (e.g. (chen2016gentle; grututorial; staudemeyer2019understanding; SHERSTINSKY2020132306)).

Fusion

The processes of combining input features, embedded features, or output features are known as early fusion, middle fusion (or slow fusion), and late fusion (or ensemble), respectively (Karpathy_2014_CVPR; middle-fusion; doi:10.1002/widm.1249). The simplest and most naïve form is averaging. However, recently attention mechanisms, processes that allow a model to focus on the most relevant information and disregard the least relevant information, have gained popularity.

5.2. State-of-the-Art Model Architectures

We focus here on grouping these methods into architecture families under each action problem and pointing to useful examples. Because of the rapidly evolving nature of the field, we recommend checking online scoreboards10 for up-to-date performances on benchmark datasets.

Action Recognition Models

Figure 11. Action Recognition Model Examples. RGB and Motion Single-Stream architectures train a 2D, 3D, or Hybrid CNN on one sampled feature. Two-stream architectures fuse RGB and Motion streams. Temporal Segmentation architectures divide a video into segments, process each segment on a single-stream or multi-stream architecture, and fuse outputs. Two-stage architectures use temporal segmentation to extract feature vectors and feed those into a convolutional or recurrent network.
\Description

C2D, C3D, and LSTM blocks are used to build up models. RGB Single Stream examples include: (1) single-frame sample, (2) multi-frame channel-stacked sample, (3) multi-frame, 3D Conv, and (4) multi-frame sample, hybrid convolution. Two-stream examples include (1) original two-stream, (2) two-stream inflated 3D (I3D), and (3) hidden two-stream (MobileNet). Temporal segmentation examples include: (1) single stream segment network and (2) two-stream segment network (TSN). Two Stage examples include (1) 3D-fused two-stream and (2) long-term recurrent convolutional neural network (LRCN).

As shown in Figure 11, we broadly group AR architectures into a few families of varying levels of complexity. The first is single-stream architectures which sample or extract one 2D (Karpathy_2014_CVPR; he2019stnet; Jiang_2019_ICCV) or 3D (10.1007/978-3-642-15567-3_11; 6165309; Tran_2015_ICCV; Hara_2018_CVPR) input feature from a video and feed that into a CNN. The output of the CNN is the model’s prediction. While surprisingly effective at some tasks (Karpathy_2014_CVPR; monfort2018moments), single-stream methods often lack the temporal resolution to adequately perform AR without the application of state-of-the-art hybrid modules discussed in Section 5.1.1.

The second family is two-stream architectures with one stream for RGB learning and one stream for motion feature learning (NIPS2014_5353; carreira2017quo). However, computing optical flow or other hand-crafted features is computationally expensive. Therefore, several recent models use a ”hidden” motion stream where motion representations are learned rather than manually determined. These include MotionNet (10.1007/978-3-030-20893-6_23) which operates similarly to standard two-stream methods, and MARS (Crasto_2019_CVPR) and D3D (Stroud_2020_WACV) which perform middle fusion between the streams. Feichetenhofer et al. (2017) (Feichtenhofer_2017_CVPR) explores gating techniques between the streams. While these models are generally computationally constrained to two streams, more streams for additional modalities are possible (WANG201733; Wang_2018).

Built out of single-streams, two-streams, or multi-streams, the third family is temporal segmentation architectures which address long-term dependencies of actions. Temporal Segment Network (TSN) methods (10.1007/978-3-319-46484-8_2; 8454294) divide an input video into segments, sample from those segments, and create video-level prediction by averaging segment level outputs. Model weights are shared between each segment stream. T-C3D (liu2018t), TRN (Zhou_2018_ECCV), ECO (Zolfaghari_2018_ECCV), and SlowFast (Feichtenhofer_2019_ICCV) build on temporal segmentation by performing multi-resolution segmentation and/or fusion.

The fourth family, at the highest level of complexity in our AR methods taxonomy, is two-stage learning where the first stage uses temporal segmentation methods to extract segment embedded feature vectors and the second stage trains on those features. These include 3D-fusion (Feichtenhofer_2016_CVPR) and CNN+LSTM approaches (Donahue_2015_CVPR; Ng_2015_CVPR; 10.1145/2733373.2806222; Wang_2019; 8904245). Ma et al. (2019) conducted a side-by-side comparison of C3D and CNN+LSTM performance (MA201976).

Action Prediction Models

Figure 12. Action Prediction Model Examples. Generative models create representations of future timesteps for prediction (typically via an encoder-decoder scheme). Non-generative models is a broad-sweeping category for those which create predictions directly from observed sections of the input.
\Description

C2D, C3D, LSTM, and Classification blocks are used to build up models. Generative model examples include Reinforced Encoder-Decoder (RED) and Rolling-Unrolling (RU-LSTM). Non-generative examples include Feedback Network (FN) and Multi-modal (MM-LSTM).

Rasouli (2020) (rasouli2020deep) noted that recurrent techniques dominate the approaches. We group these highly diverse action prediction models into generative or non-generative families. Generative architectures produce ”future” features and then classify those predictions. This often takes the form of an encoder-decoder scheme. Examples include RED (gao2017red) which uses a reinforcement learning module to improve an encoder-decoder, IRL (Zeng_2017_ICCV) which uses a C2D inverse-reinforcement learning strategy to predict future frames, Conv3D (8794278) which uses a C3D to generate unseen features for prediction, RGN (Zhao_2019_ICCV) which uses a recursive generation and prediction scheme with a Kalman filter during training, and RU-LSTM (Furnari_2019_ICCV; 8803534) which uses a multi-modal rolling-unrolling encoder-decoder with modality attention.

Non-generative architectures is a broad grouping of all other approaches. These create predictions directly from observed features. Examples include F-RNN-EL (7487478) which uses an exponential loss to bias a multi-modal CNN+LSTM fusion strategy towards the most recent predictions, MS-LSTM (Aliakbarian_2017_ICCV) which uses two LSTM stages for action-aware and context-aware learning, MM-LSTM (10.1007/978-3-030-20887-5_28) which extends MS-LSTM to arbitrarily many modalities, FN (8354277) which uses a three-stage LSTM approach, and TP-LSTM (8803820) which uses a temporal pyramid learning structure.

Many of these examples in this section were developed for action anticipation (when no portion of the action is yet observed), but they are also applicable for early action recognition (when a portion of the action has been observed). Additionally, action recognition models described in Section 5.2.1 may be applicable for some early action recognition tasks if they are able to derive enough semantic meaning from the provided portion and the video context.

Temporal Action Proposal Models

Figure 13. Temporal Action Proposal Model Examples. Top-down models use a sliding window approach to create segment-level proposals. Bottom-up models use frame or short-segment level actionness score predictions with grouping strategies to produce proposals. Hybrid models use both top-down and bottom-up strategies in parallel.
\Description

C1D, C3D, LSTM, GRU, Segmentation & Boundary Adjustment, Non-maximal suppression, and a few other specific blocks build up the models. Top-Down (Sliding Window) examples shown are (1) Deep Action Proposals (DAP), (2) Segment-CNN (S-CNN), (3) Single-Stream Temporal Acton Proposals (SST), and (4) Temporal Unit Regression Network (TURN). Bottom-Up (Grouping) example shown are (1) Temporal Actionness Grouping (TAG), (2) Boundary-Sensitive Network (BSN), (3) Boundary-Matching Network (BMN), and (4) RecapNet. A hybrid examples is also shown which fuses two-down and bottom-up streams.

As shown in Figure 13, TAP approaches can be grouped into three families. The first family is top-down architectures which consists of models that use sliding windows to derive segment-level proposals. Examples include DAP (10.1007/978-3-319-46487-9_47) and SST (Buch_2017_CVPR) which use CNN feature extractors and recurrent networks, S-CNN (Shou_2016_CVPR) which uses multi-scale sliding windows, and TURN TAP (Gao_2017_ICCV) which uses a multi-scale pooling strategy.

The second family is bottom-up-architectures which use two-stream frame-level or short-segment-level extracted features to derive ”actionness” confidence predictions. Various grouping strategies are then applied to these dense predictions to create full proposals. Examples include TAG (Zhao_2017_ICCV) which uses a flooding algorithm to convert these into multi-scale groupings, BSN (Lin_2018_ECCV) and BMM (Lin_2019_ICCV) which use additional ”startness” and ”endness” feature for different proposal generation and proposal evaluation techniques, and RecapNet (8972408) which uses a residual causal network rather than a generic 1D CNN to compute confidence predictions. R-C3D (Xu_2017_ICCV) and TAL-Net (Chao_2018_CVPR) use region-based methods to adapt 2D object proposals in images to 1D action proposals in videos. Many of the bottom-up-architectures require non-maximal suppression (NMS) of outputs to reduce the weight of redundant proposals.

The third family is hybrid architectures which combine top-down and bottom-up approaches. These generally create segment proposals and actionness scores in parallel and then use actionness to refine the proposals. Examples include CDC (Shou_2017_CVPR), CTAP (Gao_2018_ECCV), MGG (Liu_2019_CVPR), and DPP (10.1007/978-3-030-36718-3_40).

Temporal Action Localization/Detection Models

Figure 14. Temporal Action Localization/Detection Model Examples. One-stage architectures conduct proposal and classification together while two-stage architectures create proposals and then use an action recognition model to classify each proposal.
\Description

C1D, C2D, C3D, GRU, 1D Decovolution, Gaussian kernel, non-maximal suppression (NMS), and a few specific blocks build up the models. One-stage examples shown are (1) Single-Shot Action Detection (SSAD), (2) Decouple-SSAD, (3) Single-Stream Temporal Action Detection (SS-TAD), and (4) Gaussian Temporal Awareness Network (GTAN). A two-stage example shows a proposal stage and a recognition stage in series.

There are two main families of TAL/D methods as shown in Figure 14. This taxonomy was introduced by Xia et al. (2020) (9062498). The first family is two-stage architectures in which the first stage creates proposals and the second stage classifies them. Therefore, to create a two-stage architecture, you can pair any of the TAP model described in Section 5.2.3 with an AR model described in Section 5.2.1. It is worth noting that almost all papers that explore TAP methods also extend their work to TAL/D.

The second family is one-stage architectures in which proposal and classification happen together. Examples include SSAD (10.1145/3123266.3123343) which creates a snippet-level action score sequence from which a 1D CNN extracts multi-scale detections, SS-TAD (BMVC2017_93) in which parallel recurrent memory cells create proposals and classifications, Decouple-SSAD (8784822) which builds on SSAD with a three-stream decoupled-anchor network, GTAN (Long_2019_CVPR) which uses multi-scale Gaussian kernels, Two-stream SSD (9108686) which fuses RGB detections with OF detections, and RBC (9053319) which completes boundary refinement prior to classification.

Spatiotemporal Action Localization/Detection Models

Figure 15. Spatiotemporal Action Localization/Detection Model Examples. Frame-level (region) proposal models link frame-level detections together while segment-level (tube) proposal models create small ”tubelets” for short segments and link the tubelets into longer tubes.
\Description

C1D, C2D, C3D, LSTM, Regression, Classification, R-CNN, Non-maximal Suppression (NMS), and a few specific other blocks build up these models. Frame-Level (Region) examples shown are (1) AVA Two-Stream Inflated 3D Detection (I3D) and (2) Recurrent Tubelet Proposal and Recognition (RTPR). Segment-level (Tube) examples shown are (1) Tube Convolutional Neural Network (T-CNN), (2) Action Tubelet Detector (ACT-Detector), and (3) Spatiotemporal Progressive Learning (STEP).

As shown in Figure 15, there are two main families of state-of-the-art SAL/D methods. The first is frame-level (region) proposal architectures which use various region proposal algorithms (e.g. R-CNN (Girshick_2014_CVPR), Fast R-CNN (Girshick_2015_ICCV), Faster R-CNN (NIPS2015_5638), early+late fusion Faster R-CNN (YE2019515)) to derive bounding boxes from frame then apply a frame linking algorithm. Examples include MR-TS (10.1007/978-3-319-46493-0_45), CPLA (yang2017spatiotemporal), ROAD (Singh_2017_ICCV), AVA I3D (Gu_2018_CVPR), RTPR (Li_2018_ECCV), and PntMatch (YE2019515).

The second family is segment-level (tube) proposal architectures which uses various methods to create segment-level temporally-small tubes or ”tubelets” and then uses a tube linking algorithm. Examples of these models include T-CNN (Hou_2017_ICCV), ACT-detector (Kalogeiton_2017_ICCV), and STEP (Yang_2019_CVPR).

A few state-of-the-art models do not fit nicely in either of these families but are worth noting. Zhang et al. (2019) (Zhang_2019_CVPR) use a tracking network and graph convolutional network to derive person-object detections. VATX (Girdhar_2019_CVPR) augments I3D approaches with a multi-head, multi-layer transformer. STAGE (tomei2019stage) introduces a temporal graph attention method.

6. Metrics

Choosing the right metric is critical to evaluating a model properly. In this section, we define commonly used metrics and point to examples of their usage. We will not cover binary classification metrics as the action datasets we have cataloged overwhelmingly have more than two classes. Note that any time we refer to an accuracy value, the error value can easily be computed as . To clarify terms, we use following notation across the metrics:

  • : the set of input videos

  • : the set of ground truth annotations for the input videos

  • : a function (a.k.a. model) mapping input videos to prediction annotations

  • : the set of model outputs

  • : the set of action classes

  • : a function mapping rank in a list to 1 if the item at that rank is a true positive, 0 otherwise

Figure 16. Illustration of types of intersection over union: spatial, temporal, and spatiotemporal. IoU is also known as the Jaccard index or the Jaccard similarity coefficient.
\Description

The left figure depicts two partially overlapping red and blue rectangles. The middle figure shows two overlapping red and blue lines. The right figure shows two overlapping red and blue rectangular prisms. In each case, the overlapping section is purple. Below the figures are a description of intersection of union as the size of the purple section divided by ((the sum of the red section and the size of the blue section) minus the size of the purple section).

Several of these metrics also use forms of intersection over union (IoU), a measure of similarity of two regions. Figure 16 depicts spatial IoU, temporal IoU, and spatiotemporal IoU.

6.1. Multi-class Classification Metrics

In action understanding, multi-class classification consists of problems where the model returns per-class confidence scores for each input video. This is done primarily with a softmax loss in which the confidence scores across classes for a given input sum to 1. Formally, :

  • : the ground truth annotation for input is a single action class label

  • where is the probability that video depicts action

  • if the model uses softmax output (as is common)

We define the two common metrics below. Other metrics that we will not cover include F1-score (micro-averaged and macro-averaged), Cohen’s Kappa, PR-AUC, ROC-AUC, partial AUC (pAUC), or two-way pAUC. Sokolava and Lapalme (SOKOLOVA2009427) and Tharwat (THARWAT2018) present thorough evaluations of these and other multi-class classification metrics.

Top- Categorical Accuracy (a)

This metric measures the proportion of times when the ground truth label can be found in the top predicted classes for that input. Top- accuracy, sometimes simply referred to as accuracy, is the most ubiquitous while Top- and Top- are other standard choices (Heilbron_2015_CVPR; ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019; activitynetchallenge2020). In some cases, several Top- accuracies or errors are averaged. To calculate Top- accuracy, let be the subset containing the highest confidence scores for video . The Top- accuracy over the entire input set, where is a 0-1 indicator function, is:

(1)

Mean Average Precision (mAP)

This metric is the arithmetic mean of the interpolated average precision () of each class, and it has been used in multiple THUMOS and ActivityNet challenges (THUMOS13; THUMOS14; THUMOS15; Heilbron_2015_CVPR; ghanem2017activitynet). To calculate interpolated for a particular class, the model outputs must be ranked in decreasing confidence of that class. Formally, , is a ranked list of outputs such that . The prediction at rank in list is a true positive if that video’s ground truth label is class (i.e. if ). Using these lists ,…,, precision up to rank in a given list, interpolated over all ranks with unique recall values for a given class, and are calculated as:

(2)
(3)
(4)

6.2. Multi-label Classification Metrics

In the context of action understanding, multi-label classification consists of AR or AP problems in which the dataset has more than two classes and each video can be annotated with multiple action class labels. As in multi-class classification, the model returns per-class confidence scores for each input. However, in multi-label problems, it is common for the outputs to be calculated through a sigmoid loss. Unlike with softmax, confidence scores do not sum to 1. Formally, :

  • : the ground truth annotation for input is a set of action classes

  • where is the probability that video depicts action

We define two common metrics below. For more information on other metrics such as exact match ratio and Hamming loss, we recommend Tsoumakas and Katakis (2007) (tsoumakas2007multi) and Wu and Zhou (2017) (pmlr-v70-wu17a) which present surveys of multi-label classification metrics.

Mean Average Precision (mAP)

This is the same metric as described in Section 6.1.2, and it is calculated very similarly for multi-label problems. The difference occurs when determining the true positives in each class list. Here, a prediction at rank in list is a true positive if class is one of the video’s ground truth labels (i.e. if ). From there, precision up to rank , interpolated for a particular class, and are calculated as shown in Equations 2, 3, and 4. This metric is used for MultiTHUMOS (multithumos), ActivityNet 1.3 (Heilbron_2015_CVPR) when applied as an untrimmed AR problem, and Multi-Moments in Time (monfort2019multimoments). One possible variant of multi-label involves only computing for each class up to a specified rank . Another variant involves only counting predictions as true positives if the confidence score is above a specific threshold (e.g. ).

Hit@k

This metric indicates the proportion of times when any of the ground truth labels for an input can be found in the top predicted classes for that input. Once again, and are standard choices for (Karpathy_2014_CVPR). Formally, let be the subset containing the highest confidence scores for video . A ”hit” occurs if the intersection of the ground truth set of labels and the set of top- predictions is non-empty:

(5)

6.3. Temporal Proposal Metrics

Metrics for TAP are less varied than those for classification. Below, we define the two main ones found in the literature. Here, the model returns proposed temporal regions (start and end markers for each) and a confidence score for each proposal. Formally, :

  • : the ground truth annotation set of temporal segments where consists of start and end markers for input video

  • where is the probability (confidence) that proposal segment matches a ground truth segment for input

  • tIoU(,): the temporal intersection over union between the ground truth a proposal

Intuitively, a model that produces more proposals will have a better chance of covering all of the ground truth segments. Therefore, TAP metrics include average number of proposals (), a hyperparameter that can be manually tuned. is defined as the total number of proposals divided by the total number of input videos. Formally,

(6)

Average Recall at Average Number of Proposals (AR@AN)

Recall is a measure of sensitivity of the prediction model. In this context, a ground truth temporal segment is counted as a true positive if there exists a proposal segment that has a tIoU with it greater than or equal to a given threshold (i.e. if tIoU). Recall is the proportion of all ground truth temporal segments for which there is a true positive prediction. Average recall is the mean of all recall values over thresholds from to (inclusive) with a step size of . In the ActivityNet challenges, and (ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019). Formally, recall at a particular threshold and and average recall at are calculated as:

(7)
(8)

Area Under the AR-AN Curve (AUC)

Another metric for TAP is the area under the curve when is plotted for various values of . Commonly, this is for values of 1 to 100 with a step size of 1 (ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019). Note that at an of 0 where no proposals are given, is trivially 0. Using from Equation 8, AR-AN is calculated as:

(9)

6.4. Temporal Localization/Detection Metrics

Like temporal proposal, there are two main metrics for TAL/D and both are used across many challenges (THUMOS14; THUMOS15; Heilbron_2015_CVPR; ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019; activitynetchallenge2020). Here, the model returns proposed temporal regions (start and end markers for each), a class prediction for each proposal, and a confidence score for each proposal. Formally, :

  • : the ground truth annotation set of (temporal segment, class label) pairs for input where consists of start and end markers and

  • where is the probability (confidence) that proposal segment matches a ground truth segment labeled with class for input

  • tIoU(,): the temporal IoU between a ground truth segment and a proposal

Mean Average Precision at tIoU Threshold (mAP tIoU@t)

This metric is the arithmetic mean of the interpolated average precision () over all classes at a given tIoU threshold. First, all proposals for a given class are ranked in decreasing confidence. The difference from standard described in Section 6.1.2 occurs when determining true positives. In this case, a proposal segment at rank in list is counted as a true positive if there exists a ground truth segment that has a tIoU with it greater than or equal to a given threshold , the predicted class label matches the ground truth class label , and that ground truth segment has not already been detected by another proposal higher in the ranked list (i.e. if tIoU and ). This way, no redundant detections are allowed. Precision up to rank , interpolated for a particular class, and are calculated using Equations 2, 3, 4. Note that in this case, in Equation 3 must be replaced with the number of prediction tuples for the class .

Average Mean Average Precision (average mAP)

The most common TAL/D metric is the arithmetic mean of over multiple different tIoU thresholds from to with a given step size . Commonly, (inclusive) and (Heilbron_2015_CVPR; ghanem2017activitynet; ghanem2018activitynet; activitynetchallenge2019; activitynetchallenge2020). Therefore, is computed as:

(10)

6.5. Spatiotemporal Localization/Detection Metrics

SAL/D involves locating actions in both time and space as well as classifying the located actions. Here, the model generally returns frame-level proposed spatial regions (bounding boxes), a class prediction for each box, and a confidence score. Formally, :

  • : the ground truth annotation set of tuples for input where is the frame number counting up from 1, is a bounding box marking the upper left corner, the box’s height, and the box’s width, and

  • where is the confidence that bounding box at frame matches a ground truth bounding box on the same frame labeled with class

  • : a spatiotemporal tube in video is a linked set of bounding boxes with the same class label () in adjacent frames ()

  • sIoU: the spatial IoU between a ground truth bounding box and a proposed bounding box (note: this requires )

  • stIoU: the spatiotemporal IoU between a ground truth and proposed tubes

Frame-Level Mean Average Precision (frame-mAP)

This metric treats is useful because it evaluates the model independent of the linking strategy—the process of developing action instance tubes. It is utilized in several ActivityNet challenges (ghanem2018activitynet; activitynetchallenge2019; activitynetchallenge2020). Like several metrics above, this is the mean of the interpolated over all classes. For a given class, every prediction tuple is ranked in decreasing confidence. Here, a proposal box at rank in list is counted as a true positive if there exists a ground truth box on the same frame with the same class label that has a sIoU with it greater than or equal to a given threshold that has not already been detected by another proposed box higher in the ranked list (i.e. if sIoU and and ). No redundant detections are allowed. Precision up to rank , interpolated for a particular class, and are calculated using Equations 2, 3 and 4. Note that in Equation 3 must be replaced with the number of prediction tuples for the class (i.e. the length of ranked list ).

Video-Level Mean Average Precision (video-mAP)

This metric is useful for evaluating the linking strategy applied to connect bounding boxes of the same class label in adjacent frames. When using -, longer actions would take up more frames and weight more when calculating and . However, using this metric, each action instance is weighted equally regardless of the temporal duration of the occurrence. This - metric has been employed for use with both AVA and J-HMDB-21 datasets (Gu_2018_CVPR; Jhuang_2013_ICCV). Once bounding boxes of the same class label in adjacent frames are linked into tubes, every prediction tube of that class is ranked in decreasing confidence. Here, a proposal tube at rank in list is counted as a true positive if there exists a ground truth tube with the same class label that has a stIoU with it greater than or equal to a given threshold that has not already been detected by another proposed tube higher in the ranked list (i.e. if stIoU and ). No redundant detections are allowed. Precision up to rank , interpolated for a particular class, and are calculated using Equations 2, 3 and 4. Note that in this case, in Equation 3 must be replaced with the number of prediction tubes for the class .

7. Conclusion

In this tutorial, we presented the suite of problems encapsulated within action understanding, listed datasets useful as benchmarks and pretraining sources, described data preparation steps and strategies, organized deep learning model building blocks and state-of-the-art model families, and defined common metrics for assessing models. We hope that this tutorial has clarified terminology, expanded your understanding of these problems, and inspired you to pursue research in this rapidly evolving field at the intersection of computer vision and deep learning. This article has also demonstrated the similarities and differences between these action understanding problem spaces via common datasets, model building blocks, and metrics. To that end, we also hope that this can facilitate idea cross-pollination between the somewhat stove-piped action problem sub-fields.

Acknowledgements.
We like to thank the following individuals who have provided feedback on this article: Jeremy Kepner, Andrew Kirby, Alex Knapp, Alison Louthain, and Albert Reuther. Research was sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

Appendix A Additional Tables

Action Actors Annotations
Video Dataset Year Cited Classes Instances H N C T S Theme/Purpose
KTH (1334462) 2004 3,853 6 2,391 B/W, static background
CAVIAR (caviar) 2004 49 9 28 surveillance
Weizmann (1544882) 2005 1,890 10 90 human motions
ViSOR (4607676; 4730416) 2005 47 n/a n/a surveillance
ETISEO (4118811; 4425357) 2005 183 15 n/a human motions
IXMAS (WEINLAND2006249) 2006 977 13 390 B/W, partial occlusion
UCF Aerial (ucfaerial) 2007 n/a 9 n/a aerial-view
CASIA Action (4270503; 4270101) 2007 242 15 1,446 multi-view, outdoors
Coffee & Cigarettes (4409105) 2007 491 2 246 movies and TV
UIUC Action (10.1007/978-3-540-88682-2_42) 2008 378 14 532 action repetition
UCF Sports (4587727) 2008 1,269 10 150 sports
UCF ARG (ucfarg) 2008 n/a 10 480 multi-view, aerial-view
Hollywood (HOHA) (4587756) 2008 3,727 8 n/a movies
Cambridge-Gesture (4547427) 2008 298 9 900 gestures
BEHAVE (blunsden2010behave) 2009 134 10 163 human-human interaction
URADL (5459154) 2009 574 10 150 daily activities
UCF11 (5206744) 2009 1,183 11 3,040 web videos
MSR-I (5719621) 2009 181 3 n/a activities
i3DPost MuHAVi (5430066) 2009 179 12 1,000 multi-view, studio
Hollywood2 (5206557) 2009 1,312 12 3,669 movies
Collective Activity (5457461) 2009 201 5 44 group activities
LabelMe (5459289) 2009 203 70 300 actor-object interactions
Keck Gesture (5459184) 2009 349 14 126 gestures
DLSPB (5459279) 2009 307 2 89 movies and TV
Hollywood-Localization (10.1007/978-3-642-35749-7_17) 2010 159 2 408 movies
VideoWeb (Denina2011) 2010 55 51 368 multi-view, tasks
UT-Tower (5399231; UT-Tower-Data) 2010 61 9 648 aerial-view, human motions
UT-Interaction (5459361; UT-Interaction-Data) 2010 593 6 60 human-human interaction
UCF50 (ucf50) 2010 537 50 5,000 web videos, expand UCF11
TV-Human Interaction (patron2010high) 2010 176 4 300 TV, human-human interaction
Olympic Sports (10.1007/978-3-540-88690-7_55) 2010 745 16 800 sports
MSR-II (cao2010cross) 2010 232 3 n/a activities
MSR-Action3D (5543273) 2010 1,285 20 4,020 RGB-D, gestures and motions
CMU MoCap (cmu-mocap) 2010 n/a n/a 2,605 RGB-D, human motions
VIRAT (5995586) 2011 536 23 10,000 surveillance, aerial-view
HMDB51 (6126543) 2011 1,928 51 7,000 human motions
CAD-60 (sung2012unstructured) 2011 549 12 60 RGB-D, daily activities
GTEA (fathi2011learning; Li_2015_CVPR) 2011 492 71 526 egocentric, kitchen
CCV (ccv) 2011 288 *20 9,317 web videos
ChaLearn (chalearn) 2011 n/a 86 50,000 RGB-D, gestures and motions
RGBD-HuDaAct (6130379) 2011 393 12 1,189 RGB-D, daily activities
NATOPS (5771448) 2011 111 24 400 aircraft hand signaling
GTEA Gaze (fathi2012learning) 2012 331 40 331 egocentric, kitchen
GTEA Gaze+ (fathi2012learning; Li_2015_CVPR) 2012 165 44 1,958 egocentric, kitchen
BIT-Interaction (10.1007/978-3-642-33718-5_22) 2012 109 8 400 human-human interaction
LIRIS (WOLF201414) 2012 60 10 n/a RGB-D, office environment
MSR-DailyActivity3D (6247813) 2012 1,339 16 320 RGB-D, gestures
UCF101 (soomro2012ucf101) 2012 2,470 101 13,320 web videos, expand UCF50
UTKinect-A (6239233) 2012 1,216 10 200 RGB-D, indoors
MSR-Gesture3D (6333871) 2012 317 12 n/a RGB-D, gestures
ASLAN (6042884) 2012 106 432 3,631 web videos, action similarity
ADL (6248010) 2012 619 18 1,200 egocentric, daily activities
ACT4 (10.1007/978-3-642-33868-7_6) 2012 122 14 6,844 RGB-D, multi-view
SBU-Kinect-Interaction (6239234) 2012 339 8 170 RGB-D, human-human inter.
MPII-Cooking (6247801) 2012 436 65 5,609 kitchen, fine-grained actions
Osaka Kinect (6377312) 2012 31 10 80 RGB-D, gestures
*Only a portion of classes are actions. Some are objects or visual tags.
(continued on next page)
Table 3. 137 video action datasets are sorted by release year. Tabular information includes dataset name, year of publication, citations on Google Scholar as of August 2020, number of action classes, number of action instances, actors: human (H) and/or non-human (N), annotations: action class (C), temporal markers (T), spatiotemporal bounding boxes/masks (S), and theme/purpose..
(continued from previous page)
Action Actors Annotations
Video Dataset Year Cited Classes Instances H N C T S Theme/Focus
DHA (10.1145/2393347.2396381) 2012 66 23 483 RGB-D, gestures and motions
Falling Event (zhang2012rgb) 2012 138 5 200 RGB-D, daily activities
G3D (6239175; 10.1007/978-3-319-02714-2_6) 2012 207 20 659 RGB-D, gaming gestures
MSR-3DActionPairs (Oreifej_2013_CVPR) 2013 866 12 360 RGB-D, gestures
Multiview 3D Event (Wei_2013_ICCV) 2013 119 8 3,815 RGB-D, multi-view
RGBD-SAR (6571292) 2013 29 12 810 RGB-D, monitoring seniors
CAD-120 (doi:10.1177/0278364913478446) 2013 587 10 120 RGB-D, daily activities
JPL Interaction (Ryoo_2013_CVPR) 2013 253 7 85 egocentric, human-human inter.
MHAD (6474999) 2013 336 11 650 RGB-D, multi-view, gestures
Florence3D (florence3D; Seidenari_2013_CVPR_Workshops) 2013 189 9 213 RGB-D , gestures
THUMOS’13 (THUMOS13; idrees2017thumos; soomro2012ucf101) 2013 146 **101 13,320 web videos, extend UCF101
J-HMDB-21 (Jhuang_2013_ICCV) 2013 458 51 928 re-annotate HMDB51 subset
Mivia (10.1007/978-3-642-41190-8_47) 2013 21 7 490 RGB-D, daily activities
IAS-lab (MUNARO201342; munaro2013evaluation) 2013 31 15 540 RGB-D, human motions
WorkoutSU-10 (10.1007/978-3-642-39094-4_74) 2013 66 10 1,200 RGB-D, group activities
50Salads (10.1145/2493432.2493482) 2013 177 17 966 RGB-D, kitchen
UWA3D-I (10.1007/978-3-319-10605-2_48) 2014 141 30 900 RGB-D, multi-view
MANIAC (AKSOY2015118) 2014 43 8 120 RGB-D, ego., manipulations
Breakfast Action (Kuehne_2014_CVPR) 2014 203 48 11,267 kitchen
Northwester-UCLA (Wang_2014_CVPR) 2014 222 10 1,475 RGB-D, multi-view
Sports-1M (Karpathy_2014_CVPR) 2014 4,361 487 1,000,000 multi-label, sports
ORGBD (3D Online) (10.1007/978-3-319-16814-2_4) 2014 136 7 336 RGB-D, human-object inter.
THUMOS’14 (THUMOS14; idrees2017thumos) 2014 146 ***101 15,904 extends THUMOS’13
Office Activity (10.1145/2647868.2654912) 2014 94 20 1,180 RGB-D, office environment
Composable (Lillo_2014_CVPR) 2014 81 16 693 RGB-D, gestures and motions
CMU-MAD (10.1007/978-3-319-10578-9_27) 2014 80 20 1,400 RGB-D, gestures and motions
FFPA (Zhou_2015_ICCV) 2015 48 5 591 egocentric, daily activities
TJU (LIU201574) 2015 69 15 1,200 RGB-D, static background
MI (10.1145/2733373.2806315) 2015 23 22 1,760 RGB-D, multi-view
FCVID (7857793) 2015 219 *239 91,223 web videos, diverse categories
ActivityNet100 (v1.2) (Heilbron_2015_CVPR) 2015 797 100 10,733 untrimmed web videos
THUMOS’15 (THUMOS15; idrees2017thumos) 2015 146 ***101 21,037 extends THUMOS’14
MEXaction (7236896; 7350799) 2015 16/3 2 1,108 culturally relevant actions
MEXaction2 (mexaction2) 2015 n/a 2 1,975 extends MEXaction
Watch-n-Patch (Wu_2015_CVPR) 2015 119 21 2,000 RGB-D, daily activities
TVSeries (10.1007/978-3-319-46454-1_17) 2016 109 30 6,231 TV
OAD (10.1007/978-3-319-46478-7_13) 2016 109 10 n/a RGB-D, daily activities
CONVERSE (converse2016) 2016 20 7 n/a RGB-D, human-human inter.
OA (7477586) 2016 11 48 480 action semantic hierarchy
Volleyball (Ibrahim_2016_CVPR) 2016 215 6 1,643 sports (volleyball motions)
UWA3D-II (Rahmani_2016_CVPR) 2016 117 30 1,075 RGB-D, multi-view
ActivityNet200 (v2.3) (Heilbron_2015_CVPR) 2016 797 200 23,064 untrimmed web videos
YouTube-8M (abuelhaija2016youtube8m) 2016 607 *n/a n/a multi-label
Charades (10.1007/978-3-319-46448-0_31) 2016 343 157 66,500 crowd-sourced, daily activities
NTU RGB-D (shahroudy2016ntu) 2016 792 60 56,880 RGB-D, multi-view
Micro-Videos (nguyen2016open) 2016 27 *n/a n/a micro-videos (e.g. Vine, Tik-Tok)
JAAD (rasouli2017they; rasouli2018role) 2017 53 n/a 654 pedestrians
DAHLIA (7961782) 2017 9 7 51 RGB-D, daily activities
PKU-MMD (liu2017pku) 2017 67 51 3,366 RGB-D, multi-view
SYSU 3DHOI (Hu_2015_CVPR) 2017 302 12 480 RGB-D, human-object inter.
DALY (weinzaepfel2016human) 2017 26 10 3,600 daily activities
Okutama Action (Barekatain_2017_CVPR_Workshops) 2017 55 12  4,700 aerial view
Kinetics-400 (kay2017kinetics) 2017 810 400 306,245 diverse web videos
AVA (Gu_2018_CVPR) 2017 270 80 392,416 atomic visual actions
Something-Something (Goyal_2017_ICCV) 2017 182 174 108,499 human-object inter.
SLAC (slac) 2017 19 200 1,750,000 sparse-labelled web videos
Moments in Time (MiT) (monfort2018moments) 2017 137 339 836,144 intra-class variation, web videos
MultiTHUMOS (multithumos) 2017 231 65 16,000 multi-label, extends THUMOS
VIENA (10.1007/978-3-030-20887-5_28) 2018 7 25 15,000 pedestrians and vehicles
PRAXIS Gesture (negin2018praxis) 2018 16 29 4,600 RGB-D, gestures
UAV-GESTURE (Perera_2018_ECCV_Workshops) 2018 10 13 119 aerial-view, gestures
Diving48 (Li_2018_ECCV-diving) 2018 25 48 18,404 diving motions (sports)
EPIC-KITCHENS-55 (Damen_2018_ECCV) 2018 209 125 39,594 egocentric, kitchen
YouCook2 (zhou2018towards) 2018 96 n/a 15,400 web videos, kitchen
*Only a portion of classes are actions. Some are objects or visual tags.
**Only 24 classes have temporal annotations. This subset is known as UCF101-24.
***Only 20 classes have temporal annotations.
(continued on next page)
(continued from previous page)
Action Actors Annotations
Video Dataset Year Cited Classes Instances H N C T S Theme/Focus
Kinetics-600 (carreira2018short) 2018 52 600 495,547 extends Kinetics-400
VLOG (Fouhey_2018_CVPR) 2018 41 30 122,000 web videos, human-object inter.
EGTEA Gaze+ (Li_2018_ECCV) 2018 52 106 10,325 egocentric, kitchen
Something-Something-v2 (mahdisoltani2018effectiveness) 2018 5 174 220,847 extends Something-Something
Charades-Ego (sigurdsson2018charadesego) 2018 19 157 68,536 egocentric, daily activities
Youtube-8M Segments (abuelhaija2016youtube8m) 2019 n/a *n/a n/a multi-label, extends YT-8M
Jester (Materzynska_2019_ICCV) 2019 12 27 148,092 crowd-sourced, gestures
LSVV-HRI (ji2019largescale) 2019 4 83 25,600 RGB-D, human-robot inter.
PIE (Rasouli_2019_ICCV) 2019 10 6 1,800 pedestrians
Kinetics-700 (carreira2019short) 2019 33 700 650,000 extends Kinetics-600
Multi-MiT (monfort2019multimoments) 2019 1 313 1,020,000 multi-label, extends MiT
HACS Clips (zhao2017hacs) 2019 31 200 1,500,000 trimmed web videos
HACS Segments (zhao2017hacs) 2019 31 200 139,000 extends and improves SLAC
NTU RGB-D 120 (8713892) 2019 55 120 114,480 extends NTU RGB-D 60
EPIC-KITCHENS-100 (damen2020rescaling) 2020 6 97 90,000 extends EPIC-KITCHENS-55
AVA-Kinetics (li2020avakinetics) 2020 5 80 238,000 adds annotations, AVA+Kinetics
ARID (xu2020arid) 2020 0 11 3,784 dark (low-lighting) videos
AViD (piergiovanni2020avid) 2020 0 887 450,000 diverse, anonymized faces
*Only a portion of classes are actions. Some are objects or visual tags.
Workshop Year Conf. Problem Dataset(s) Metric(s) #Teams
THUMOS (THUMOS13) 2013 ICCV AR UCF101 average accuracy 17
SAL/D UCF101-24 ROC AUC sIoU@0.2 n/a
THUMOS (THUMOS14) 2014 ECCV AR UCF101+ mAP 11
TAL/D UCF101-20 mAP tIoU@{0.1,0.2,0.3,0.4,0.5} 3
THUMOS (THUMOS15; Idrees_2017) 2015 CVPR AR UCF101+1 mAP 11
TAL/D UCF101-20 mAP tIoU@{0.1,0.2,0.3,0.4,0.5} 1
ActivityNet (Heilbron_2015_CVPR) 2016 CVPR AR ActivityNet 1.3 mAP, Top-1 accuracy, Top-3 accuracy 26
TAL/D ActivityNet 1.3 mAP-50, mAP-75, average-mAP 6
ActivityNet (ghanem2017activitynet) 2017 CVPR AR ActivityNet 1.3 Top-1 error n/a
AR Kinetics-400 average(Top-1 error, Top-5 error) 31
TAP ActivityNet 1.3 AR-AN AUC 17
TAL/D ActivityNet 1.3 mAP tIoU@0.5:0.05:0.95 17
ActivityNet (ghanem2018activitynet) 2018 CVPR TAP ActivityNet 1.3 AR-AN AUC 55
TAL/D ActivityNet 1.3 mAP tIoU@0.5:0.05:0.95 43
AR Kinetics-600 average(Top-1 error, Top-5 error) 13
SAL/D AVA frame-mAP sIoU@0.5 23
AR MiT (full-track) average(Top-1 acc, Top-5 acc) 29
AR MiT (mini-track) average(Top-1 acc, Top-5 acc) 12
ActivityNet (activitynetchallenge2019; Lee_2020_WACV) 2019 CVPR TAP ActivityNet 1.3 AR-AN AUC 72
TAL/D ActivityNet 1.3 mAP tIoU@0.5:0.05:0.95 23
AR Kinetics-700 average(Top-1 error, Top-5 error) 15
SAL/D AVA frame mAP sIoU@0.5 32
AR EPIC-KITCHENS-55 micro-avg Top-1,5 acc, macro-AP,AR 39
AP EPIC-KITCHENS-55 micro-avg Top-1,5 acc, macro-AP,AR 19
TAL/D VIRAT P@miss 42
Multi-modal (multimodalICCV19) 2019 ICCV AR Multi-MiT mAP 10
TAL/D HACS Segments mAP tIoU@0.5:0.05:0.95 5
ActivityNet (activitynetchallenge2020) 2020 CVPR TAL/D ActivityNet 1.3 mAP tIoU@0.5:0.05:0.95 n/a
AR Kinetics-700 average(Top-1 error, Top-5 error) n/a
SAL/D AVA frame mAP sIoU@0.5 n/a
TAL/D VIRAT P@miss 11
TAL/D HACS Segments mAP tIoU@0.5:0.05:0.95 22
TAL/D HACS Clips+Seg. mAP tIoU@0.5:0.05:0.95 13
AR = Action Recognition
TAP = Temporal Action Proposal
TAL/D = Temporal Action Localization/Detection
SAL/D = Spatiotemporal Action Localization/Detection
Table 4. Prominent Video Action Understanding Challenges 2013-2020.

Footnotes

  1. ccs: Computing methodologies Supervised learning
  2. ccs: Computing methodologies Machine learning
  3. ccs: Computing methodologies Ensemble methods
  4. ccs: Computing methodologies Activity recognition and understanding
  5. ccs: Computing methodologies Image representations
  6. https://sites.google.com/view/multimodalvideo/home
  7. https://developer.microsoft.com/en-us/windows/kinect/
  8. http://deeplearning.net/software/theano/ and https://mxnet.apache.org/versions/1.6/
  9. https://docs.microsoft.com/en-us/cognitive-toolkit/ and https://www.tensorflow.org/
  10. https://paperswithcode.com/area/computer-vision
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
416901
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description