Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation


Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on instructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as “COIN” for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at

Instructional Video, Activity Understanding, Video Analysis, Deep Learning, Large-Scale Benchmark.

1 Introduction

INSTRUCTION, which refers to “directions about how something should be done or operated” [1], enables novices to acquire knowledge from experts to accomplish different tasks. Over the past decades, learning from instruction has become an important topic in various areas such as cognition science [33], educational psychology [53], and the intersection of computer vision, nature language processing and robotics [46, 5, 4].

Instruction can be expressed by different mediums like text, image and video. Among them, instructional videos provide more intuitive visual examples, and will be focused on in this paper. With the explosion of video data on the Internet, people around the world have uploaded and watched substantial instructional videos [3, 58], covering miscellaneous categories. As suggested by the scientists in educational psychology [53], novices often face difficulties in learning from the whole realistic task, and it is necessary to divide the whole task into smaller segments or steps as a form of simplification. Accordingly, a variety of relative tasks have been studied by morden computer vision community in recent years (e.g., action temporal localization [80, 74], video summarization [20, 79, 48] and video caption [83, 34, 77], etc). Also, increasing efforts have been devoted to exploring different challenges of instructional video analysis [31, 82, 58, 3] because of its great research and application value. As an evidence, Fig. 2 shows the growing number of publications in the top venues over the recent ten years.

Fig. 1: Visualization of two root-to-leaf branches of the COIN. There are three levels of our dataset: domain, task and step. Take the top row as an example, in the left box, we show a set of frames of 9 different tasks associated with the domain “vehicles”. In the middle box, we present several images of 9 videos belonging to the task “change the car tire”. Based on this task, in the right box, we display a sequence of frames sampled from a specific video, where the indices are presented at the left-top of each frame. The intervals in red, blue and yellow indicate the step of “unscrew the screws”, “jack up the car” and “put on the tire”, which are described with the text in corresponding color at the bottom of the right box. All figures are best viewed in color.

In the meantime, a number of datasets for instructional video analysis [55, 12, 64, 35, 3, 69, 82] have been collected in the community. Annotated with texts and temporal boundaries of a series of steps to complete different tasks, these datasets have provided good benchmarks for preliminary research. However, most existing datasets focus on a specific domain like cooking, which makes them far from many real-world applications where more diverse activities occur. Moreover, the scales of these datasets are insufficient to satisfy the hunger of recent data-driven learning methods.

To tackle these problems, we introduce a new dataset called “COIN” for COmprehensive INstructional video analysis. The COIN dataset contains 11,827 videos of 180 different tasks, covering the daily activities related to vehicles, gadgets and many others. Unlike the most existing instructional video datasets, COIN is organized in a three-level semantic structure. Take the top row of Fig. 1 as an example, the first level of this root-to-leaf branch is a domain named “vehicles”, under which there are numbers of video samples belonging to the second level tasks. For a specific task like “change the car tire”, it is comprised of a series of steps such as “unscrew the screws”, “jack up the car”, “put on the tire”, etc. These steps appear in different intervals of a video, which belong to the third-level tags of COIN. We also provide the temporal boundaries of all the steps, which are efficiently annotated based on a new developed toolbox.

Towards the goal to set up a benchmark, we implement various approaches on the COIN under five different settings, including step localization, action segmentation, procedure localization, task recognition and step recognition. Furthermore, we propose two simple yet effective methods for localizing different steps in instructional videos. Specifically, we first explore the task-consistency based on bottom-up aggregation and top-down refinement strategies. Then, we investigate the ordering-dependency by considering the transition probability of different steps. Extensive experimental results have shown the great challenges of COIN and the effectiveness of our methods. Moreover, we study the cross dataset transfer setting under the conventional “pre-training+fine-tuning” paradigm, and demonstrate that COIN can benefit step localization task for other instructional video datasets.

Our main contributions are summarized as follows:

  • We have introduced the COIN dataset based on our extensive survey on instructional video analysis. To our best knowledge, this is the currently largest dataset with manual annotation in this field. Moreover, as a by-product, we have developed an efficient and practical annotation tool, which can be further utilized to label temporal boundaries for other tasks like action detection and video caption.

  • We have evaluated various methods on the COIN dataset under five different evaluation criteria, constructing a benchmark to facilitate the future research. Based on the extensive experimental results, we have analyzed our COIN from different aspects, and provided an in-depth discussion on the comparison with other relative video analysis datasets.

  • We have exploited the task-consistency and ordering-dependency to further enhance the performance of the step localization task. Moreover, we have verified that our COIN dataset can contribute to the step localization task for other instructional video datasets.

It is to be noted that a preliminary conference version of this work was initially presented in [66]. As an extension, we have devised a new method to explore the ordering-dependency of different steps in instructional videos, and verified its effectiveness on the COIN and Breakfast[35] datasets. Moreover, we have conducted more experiments and provided more in-depth discussions, including the study of cross dataset transfer, analysis of hyper-parameters, and a new experimental setting on step recognition. And also, we have presented a more detailed literature review for instructional video analysis and discussion on future works.

Fig. 2: Number of papers related to instructional video analysis published on top computer vision conferences (CVPR/ICCV/ECCV) over the recent 10 years.
Dataset Duration Samples Segments Task Video Source H? Manual Anno. Classes Year
MPII[55] 9h,48m 44 5,609 CA self-collected SL + TB - 2012
YouCook[12] 2h,20m 88 - CA YouTube SL + TB - 2013
50Salads[64] 5h,20m 50 966 CA self-collected SL + TB - 2013
Breakfast[35] 77h 1,989 8,456 CA self-collected SD + TB 10 2014
JIGSAWS[24] 2.6h 206 1,703 SA self-collected SD + TB 3 2014
“5 tasks”[3] 5h 150 - CT YouTube SD + TB 5 2016
Ikea-FA[69] 3h,50m 101 1,911 AF self-collected SL + TB - 2017
Recipe1M[56] - 432 - CA - SD - 2017
Recipe1M+[43] - 13,735,679 - CA Google SD - 2018
YouCook2[82] 176h 2,000 13,829 CA YouTube SD + TB + OBL 89 2018
EPIC-KITCHENS[11] 55h 432 39,596 CA self-collected SD + TB + OBL 5 2018
EPIC-Skills[18] 5.2h 216 - - mixed WB 4 2018
CrossTask[84] 376h 4,700 - CT YouTube SL + TB 83 2019
BEST[19] 26h 500 - CT YouTube WB 5 2019
HowTo100M[45] 134,472h 1.22M 136M CT YouTube - 23K 2019
COIN (Ours) 476h,38m 11,827 46,354 CT YouTube SL + TB 180
  • Both Recipe1M and Recipe1M+ are image-based datasets.   The EPIC-Skills dataset comprised of four tasks, where two were self-recorded, and two were seletcted from published datasets [24, 14].   CA: cooking activities, SA: surgical activities, AF: assembling furniture, CT: comprehensive tasks
    H?: Hierarchical?   SL: step label (shared in different videos), SD: step description (not shared in different videos), TB: temporal boundary,
    OBL: object bounding box and label, WB: which is better given a pair-wise videos.

TABLE I: Comparisons of existing datasets related to instructional video analysis.

2 Related Work

2.1 Tasks for Instructional Video Analysis

Fig. 3 shows the development of different tasks for instructional video analysis being proposed in past decade. In 2012, Rohrbach et al. [55] collected the MPII dataset, which promoted the later works on step localization and action segmentation. As the two fundamental tasks in this field, step localization aims to localize the start and end points of a series of steps and recognize their labels, and action segmentation targets to parse a video into different actions at frame-level. In later sections of the paper we will concentrate more on these two tasks, while in this subsection we provide some brief introduction on other tasks.

In 2013, Das et al.[35] proposed the YouCook dataset and facilitated the research on video caption, which required generating sentences to describe the videos. In 2017, Huang et al. [31] studied the task of reference resolution, which aimed to temporally link an entity to the original action that produced it. In 2018, they further investigated the visual grounding problem [30], which explored the visual-linguistic meaning of referring expressions in both spatial and temporal domains. In the same year, Zhou et al. [82] presented a procedure segmentation task, targeting at segmenting an instructional video into category-independent procedure segments. Farha et al. [22] studied the activity anticipation problem, which predicted the future actions and their durations in instructional videos. Doughty et al. [18] addressed the issue on skill determination, which assessed the skill behaviour of a subject. More recently, Chang et al. [9] presented a new task of procedure planning in instructional videos, which aimed to discover the intermediate actions according to the start and final observations. With these promotion of the new topics on instructional video analysis, the research community is paying growing attention to this burgeoning field.

Fig. 3: The timeline of different tasks for instructional video analysis being proposed.

2.2 Datasets Related to Instructional Video Analysis

There are mainly three types of related datasets. (1) The action detection datasets are comprised of untrimmed video samples, and the goal is to recognize and localize the action instances on temporal domain [27, 32, 67] or spatial-temporal domain [25]. (2) The video summarization datasets [13, 26, 62, 48] contain long videos arranging from different domains. The objective is to extract a set of informative frames in order to briefly summarize the video content. (3) The video caption datasets are annotated with descried sentences or phrases, which can be based on either a trimmed video [75, 77] or different segments of a long video [34]. Our COIN is relevant to the above mentioned datasets, as it requires to localize the temporal boundaries of important steps corresponding to a task. The main differences lie in the following two aspects: (1) Task-consistency. The steps belonging to different tasks shall not appear in the same video. For example, it is unlikely for an instructional video to contain the step “pour water to the tree” (belongs to task “plant tree”) and the step “install the lampshade” (belongs to task “replace a bulb”). (2) Ordering-dependency. There may be some intrinsic ordering constraints among a series of steps for completing different tasks. For example, for the task of “planting tree”, the step “dig a hole” shall be ahead of the step “put the tree into the hole”.

There have been a variety of instructional video datasets proposed in recent years, and we briefly review some representative datasets in supplementary. Table I summarizes the comparison among some publicly relevant instructional datasets and our proposed COIN. While the existing datasets present various challenges to some extent, they still have some limitations in the following two aspects. (1) Diversity: Most of these datasets tend to be specific and contain certain types of instructional activities, e.g., cooking. However, according to some widely-used websites [28, 29, 73], people attempt to acquire knowledge from various types of instructional video across different domains. (2) Scale: Compared with the recent datasets for image classification (e.g., ImageNet [15] with  1 million images) and action detection (e.g., ActivityNet v1.3 [27] with  20k videos), most existing instructional video datasets are relatively smaller in scale. Though the HowTo100M dataset provided a great amount of data, its automaticly generated annotation might be inaccurate as the authors mentioned in [45]. The challenge of building such a large-scale dataset mainly stems from the difficulty to organize enormous amount of video and the heavy workload of annotation. To address these two issues, we first establish a rich semantic taxonomy covering 12 domains and collect 11,827 instructional videos to construct COIN. With our new developed toolbox, we also provide the temporal boundaries of steps that appear in all the videos with efficient and precise annotation.

2.3 Methods for Instructional Video Analysis

In this subsection, we review a series of approaches related to two core tasks for instructional video analysis (step localization and action segmentation). We roughly divided them into three categories according to experimental settings: unsupervised learning-based, weakly-supervised learning-based and fully-supervised learning-based.

Unsupervised Learning Approaches: In the first category, the step localization task usually took a video and the corresponding narration or subtitle as multi-modal inputs 1. For example, Sener et al.[58] developed a joint generative model to parse both video frames and subtitles into activity steps. Alayrac et al.[3] leveraged the complementary nature of the instructional video and its narration to discover and locate the main steps of a certain task. Generally speaking, the advantages of employing the narration or subtitle is to avoid human annotation, which may cost huge workload. However, these narration or subtitles may be inaccurate [82] or even irrelevant to the video as we mention above. For the action segmentation task, Aakur et al. [2] presented a self-supervised and predictive learning framework to explore the spatial-temporal dynamics of the videos, while Sener et al. [57] proposed a Generalized Mallows Model (GMM) to model the distribution over sub-activity permutations. More recently, Kukleva et al. [36] first learned a continuous temporal embedding of frame-based features, and then decoded the videos into coherent action segments according to an ordered clustering of these features.

Weakly-supervised Learning Approaches: In the second category, the step localization problem has recently been studied by Zhukov et al. [84] by exploring the sharing information of different steps across different tasks. And Liu et al. [40] identified and addressed the action completeness modeling and action-context separation problems for temporal action localization by the weak supervision. For the action segmentation task, Kuehne et al. [35] developed a hierarchical model based on HMMs and a context-free grammar to parse the main steps in the cooking activities. Richard et al. [51][52] adopted Viterbi algorithm to solve the probabilistic model of weakly supervised segmentation. Ding et al. [16] proposed a temporal convolutional feature pyramid network to predict frame-wise labels and use soft boundary assignment to iteratively optimize the segmentation results. In this work, we also evaluate these three methods2 to provide some baseline results on COIN. More recently, Chang et al. [8] developed a discriminative differentiable dynamic time warping (DTW) method, which extended the ordering loss to be differentiable.

Fully-supervised Learning Approaches: In the third category, the action segmentation task has been explored by numbers of works by developing various types of network architectures. For example, multi-stream bi-directional recurrent neural network (MSB-RNN) [61], temporal deformable residual network (TDRN) [37], multi-stage temporal convolutional network (MS-TCN) [21], etc. As a task we pay more attention to, the step localization is related to the area of action detection, where promising progress has also been achieved recently [41, 42, 39, 38]. For example, Zhao et al. [80] developed structured segment networks (SSN) to model the temporal structure of each action instance with a structured temporal pyramid. Xu et al. [74] introduced a Region Convolutional 3D Network (R-C3D) architecture, which was built on C3D [70] and Faster R-CNN [50], to explore the region information of video frames. Compared with these methods, we attempt to further explore the dependencies of different steps, which lies in the intrinsic structure of instructional videos. Towards this goal, we proposed two methods to leverage the task-consistency and ordering-dependency of different steps. Our methods can be easily plugged into recent proposal-based action detection methods and enhance the performance of step localization task for instructional video analysis.

3 The COIN Dataset

Fig. 4: Illustration of the COIN lexicon. The left figure shows the hierarchical structure, where the nodes of three different sizes correspond to the domain, task and step respectively. For brevity, we do not draw all the tasks and steps here. The right figure presents detailed steps of the task “replace a bulb”, which belongs to the domain “electrical appliances”.

3.1 Lexicon

The purpose of COIN is to establish a rich semantic taxonomy to organize comprehensive instructional videos. In previous literature, some representative large-scale datasets were built upon existing structures. For example, the ImageNet [15] database was constructed based on a hierarchical structure of WordNet [23], while the ActivityNet dataset [27] adopted the activity taxonomy organized by American Time Use Survey (ATUS) [47]. In comparison, it remains great difficulty to define such a semantic lexicon for instructional videos because of their high diversity and complex temporal structure. Hence, most existing instructional video datasets [82] focus on a specific domain like cooking or furniture assembling, and  [3] only consists of five tasks. Towards the goal to construct a large-scale benchmark with high diversity, we utilize a hierarchical structure to organize our dataset. Fig. 1 and Fig. 4 present the illustration of our lexicon, which contains three levels from roots to leaves: domain, task and step.

(1) Domain: For the first level, we bring the ideas from the organization of several websites[28][73][29], which are commonly-used for users to watch or upload instructional videos. We choose 12 domains as: nursing & caring, vehicles, leisure & performance, gadgets, electric appliances, household items, science & craft, plants & fruits, snacks & drinks, dishes, sports, and housework.

(2) Task: As the second level, the task is linked to the domain. For example, the tasks “replace a bulb” and “install a ceiling fan” are associated with the domain “electrical appliances”. As most tasks on [28][73][29] may be too specific, we further search different tasks of the 12 domains on YouTube. In order to ensure the tasks of COIN are commonly used, we finally select 180 tasks, under which the searched videos are often viewed 3.

(3) Step: The third level of the lexicon contains various series of steps to complete different tasks. For example, steps “remove the lampshade”, “take out the old bulb”, “install the new bulb” and “install the lampshade” are associated with the tasks “replace a bulb”. We employed 6 experts (e.g., driver, athlete, etc.) who have prior knowledge in the 12 domains to define these steps. They were asked to browse the corresponding videos as a preparation in order to provide the high-quality definition, and each step phrase will be double checked by another expert. In total, there are 778 defined steps. Note that we do not directly adopt narrated information, which might have large variance for a specific task, because we expect to obtain the simplification of the core steps, which are common in different videos of accomplishing a certain task.

3.2 Annotation Tool

Given an instructional video, the goal of annotation is to label the step categories and the corresponding segments. As the segments are variant in length and content, it will cost huge workload to label the COIN with conventional annotation tool. In order to improve the annotation efficiency, we develop a new toolbox which has two modes: frame mode and video mode. Fig. 5 shows an example interface of the frame mode, which presents the frames extracted from a video under an adjustable frame rate (default is 2fps). Under the frame mode, the annotator can directly select the start and end frame of the segment as well as its label. However, due to the time gap between two adjacent frames, some quick and consecutive actions might be missed. To address this problem, we adopt another video mode. The video mode of the annotation tool presents the online video and timeline, which is frequently used in previous video annotation systems [34]. Though the video mode brings more continuous information in the time scale, it is much more time-consuming than the frame mode because of the process to locate a certain frame and adjust the timeline4.

During the annotation process, each video is labelled by three different workers with payments. To begin with, the first worker generated primary annotation under the frame mode. Next, the second worker adjusted the annotation based on the results of the first worker. Ultimately, the third worker switched to the video mode to check and refine the annotation. Under this pipeline, the total time of the annotation process is about 600 hours.

Fig. 5: The interface of our new developed annotation tool under the frame mode.
Fig. 6: The duration statistics of the videos (left) and segments (right) in the COIN dataset.

3.3 Statistics

The COIN dataset consists of 11,827 videos related to 180 different tasks, which were all collected from YouTube. We split the COIN into 9030 and 2797 video samples for training and testing respectively, and show the sample distributions among all the task categories in supplementary. Fig. 6 displays the duration distribution of videos and segments. The averaged length of a video is 2.36 minutes. Each video is labelled with 3.91 step segments, where each segment lasts 14.91 seconds on average. In total, the dataset contains videos of 476 hours, with 46,354 annotated segments.

In order to further illustrate different charaterstics of the COIN dataset, we calculate two scores which are similarly used in [4]. For a given task, suppose there are N video samples and K steps defined in the ground truth, for the video, let denote the numbers of the unique anotated steps, denote the length of the longest common subsequence between the annotated sequence of steps and the ground truth sequence, and denote the number of annotated steps without duplicate. Then the missing steps score (MSS) and order consistency error (OCE) are defined as below:


The average values of the 180 tasks in the COIN are 0.2924 (MSS) and 0.2076 (OCE) respectively, and the concrete values of each task are presented in the supplementary. The statistics illustrates that videos of the same task would not strictly share the same series of ordered steps due to the abbreaviated or reversed step sequences. For example, a task “A” might contains different step sequences of , , .

4 Approach

4.1 Task-consistency Analysis

Given an instructional video, one important real-world application is to localize a series of steps to complete the corresponding task. In this section, we introduce a new proposed task-consistency method for step localization in instructional videos. Our method is motivated by the intrinsic dependencies of different steps which are associated to a certain task. For example, it is unlikely for the steps of “dig a pit of proper size” and “soak the strips into water” to occur in the same video, because they belong to different tasks of “plant tree” and “make french fries” respectively. In another word, the steps in the same video should be task-consistent to ensure that they belong to the same task. Fig. 7 presents the flowchart of our task-consistency method, which contains two stages: (1) bottom-up aggregation and (2) top-down refinement.

Fig. 7: Flowchart of our proposed task-consistency method. During the first bottom-up aggregation stage, the inputs are a series of scores of an instructional video, which denotes the probabilities of each step appearing in the corresponding proposal. We first aggregate them into a video-based score , and map it into another score to predict the task label . At top-down refinement stage, we generate a refined mask vector based on the task label. Then we alleviate the weights of other bits in by to ensure the task-consistency. The refined scores are finally utilized to perform NMS process and output the final results.

Bottom-up Aggregation: As our method is built upon the proposal-based action detection methods, we start with training an existing action detector, e.g., SSN [80], on our COIN dataset. During inference phase, given an input video, we send it into the action detector to produce a series of proposals with their corresponding locations and predicted scores. These scores indicate the probabilities of each step occuring in the corresponding proposal. We denote them as , where represents the score of the proposal and is the number of the total steps. The goal of the bottom-up aggregation stage is to predict the task labels based on these proposal scores. To this end, we first aggregate the scores along all the proposals as , where indicates the probability of each step appearing in the video. Then we construct a binary matrix with the size of to model the relationship between the steps and tasks:


Having obtained the step-based score and the binary matrix , we calculate a task-based score as . This operation is essential to combine the scores of steps belonging to same tasks. We choose the index with the max value in the as the task label of the entire video.

Note that since we sum the step scores to calculate the task score, someone might cast doubt that the task score will be overwhelmed by the task with more steps. Actually, besides summing the scores, another way is averaging them based on the number of steps. In fact, these two methods equally weigh all the steps or tasks respectively. We conduct experiments in section 5 and find the summing method is more effective on the COIN.

Top-down Refinement: The target of the top-down refinement stage is to refine the original proposal scores with the guidance of the task label. We first select the row in as a mask vector v, based on which we define a refined vector as:


Here I is an vector where all the elements equal to 1. is an attenuation coefficient to alleviate the weights of the steps which do not belong to the task . We empirically set to be , and further exploration on this parameter can be found in our supplementary. Then, we employ the to mask the original scores as:


where is the element-wise Hadamard product. We compute a sequence of scores as . Based on these refined scores and their locations, we employ a Non-Maximum Suppression (NMS) strategy to obtain the results of step localization. In summary, we first predict the task label through the bottom-up scheme, and refine the proposal scores by the top-down strategy, hence the task-consistency is guaranteed.

Fig. 8: Pipeline of our ordering-dependency method. The input of our method is a set of proposals and we refine it through the following three stages. (1) In order to deal with the overlapped proposals , we first group them into a series of segments . (2) We refine the segments based on the ordering information in the training set. (3) We map the score variation of into , and finally refine the proposal .

4.2 Ordering-dependency Analysis

Another important characteristic of instructional video is the ordering constraint of different steps belonging to a task. Some previous works [4, 3] assumed that videos of the same task share the same series of ordered steps. However this assumption is too strict as several steps might be abbreviated or reversed based on the statistics we show in Section 3.3. To address this issue, we propose a new method to perform ordering refinement by leveraging the transition probability between different steps. Fig. 8 displays a pipeline of our approach which consists of three stages as (1) grouping proposals, (2) ordering refinement and (3) mapping variation. We elaborate each stage in detail as follow.

Grouping Proposals into Segment . The “grouping proposals” can be considered as a transformation from a proposal space to a segment space . Similar to the task-consistency method, the input of our ordering-dependency approach is a set of proposals which are generated by existing proposal-based action detector. Here denotes its temporal interval, and is the corresponding step probability scores need to be refined. However, these detectors usually produce a great amount of proposals with many overlaps, which make the ordering refinement hard to perform. This is because (1) the overlaps might bring ambiguity for deciding which proposal occurs earlier, and (2) several overlapped proposals might actually correspond to the same step. To address this, we group the proposals into a sequence of segments where the overlaps are eliminated. Specifically, for an input proposal , we first generate a Gaussian based function as below:

Fig. 9: Visualization of auxiliary matrix and transition matrix of the task “Replace a Bulb”, which contains 4 steps as “remove the light shell”, “take out the old bulb”, “install the new bulb” and “install the light shell”.

Here denotes the probabilities of steps occurring at the time-stamp , and we set respectively. For a whole video, we obtain the function of step score as . Then, at time-stamp t, we sum the scores of all the steps to calculate its corresponding binary actionness score . After that, we follow [80] to apply the watershed algorithm on the 1D signal to obtain a sequence of segments , where its validness is theoretically guaranteed by [54]. Here, is the temporal interval of the segment , where we calculate the step score as:


Refining Segments S by Ordering Constraint. In order to refine the S by the intrinsic ordering-dependency in instructional video, we leverage a transition matrix where the element of the matrix denotes the probability of step follows step :


To construct , we first introduce an auxiliary matrix by counting the occurrence time of the step after step based on the all ordered step lists appearing in the training set:


We normalize each row of to obtain the transition matrix . The and of a task is presented in Fig. 9, and more examples can be found in the supplementary. Then, for each list of segments , we refine their scores as:


where and () are two hyper-parameters to balance the effects of the original score and the ordering regularized score. is the probability of different steps occurring at the first segment:


In Eqn. (9), we borrow idea from hidden Markove Model (HMM) [6], which mathematically defined the state transition equation as . Here and correspond to the segment scores and transition matrix in our paper. Besides, in order to combine the orignal score and the ordering regularized score, we utilize the score fusion method, which is a simple yet effective scheme and similarly adopted in previous works [60, 72] empirically. We also evaluate other fusion strategies in section 5.1 experimentally.

Mapping Variation to . The “mapping variation” can be regarded as an inverse process of “grouping proposals”, which maps the variation in the segment space into the original proposal space for the later evaluation process. Specifically, for a region , we obtain the variation of as . Then we have the as:


Here and denote element-wise division and production, and we assume the ratio of the variation and equal to the ratio between their original value. Similarly, we calculate the variation and refine the proposal score as follow:


In practice, we multiply a Gaussian function as a regularized factor before the integration. This operation concentrates more energy to the middle of the proposal and is shown to be more effective during the experiments. As there are several time-wise functions like and , we discretize the time into slots and use accumulation to approximate the continuous integration. We set to 100 in our experiments empirically. Having obtained the output proposals which are regularized by the prior ordering-dependency knowledge, we perform NMS to obtain the final step localization results.

mAP @ mAR @
Method 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
Random 0.03 0.03 0.02 0.01 0.01 2.57 1.79 1.36 0.90 0.50
SSN[80] 19.39 15.61 12.68 9.97 7.79 50.33 43.42 37.12 31.53 26.29
SSN[80] 11.23 9.57 7.84 6.31 4.94 33.78 29.47 25.62 21.98 18.20
SSN[80] 20.00 16.09 13.12 10.35 8.12 51.04 43.91 37.74 32.06 26.79
SSN+TC 20.15 16.79 14.24 11.74 9.33 54.05 47.31 40.99 35.11 29.17
SSN+TC 12.11 10.29 8.63 7.03 5.52 37.24 32.52 28.50 24.46 20.58
SSN+TC 20.01 16.44 13.83 11.29 9.05 54.64 47.69 41.46 35.59 29.79
TABLE II: Comparisons of the step localization accuracy (%) of the baselines and our task-consistency (TC) method on the COIN dataset.

5 Experiments

In this section, we study five different tasks for instructional video analysis: (1) step localization, (2) action segmentation, (3) proposal localization, (4) task recognition and (5) step recognition5. For the first two important tasks in this field, we evaluate various existing approaches to provide a benchmark on our COIN dataset. For the other three tasks, our goal is to compare the challenge of our COIN with other relative datasets based on the same method. We also test our task-consistency and ordering-dependency methods for the step localization tasks. In addition, we further conduct experiments to evaluate whether COIN can help for training models on other instructional datasets. The following describes the details of our experiments and results.

5.1 Evaluation on Step Localization

Implementation Details: In this task, we aim to localize a series of steps and recognize their corresponding labels given an instructional video. We mainly evaluate the following approaches: (1) Random. We uniformly segmented the video into three intervals, and randomly assigned the label to each interval. (2) SSN[80]. This is an effective model for action detection, which outputs the same type of results (interval and label for each action instance) with step localization. We utilized the PyTorch implementation and the BNIcenpention as the backbone. We followed the default setting to sample 9 snippets from each segment. We used the SGD optimizer to train the model with the initial learning rate of 0.001. The training process lasted for 24 epochs, and the learning rate was scaled down by 0.1 at the 10th and 20th epoch. The NMS threshold was set to 0.6. The reported results are based on the inputs of different modalities as: SSN, SSN and SSN. Here SSN adopted the optical flows calculated by  [78], and SSN combined the predicted scores of SSN and SSN. (3) SSN+TC, SSN+OD, SSN+ODTC, SSN+TCOD. We test these methods in order to demonstrate the advantages of the proposed method to explore the task-consistency (TC) and ordering-dependency (OD) in instructional videos. For clarification, +ODTC denotes first performing ordering-dependency regularization then executing task-consistency method, while +TCOD is the other way round. (4) R-C3D[74], BSN[39] and BMN [38] with TC and OD. We further plugged our TC and OD methods into these action detection models to verify their generalization ability. Since the BMN and BSN were originally designed for temporal action proposal generation, we processed the proposals generated by them with the classifier of SSN to produce the final results.

Fig. 10: Visualized results of step localization. The videos belongs to the tasks of “paste screen protector on Pad” and “install the bicycle rack”.

Evaluation Metrics: As the results of step localization contain time intervals, labels and confidence scores, we employed Intersection over Union (IoU) as a basic metric to determine whether a detected interval is positive or not. The IoU is defined as , where G denotes the ground truth action interval and D denotes the detected action interval. We followed [10] to calculate Mean Average Precision (mAP) and Mean Average Recall (mAR). The results are reported under the IoU threshold ranging from 0.1 to 0.5.

Baseline Results: Table II presents the experimental results which reveal great challenges to performing step localization on the COIN dataset. Even for the state-of-the-art method SSN, it only attains the results of 8.12% and 26.79% on mAP@0.5 and mAR@0.5 respectively.

Analysis on Task-consistency Method: For the task-consistency method, we observe that SSN+TC improves the performance over the baseline models in Table II, which illustrates the effectiveness of our proposed method to capture the dependencies among different steps. We show several visualization results of different methods and ground-truth in Fig. 16. As an example, we analyze an instructional video of the task “paste screen protector on Pad” as follow. When applying our task-consistency method, we can discard those steps which do not belong to this task, e.g., “line up a screen protector with cellphone” and “open the slot of SIM card”, hence more accurate step labels can be obtained.

As we discussed in Section 4.1, during the bottom-up aggregation period, there are two approaches to calculate the task score: (1) summing the scores of steps belonging to the task, or (2) averaging them according to the number of steps. We display the compared results in Table III, the summing strategy achieves better performance than the averaging method under both mAP and mAR metrics. Actually, they equally weigh all the steps or tasks respectively, and the summing method is shown to be more effective on the COIN.

SSN +TC(sum) +TC(ave)
mAP@0.5 7.79 9.33 9.16
mAR@0.5 26.29 29.17 29.02
TABLE III: Study of different approaches for calculating the task score in task-consistency method on the COIN dataset.
RGB Flow Fusion
SSN[80] 19.39 11.23 20..00
SSN+TC 20.15 12.11 20.01
SSN+OD 20.39 11.82 20.95
SSN+TCOD 21.82 13.09 21.77
SSN+ODTC 23.23 13.25 22.89
TABLE IV: Analysis on our proposed TC and OD methods on the COIN dataset. The results are based on mAP under the IoU threshold .

Analysis on Ordering-dependency Method: As shown in Table IV, our proposed ordering-dependency (OD) method surpasses the baseline model on both mAP and mAR metrics, which verifies its effectiveness for the step localization task. Moreover, when combining with the task-consistency (TC) method, it can achieve further improvements. Specifically, the ODTC (first performing OD then executing TC) achieves the best result of 23.23% accuracy on RGB modality, slightly outperforming the TCOD (first performing TC then executing OD). Some visualization results have also been shown in Fig. 16. In most cases, our proposed methods can obtain more accurate results. However in the first task, for the SSN+ODTC and SSN+TCOD, the results are little worse than the result of SSN+OD due to the nebulous boundary of some steps.

0 0.2 0.4 0.6 0.8 1
1 0.8 0.6 0.4 0.2 0
COIN 20.95 20.76 20.58 20.44 20.27 20.01
Breakfast 26.71 27.61 28.17 28.71 29.05 28.24
TABLE V: Study of the hyper-parameters and for step localization on the COIN and Breakfast dataset. The results are all based on SSN+OD under mAP@0.1.

Besides, in ordering-dependency method, there are two hyper-parameters and , which make a trade-off for the original score and ordering regularized score. We explore different and on both COIN and Breakfast datasets and present the results in Table V. We observe that in COIN dataset, the peak reaches at , suggesting that the ordering-dependency information is more important. However for the Breakfast dataset, we achieve the best result at , which indicates that the original score obtained based on the appearance information contributes more.

In Eqn (5), we use a Gaussian based function for an input proposal. Here we study other distributions with different standard deviation where is a standard deviation factor. Besides, we explore another Triangle distribution:


We show the experimental results in Table VI, which indicates that the Gaussian based distribution with the standard deviation factor is a proper choice.

In Eqn (9), we adopt the weight score fusion method to refine the segment score, while here we explore other different methods. Specifically, we denote the original score and the ordering regularized score as follow:

Study of different generated distributions for calculating scores
Distribution (G,0.5) (G,1) (G,2) (G,5) T
mAP@0.1 20.07 20.39 20.13 19.81 19.77
Study of different methods to refine the segment scores ()
Method max-pool
mAP@0.1 20.30 20.09 20.36 19.84
Study of the number of time slots M
M 50 100 150 200
mAP@0.1 19.87 20.39 20.55 20.58
TABLE VI: Analysis of the OD method. Experiments are conducted on the RGB modality of the COIN dataset. G and T denote the Gaussian based distribution and Triangle distribution respectively, while the second element in parentheses denotes the standard deviation factor . See text for the definitions of other variables and more details.

We explore different approaches to fuse these two scores. Table VI shows that “” and “” are two methods which can achieve better results for refining the segment scores.

Moreover, we futher present the evalutation results of the time slot M in Table VI. It can be seen that finer-grained division with larger M can lead to better performance. In this paper, we use to make a good trade-off between the effectiveness and efficiency as larger M would also bring more computational cost.

Domain mAP Domain mAP
nursing & caring 22.92 vehicles 19.07
science & craft 16.59 electric appliances 19.86
leisure & performance 24.32 gadgets 17.99
snacks & drinks 19.79 dishes 23.76
plants & fruits 22.71 sports 30.20
household items 19.07 housework 20.70
TABLE VII: Comparisons of the step localization accuracy (%) over 12 domains on the COIN dataset. We report the results obtained by SSN+TC with = 0.1.

Discussion: We provide some further discussions as below.

(1) What are the hardest and easiest domains for instructional video analysis? In order to provide a more in-depth analysis of the COIN dataset, we report the performance of SSN+TC among the 12 domains. Table VII presents the comparison results, where the domain “sports” achieves the highest mAP of 30.20%, This is because the differences between the “sports” steps are more clear, thus they are easier to be identified. In contrast, the results of “gadgets” and “science & craft” are relatively low. The reason is that the steps in these two domains usually have higher similarity with each other. For example, the step “remove the tape of the old battery” is similar to the step “take down the old battery”. Hence it is harder to localize the steps in these two domains. For the compared performance across different tasks, please see in the supplementary material for more details.

(2) Can the proposed task-consistency and ordering-dependency methods be applied to other action detection models? Since our proposed TC and OD are two plug-and-play methods, we futher validate them on the R-C3D [74], BSN [39] and BMN [38] models. From Table VIII we can see that both TC and OD could improve the performance of various basic models, which further demonstrate the effectiveness of our proposed methods.

Basic Model R-C3D [74] BSN [39] BMN [38]
Baseline 9.85 18.91 18.60
Baseline+TC 10.32 19.96 19.27
Baseline+OD 10.08 20.46 19.68
TABLE VIII: Study of the TC and OD approaches on different basic models. The results are reported based on the RGB modality in the COIN dataset under mAP@0.1 (%).

(3) Can the proposed task-consistency and ordering-dependency methods be applied to other instructional video datasets? In order to demonstrate the effectiveness of our proposed methods, we further conduct experiments on another dataset called “Breakfast”[35], which is also widely-used for instructional video analysis. The Breakfast dataset contains over 1.9k videos with 77 hours of 4 million frames. Each video is labelled with a subset of 48 cooking-related action categories. Following the default setting, we set split 1 as testing set and the other splits as training set. Similar to COIN, we employ SSN[80], which is a state-of-the-art method for action detection, as a baseline method under the setting of step localization. As shown in Table IX, our proposed task-consistency and ordering-dependency methods improve the performance of the baseline model, which further shows their advantages to model the dependencies of different steps in instructional videos. Besides, the ODTC achieves better results than the TCOD, which illustrates that first performing OD is more effective.

Metrics mAP@ mAR@
Threshold 0.1 0.3 0.5 0.1 0.3 0.5
SSN[80] 28.24 22.55 15.84 54.86 45.84 35.51
SSN+TC 28.25 22.73 16.39 55.51 47.37 36.20
SSN+OD 29.05 22.61 15.90 55.81 45.85 35.79
SSN+TCOD 30.89 24.35 16.87 57.80 48.84 36.51
SSN+ODTC 30.91 24.45 16.94 57.91 49.33 36.78
TABLE IX: Comparisons of the step localization accuracy (%) on the Breakfast dataset. The results are all based on the combination scores of RGB frames and optical flows.

5.2 Evaluation on Action Segmentation

Implementation Details: The goal of this task is to assign each video frame with a step label. We present the results on three types of approaches as follows. (1) Random. We randomly assigned a step label to each frame. (2) Fully-supervised method. We used VGG16 network pre-trained on ImageNet, and finetuned it on the training set of COIN to predict the frame-level label. (3) Weakly-supervised approaches. In this setting, we evaluated recent proposed Action-Sets[51], NN-Viterbi[52] and TCFPN-ISBA[16] without temporal supervision. For Action-Sets, only a set of steps within a video is given, while the occurring order of steps are also provided for NN-Viterbi and TCFPN-ISBA. We used frames or their representations sampled at 10fps as input. We followed the default train and inference pipeline of Action-Sets[51], NN-Viterbi[52] and [16]. However, these methods use frame-wise fisher vector as video representation, which comes with huge computation and storage cost on the COIN dataset6. To address this, we employed a bidirectional LSTM on the top of a VGG16 network to extract dynamic feature of a video sequence[17].

Evaluation Metrics: We adopted frame-wise accuracy (FA), which is a common benchmarking metric for action segmentation. It is computed by first counting the number of correctly predicted frames, and dividing it by the number of total video frames.

Method Frame Acc. Setting
Random 0.13 -
CNN [59] 25.79 fully-supervised
Action-Sets[51] 4.94 weakly-supervised
NN-Viterbi[52] 21.17 weakly-supervised
TCFPN-ISBA[16] 34.30 weakly-supervised
TABLE X: Comparisons of the action segmentation accuracy (%) on the COIN.

Results: Table X shows the results of action segmentation on the COIN. Given the weakest supervision of video transcripts without ordering constraint, Action-Sets[51] achieves the result of 4.94% frame accuracy. When taking into account the ordering information, NN-Viterbi [52] and TCFPN-ISBA [16] outperform Action-Sets with a large margin of 16.23% and 29.66% respectively. As a fully-supervised method, CNN [59] reaches an accuracy 25.79%, which is much higher than Action-Sets. This is because CNN utilizes the label of each frame to perform classification and the supervision is much stronger than Action-Sets. However, as the temporal information and ordering constraints are ignored, the result of CNN is inferior to TCFPN-ISBA.

5.3 Comparison with Other Video Analysis Datasets

YouCook2 COIN YouCook2 COIN
mAP 38.05 39.67 mAR 50.04 56.16
TABLE XI: Comparisons of the proposal localization accuracy (%) with YouCook2 dataset [82]. The results are obtained by temporal actionness grouping (TAG) method [80] with = 0.5.

In order to assess the difficulty of COIN, we report the performance on different tasks compared with other datasets.

Proposal Localization: As defined in [82], proposal localization aims to segment an instructional video into category-independent procedure segments. For this task, we evaluated COIN and YouCook2 [82] based on temporal actionness grouping (TAG) approach [80]. We follow the experiments setting in [82] to generate 10 proposals for each videos, and report the results of mAR@0.5 and mAP@0.5. From the results in Table XI, we observe that the mAP and mAR of the same method are lower on the YouCook2 dataset, which indicates that it is more challenging than the COIN for the proposal localization task.

Video Classification: For video classification on COIN, we conducted experiments on task recognition and step recognition. The task recognition takes a whole video as input, and predicts the task label referring to the second level of the lexicon. The step recognition takes the trimmed segments as input, and outputs the step label corresponding to the third level of the lexicon. We employed the temporal segment network (TSN) model [72], which is a state-of-the-art method for video classification. As shown in the Table LABEL:tab:coin_vc, the task recognition accuracy on COIN is 73.36%, suggesting its general difficulty in comparison with other datasets. However, the step recognition accuracy is only 36.46%, as it requires discriminating different steps at finer level. Besides, since the step recognition task is performed on the clips trimmed by the ground-truth temporal intervals, its accuracy can be considered as a reference for step localization which is a highly related and more complex task.

Step Localization: For action detection or step localization, we display the compared performances of structured segment networks (SSN) approach [80] on COIN and the other three datasets in Table XIII. The THUMOS14 [32] and ActivityNet [27] are conventional datasets for action detection, on which the detection accuracies are relatively higher. The Breakfast [35] and COIN contain instructional videos with more difficulty. Hence, the performance on these two datasets are lower. Especially for our COIN, the results of mAP@0.5 is only 8.12%. We attribute the low performance to two aspects: (1) The step intervals are usually shorter than action instances, which brings more challenges for temporal localization; (2) Some steps in the same tasks are similar, which carry ambiguous information for the recognition process. These two phenomena are also common in real-world scenarios, and future works are encouraged to address these two issues.

Dataset Accuracy
UCF101 [63] 97.00
ActivityNet v1.3 [27] 88.30
Kinectics [7] 73.90
COIN (task recognition) 73.36
COIN (step recognition) 36.46
TABLE XII: Comparisons of the video classification performance (%) on different datasets. The reported results are based on temporal segment networks (TSN) model [72].
Dataset mAP
THUMOS14 [32] 29.10
ActivityNet v1.3 [27] 28.30
Breakfast [35] 15.84
COIN 8.12
TABLE XIII: Comparisons of the action detection/step localization performance (%) on different datasets. The reported results are based on stuctured segment networks (SSN) method [80] with = 0.5.

5.4 Cross Dataset Transfer

In order to see whether the COIN could benefit other instructional video datasets from different domains, we further study the cross dataset transfer setting. Under the “pre-training + fine-tuning” paradigm, we conducted experiments to verify the step localization task on the Breakfast[35], JIGSAWS[24] and UNLV-Diving[49] datasets. The Breakfast dataset is based on cooking activity while the JIGSAWS dataset consists of three surgical tasks. In comparison, the UNLV-diving dataset [49] contains a completely different task (diving) in a very different environment (natatorium) compared with the tasks in COIN. This dataset is originally collected for action quality assessment, where we employ the annotation of three steps as “jumping”, “dropping” and “entering into water” to perform step localization.

For the Breakfast and the UNLV-Diving dataset, we split the training and testing set following[35, 49]. For the JIGSAWS dataset, we evaluate the performance on the split 1 of the leave-one-user-out setting suggested in [24]. In the cross dataset transfer experiments, we used two pre-trained models on “Kinetics” and “Kinetics + COIN”, and fine-tuned the model on the target datasets. Since the Breakfast and JIGSAWS datasets are both self-collected and the UNLV-Diving dataset is created from the professional sport videos, there will not be any duplicate videos from the COIN and Kinetics which are sourced from YouTube. Specifically, “Kinetics + COIN” denotes that we first trained a TSN model [72] for the step recognition task on COIN, where the backbone of TSN model was pre-trained on the Kinetics dataset. Then we employed the backbone to the SSN model [80] for the Breakfast, JIGSAWS or UNLV-diving dataset. Besides, we include the results based on two state-of-the-art methods (i.e., BMN [38] and BSN [39]) to see if the improvements are subtle or significant.

Breakfast Dataset [35] (32 similar tasks in COIN), mAP @
Threshold 0.1 0.2 0.3 0.4 0.5 Average
BSN [39] 25.01 23.96 21.30 19.16 16.46 21.17
BMN [38] 24.40 23.19 21.38 19.49 16.88 21.07
SSN* 25.78 23.38 19.85 16.77 13.43 19.84
SSN** 27.47 24.98 21.14 18.29 14.94 21.36
JIGSAWS Dataset [24] (13 similar tasks in COIN), mAP @
Threshold 0.1 0.2 0.3 0.4 0.5 Average
BSN [39] 40.58 32.25 28.08 22.84 18.35 28.42
BMN [38] 36.41 36.34 32.66 27.24 23.23 31.18
SSN* 30.21 22.51 14.00 9.88 7.18 16.76
SSN** 31.01 24.08 15.79 12.38 6.30 17.91
UNLV-Diving Dataset [49] (0 similar task in COIN), mAP @
Threshold 0.1 0.2 0.3 0.4 0.5 Average
BSN [39] 67.56 63.44 58.62 56.28 52.61 59.70
BMN [38] 75.87 73.55 67.15 62.00 54.23 66.56
SSN* 73.00 54.07 33.26 32.44 32.44 45.04
SSN** 73.73 54.80 34.01 33.25 33.25 45.81
TABLE XIV: Comparisons of the cross dataset transfer accuracy (%) on three datasets based on the RGB modality. * and ** denote the model used “Kinetics” or “Kinetics+COIN” strategy as we introduce in the text.

In Table XIV, we present the experimental results and the number of similar tasks in COIN for the three datasets. The “similar tasks” here refers to the tasks with similar human-object interaction behaviors. We observe that the model pre-trained on “Kinetics + COIN” achieves consistent improvements over that only pre-trained on Kinetics. However, the gains on the UNLV-Diving dataset are very slight (e.g., only 0.77% improvement on average mAP) because of the large difference with the tasks in COIN dataset. In comparison, the improvements on the Breakfast dataset are much more promising. The reason is that there are 32 tasks related to food in the COIN dataset, which are similar to the tasks in the Breakfast dataset. For the JIGSAWS dataset, it contains three surgical activities as “suturing”, “knot-tying” and “needle-passing”, and there are about 13 similar tasks in the COIN dataset. Hence, the improvements on the JIGSAWS dataset are not significant. From the results of these three datasets, we further conclude that the annotation and data from COIN could lead to better results for the dataset which contains more similar tasks. This makes sense and has also been verified in image-based datasets [76]. And also, the COIN would not hinder the performance when applied to other irrelevant tasks.

Besides, we have also attempted to evaluate a model only pre-trained on COIN. However, when training the TSN model from scratch on the COIN, we found the performance was inferior. This is because in the COIN dataset, the samples of each step are still limited. And the reason that “Kinetics + COIN” works better than Kinetics is that the COIN contains step annotations at finer level, which would be more helpful for instructional video where fine-grained actions occur.

6 Future work

Finally, we discuss some several potential directions for future works based on our COIN dataset.

(1) Mining shared information of different steps. In our COIN dataset, we define different steps at task level, so no steps are shared between tasks. Though in different tasks, some steps might be similar, we still assign different step labels to them because of the difference in the interacted objects and the interacted ways. As mentioned in  [84], the shared components of different steps across takes are useful cues for analyzing instructional videos. It is interesting to explore the shared knowledge across different tasks from two aspects: (i) Leveraging the similarity information between tasks for step localization. (ii) Redefining the step at the dataset level to merge the similar steps in different tasks.

(2) End-to-end training. Since our methods attempt to refine the proposal scores during the inference stage, others might doubt whether end-to-end training might help. Theoretically, end-to-end training with our proposed methods would bring more improvements. However, the bottleneck is the computational cost during implementation, since the frames in hundreds of proposals in a video need to be processed at the same time. It is desirable to explore some effective ways or other methods (e.g., considering task and step labels simultaneously by multi-task learning) for end-to-end training in future.

(3) Semi-supervised, unsupervised or self-supervised learning for step localization. Since the temporal annotation are expensive, other settings for step localization are encouraged to study based on COIN dataset besides fully-supervised learning. For example, (i) semi-supervised learning setting only based on the step labels with[52] or without[51] ordering information, (ii) unsupervised learning setting when some auxiliary information is further provided (e.g., narration associated with the original video can be obtained via the Automatic Speech Recognition (ASR) system [4, 44]), and (iii) self-supervised learning setting which leverages the intrinsic long-term dependency of instructional videos (e.g., the recent proposed videoBert model [65]). Note that though our temporal annotation can be absent for these settings during training phase, it is still essential in evaluation period.

(4) Other tasks for instructional video analysis. As we mentioned in Section 2.1, there are various tasks for instructional video analysis. Based on the existing annotation, our COIN dataset can be used to evaluate other tasks such as activity anticipation[22] or procedure planning[9]. It can also be utilized for skill assessment[18, 68] and visual object grounding[30] if the corresponding annotations are further available.

7 Conclusions

In this paper we have introduced COIN, a new large-scale dataset for comprehensive instructional video analysis. Organized in a rich semantic taxonomy, the COIN dataset covers boarder domains and contains more tasks than the most existing instructional video datasets. In order to establish a benchmark, we have evaluated various approaches under different scenarios on the COIN. In addition, we have explored the relationship among different steps of a specific task based on the task-consistency and ordering-dependency characteristics of instructional videos. The experimental results have revealed the great challenges of COIN and demonstrated the effectiveness of our methods. It has also been shown that the COIN can contribute to the step localization task for other instructional video datasets.


This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, and Grant 61672306, in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564, and in part by Tsinghua University Initiative Scientific Research Program.

The authors would like to thank Dajun Ding and Lili Zhao from Meitu Inc. for their helps on computing resources and annotation, Danyang Zhang, Yu Zheng and Xumin Yu for conducting partial experiments, Yongming Rao for valuable discussion.

Appendix A A brief review of some instructional video datasets

We briefly review some representative datasets as follow:

(1) The MPII[55] and Breakfast[35] datasets are two self-collected instructional video datasets in the early time. As a pioneering work, Rohrbach et al. proposed the MPII dataset which contained 44 long videos. They provided annotations of 5,609 clips on 65 fine-grained cooking activities. Later, Kuehne et al. introduced the Breakfast dataset which consisted of 1,989 videos. This dataset included 10 cooking activities (tasks) of 52 participants. The annotation also contained temporal intervals of 48 action units (steps). The main purpose of these two datasets is to detect the main steps and recognize their labels.

(2) The YouCook[12] and YouCook2[82] datasets were collected by downloading cooking activity videos from YouTube. In 2013, Das et al. introduced the YouCook dataset, which consisted of 88 videos. Each video was annotated with at least three sentences by the participants in MTurk. More recently in 2018, Zhou et al. collected the YouCook2 dataset of 2,000 videos, which were labelled by temporal intervals of different steps and their captions on the recipe. As an extension work, they further provided the object level annotation [81]. Both the YouCook and YouCook2 datasets can be used for the video caption tasks, and the YouCook2 further facilitated the research for procedure segmentation [82] and video object grounding [81, 30].

(3) The “5 tasks”[3] and HowTo100M[45] datasets were collected for unsupervised learning from narrations. The “5 tasks” dataset [3] consisted of 150 videos on 5 tasks (changing tire, performing CPR, repoting plant, making coffee and jumping cars). Each video was associated with a sequence of natural description text, which was generated from the corresponding audio modality. The goal of this dataset was to automatically discover the main steps of a certain task and locate them in the videos in an unsupervised setting. More recently, Miech et al.[45] introduced another large-scale dataset HowTo100M, which contained 136 million clips sourced from 1.238 million instructional videos. These video clips were paired with a list of text chunks based on their corresponding time intervals. With its large-scale data, the HowTo100M greatly promoted the development of pre-trained text-video joint embedding models for various vision-languages tasks. However, as the authors mentioned, since their annotations were automatically generated from the narration, there might be various of incoherent samples. For example, the narration was unrelated to the video, or describing something before or after it happens in video. Hence, our manually labelled COIN is complementary to HowTo100M from this aspect.

(4) The EPIC-Skills[18] and BEST[19] datasets were constructed for skill determination, which referred to assess the behaviour of a subject to accomplish a task. For both datasets, the AMT workers were asked to watch the videos in a pair-wise manner, and selected the video which contained more skill. The BEST dataset also provided the initial opinion of the annotators over the video as “Beginner”, “Intermediate” or “Expert”.

(5) The CrossTask[84] dataset contained 4.7k videos of 83 tasks, and the goal was to localize the steps in the video by weakly supervised learning (i.e., instructional narrations and an ordered list of steps). Specifically, this dataset was proposed to assess the shared components on different steps across different tasks. For example, the task “pour egg” would be benefit from other tasks involving “pour” and “egg”.

Task samples FM VM
Assemble Bed 6 6:55 23:30
Boil Noodles 5 3:50 18:15
Lubricate A Lock 2 1:23 5:29
Make French Fries 6 5:57 20:24
Change Mobile Phone Battery 2 2:23 7:35
Replace A Bulb 2 1:30 6:40
Plant A Tree 2 1:45 6:37
Total 25 23:43 88:30
TABLE XV: Comparisons of the annotation time cost under two modes. FM indicates the new developed frame mode, and VM represents the conventional video mode.

Appendix B Annotation Time Cost Analysis

In section 3.2, we have introduced a toolbox for annotating COIN dataset. The toolbox has two modes: frame mode and video mode. The frame mode is new developed for efficient annotation, while the video mode is frequently used in previous works [34]. We have evaluated the annotation time on a small set of COIN, which contains 25 videos of 7 tasks. Table XV shows the comparison of annotation time under two different modes. We observe that the annotation time under the frame mode is only 26.8% of that under the video mode, which shows the advantages of our toolbox.

Fig. 11: The browse time distributions of the selected 180 tasks on YouTube.

Appendix C Browse Times Analysis

In order to justify that the selected tasks meet the need of website viewers, we display the number of browse times across 180 tasks in Fig. 11. We searched “How to” + name of 180 tasks, e.g., “How to Assemble Sofa”, on YouTube respectively. Then we summed up the browse times of the videos appearing in the first pages (about 20 videos) to get the final results. “Make French Fries” is the most-viewed task, which has been browsed times. And the browse times per task are on average. These results demonstrate the selected tasks of our COIN dataset satisfy the need of website viewers, and also reveal the practical value of instructional video analysis.

Fig. 12: Visualization of auxiliary matrix and transition matrix of three tasks. Recall that the auxiliary matrix is constructed by counting the occurrence time of the step after step based on the all ordered step lists appearing in the training set as We normalize each row of to obtain the transition matrix , in which the element of the transition matrix denotes the probability of step occurs after step as . The step lists of these tasks are: cut potato into strips, soak them in water, dry strips, put in the oil to fry, cone the leaves, add ingredients into cone, fold the leaves, tie the zongzi tightly and take out the screws, remove the old toilet seat, install the new toilet seat, screw on the screws.

Appendix D Visualization of matrices

In section 4.2, we introduce the transition matrix and the corresponding auxiliary matrix for the OD method. Here we visualize the matrices of three different tasks in Fig. 12, where the definitions of and are detailed in caption of the figure.

Appendix E Sample distribution of COIN

We present the sample distributions of all the tasks in COIN in Fig. 13. To alleviate the effect of long tails, we make sure that there are more than 39 videos for each task.

Fig. 13: The sample distributions of all the tasks in COIN. The blue bars and the grey bars indicate the number of training and testing videos in each class respectively.
Task Goal Evaluation Metrics Evaluated Methods
step localization localize the step boundary and predict the step label mAP/mAR (interval level) R-C3D[74], SSN[80], TC(ours), OD(ours)
action segmentation assign each frame a step label accuracy (frame level) Action-Sets[51], NN-Viterbi[52], TCFPN-ISBA[16]
proposal localization localize the step boundary mAP/mAR (interval level) TAG[80]
task recognition recognize the task label accuracy (video level ) TSN[72]
step recognition recognize the step label given the step boundary accuracy (interval level) TSN[72]
TABLE XVI: Clarification of different tasks evaluated on the COIN.

Appendix F Wordles

We show the wordles of the annotation of COIN in Fig. 14. There are 4.84 words per phrase for each step.

Appendix G Clarification of different tasks evaluated on the COIN

In the experiment section, we evaluate five tasks (i.e., step localization, action segmentation, procedure localization, task recognition and step recognition) on the COIN dataset. For clarification, we present the goal, metric, and evaluated methods for each task in Table XVI.

Fig. 14: Wordles of the annotations of COIN. The figure of tasks (the second-level tags) is shown on the left, while the figure of steps (the third-level tags) is presented on the right.
Metrics mAP @ mAR @
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
0 19.95 16.64 14.12 11.69 9.30 52.00 45.60 39.74 34.26 28.55
20.12 16.77 14.23 11.79 9.38 53.38 46.79 40.69 34.98 29.14
20.15 16.79 14.24 11.74 9.33 54.05 47.31 40.99 35.11 29.17
20.10 16.76 14.19 11.71 9.30 54.48 47.51 41.09 35.36 29.36
TABLE XVII: Study of the attenuation coefficient on the COIN dataset.
Metrics mAP @ mAR @
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
50 19.87 15.97 12.95 10.22 7.98 51.19 43.73 37.33 31.84 26.58
100 20.39 16.35 13.20 10.38 8.13 51.51 43.87 37.37 31.81 26.64
150 20.55 16.37 13.20 10.36 8.10 51.67 43.92 37.45 31.86 26.65
200 20.58 16.42 13.20 10.41 8.15 51.84 44.04 37.38 31.74 26.59
TABLE XVIII: Study of the number of time slots M on the COIN dataset.

Appendix H Experiments on the hyper-parameters

Table XVII presents the results of (introduced in section 4.1). On one hand, cannot be too large as it is an attenuation coefficient in the TC method. On the other hand, it cannot be too small, because if the selected task is wrong (sometimes the scores of two different tasks are approached), the steps in other tasks should also be taken into consideration. Experiments show that is more effective for in our case.

Table XVIII presents the evalutation results of the time slot M (introduced in section 4.2). It can be seen that finer-grained division with larger M can lead to better performance. In this paper, we use to make a good trade-off between the effectiveness and efficiency as larger M would also bring more computational cost.

Fig. 15: Visualization of different step scores in and the summed score a(t). The video belongs to the task “Use Toaster”. In the above plots, the “put a slice of bread in” and “run the toaster and adjust” are two steps of this task while “stick with the tap and wrap” is not.

Appendix I Visualization of f(t)

In section 4.2, f(t) denotes the scores of different steps, which implies their possibilities occuring at the time-stamp t. We show the visualization of several steps of f(t) when evaluating OD method in Fig. 15. The summed score, which is denoted as a(t) in our paper, is also presented.

Appendix J Analysis on the Watershed Algorithm

In section 4.2, we employ the watershed algorithm [80, 54] to obtain a sequence of segments during the period of grouping proposals. Particularly, we iterate the thresholds of the actionness score from high value () to low value () until the termination criterion is met and consider the intervals where is larger than the theshold as the action segments.

Here we study two types of termination criteria. The first one is to do the iteration until the average temporal gap between different segments gets smaller than a specific value . Another one is to check if the average temporal length of the segments is greater than a specific value . Table XIX shows the experimental results of these two strategies. And in our main paper, we use “=6” as the termination criterion of the watershed algorithm.

mAP @ mAR @
Criteria Parameters 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
2 19.75 15.80 12.72 9.95 7.79 51.06 43.48 36.96 31.44 26.25
4 20.22 16.20 13.07 10.26 8.04 51.25 43.66 37.16 31.64 26.45
6 20.39 16.35 13.20 10.38 8.13 51.51 43.87 37.37 31.81 26.64
8 20.41 16.36 13.21 10.39 8.13 51.60 43.97 37.40 31.81 26.66
10 20.39 16.35 13.23 10.34 8.11 51.62 43.99 37.37 31.71 26.60
10 20.07 16.06 12.92 10.13 7.93 51.70 44.06 37.42 31.87 26.60
15 19.98 15.99 12.87 10.07 7.89 51.58 43.95 37.33 31.77 26.54
20 19.96 15.99 12.88 10.09 7.91 51.41 43.83 37.30 31.76 26.49
TABLE XIX: Study of two hyper-parameters in the watershed algorithm. denotes the average temporal gap between different segments, and denotes the average temporal length of different segments. We used the time slots as the unit for these two hyperparameters in this table.
Metrics mAP @ mAR @
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
BMN [38] 18.60 16.76 14.78 12.40 10.38 48.70 45.71 42.45 38.70 34.57
BMN+TC 19.27 17.17 15.09 12.60 10.59 47.87 44.64 41.61 37.92 34.07
BMN+OD 19.68 17.44 15.27 12.90 10.75 49.70 46.00 42.43 38.78 34.64
BSN [39] 18.91 16.84 14.49 12.26 10.00 46.97 43.87 39.78 35.77 31.61
BSN+TC 19.96 17.54 15.00 12.68 10.35 47.16 44.16 40.12 36.40 32.45
BSN+OD 20.46 17.82 15.20 12.85 10.34 48.76 44.88 40.86 36.95 32.54
R-C3D[74] 9.85 7.78 5.80 4.23 2.82 36.82 31.55 26.56 21.42 17.07
R-C3D+TC 10.32 8.25 6.20 4.54 3.08 39.25 34.22 29.09 23.71 19.24
R-C3D+OD 10.08 8.01 5.98 4.36 2.93 37.37 31.84 26.81 21.74 14.43
TABLE XX: Study of the OD approach on different basic models. The results are reported based on the RGB modality in the COIN dataset. Since the BMN [38] and BSN[39] are originally designed for temporal action proposal generation, we process the proposals generated by these two methods with the classifier of SSN [80] to produce the final results.
Metrics mAP @ mAR @
Distributions 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
Gaussian, 20.07 15.90 12.78 10.00 7.81 51.05 43.30 36.81 31.28 26.18
Gaussian, 20.39 16.35 13.20 10.38 8.13 51.51 43.87 37.37 31.81 26.64
Gaussian, 20.13 16.20 13.13 10.34 8.08 51.40 43.91 37.46 31.93 26.65
Gaussian, 19.81 15.97 12.92 10.18 7.98 51.19 43.85 37.34 31.78 26.58
Triangle 19.77 15.66 12.55 9.78 7.66 50.99 43.23 36.64 31.08 26.03
TABLE XXI: Study of different generated distributions for calculating scores in OD method on the COIN dataset.

Appendix K Detailed Results on the OD method and other action detectors

In our main paper, Table 6 presents the results on different generated distributions and fusion methods for OD method, and Table 8 shows the results on more basic action detectin models. Here we further present more detailed results on these issues in Table XX, Table XXI and Table XXII.

mAP @ mAR @
Fusion methods 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
20.30 16.28 13.20 10.37 08.12 51.51 43.93 37.47 31.88 26.69
20.09 16.18 13.12 10.32 08.06 51.42 43.98 37.54 31.94 26.64
20.36 16.32 13.19 10.36 08.11 51.57 43.89 37.37 31.81 26.64
max-pool 19.84 16.01 12.97 10.21 07.98 51.41 44.01 37.51 31.97 26.70
TABLE XXII: Study of different methods to refine the segment scores on the COIN dataset. We set the hyper-parameters as . All the fusion methods are performed on the element-wise level for the scores and .
Fig. 16: Visualization of step localization results. The video is associated with the task “make paper windmill”.
Fig. 17: Comparisons of the step localization accuracy (%) of different tasks. We report the results obtained by SSN+TC with = 0.1.

Appendix L Visualization Results of Different Tasks

In section 5.2, we have compared the performance across different domains. Fig. 17 further shows some examples from 4 different tasks as “blow sugar”, “play curling”, “make soap” and “resize watch band”. They belong to the domain “sports”, “leisure & performance”, “gadgets” and “science and craft”, which are the two of the easiest domains and the two of the hardest domains. For “blow sugar” and “play curling”, different steps vary a lot in appearance, thus it is easier to localize them in videos. For “make soap” and “resize watch band”, various steps tend to occur in similar scenes, hence the mAP accuracy of these tasks are inferior.

Besides Fig. 10 in our main paper, we show one more visualized step localization result in Fig. 16. The video is associated with the task “make paper windmill” and the results further demonstrate the effectiveness of our proposed TC and OD methods.

0 0.3937 0.2948 45 0.1167 0.1908 90 0.1429 0.1605 135 0.4078 0.1038
1 0.1443 0.2294 46 0.3643 0.2049 91 0.2000 0.3961 136 0.4800 0.0918
2 0.6020 0.3415 47 0.1327 0.1343 92 0.4580 0.2611 137 0.3502 0.0423
3 0.5493 0.4122 48 0.2875 0.2297 93 0.4225 0.2286 138 0.2592 0.0216
4 0.2889 0.1982 49 0.1902 0.4522 94 0.3494 0.1235 139 0.3590 0.1095
5 0.4000 0.2258 50 0.3152 0.5097 95 0.2629 0.1646 140 0.2211 0.0345
6 0.2333 0.1415 51 0.1957 0.3180 96 0.1667 0.1038 141 0.0465 0.1151
7 0.5884 0.1018 52 0.2787 0.0979 97 0.4679 0.1235 142 0.0690 0.0122
8 0.2907 0.0244 53 0.4495 0.1640 98 0.3967 0.1566 143 0.2500 0.0441
9 0.1667 0.2029 54 0.4079 0.0592 99 0.4875 0.2517 144 0.3333 0.0000
10 0.3840 0.0837 55 0.2174 0.0847 100 0.4291 0.3103 145 0.1181 0.0000
11 0.3618 0.3733 56 0.4695 0.3254 101 0.4676 0.0726 146 0.0974 0.0499
12 0.3833 0.1097 57 0.5659 0.0220 102 0.0885 0.4320 147 0.3578 0.2795
13 0.2135 0.2342 58 0.0272 0.0110 103 0.2146 0.4725 148 0.1429 0.0539
14 0.4857 0.1212 59 0.2195 0.0448 104 0.4200 0.3225 149 0.3063 0.1628
15 0.1860 0.0000 60 0.2522 0.2617 105 0.0850 0.0244 150 0.3699 0.1120
16 0.5403 0.2246 61 0.1282 0.0685 106 0.2903 0.1545 151 0.3313 0.1860
17 0.3267 0.2540 62 0.4062 0.1143 107 0.0000 0.0189 152 0.1389 0.0828
18 0.0788 0.0531 63 0.1571 0.3838 108 0.0252 0.2010 153 0.4362 0.1868
19 0.4939 0.0909 64 0.0638 0.1706 109 0.1557 0.0844 154 0.2625 0.6657
20 0.4408 0.3092 65 0.3562 0.1468 110 0.1591 0.2353 155 0.3000 0.3750
21 0.3162 0.1345 66 0.2449 0.0449 111 0.2694 0.0821 156 0.4674 0.1917
22 0.4640 0.3878 67 0.6343 0.0746 112 0.5632 0.5025 157 0.0353 0.1667
23 0.3949 0.2627 68 0.5443 0.3366 113 0.4079 0.2310 158 0.3264 0.2707
24 0.4964 0.2928 69 0.3583 0.1928 114 0.1833 0.0932 159 0.3489 0.1853
25 0.4861 0.2605 70 0.1939 0.2414 115 0.2417 0.6190 160 0.0750 0.6606
26 0.5544 0.3462 71 0.1288 0.0417 116 0.0583 0.6424 161 0.4250 0.0622
27 0.6398 0.2928 72 0.5083 0.0484 117 0.0000 0.1413 162 0.2692 0.1337
28 0.4247 0.1257 73 0.4907 0.4682 118 0.2843 0.1579 163 0.1100 0.2462
29 0.0000 0.5493 74 0.2562 0.1250 119 0.0000 0.3103 164 0.1806 0.0787
30 0.4816 0.2778 75 0.5228 0.5254 120 0.2561 0.1393 165 0.0238 0.4991
31 0.2183 0.3175 76 0.0990 0.0455 121 0.2150 0.0542 166 0.1633 0.1123
32 0.4000 0.1850 77 0.2500 0.3972 122 0.0716 0.3968 167 0.3580 0.2754
33 0.3132 0.2252 78 0.4550 0.3864 123 0.1462 0.0818 168 0.1980 0.0123
34 0.5450 0.2705 79 0.1894 0.0655 124 0.2311 0.1841 169 0.2627 0.0990
35 0.1905 0.1324 80 0.4269 0.3316 125 0.0230 0.0116 170 0.3800 0.1061
36 0.4167 0.2753 81 0.5634 0.5474 126 0.1455 0.0621 171 0.1905 0.1373
37 0.1250 0.1788 82 0.3976 0.1103 127 0.1750 0.2632 172 0.0244 0.4203
38 0.0250 0.6638 83 0.1801 0.0288 128 0.1900 0.0638 173 0.2448 0.3026
39 0.6998 0.4741 84 0.2804 0.4672 129 0.5907 0.1401 174 0.0000 0.1531
40 0.2553 0.1732 85 0.3563 0.2337 130 0.2079 0.0422 175 0.5068 0.5137
41 0.3810 0.2424 86 0.4485 0.0302 131 0.4647 0.0671 176 0.1917 0.1880
42 0.0922 0.3520 87 0.0926 0.0581 132 0.3095 0.1240 177 0.2833 0.3628
43 0.5540 0.1977 88 0.0863 0.0769 133 0.3388 0.1801 178 0.2875 0.6123
44 0.3095 0.0444 89 0.2699 0.1824 134 0.2442 0.1447 179 0.4139 0.0719
TABLE XXIII: Statistical analysis on the ordering characteristics of the 180 tasks in COIN dataset.
Fig. 18: An overview of the 180 tasks of the COIN dataset, which are associated to 12 domains to our daily life: (1) nursing & caring, (2) vehicles, (3) leisure & performance, (4) gadgets, (5) electric appliances, (6) household items, (7) science & craft, (8) plants & fruits, (9) snacks & drinks, (10) dishes, (11) sports, and (12) housework.

Yansong Tang received the B.S. degree in the Department of Automation, Tsinghua University, China, in 2015. He is currently a Ph.D Candidate with the Department of Automation, Tsinghua University, China. His current research interest lies in human behaviour understanding for computer vision. He has authored 10 scientific papers in this area, where 4 papers are published in top journals and conferences including IEEE TIP, CVPR and ACM MM. He serves as a regular reviewer member for a number of journals and conferences, e.g., TPAMI, TIP, TCSVT, CVPR, ICCV, AAAI and so on. He has obtained the National Scholarship of Tsinghua in 2018.

Jiwen Lu (M’11-SM’15) received the B.Eng. degree in mechanical engineering and the M.Eng. degree in electrical engineering from the Xi’an University of Technology, Xi’an, China, in 2003 and 2006, respectively, and the Ph.D. degree in electrical engineering from Nanyang Technological University, Singapore, in 2012. He is currently an Associate Professor with the Department of Automation, Tsinghua University, Beijing, China. His current research interests include computer vision, pattern recognition, and machine learning. He has authored/co-authored over 200 scientific papers in these areas, where 70+ of them are IEEE Transactions papers and 50+ of them are CVPR/ICCV/ECCV papers. He serves the Co-Editor-of-Chief of the Pattern Recognition Letters, an Associate Editor of the IEEE Transactions on Image Processing, the IEEE Transactions on Circuits and Systems for Video Technology, the IEEE Transactions on Biometrics, Behavior, and Identity Science, and Pattern Recognition. He is a member of the Multimedia Signal Processing Technical Committee and the Information Forensics and Security Technical Committee of the IEEE Signal Processing Society, and a member of the Multimedia Systems and Applications Technical Committee and the Visual Signal Processing and Communications Technical Committee of the IEEE Circuits and Systems Society. He was a recipient of the National 1000 Young Talents Program of China in 2015, and the National Science Fund of China for Excellent Young Scholars in 2018, respectively. He is a senior member of the IEEE.

Jie Zhou (M’01-SM’04) received the BS and MS degrees both from the Department of Mathematics, Nankai University, Tianjin, China, in 1990 and 1992, respectively, and the PhD degree from the Institute of Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology (HUST), Wuhan, China, in 1995. From then to 1997, he served as a postdoctoral fellow in the Department of Automation, Tsinghua University, Beijing, China. Since 2003, he has been a full professor in the Department of Automation, Tsinghua University. His research interests include computer vision, pattern recognition, and image processing. In recent years, he has authored more than 300 papers in peer-reviewed journals and conferences. Among them, more than 60 papers have been published in top journals and conferences such as the IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Image Processing, and CVPR. He is an associate editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence and two other journals. He received the National Outstanding Youth Foundation of China Award. He is a senior member of the IEEE.


  1. The language signal should not be treated as supervision since the steps are not directly given, but need to be further explored in an unsupervised manner.
  2. The details of the weak supervisions are described in section 5.2.
  3. We present the statistics of browse times in supplementary material.
  4. For a set of videos, the annotation time under the frame mode is only 26.8% of that under the video mode. Please see supplementary material for details.
  5. We present a table to clarify the goal, metric, and evaluated methods for each task in supplementary material.
  6. The calculation of fisher vector is based on the improved Dense Trajectory (iDT) representation [71], which requires huge computation cost and storage space.


  1. (2019) “Instruction, n.”. OED Online, Oxford University Press. Cited by: §1.
  2. S. N. Aakur and S. Sarkar (2019) A perceptual prediction framework for self supervised event segmentation. In CVPR, pp. 1197–1206. Cited by: §2.3.
  3. J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev and S. Lacoste-Julien (2016) Unsupervised learning from narrated instruction videos. In CVPR, pp. 4575–4583. Cited by: Appendix A, TABLE I, §1, §1, §2.3, §3.1, §4.2.
  4. J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev and S. Lacoste-Julien (2018) Learning from narrated instruction videos. TPAMI 40 (9), pp. 2194–2208. Cited by: §1, §3.3, §4.2, §6.
  5. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. D. Reid, S. Gould and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, pp. 3674–3683. Cited by: §1.
  6. L. E. Baum and T. Petrie (1966) Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics 37 (6), pp. 1554–1563. Cited by: §4.2.
  7. J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pp. 4724–4733. Cited by: TABLE XII.
  8. C. Chang, D. Huang, Y. Sui, L. Fei-Fei and J. C. Niebles (2019) D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In CVPR, pp. 3546–3555. Cited by: §2.3.
  9. C. Chang, D. Huang, D. Xu, E. Adeli, L. Fei-Fei and J. C. Niebles (2019) Procedure planning in instructional videos. CoRR abs/1907.01172. Cited by: §2.1, §6.
  10. L. Chunhui, H. Yueyu, L. Yanghao, S. Sijie and L. Jiaying (2017) PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. In ACM MM workshop, Cited by: §5.1.
  11. D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In ECCV, pp. 753–771. Cited by: TABLE I.
  12. P. Das, C. Xu, R. F. Doell and J. J. Corso (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In CVPR, pp. 2634–2641. Cited by: Appendix A, TABLE I, §1.
  13. S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr. and A. de Albuquerque Araújo (2011) VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32 (1), pp. 56–68. Cited by: §2.2.
  14. F. De la Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada and J. Macey (2009) Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Robotics Institute, Carnegie Mellon University 5. Cited by: item .
  15. J. Deng, W. Dong, R. Socher, L. Li, K. Li and F. Li (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.2, §3.1.
  16. L. Ding and C. Xu (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In CVPR, pp. 6508–6516. Cited by: TABLE XVI, §2.3, §5.2, §5.2, TABLE X.
  17. J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko and T. Darrell (2017) Long-term recurrent convolutional networks for visual recognition and description. TPAMI 39 (4), pp. 677–691. Cited by: §5.2.
  18. H. Doughty, D. Damen and W. W. Mayol-Cuevas (2018) Who’s better? who’s best? pairwise deep ranking for skill determination. In CVPR, pp. 6057–6066. Cited by: Appendix A, TABLE I, §2.1, §6.
  19. H. Doughty, W. W. Mayol-Cuevas and D. Damen (2019) The pros and cons: rank-aware temporal attention for skill determination in long videos. In CVPR, pp. 7862–7871. Cited by: Appendix A, TABLE I.
  20. E. Elhamifar, G. Sapiro and S. S. Sastry (2016) Dissimilarity-based sparse subset selection. TPAMI 38 (11), pp. 2182–2197. Cited by: §1.
  21. Y. A. Farha and J. Gall (2019) MS-tcn: multi-stage temporal convolutional network for action segmentation. In CVPR, Cited by: §2.3.
  22. Y. A. Farha, A. Richard and J. Gall (2018) When will you do what? - anticipating temporal occurrences of activities. In CVPR, pp. 5343–5352. Cited by: §2.1, §6.
  23. C. Fellbaum (1998) WordNet: an electronic lexical database. Bradford Books. Cited by: §3.1.
  24. Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar and D. D. Yuh (2014) Jhu-isi gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In MICCAI Workshop, Vol. 3, pp. 3. Cited by: item , TABLE I, §5.4, §5.4, TABLE XIV.
  25. C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid and J. Malik (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In CVPR, pp. 6047–6056. Cited by: §2.2.
  26. M. Gygli, H. Grabner, H. Riemenschneider and L. J. V. Gool (2014) Creating summaries from user videos. In ECCV, pp. 505–520. Cited by: §2.2.
  27. F. C. Heilbron, V. Escorcia, B. Ghanem and J. C. Niebles (2015) ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR, pp. 961–970. Cited by: §2.2, §2.2, §3.1, §5.3, TABLE XII, TABLE XIII.
  28. Howcast. Note: \url Cited by: §2.2, §3.1, §3.1.
  29. Howdini. Note: \url Cited by: §2.2, §3.1, §3.1.
  30. D. Huang, S. Buch, L. Dery, A. Garg, L. Fei-Fei and J. Carlos Niebles (2018) Finding ”it”: weakly-supervised reference-aware visual grounding in instructional videos. In CVPR, pp. 5948–5957. Cited by: Appendix A, §2.1, §6.
  31. D. Huang, J. J. Lim, L. Fei-Fei and J. C. Niebles (2017) Unsupervised visual-linguistic reference resolution in instructional videos. In CVPR, pp. 1032–1041. Cited by: §1, §2.1.
  32. Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah and R. Sukthankar (2014) THUMOS challenge: action recognition with a large number of classes. Note: \url Cited by: §2.2, §5.3, TABLE XIII.
  33. K. R. Koedinger, A. T. Corbett and C. Perfetti (2012) The knowledge-learning-instruction framework: bridging the science-practice chasm to enhance robust student learning. Cognitive science 36 (5), pp. 757–798. Cited by: §1.
  34. R. Krishna, K. Hata, F. Ren, L. Fei-Fei and J. C. Niebles (2017) Dense-captioning events in videos. In ICCV, pp. 706–715. Cited by: Appendix B, §1, §2.2, §3.2.
  35. H. Kuehne, A. B. Arslan and T. Serre (2014) The language of actions: recovering the syntax and semantics of goal-directed human activities. In CVPR, pp. 780–787. Cited by: Appendix A, TABLE I, §1, §1, §2.1, §2.3, §5.1, §5.3, §5.4, §5.4, TABLE XIII, TABLE XIV.
  36. A. Kukleva, H. Kuehne, F. Sener and J. Gall (2019) Unsupervised learning of action classes with continuous temporal embedding. In CVPR, pp. 12066–12074. Cited by: §2.3.
  37. P. Lei and S. Todorovic (2018) Temporal deformable residual networks for action segmentation in videos. In CVPR, pp. 6742–6751. Cited by: §2.3.
  38. T. Lin, X. Liu, X. Li, E. Ding and S. Wen (2019) BMN: boundary-matching network for temporal action proposal generation. In ICCV, pp. 3888–3897. Cited by: TABLE XX, §2.3, §5.1, §5.1, §5.4, TABLE XIV, TABLE VIII.
  39. T. Lin, X. Zhao, H. Su, C. Wang and M. Yang (2018) BSN: boundary sensitive network for temporal action proposal generation. In ECCV, pp. 3–21. Cited by: TABLE XX, §2.3, §5.1, §5.1, §5.4, TABLE XIV, TABLE VIII.
  40. D. Liu, T. Jiang and Y. Wang (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR, pp. 1298–1307. Cited by: §2.3.
  41. Y. Liu, L. Ma, Y. Zhang, W. Liu and S. Chang (2019) Multi-granularity generator for temporal action proposal. In CVPR, pp. 3604–3613. Cited by: §2.3.
  42. F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo and T. Mei (2019) Gaussian temporal awareness networks for action localization. In CVPR, pp. 344–353. Cited by: §2.3.
  43. J. Marín, A. Biswas, F. Ofli, N. Hynes, A. Salvador, Y. Aytar, I. Weber and A. Torralba (2018) Recipe1M: A dataset for learning cross-modal embeddings for cooking recipes and food images. TPAMI. Cited by: TABLE I.
  44. A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In CVPR, Cited by: §6.
  45. A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev and J. Sivic (2019) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, pp. 2630–2640. Cited by: Appendix A, TABLE I, §2.2.
  46. D. K. Misra, J. Sung, K. Lee and A. Saxena (2016) Tell me dave: context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research 35 (1-3), pp. 281–300. Cited by: §1.
  47. U.S. D. of Labor (2013) American time use survey. Note: \url Cited by: §3.1.
  48. R. Panda, N. C. Mithun and A. K. Roy-Chowdhury (2017) Diversity-aware multi-video summarization. TIP 26 (10), pp. 4712–4724. Cited by: §1, §2.2.
  49. P. Parmar and B. T. Morris (2017) Learning to score olympic events. In CVPRW, pp. 76–84. Cited by: §5.4, §5.4, TABLE XIV.
  50. S. Ren, K. He, R. B. Girshick and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §2.3.
  51. A. Richard, H. Kuehne and J. Gall (2018) Action sets: weakly supervised action segmentation without ordering constraints. In CVPR, pp. 5987–5996. Cited by: TABLE XVI, §2.3, §5.2, §5.2, TABLE X, §6.
  52. A. Richard, H. Kuehne, A. Iqbal and J. Gall (2018) NeuralNetwork-viterbi: A framework for weakly supervised video learning. In CVPR, pp. 7386–7395. Cited by: TABLE XVI, §2.3, §5.2, §5.2, TABLE X, §6.
  53. N. RJ, K. PA and van Merriënboer JJ (2005) Optimizing the number of steps in learning tasks for complex skills. British Journal of Educational Psychology 75 (2), pp. 223– 237. Cited by: §1, §1.
  54. J. B. T. M. Roerdink and A. Meijster (2000) The watershed transform: definitions, algorithms and parallelization strategies. Fundam. Inform. 41 (1-2), pp. 187–228. Cited by: Appendix J, §4.2.
  55. M. Rohrbach, S. Amin, M. Andriluka and B. Schiele (2012) A database for fine grained activity detection of cooking activities. In CVPR, pp. 1194–1201. Cited by: Appendix A, TABLE I, §1, §2.1.
  56. A. Salvador, N. Hynes, Y. Aytar, J. Marín, F. Ofli, I. Weber and A. Torralba (2017) Learning cross-modal embeddings for cooking recipes and food images. In CVPR, pp. 3068–3076. Cited by: TABLE I.
  57. F. Sener and A. Yao (2018) Unsupervised learning and segmentation of complex activities from video. In CVPR, pp. 8368–8376. Cited by: §2.3.
  58. O. Sener, A. R. Zamir, S. Savarese and A. Saxena (2015) Unsupervised semantic parsing of video collections. In ICCV, pp. 4480–4488. Cited by: §1, §2.3.
  59. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–14. Cited by: §5.2, TABLE X.
  60. K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger (Eds.), pp. 568–576. Cited by: §4.2.
  61. B. Singh, T. K. Marks, M. J. Jones, O. Tuzel and M. Shao (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In CVPR, pp. 1961–1970. Cited by: §2.3.
  62. Y. Song, J. Vallmitjana, A. Stent and A. Jaimes (2015) TVSum: summarizing web videos using titles. In CVPR, pp. 5179–5187. Cited by: §2.2.
  63. K. Soomro and M. Shah. (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. Technical Report CRCV-TR-12-01, University of Central Florida. Cited by: TABLE XII.
  64. S. Stein and S. J. McKenna (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp, pp. 729–738. Cited by: TABLE I, §1.
  65. C. Sun, A. Myers, C. Vondrick, K. Murphy and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In ICCV, pp. 7463–7472. Cited by: §6.
  66. Y. Tang, D. Ding, Y. Rao, Y. Zheng, D. Zhang, L. Zhao, J. Lu and J. Zhou (2019) COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, Cited by: §1.
  67. Y. Tang, J. Lu, Z. Wang, M. Yang and J. Zhou (2019) Learning semantics-preserving attention and contextual interaction for group activity recognition. TIP 28 (10), pp. 4997–5012. Cited by: §2.2.
  68. Y. Tang, Z. Ni, J. Zhou, D. Zhang, J. Lu, Y. Wu and J. Zhou (2020) Uncertainty-aware score distribution learning for action quality assessment. In CVPR, Cited by: §6.
  69. S. Toyer, A. Cherian, T. Han and S. Gould (2017) Human pose forecasting via deep markov models. In DICTA, pp. 1–8. Cited by: TABLE I, §1.
  70. D. Tran, L. D. Bourdev, R. Fergus, L. Torresani and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: §2.3.
  71. H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In ICCV, pp. 3551–3558. Cited by: footnote 6.
  72. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang and L. Val Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: TABLE XVI, §4.2, §5.3, §5.4, TABLE XII.
  73. Wikihow. Note: \url Cited by: §2.2, §3.1, §3.1.
  74. H. Xu, A. Das and K. Saenko (2017) R-C3D: region convolutional 3d network for temporal activity detection. In ICCV, pp. 5794–5803. Cited by: TABLE XX, TABLE XVI, §1, §2.3, §5.1, §5.1, TABLE VIII.
  75. J. Xu, T. Mei, T. Yao and Y. Rui (2016) MSR-VTT: A large video description dataset for bridging video and language. In CVPR, pp. 5288–5296. Cited by: §2.2.
  76. J. Yosinski, J. Clune, Y. Bengio and H. Lipson (2014) How transferable are features in deep neural networks?. In NeurIPS, pp. 3320–3328. Cited by: §5.4.
  77. H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang and X. Yang (2018-06) Fine-grained video captioning for sports narrative. In CVPR, pp. 6066–6015. Cited by: §1, §2.2.
  78. C. Zach, T. Pock and H. Bischof (2007) A duality based approach for realtime tv-l1 optical flow. In DAGM, pp. 214–223. Cited by: §5.1.
  79. K. Zhang, W. Chao, F. Sha and K. Grauman (2016) Video summarization with long short-term memory. In ECCV, pp. 766–782. Cited by: §1.
  80. Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang and D. Lin (2017) Temporal action detection with structured segment networks. In ICCV, pp. 2933–2942. Cited by: TABLE XX, Appendix J, TABLE XVI, §1, §2.3, §4.1, §4.2, TABLE II, §5.1, §5.1, §5.3, §5.3, §5.4, TABLE XI, TABLE XIII, TABLE IV, TABLE IX.
  81. L. Zhou, N. Louis and J. J. Corso (2018) Weakly-supervised video object grounding from text by loss weighting and object interaction. In BMVC, pp. 50. Cited by: Appendix A.
  82. L. Zhou, C. Xu and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, pp. 7590–7598. Cited by: Appendix A, TABLE I, §1, §1, §2.1, §2.3, §3.1, §5.3, TABLE XI.
  83. L. Zhou, Y. Zhou, J. J. Corso, R. Socher and C. Xiong (2018) End-to-end dense video captioning with masked transformer. In CVPR, pp. 8739–8748. Cited by: §1.
  84. D. Zhukov, J. Alayrac, R. G. Cinbis, D. F. Fouhey, I. Laptev and J. Sivic (2019) Cross-task weakly supervised learning from instructional videos. In CVPR, pp. 3537–3545. Cited by: Appendix A, TABLE I, §2.3, §6.