Temporal Convolutional Networks for Action Segmentation and Detection

# Temporal Convolutional Networks for Action Segmentation and Detection

## Abstract

The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We introduce a new class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

## 1Introduction

Action segmentation is crucial for applications ranging from collaborative robotics to analysis of activities of daily living. Given a video, the goal is to simultaneously segment every action in time and classify each constituent segment. In the literature, this task goes by either action segmentation or detection. We focus on modeling situated activities – such as in a kitchen or surveillance setup – which are composed of dozens of actions over a period of many minutes. These actions, such as cutting a tomato versus peeling a cucumber, are often only subtly different from one another.

Current approaches decouple this task into extracting low-level spatiotemporal features and applying high-level temporal classifiers. While there has been extensive work on the former, recent temporal models have been limited to sliding window action detectors [?], which typically do not capture long-range temporal patterns; segmental models [?], which capture within-segment properties but assume conditional independence, thereby ignoring long-range latent dependencies; and recurrent models [?], which can capture latent temporal patterns but are difficult to interpret, have been found empirically to have a limited span of attention [?], and are hard to correctly train [?].

In this paper, we discuss a class of time-series models, which we call Temporal Convolutional Networks (TCNs), that overcome the previous shortcomings by capturing long-range patterns using a hierarchy of temporal convolutional filters. We present two types of TCNs: First, our Encoder-Decoder TCN (ED-TCN) only uses a hierarchy of temporal convolutions, pooling, and upsampling but can efficiently capture long-range temporal patterns. The ED-TCN has a relatively small number of layers (e.g., 3 in the encoder) but each layer contains a set of long convolutional filters. Second, a Dilated TCN uses dilated convolutions instead of pooling and upsampling and adds skip connections between layers. This is an adaptation of the recent WaveNet [?] model, which shares similarities to our ED-TCN but was developed for speech processing tasks. The Dilated TCN has more layers, but each uses dilated filters that only operate on a small number of time steps. Empirically, we find both TCNs are capable of capturing features of segmental models, such as action durations and pairwise transitions between segments, as well as long-range temporal patterns similar to recurrent models. These models tend to outperform our Bidirectional LSTM (Bi-LSTM) [?] baseline and are over a magnitude faster to train. The ED-TCN in particular produces many fewer over-segmentation errors than other models.

In the literature, our task goes by the name action segmentation [?] or action detection [?]. Despite effectively being the same problem, the temporal methods in segmentation papers tends to differ from detection papers, as do the metrics by which they are evaluated. In this paper, we evaluate on datasets targeted at both tasks and propose a segmental F1 score, which we qualitatively find is more applicable to real-world concerns for both segmentation and detection than common metrics. We use MERL Shopping which was designed for action detection, Georgia Tech Egocentric Activities which was designed for action segmentation, and 50 Salads which has been used for both.

Code for our models and metrics, as well as dataset features and predictions,1 will be released upon acceptance.

## 2Related Work

Action segmentation methods predict what action is occurring at every frame in a video and detection methods output a sparse set action segments, where a segment is defined by a start time, end time, and class label. It is possible to convert between a given segmentation and set of detections by simply adding or removing null/background segments.

Action Detection: Many fine-grained detection papers use sliding window-based detection methods on spatial or spatiotemporal features. Rohrbach [?] used Dense Trajectories [?] and human pose features on the MPII Cooking dataset. At each frame they evaluated a sliding SVM for many candidate segment lengths and performed non-maximal suppression to find a small set of action predictions. Ni [?] used an object-centric feature representation, which iteratively parses object locations and spatial configurations, and applied it to the MPII Cooking and ICPR 2012 Kitchen datasets. Their approach used Dense Trajectory features as input into a sliding-window detection method with segment intervals of 30, 60, and 90 frames. Singh [?] improved upon this by feeding per-frame CNN features into an LSTM model and applying a method analogous to non-maximal suppression to the LSTM output. We use Singh’s proposed dataset, MERL Shopping, and show our approach outperforms their LSTM-based detection model. Recently, Richard [?] introduced a segmental approach that incorporates a language model, which captures pairwise transitions between segments, and a duration model, which ensures that segments are of an appropriate length. In the experiments we show that our model is capable of capturing both of these components.

Some of these datasets, including MPII Cooking, have been used for classification (e.g., [?]), however, this task assumes the boundaries of each segment are known.

Action Segmentation: Segmentation papers tend to use temporal models that capture high-level patterns, for example RNNs or Conditional Random Fields (CRFs). The line of work by Fathi [?] used a segmental model that captured object states at the start and end of each action (e.g., the appearance of bread before and after spreading jam). They applied their work to the Georgia Tech Egocentric Activities (GTEA) dataset, which we use in our experiments. Singh [?] used an ensemble of CNNs to extract egocentric-specific features on the GTEA dataset but did not use a high-level temporal model. Lea [?] introduced a spatiotemporal CNN with a constrained segmental model which they applied to 50 Salads. Their model reduced the number of action over-segmentation errors by constraining the maximum number of candidate segments. We show our TCNs produce even fewer over-segmentation errors. Kuehne [?] modeled actions using Hidden Markov Models on Dense Trajectory features, which they applied with a high-level grammar to 50 Salads. Other work has looked at semi-supervised methods for action segmentation, such as Huang [?], which reduces the number of required annotations and improves performance when used with RNN-based models. It is possible that their approach could be used with TCNs for improved performance.

Large-scale Recognition: There has been substantial work on spatiotemporal models for large scale video classification and detection [?]. These focus on capturing object- and scene-level information from short sequences of images and thus are considered orthogonal to our work, which focuses on capturing longer-range temporal information. The input into our model could be the output of a spatiotemporal CNN.

Other related models: There are parallels between TCNs and recent work on semantic segmentation, which uses Fully Convolutional CNNs to compute a per-pixel object labeling of a given image. The Encoder-Decoder TCN is most similar to SegNet [?] whereas the Dilated TCN is most similar to the Multi-Scale Context model [?]. TCNs are also related to Time-Delay Neural Networks (TDNNs), which were introduced by Waibel [?] in the early 1990s. TDNNs apply a hierarchy of temporal convolutions across the input but do not use pooling, skip connections, newer activations (e.g., Rectified Linear Units), or other features of our TCNs.

## 3Temporal Convolutional Networks

In this section we define two TCNs, each of which have the following properties: (1) computations are performed layer-wise, meaning every time-step is updated simultaneously, instead of updating sequentially per-frame (2) convolutions are computed across time, and (3) predictions at each frame are a function of a fixed-length period of time, which is referred to as the receptive field. Our ED-TCN uses an encoder-decoder architecture with temporal convolutions and the Dilated TCN, which is adapted from the WaveNet model, uses a deep series of dilated convolutions.

The input to a TCN will be a set of video features, such as those output from a spatial or spatiotemporal CNN, for each frame of a given video. Let be the input feature vector of length for time step for . Note that the number of time steps may vary for each video sequence. The action label for each frame is given by vector , where is the number of classes, such that the true class is and all others are .

### 3.1Encoder-Decoder TCN

Our encoder-decoder framework is depicted in Figure 1. The encoder consists of layers denotes by where is the number of convolutional filters in a the -th layer and is the number of corresponding time steps. Each layer consists of temporal convolutions, a non-linear activation function, and max pooling across time.

We define the collection of filters in each layer as for with a corresponding bias vector . Given the signal from the previous layer, , we compute activations with

where is the activation function and is the convolution operator. We compare activation functions in Section 4.4 and find normalized Rectified Linear Units perform best. After each activation function, we max pool with width 2 across time such that . Pooling enables us to efficiently compute activations over long temporal windows.

Our decoder is similar to the encoder, except that upsampling is used instead of pooling and the order of the operations is now upsample, convolve, and apply the activation function. Upsampling is performed by simply repeating each entry twice. The convolutional filters in the decoder distribute the activations from the condensed layers in the middle to the action predictions at the top. Experimentally, these convolutions provide a large improvement in performance and appear to capture pairwise transitions between actions. Each decoder layer is denoted by for . Note that these are indexed in reverse order compared to the encoder, so the filter count in the first encoder layer is the same as in the last decoder layer.

The probability that frame corresponds each of the C action classes is given by vector using weight matrix and bias , such that

We explored other mechanisms, such as skip connections between layers, different patterns of convolutions, and other normalization schemes, however, the proposed model outperformed these alternatives and is arguably simpler. Implementation details are described in Section 3.3.

Receptive Field: The prediction at each frame is a function of a fixed-length period of time, which is given by the for layers and duration .

### 3.2Dilated TCN

We adapt the WaveNet [?] model, which was designed for speech synthesis, to the task of action segmentation. In their work, the predicted output, , denoted which audio sample should come next given the audio from frames to . In our case is the current action given the video features up to .

As shown in Figure 2, we define a series of blocks, each of which contains a sequence of convolutional layers. The activations in the -th layer and -th block are given by . Note that each layer has the same number of filters . This enables us to combine activations from different layers later. Each layer consists a set of dilated convolutions with rate parameter , a non-linear activation , and a residual connection than combines the layer’s input and the convolution signal. Convolutions are only applied over two time steps, and , so we write out the full equations to be specific. The filters are parameterized by with and bias vector . Let be the result of the dilated convolution at time and be the result after adding the residual connection such that

Let and be a set of weights and biases for the residual. Note that parameters are separate for each layer.

The dilation rate increases for consecutive layers within a block such that . This enables us to increase the receptive field by a substantial amount without drastically increasing the number of parameters.

The output of each block is summed using a set of skip connections with such that

There is a set of latent states for weight matrix and bias . The predictions for each time are given by

for weight matrix and bias .

Receptive Field: The filters in each Dilated TCN layer are smaller than in ED-TCN, so in order to get an equal-sized receptive field it needs more layers or blocks. The expression for its receptive field is for number of blocks and number of layers per block .

### 3.3Implementation details

Parameters of both TCNs are learned using the categorical cross entropy loss with Stochastic Gradient Descent and ADAM [?] step updates. Using dropout on full convolutional filters [?], as opposed to individual weights, improves performance and produces smoother looking weights. For ED-TCN, each of the layers has filters. For the Dilated TCN we find that performance does not depend heavily on the number of filters for each convolutional layer, so we set . We perform ablative analysis with various number of layers and filter durations in the experiments. All models were implemented using Keras [?] and TensorFlow [?].

### 3.4Causal versus Acausal

We perform causal and acausal experiments. Causal means that the prediction at time is only a function of data from times to , which is important for applications in robotics. Acausal means that the predictions may be a function of data at any time step in the sequence. For the causal case in ED-TCN, for at each time step and filter length , we convolve from to . In the acausal case we convolve from to .

For the acausal Dilated TCN, we modify Equation 2 such that each layer now operates over one previous step, the current step, and one future step:

where now .

## 4Evaluation & Discussion

We start by performing synthetic experiments that highlight the ability of our TCNs to capture high-level temporal patterns. We then perform quantitative experiments on three challenging datasets and ablative analysis to measure the impact of hyper-parameters such as filter duration.

### 4.1Metrics

Papers addressing action segmentation tend to use different metrics than those on action detection. We evaluate using metrics from both communities and introduce a segmental F1 score, which is applicable to both tasks.

Segmentation metrics: Action segmentation papers tend use to frame-wise metrics, such as accuracy, precision, and recall [?]. Some work on 50 Salads also uses segmental metrics [?], in the form of a segmental edit distance, which is useful because it penalizes predictions that are out-of-order and for over-segmentation errors. We evaluate all methods using frame-wise accuracy.

One drawback of frame-wise metrics is that models achieving similar accuracy may have large qualitative differences, as visualized later. For example, a model may achieve high accuracy but produce numerous over-segmentation errors. It is important to avoid these errors for human-robot interaction and video summarization.

Detection metrics: Action detection papers tend to use segment-wise metrics such as mean Average Precision with midpoint hit criterion (mAP@mid) [?] or mAP with a intersection over union (IoU) overlap criterion (mAP@k) [?]. mAP@k is computed my comparing the overlap score for each segment with respect to the ground truth action of the same class. If an IoU score is above a threshold it is considered a true positive otherwise it is a false positive. Average precision is computed for each class and the results are averaged. mAP@mid is similar except the criterion for a true positive is whether or not the midpoint (mean time) is within the start and stop time of the corresponding correct action.

mAP is a useful metric for information retrieval tasks like video search, however for many fine-grained action detection applications, such as robotics or video surveillance, we find that results are not indicative of real-world performance. The key issue is that mAP is very sensitive to a confidence score assigned to each segment prediction. These confidences are often simply the mean or maximum class score within the frames corresponding to a predicted segment. By computing these confidences in subtly different ways you obtain wildly different results. On MERL Shopping, the mAP@mid scores for Singh [?] jump from 50.9 using the mean prediction score over an interval to 69.8 using the maximum score over that same interval.

F1@k: We propose a segmental F1 score which is applicable to both segmentation and detection tasks and has the following properties: (1) it penalizes over-segmentation errors, (2) it does not penalize for minor temporal shifts between the predictions and ground truth, which may have been caused by annotator variability, and (3) scores are dependent on the number actions and not on the duration of each action instance. This metric is similar to mAP with IoU thresholds except that it does not require a confidence for each prediction. Qualitatively, we find these numbers are better at indicating the caliber of a given segmentation.

We compute whether or not each predicted action segment is a true or false positive by comparing its IoU with respect to the corresponding ground truth using threshold . As with mAP detection scores, if there is more than one correct detection within the span of a single true action then only one is marked as a true positive and all others are false positives. We compute precision and recall for true positives, false positives, and false negatives summed over all classes and compute .

We attempted to obtain action predictions from the original authors on all datasets to compare across multiple metrics. We received them for 50 Salads and MERL Shopping.

### 4.2Synthetic Experiments

We claim TCNs are capable of capturing complex temporal patterns, such as action compositions, action durations, and long-range temporal dependencies. We show these abilities with two synthetic experiments. For each, we generate toy features and corresponding labels for 50 training sequences and 10 test sequences of length . The duration of each action of a given class is fixed and action segments are sampled randomly. An example for the composition experiment is shown in Figure 3. Both TCNs are acausal and have a receptive field of length 16.

Action Compositions: By definition, an activity is composed of a sequence of actions. Typically there is a dependency between consecutive actions (e.g., action likely comes after ). CRFs capture this using a pairwise transition model over class labels and RNNs capture it using LSTM across latent states. We show that TCNs can capture action compositions, despite not explicitly conditioning the activations at time on previous time steps within that layer.

In this experiment, we generated sequences using a Markov model with three high-level actions , , and with subactions , , and , as shown in Figure 3. always consists of subactions (dark blue), (light blue), then (green), after which it is transitions to or . For simplicity, corresponds to the high-level action such that the true class is and others are .

The feature vectors corresponding to subactions are all the same, thus a simple frame-based classifier would not be able to differentiate them. The TCNs, given their long receptive fields, segment the actions perfectly. This suggests that our TCNs are capable of capturing action compositions. Recall each action class had a different segment duration, and we correctly labeled all frames, which suggests TCNs can capture duration properties for each class.

Long-range temporal dependencies: For many actions it is important to consider information from seconds or even minutes in the past. For example, in the cooking scenario, when a user cuts a tomato, they tend to occlude the tomato with their hands, which makes it difficult to identify which object is being cut. It would be advantageous to recognize that the tomato is on the cutting board before the user starts the cutting action. In this experiment, we show TCNs are capable of learning these long-range temporal patterns by adding a phase delay to the features. Specifically, for both training and test features we define as for all . Thus, there is a delay of frames between the labels and the corresponding features.

Results using F1@10 are shown in ?. For comparison we show the TCNs as well as Bi-LSTM. As expected, with no delay () all models achieve perfect prediction. For short delays (), TCNs correctly detect all actions except the first and last of a sequence. As the delay increases, ED-TCN and Dilated TCN perform very well up to about half the length of the receptive field. The results for Bi-LSTM degrade at a much faster rate.

### 4.3Datasets

University of Dundee 50 Salads [?]: contains 50 sequences of users making a salad and has been used for both action segmentation and detection. While this is a multimodal dataset we only evaluate using the video data. Each video is 5-10 minutes in duration and contains around 30 instances of actions such as cut tomato or peel cucumber. We performed cross validation with 5 splits on a higher-level action granularity which includes 9 action classes such as add dressing, add oil, cut, mix ingredients, peel, and place, plus a background class. In [?] this was referred to as “eval.” We also evaluate on a mid-level action granularity with 17 action classes. The mid-level labels differentiates actions like cut tomato from cut cucumber whereas the higher-level combines these into a single class, cut.

We use the spatial CNN features of Lea [?] as input into our models. This is a simplified VGG-style model trained solely on 50 Salads. Data was downsampled to approximately 1 frame/second.

MERL Shopping [?]: is an action detection dataset consisting of 106 surveillance-style videos in which users interact with items on store shelves. The camera viewpoint is fixed and only one user is present in each video. There are five actions plus a background class: reach to shelf, retract hand from shelf, hand in shelf, inspect product, inspect shelf. Actions are typically a few seconds long.

We use the features from Singh [?] as input. Singh’s model consists of four VGG-style spatial CNNs: one for RGB, one for optical flow, and ones for cropped versions of RGB and optical flow. We stack the four feature-types for each frame and use Principal Components Analysis with 50 components to reduce the dimensionality. The train, validation, and test splits are the same as described in [?]. Data is sampled at 2.5 frames/second.

Georgia Tech Egocentric Activities (GTEA) [?]: contains 28 videos of 7 kitchen activities such as making a sandwich and making coffee. The four subjects performed each activity once. The camera is mounted on the user’s head and is pointing towards their hands. On average there are about 19 (non-background) actions per video and videos are around a minute long. We use the 11 action classes defined in [?] and evaluate using leave one user out. Cross-validation is performed over users 1-3 as done by [?]. We use a frame rate of 3 frames per second.

We were unable to obtain state of the art features from [?], so we trained spatial CNNs from scratch using code from [?], which was originally applied to 50 Salads. This is a simplified VGG-style network where the input for each frame is a pair of RGB and motion images. Optical flow is very noisy due to large amounts of video compression in this dataset, so we simply use difference images, such that for image at frame the motion image is the concatenation of for delay seconds. These difference images can be viewed as a simple attention mechanism. We compare results from this spatial CNN, the spatiotemporal CNN from [?], EgoNet [?], and our TCNs.

### 4.4Experimental Results

To make the baselines more competitive, we apply Bidirectional LSTM (Bi-LSTM) [?] to 50 Salads and GTEA. We use 64 latent states per LSTM direction with the same loss and learning methods as previously described. The input to this model is the same as for the TCNs. Note that MERL Shopping already has this baseline.

50 Salads: Results on both action granularities are included in Table ?. All methods are evaluated in acausal mode. ED-TCN outperforms all other models on both granularities and on all metrics. We also compare against Richard [?] who evaluated on the mid-level and reported using IoU mAP detection metrics. Their approach achieved 37.9 mAP@10 and 22.9 mAP@50. The ED-TCN achieves 64.9 mAP@10 and 42.3 mAP@50 and Dilated TCN achieves 53.3 mAP@10 and 29.2 mAP@50.

Notice that ED-TCN, Dilated TCN, and ST-CNN all achieve similar frame-wise accuracy but very different F1@k and edit scores. ED-TCN tends to produce many fewer over-segmentations than competing methods. Figure 5 shows mid-level predictions for these models. Accuracy and F1 for each prediction is included for comparison.

Many errors on this dataset are due to the extreme similarity between actions and subtle differences in object appearance. For example, our models confuse actions using the vinegar and olive oil bottles, which have a similar appearance. Similarly, we confuse some cutting actions (e.g., cut cucumber versus cut tomato) and placing actions (e.g., place cheese versus place lettuce).

MERL Shopping: We compare against use two sets of predictions from Singh [?], as shown in Table ?. The first, as reported in their paper, are a sparse set of action detections which are referred to as MSN Det. The second, obtained from the authors, are a set of dense (per-frame) action segmentations. The detections use activations from the dense segmentations with a non-maximal suppression detection algorithm to output a sparse set of segments. Their causal version uses LSTM on the dense activations and their acausal version uses Bidirectional LSTM.

While Singh’s detections achieve very high midpoint mAP, the same predictions perform very poorly on the other metrics. As visualized in Figure 5 (right), the actions are very short and sparse. This is advantageous when optimizing for midpoint mAP, because performance only depends on the midpoint of a given action, however, it it not effective if you require the start and stop time of an activity. Interesting, this method does worst in F1 even for low overlaps.

As expected the acausal TCNs perform much better than the causal variants. This verifies that using future information is important for achieving best performance. In the causal and acausal results the Dilated TCN outperforms ED-TCN in midpoint mAP, however, the F1 scores are better for ED-TCN. This suggests the confidences for the Dilated TCN are more reliable than ED-TCN.

Georgia Tech Egocentric Activities: Performance of the ED-TCN is on par with the ensemble approach of Singh [?], which combines EgoNet features with TDD [?]. Recall that Singh’s approach does not incorporate a temporal model, so we expect that combining their features with our ED-TCN would improve performance. Unlike EgoNet and TDD, our approach uses simpler spatial CNN features which can be computed in real-time.

Overall, in our experiments the Encoder-Decoder TCN outperformed all other models, including state of the art approaches for most datasets and our adaptation of the recent WaveNet model. The most important difference between these models is that ED-TCN uses fewer layers but has longer convolutional filters whereas the Dilated TCN has more layers but with shorter filters. The long filters in ED-TCN have a strong positive affect on F1 performance, in particular because they prevent over-segmentation issues. The Dilated TCN performs well on metrics like accuracy, but is less robust to over-segmentation. This is likely due to the short filter lengths in each layer.

#### Ablative Experiments

Ablative experiments were performed on 50 Salads (mid-level). Note that these were done with different hyper-parameters and thus may not match previous results.

Activation functions: We assess performance using the activation functions shown in Table ?. The Gated PixelCNN (GPC) activation [?], , was used for WaveNet and also achieves high performance on our tasks. We define the Normalized ReLU

for vector and where the max is computed per-frame. Normalized ReLU outperforms all others with ED-TCN, whereas for Dilated TCN all functions are similar.

Receptive fields: We compare performance with varying receptive field hyperparameters. Line in Figure 4 (left) show F1@25 for from to and filter sizes from to on ED-TCN. Lines in Figure 4 (right) correspond to block count with layer count from to for a Dilated TCN. Note, our GPU ran out of memory on ED-TCN after (,) and Dilated TCN after (,). The ED-TCN performs best with a receptive field of 44 frames (,) which corresponds to 52 seconds. The Dilated TCN performs best at 128 frames (,) and achieves similar performance at 96 frames (,).

Training time: It takes much less time to train a TCN than a Bi-LSTM. While the exact timings vary with the number of TCN layers and filter lengths, for one split of 50 Salads – using a Nvidia Titan X for 200 epochs – it takes about a minute to train the ED-TCN whereas and 30 minutes to train the Bi-LSTM. This speedup comes from the fact that activations within each TCN layer are all independent, and thus they can be performed in batch on a GPU. Activations in intermediate RNN layers depend on previous activations within that layer, so operations must be applied sequentially.

## 5Conclusion

We introduced Temporal Convolutional Networks, which use a hierarchy of convolutions to capture long-range temporal patterns. We showed on synthetic data that TCNs are capable of capturing complex patterns such as compositions, action durations, and are robust to time-delays. Our models outperformed strong baselines, including Bidirectional LSTM, and achieve state of the art performance on challenging datasets. We believe TCNs are a formidable alternative to RNNs and are worth further exploration.

Acknowledgments: Thanks to Bharat Singh and the group at MERL for discussions on their dataset and for letting us use the spatiotemporal features as input into our model. We also thank Alexander Richard for his 50 Salads predictions.

### Footnotes

1. The release of the MERL Shopping features depends on the permission of the original authors. All other features will be available.
13738