Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks

Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks

Yanghao Li Institute of Computer Science and Technology, Peking University
Microsoft Research Asia    Institute of Automation, Chinese Academy of Sciences
11email: {lyttonhao, liujiaying}@pku.edu.cn,  
11email: {culan, wezeng}@microsoft.com,  11email: {jlxing, cfyuan}@nlpr.ia.ac.cn
   Cuiling Lan Corresponding author. This work was done at Microsoft Research Asia. Institute of Computer Science and Technology, Peking University
Microsoft Research Asia    Institute of Automation, Chinese Academy of Sciences
11email: {lyttonhao, liujiaying}@pku.edu.cn,  
11email: {culan, wezeng}@microsoft.com,  11email: {jlxing, cfyuan}@nlpr.ia.ac.cn
   Junliang Xing Institute of Computer Science and Technology, Peking University
Microsoft Research Asia    Institute of Automation, Chinese Academy of Sciences
11email: {lyttonhao, liujiaying}@pku.edu.cn,  
11email: {culan, wezeng}@microsoft.com,  11email: {jlxing, cfyuan}@nlpr.ia.ac.cn
   Wenjun Zeng Institute of Computer Science and Technology, Peking University
Microsoft Research Asia    Institute of Automation, Chinese Academy of Sciences
11email: {lyttonhao, liujiaying}@pku.edu.cn,  
11email: {culan, wezeng}@microsoft.com,  11email: {jlxing, cfyuan}@nlpr.ia.ac.cn
  
Chunfeng Yuan
Institute of Computer Science and Technology, Peking University
Microsoft Research Asia    Institute of Automation, Chinese Academy of Sciences
11email: {lyttonhao, liujiaying}@pku.edu.cn,  
11email: {culan, wezeng}@microsoft.com,  11email: {jlxing, cfyuan}@nlpr.ia.ac.cn
   Jiaying Liufootnotemark: Institute of Computer Science and Technology, Peking University
Microsoft Research Asia    Institute of Automation, Chinese Academy of Sciences
11email: {lyttonhao, liujiaying}@pku.edu.cn,  
11email: {culan, wezeng}@microsoft.com,  11email: {jlxing, cfyuan}@nlpr.ia.ac.cn
Abstract

Human action recognition from well-segmented 3D skeleton data has been intensively studied and has been attracting an increasing attention. Online action detection goes one step further and is more challenging, which identifies the action type and localizes the action positions on the fly from the untrimmed stream data. In this paper, we study the problem of online action detection from streaming skeleton data. We propose a multi-task end-to-end Joint Classification-Regression Recurrent Neural Network to better explore the action type and temporal localization information. By employing a joint classification and regression optimization objective, this network is capable of automatically localizing the start and end points of actions more accurately. Specifically, by leveraging the merits of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures the complex long-range temporal dynamics, which naturally avoids the typical sliding window design and thus ensures high computational efficiency. Furthermore, the subtask of regression optimization provides the ability to forecast the action prior to its occurrence. To evaluate our proposed model, we build a large streaming video dataset with annotations. Experimental results on our dataset and the public G3D dataset both demonstrate very promising performance of our scheme.

Keywords:
Action Detection, Recurrent Neural Network, Joint Classification-Regression

1 Introduction

Human action detection is an important problem in computer vision, which has broad practical applications like visual surveillance, human-computer interaction and intelligent robot navigation. Unlike action recognition and offline action detection, which determine the action after it is fully observed, online action detection aims to detect the action on the fly, as early as possible. It is much desirable to accurately and timely localize the start point and end point of an action along the time and determine the action type as illustrated in Fig. 1. Besides, it is also desirable to forecast the start and end of the actions prior to their occurrence. For example, for intelligent robot system, in addition to the accurate detection of actions, it would also be appreciated if it can predict the start of the impending action or the end of the ongoing actions and then get something ready for the person it serves, e.g., passing towels when he/she finishes washing hands. Therefore, the detection and forecast system could respond to impending or ongoing events accurately and as soon as possible, to provide better user experiences.

For human action recognition and detection, many research works have been designed for RGB videos recorded by 2D cameras in the past couple of decades [1]. In recent years, with the prevalence of the affordable color-depth sensing cameras, such as the Microsoft Kinect [2], it is much easier and cheaper to obtain depth data and thus the 3D skeleton of human body (see skeleton examples in Fig. 1). Biological observations suggest that skeleton, as an intrinsic high level representation, is very valuable information for recognizing actions by humans [3]. In comparison to RGB video, such high level human representation by skeleton is robust to illumination and clustered background [4], but may not be appropriate for recognizing fine-grained actions with marginal differences. Taking the advantages of skeleton representation, in this paper, we investigate skeleton based human action detection. The addition of RGB information may result in better performance and will be addressed in the future work.

Figure 1: Illustration of online action detection. It aims to determine the action type and the localization on the fly. It is also desirable to forecast the start and end points (e.g., frames ahead).

Although online action detection is of great importance, there are very few works specially designed for it [5, 6]. Moreover, efficient exploitation of the advanced recurrent neural network (RNN) has not been well studied for the efficient temporal localization of actions. Most published methods are designed for offline detection [7], which performs detection after fully observing the sequence. To localize the action, most of previous works employ a sliding window design [8, 5, 9, 10], which divides the sequence into overlapped clips before action recognition/classification is performed on each clip. Such sliding window design has low computational efficiency. A method that divides the continuous sequence into short clips with the shot boundary detected by computing the color histogram and motion histogram was proposed in [11]. However, indirect modeling of action localization in such an unsupervised manner does not provide satisfactory performance. An algorithm which can intelligently localize the actions on the fly is much expected, being suitable for the streaming sequence with actions of uncertain length. For action recognition on a segmented clip, deep learning methods, such as convolutional neural networks and recurrent neural networks, have been shown to have superior performances on feature representation and temporal dynamics modeling [12, 13, 14, 15]. However, how to design an efficient online action detection system that leverages the neural network for the untrimmed streaming data is not well studied.

Figure 2: Architecture of the proposed joint classification-regression RNN framework for online action detection and forecasting.

In this paper, we propose a Joint Classification-Regression Recurrent Neural Network to accurately detect the actions and localize the start and end positions of the actions on the fly from the streaming data. Fig. 2 shows the architecture of the proposed framework. Specifically, we use LSTM [16] as the recurrent layers to perform automatic feature learning and long-range temporal dynamics modeling. Our network is end-to-end trainable by optimizing a joint objective function of frame-wise action classification and temporal localization regression. On one hand, we perform frame-wise action classification, which aims to detect the actions timely. On the other hand, to better localize the start and end of actions, we incorporate the regression of the start and end points of actions into the network. We can forecast their occurrences in advance based on the regressed curve. We train this classification and regression network jointly to obtain high detection accuracy. Note that the detection is performed frame-by-frame and the temporal information is automatically learnt by the deep LSTM network without requiring a sliding window design, which is time efficient.

The main contributions of this paper are summarized as follows:

  • We investigate the new problem of online action detection for streaming skeleton data by leveraging recurrent neural network.

  • We propose an end-to-end Joint Classification-Regression RNN to address our target problem. Our method leverages the advantages of RNNs for frame-wise action detection and forecasting without requiring a sliding window design and explicit looking forward or backward.

  • We build a large action dataset for the task of online action detection from streaming sequence.

2 Related Work

2.1 Action Recognition and Action Detection

Action recognition and detection have attracted a lot of research interests in recent years. Most methods are designed for action recognition [13, 14, 17], i.e., to recognize the action type from a well-segmented sequence, or offline action detection [18, 10, 8, 19]. However, in many applications it is desirable to recognize the action on the fly, without waiting for the completion of the action, e.g., in human computer interaction to reduce the response delay. In [5], a learning formulation based on a structural SVM is proposed to recognize partial events, enabling early detection. To reduce the observational latency of human action recognition, a non-parametric moving pose framework [6] and a dynamic integral bag-of-words approach [20] are proposed respectively to detect actions earlier. Our model goes beyond early detection. Besides providing frame-wise class information, it forecasts the occurrence of start and end of actions.

To localize actions in streaming video sequence, existing detection methods utilize either sliding-window scheme [8, 5, 9, 10], or action proposal approaches [21, 22, 11]. These methods usually have low computational efficiency or unsatisfactory localization accuracy due to the overlapping design and unsupervised localization approach. Besides, it is not easy to determine the sliding-window size.

Our framework aims to address the online action detection in such a way that it can predict the action at each time slot efficiently without requiring a sliding window design. We use the regression design to determine the start/end points learned in a supervised manner during the training, enabling the localization being more accurate. Furthermore, it forecasts the start of the impending or end of the ongoing actions.

2.2 Deep Learning

Recently, deep learning has been exploited for action recognition [17]. Instead of using hand-crafted features, deep learning can automatically learn robust feature representations from raw data. To model temporal dynamics, RNNs have also been used for action recognition. A Long-term Recurrent Convolutional Network (LRCN) [13] is proposed for activity recognition, where the LRCN model contains several Convolutional Neural Network (CNN) layers to extract visual features followed by LSTM layers to handle temporal dynamics. A hybrid deep learning framework is proposed for video classification [12], where LSTM networks are applied on top of the two types of CNN-based features related to the spatial and the short-term motion information. For skeleton data, hierarchical RNN [14] and fully connected LSTM [15] are investigated to model the temporal dynamics for skeleton based action recognition.

Despite a lot of efforts in action recognition, which uses pre-segmented sequences, there are few works on applying RNNs for the action detection and forecasting tasks. Motivated by the advantages of RNNs in sequence learning and some other online detection tasks (e.g. audio onset [23] and driver distraction [24] detection), we propose a Joint Classification and Regression RNN to automatically localize the action location and determine the action type on the fly. In our framework, the designed LSTM network simultaneously plays the role of feature extraction and temporal dynamic modeling. Thanks to the long-short term memorizing function of LSTM, we do not need to assign an observation window as in the sliding window based approaches for the action type determination and avoid the repeat calculation. This enables our design to have superior detection performance with low computation complexity.

3 Problem Formulation

In this section, we formulate the online action detection problem. To help clarify the differences, offline action detection is first discussed.

3.1 Offline Action Detection

Given a video observation composed of frames from time to , the goal of action detection is to determine whether a frame at time belongs to an action among the predefined action classes.

Without loss of generality, the target classes for the frame are denoted by a label vector , where means the presence of an action of class at this frame and means absence of this action. Besides the classes of actions, a blank class is added to represent the situation in which the current frame does not belong to any predefined actions. Since the entire sequence is known, the determination of the classes at each time slot is to maximize the posterior probability

(1)

where is the possible action label vector for frame . Therefore, conditioned on the entire sequence , the action label with the maximum probability is chosen to be the status of frame in the sequence.

According to the action label of each frame, an occurring action can be represented in the form , where denotes the class type of the action , and correspond to the starting and ending time of the action, respectively.

3.2 Online Action Detection

In contrast to offline action detection, which makes use of the whole video to make decisions, online detection is required to determine which actions the current frame belongs to without using future information. Thus, the method must automatically estimate the start time and status of the current action. The problem can be formulated as

(2)

Besides determining the action label, an online action detection system for streaming data is also expected to predict the starting and ending time of an action. We should be aware of the occurrence of the action as early as possible and be able to predict the end of the action. For example, for an action , the system is expected to forecast the start and end of the action during and , respectively, ahead its occurrence. could be considered as the expected forecasting time in statistic. We define the optimization problem as

(3)

where and are two vectors, denoting whether actions are to start or to stop within the following frames, respectively. For example, means the action of class will start within frames.

4 Joint Classification-Regression RNN for Online Action Detection

We propose an end-to-end Joint Classification-Regression framework based on RNN to address the online action detection problem. Fig. 2 shows the architecture of the proposed network, which has a shared deep LSTM network for feature extraction and temporal dynamic modeling, a classification subnetwork and a regression subnetwork. Note that we construct the deep LSTM network by stacking three LSTM layers and three non-linear fully-connected (FC) layers to have powerful learning capability. We first train the classification network for the frame-wise action classification. Then under the guidance of the classification results through the Soft Selector, we train the regressor to obtain more accurate localization of the start and end time points.

In the following, we first briefly review the RNNs and LSTM to make the paper self-contained. Then we introduce our proposed joint classification-regression network for online action detection.

4.1 Overview of RNN and LSTM

In contrast to traditional feedforward neural networks, RNNs have self-connected recurrent connections which model the temporal evolution. The output response of a recurrent hidden layer can be formulated as follows [25]

(4)

where and are mapping matrices from the current inputs to the hidden layer and the hidden layer to itself. denotes the bias vector. is the activation function in the hidden layer.

Figure 3: The structure of an LSTM neuron, which contains an input gate , a forget gate , and an output gate . Information is saved in the cell .
Figure 4: Illustration of the confidence values around the start point and end point, which follow Gaussian-like curves with the confidence value 1 at the start and end point.

The above RNNs have difficulty in learning long range dependencies [26], due to the vanishing gradient effect. To overcome this limitation, recurrent neural networks using LSTM [16, 25, 14] has been designed to mitigate the vanishing gradient problem and learn the long-range contextual information of a temporal sequence. Fig. 4 illustrates a typical LSTM neuron. In addition to a hidden output , an LSTM neuron contains an input gate , a forget gate , a memory cell , and an output gate . At each timestep, it can choose to read, write or reset the memory cell through the three gates. This strategy allows LSTM to memorize and access information many timesteps ago.

4.2 Subnetwork for Classification Task

We first train an end-to-end classification subnetwork for frame-wise action recognition. The structure of this classification subnetwork is shown in the upper part of Fig. 2. The frame first goes through the deep LSTM network, which is responsible for modeling the spatial structure and temporal dynamics. Then a fully-connected layer FC1 and a SoftMax layer are added for the classification of the current frame. The ouptput of the SoftMax layer is the probability distribution of the action classes . Following the problem formulation as described in Sec. 3, the objective function of this classification task is to minimize the cross-entropy loss function

(5)

where corresponds to the groundtruth label of frame for class , means the groundtruth class is the class, denotes the estimated probability of being action classes of frame .

We train this network with Back Propagation Through Time (BPTT) [27] and use stochastic gradient descent with momentum to compute the derivatives of the objective function with respect to all parameters. To prevent over-fitting, we have utilized dropout at the three fully-connected layers.

4.3 Joint Classification and Regression

We fine-tune this network on the initialized classification model by jointly optimizing the classification and regression. Inspired by the Joint Classification-Regression models used in Random Forest [28, 29] for other tasks (e.g. segmentation [28] and object detection [29]), we propose our Joint learning to simultaneously make frame-wise classification, localize the start and end time points of actions, and to forecast them.

We define a confidence factor for each frame to measure the possibility of the current frame to be the start or end point of some action. To better localize the start or end point, we use a Gaussian-like curve to describe the confidences, which centralizes at the actual start (or end) point as illustrated in Fig. 4. Taking the start point as an example, the confidence of the frame with respect to the start point of action is defined as

(6)

where is the start point of the nearest (along time) action to the frame , and is the parameter which controls the shape of the confidence curve. Note that at the start point time, i.e., , the confidence value is 1. Similarly, we denote the confidence of being the end point of one action as . For the Gaussian-like curve, a lower confidence value suggests the current frame has larger distance from the start point and the peak point indicates the start point.

Such design has two benefits. First, it is easy to localize the start/end point by checking the regressed peak points. Second, this makes the designed system have the ability of forecasting. We can forecast the start (or end) of actions according to the current confidence response. We set a confidence threshold (or ) according to the sensitivity requirement of the system to predict the start (or end) point. When the current confidence value is larger than (or ), we consider that one action may start (or end) soon. Usually, larger threshold corresponds to a later response but a more accurate forecast.

Using the confidence as the target values, we include this regression problem as another task in our RNN model, as shown in the lower part of Fig. 2. This regression subnetwork consists of a non-linear fully-connected layer FC2, a Soft Selector layer, and a non-linear fully-connected layer FC3. Since we regress one type of confidence values for all the start points of different actions, we need to use the output of the action classification to guide the regression task. Therefore, we design a Soft Selector module to generate more specific features by fusing the output of SoftMax layer which describes the probabilities of classification together with the output of the FC2 layer.

We achieve this by using class specific element-wise multiplication of the outputs of SoftMax and FC2 layer. The information from the SoftMax layer for the classification task plays the role of class-based feature selection over the output features of FC2 for the regression task. A simplified illustration about the Soft Selector model is shown in Fig. 5. Assume we have 5 action classes and the dimension of the FC2 layer output is reshaped to 75. The vector (marked by circles) with the dimension of 5 from the SoftMax output denotes the probabilities of the current frame belonging to the 5 classes respectively. Element-wise multiplication is performed for each row of features and then integrating the SoftMax output plays the role of feature selection for different classes.

The final objective function of the Joint Classification-Regression is formulated as

(7)

where and are the predicted confidence values as start and end points, is the weight for the regression task, is the regression loss function, which is defined as . In the training, the overall loss is a summarization of the loss from each frame , where . For a frame , its loss consists of the classification loss represented by the cross-entropy for the classes and the regression loss for identifying the start and end of the nearest action.

Figure 5: Soft Selector for the fusion of classification output and features from FC2. Element-wise multiplication is performed for each row of features (we only show the first two rows here).

We fine-tune this entire network over the initialized classification model by minimizing the object function of the joint classification and regression optimization. Note that to enable the classification result indicating which action will begin soon, we set the groundtruth label in the training where for all actions, according to the expected forecast-forward value as defined in Sec. 3. Then, for each frame, the classification output indicates the impending or ongoing action class while the two confidence outputs show the probability to be the start or end point. We set the peak positions of confidences to be the predicted action start (or end) time. Note that since the classification and regression results of the current frame are correlated with the current input and the previous memorized information for the LSTM network, the system does not need to explicitly look back, avoiding sliding window design.

5 Experiments

In this section, we evaluate the detection and forecast performance of the proposed method on two different skeleton-based datasets. The reason why we choose skeleton-based datasets for experiments is three-fold. First, skeleton joints are captured in the full 3D space, which can provide more comprehensive information for action detection compared to 2D images. Second, the skeleton joints can well represent the posture of human which provide accurate positions and capture human motions. Finally, the dimension of skeleton is low, i.e., 253 = 75 values for each frame from Kinect V2. This makes the skeleton based online action detection much attractive for real applications.

5.1 Datasets and Settings

Most published RGB-D datasets were generated for the classification task where actions were already pre-segmented [30, 31]. They are only suitable for action recognition. Thus, besides using an existing skeleton-based detection dataset, the Gaming Action Dataset (G3D) [32], we collect a new online streaming dataset following similar rules of previous action recognition datasets, which is much more appropriate for the online action detection problem. In this work, being similar to that in [15], the normalization processing on each skeleton frame is performed to be invariant to position.

Gaming Action Dataset (G3D). The G3D dataset contains 20 gaming actions captured by Kinect, which are grouped into seven categories, such as fighting, tennis and golf. Some limitations of this dataset are that the number and occurrence order of actions in the videos are unchanged and the actors are motionless between performing different actions, which make the dataset a little unrealistic.

Online Action Detection Dataset (OAD). This is our newly collected action dataset with long sequences for our online action detection problem. The dataset was captured using the Kinect v2 sensor, which collects color images, depth images and human skeleton joints synchronously. It was captured in a daily-life indoor enviroment. Different actors freely performed 10 actions, including drinking, eating, writing, opening cupboard, washing hands, opening microwave, sweeping, gargling, throwing trash, and wiping. We collected 59 long video sequences at 8 fps (in total 103,347 frames of 216 minutes). Note that our low recording frame rate is due to the speed limitation of writing large amount of data (i.e., skeleton, high resolution RGB-D) to the disk of our laptop.

Since the Kinect v2 sensor is capable of providing more accurate depth, our dataset has more accurate tracked skeleton positions compared to previous skeleton datasets. In addition, the acting orders and duration of the actions are arbitrary, which approach the real-life scenarios. The length of each sequence is very long and there are variable idle periods between different actions, which meets the requirements of realistic online action detection from streaming videos.

Network and Parameter Settings. We show the architecture of our network in Fig. 2. The number of neurons in the deep LSTM network is 100, 100, 110, 110, 100, 100 for the first six layers respectively, including three LSTM layers and three FC layers. The design choice (i.e., LSTM architecture) is motivated by some previous works [14, 15]. The number of neurons in the FC1 layer corresponds to the number of action classes and the number of neurons in the FC2 layer is set to 10(). For the FC3 layer, there are two neurons corresponding to the start and end confidences respectively. The forecast response threshold can be set based on the requirement of the applications. In this paper, we set (around one second) for the following experiments. The parameter in (6) is set to 5. The weight in the final loss function (7) is increased gradually from 0 to 10 during the fine-tuning of the entire network. Note that we use the same parameter settings for both OAD and G3D datasets.

For our OAD dataset, we randomly select 30 sequences for training and 20 sequences for testing. The remaining 9 long videos are used for the evaluation of the running speed. For the G3D dataset, we use the same setting as used in [32].

5.2 Action Detection Performance Evaluation

Evaluation criterions. We use three different evaluation protocols to measure the detection results.

  1. Score. Similar to the protocols used in object detection from images [33], we define a criterion to determine a correct detection. A detection is correct when the overlapping ratio between the predicted action interval and the groundtruth interval exceeds a threshold, e.g., . is defined as

    (8)

    where denotes the intersection of the predicted and groundtruth intervals and denotes their union. With the above criterion to determine a correction detection, the Score is defined as

    (9)
  2. Score. To evaluate the accuracy of the localization of the start point for an action, we define a Start Localization Score (Score) based on the relative distance between the predicted and the groundtruth start time. Suppose that the detector predicts that an action will start at time and the corresponding groundtruth action interval is , the score is calculated as . For false positive or false negative samples, the score is set to .

  3. Score. Similarly, the End Localizataion Score (Score) is defined based on the relative distance between the predicted and the groundtruth end time.

Baselines. We have implemented serveral baselines for comparison to evaluate the performance of our proposed Joint Classification-Regression RNN model (JCR-RNN), (i) SVM-SW. We train a SVM detector to detect the action with sliding window design (SW). (ii) RNN-SW. This is based on the baseline method Deep LSTM in [15], which employs a deep LSTM network that achieves good results on many skeleton-based action recognition datasets. We train the classifiers and perform the detection based on sliding window design. We set the window size to 10 with step of 5 for both RNN-SW and SVM-SW. We experimentally tried different window sizes and found 10 gives relatively good average performance. (iii) CA-RNN. This is a degenerated version of our model that only consists of the LSTM and classification network, without the regression network involved. We denote it as Classification Alone RNN model (CA-RNN).

Actions SVM RNN CA JCR
-SW -SW [15] -RNN -RNN
drinking 0.146 0.441 0.584 0.574
eating 0.465 0.550 0.558 0.523
writing 0.645 0.859 0.749 0.822
opening cupboard 0.308 0.321 0.490 0.495
washing hands 0.562 0.668 0.672 0.718
opening microwave 0.607 0.665 0.468 0.703
sweeping 0.461 0.590 0.597 0.643
gargling 0.437 0.550 0.579 0.623
throwing trash 0.554 0.674 0.430 0.459
wiping 0.857 0.747 0.761 0.780
average 0.540 0.600 0.596 0.653
Table 1: Score on OAD dataset.

Dectection Performance. Table 1 shows the Score of each action class and the average Score of all actions on our OAD Dataset. From Table 1, we have the following observations. (i) The RNN-SW method achieves 6% higher Score than the SVM-SW method. This demonstrates that RNNs can better model the temporal dynamics. (ii) Our JCR-RNN outperforms the RNN-SW method by 5.3%. Despite RNN-SW, CA-RNN and JCR-RNN methods all use RNNs for feature learning, one difference is that our schemes are end-to-end trainable without the involvement of sliding window. Therefore, the improvements clearly demonstrate that our end-to-end schemes are more efficient than the classical sliding window scheme. (iii) Our JCR-RNN further improves over the CA-RNN and achieves the best performance. It can be seen that incorporating the regression task into the network and jointly optimizing classification-regression make the localization more accurate and enhance the detection accuracy.

To further evaluate the localization accuracy, we calcuate the and Scores on the Online Action Dataset. The average scores of all actions are shown in Table 2. We can see the proposed scheme achieves the best localization accuracy.

Scores SVM RNN CA JCR
-SW -SW [15] -RNN -RNN
0.316 0.366 0.378 0.418
0.325 0.376 0.382 0.443
Table 2: and Score on the OAD dataset.

For the G3D dataset, we evaluate the performance in terms of the three types of scores for the seven categories of sequences. To save space, we only show the results for the first two categories Fighting and Golf in Table 4 and 4, and more results which have the similar trends can be found in the supplementary material. The results are consistent with the experiments on our own dataset. We also compare these methods using the evaluation metric action-based as defined in [32], which treats the detection of an action as correct when the predicted start point is within 4 frames of the groundtruth start point for that action. Note that the action-based only considers the accuracy of the start point. The results are shown in Table 6. The method in [32] uses a traditional boosting algorithm [34] and its scores are significantly lower than other methods.

Action Scores SVM RNN CA JCR Category -SW -SW [15] -RNN -RNN Fighting 0.318 0.412 0.512 0.528 0.328 0.419 0.525 0.557 Golf 0.553 0.635 0.789 0.793 0.524 0.656 0.791 0.836 Action SVM RNN CA JCR Category -SW -SW [15] -RNN -RNN Fighting 0.486 0.613 0.700 0.735 Golf 0.680 0.745 0.900 0.967

Table 3: and Score on the G3D Dataset.
Table 4: Score on G3D.

Action G3D SVM RNN CA JCR Category [32] -SW -SW [15] -RNN -RNN Fighting 58.54 76.72 83.28 94.00 96.18 Golf 11.88 45.00 55.00 50.00 70.00 SVM-SW RNN-SW [15] JCR-RNN 1.05 3.14 2.60

Table 5: Action-based [32] on G3D.
Table 6: Average running time (seconds per sequence).

5.3 Action Forecast Performance Evaluation

(a) Forecast of start.
(b) Forecast of end.
Figure 6: The Precision-Recall curves of the start and end time forecast with different methods on the OAD dataset. Overall JCR-RNN outperforms other baselines by a large margin. This figure is best seen in color.

Evaluation criterion. As explained in Section 3, the system is expected to forecast whether the action will start or end within frames prior to its occurrence. To be considered as a true positive start forecast, the forecast should not only predict the impending action class, but also do so within a reasonable interval, i.e., for an action starting at . This rule is also applied to end forecast. We use the Precision-Recall Curve to evaluate the performance of the action forecast methods. Note that both precision and recall are calculated on the frame-level for all frames.

Baselines. Since there is no previous method proposed for the action forecast problem, we use a simple strategy to do the forecast based on the above detection baseline methods. For SVM-SW, RNN-SW, CA-RNN, they will output the probability for each action class at each time step . At time , when the probability of action class is larger than a predefined threshold , we consider that the action of class will start soon. Similarly, during an ongoing period of the action of class , when the probability is smaller than another threshold , we consider this action to end soon.

Forecast Performance. The peak point of the regressed confidence curve is considered as the start/end point in the test. When the current confidence value is higher than but ahead of the peak, this frame forecasts that the action will start soon. By adjusting the confidence thresholds and in our method, we draw the Precision-Recall curves for our method. Similiarly, we draw the curves for the baselines by adjusting and . We show them in Fig. 6. The performance of the baselines is significantly inferior to our method JCR-RNN. This suggests that only using the traditional detection probability is not suitable for forecasting. One important reason is that the frames before the start time are simply treated as background samples in the baselines but actually they contain evidences. While in our regression task, we deal with these frames using different confidence values to guide the network to explore the hidden starting or ending patterns of actions. In addition, we note that the forecast precision of all the methods are not very high even though our method is much better, e.g., precision is for start forecast and for end forecast when recall is . This is because the forecast problem itself is a difficult problem. For example, when a person is writing on the board, it is difficult to forecast whether he will finish writing soon.

Fig. 7 shows the confusion matrix of the start forecast by our proposed method. This confusion matrix represents the relationships between the predicted start action class and the groundtruth class. The shown confusion matrix is obtained when the recall rate equals to 40%. From this matrix, although there are some missed or wrong forecasts, most of the forecasts are correct. In addition, there are a few interesting observations. For example, the action eating and drinking may have similar poses before they start. Action gargling and washing hands are also easy to be mixed up when forecasting since the two actions both need to turn on the tap before starting. Taking into account human-object interaction should help reduce the ambiguity and we will leave it for future work.

Figure 7: Confusion Matrix of start forecast on the OAD dataset. Vertical axis: groundtruth class; Horizontal axis: predicted class.

5.4 Comparison of Running Speeds

In this section, we compare the running speeds of different methods. Table 6 shows the average running time on 9 long sequences, which has 3200 frames on average. SVM-SW has the fastest speed because of its small model compared with the deep learning methods. The RNN-SW runs slower than our methods due to its sliding window design. We can notice the running speed for the action detection based on skeleton input is rather fast, being 1230 fps for the JCR-RNN approach. This is because the dimension of skeleton is low (253 = 75 values for each frame) in comparison with RGB input. This makes the skeleton based online action detection much attractive for real applications.

6 Conclusion and Future Work

In this paper, we propose an end-to-end Joint Classification-Regression RNN to explore the action type and better localize the start and end points on the fly. We leverage the merits of the deep LSTM network to capture the complex long-range temporal dynamics and avoid the typical sliding window design. We first pretrain the classification network for the frame-wise action classification. Then with the incorporation of the regression network, our joint model is capable of not only localizing the start and end time of actions more accurately but also forecasting their occurrence in advance. Experiments on two datasets demonstrate the effectiveness of our method. In the future work, we will introduce more features, such as appearance and human-object interaction information, into our model to further improve the detection and forecast performance.

Acknowledgement. This work was supported by National High-tech Technology RD Program (863 Program) of China under Grant 2014AA015205, National Natural Science Foundation of China under contract No.61472011 and No.61303178, and Beijing Natural Science Foundation under contract No.4142021.

References

  • [1] Weinland, D., Ronfard, R., Boyerc, E.: A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding 115(2) (2011) 224–241
  • [2] Microsoft Kinect. https://dev.windows.com/en-us/kinect
  • [3] Johansson, G.: Visual perception of biological motion and a model for it is analysis. Perception and Psychophysics 14(2) (1973) 201–211
  • [4] Han, F., Reily, B., Hoff, W., Zhang, H.: space-time representation of people based on 3d skeletal data: a review. arXiv:1601.01006 (2016) 1–20
  • [5] Hoai, M., De la Torre, F.: Max-margin early event detectors. Int’l Journal of Computer Vision 107(2) (2014) 191–202
  • [6] Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proc. IEEE Int’l Conf. Computer Vision. (2013) 2752–2759
  • [7] Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at THUMOS 2014. (2014)
  • [8] Siva, P., Xiang, T.: Weakly supervised action detection. In: British Machine Vision Conference. Volume 2., Citeseer (2011)  6
  • [9] Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combing motion and appearance feature. (2014)
  • [10] Sharaf, A., Torki, M., Hussein, M.E., El-Saban, M.: Real-time multi-scale action detection from 3d skeleton data. In: Proc. IEEE Winter Conf. Applications of Computer Vision. (2015) 998–1005
  • [11] Wang, L., Wang, Z., Xiong, Y., Qiao, Y.: CUHK&SIAT submission for THUMOS15 action recognition challenge. (2015)
  • [12] Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. Proc. ACM Int’l Conf. Multimedia (2015)
  • [13] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2015) 2625–2634
  • [14] Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2015) 1110–1118
  • [15] Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI Conference on Artificial Intelligence. (2016)
  • [16] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8) (1997) 1735–1780
  • [17] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. (2014) 568–576
  • [18] Wei, P., Zheng, N., Zhao, Y., Zhu, S.C.: Concurrent action detection with structural prediction. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2013) 3136–3143
  • [19] Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2013) 2642–2649
  • [20] Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: Proc. IEEE Int’l Conf. Computer Vision. (2011) 1036–1043
  • [21] Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.G.: Action localization with tubelets from motion. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2014) 740–747
  • [22] Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2015) 1302–1311
  • [23] Böck, S., Arzt, A., Krebs, F., Schedl, M.: Online real-time onset detection with recurrent neural networks. In: Proc. IEEE Int’l Conf. Digital Audio Effects. (2012)
  • [24] Wollmer, M., Blaschke, C., Schindl, T., Schuller, B., Farber, B., Mayer, S., Trefflich, B.: Online driver distraction detection using long short-term memory. IEEE Transactions on Intelligent Transportation Systems 12(2) (2011) 574–582
  • [25] Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Volume 385. Springer Science & Business Media (2012)
  • [26] Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In Kremer, Kolen, eds.: A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press (2001)
  • [27] Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78(10) (1990) 1550–1560
  • [28] Glocker, B., Pauly, O., Konukoglu, E., Criminisi, A.: Joint classification-regression forests for spatially structured multi-object segmentation. In: Proc. European Conference on Computer Vision. Springer (2012) 870–881
  • [29] Schulter, S., Leistner, C., Wohlhart, P., Roth, P.M., Bischof, H.: Accurate object detection with joint classification-regression random forests. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition. (2014) 923–930
  • [30] Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition Workshops. (2010) 9–14
  • [31] Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body pose features and multiple instance learning. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition Workshops. (2012) 28–35
  • [32] Bloom, V., Makris, D., Argyriou, V.: G3d: A gaming action dataset and real time action recognition evaluation framework. In: Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition Workshops. (2012) 7–12
  • [33] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int’l Journal of Computer Vision 88(2) (2010) 303–338
  • [34] Freund, Y., Schapire, R.E., et al.: Experiments with a new boosting algorithm. In: Proc. Int’l Conf. Machine Learning. Volume 96. (1996) 148–156

Supplementary

Appendix A Experimental Results on the OAD Dataset

a.1 Evaluation of the Soft Selector Module

In this section, we evaluate the influence of the Soft Selector Module, which described in Section 4.3. We implement a variant version of our method by removing the Soft Selector module and directly linking the FC2 and FC3 layers. Table 7 and Table 8 compare the Score, and Scores of our method with and without soft selctor on the OAD Dataset. Note that we use the same parameters for the two settings. We can see that the incorporation of the Soft Selector can bring significant improvement.

Actions w/o Soft Selector with Soft Selector
drinking 0.445 0.574
eating 0.542 0.523
writing 0.714 0.822
opening cupboard 0.526 0.495
washing hands 0.688 0.718
opening microwave 0.697 0.703
sweeping 0.547 0.643
gargling 0.478 0.623
throwing trash 0.608 0.459
wiping 0.691 0.780
overall 0.616 0.653
Table 7: Score on OAD dataset.
Scores w/o Soft Selector with Soft Selector
0.389 0.418
0.413 0.443
Table 8: and Scores on the OAD dataset.

a.2 Sliding Window Size of the Baselines

Since the baseline methods SVM-SW and RNN-SW [15] are based on the sliding window schemes, we report the Score of these methods using different window size (number of frames) in Table 9 and 10. The correponding and scores are shown in Table 11 and 12. In the experiments, the stride of the window is set to be half of the window size.

Actions SVM-SW JCR-RNN
ws=5 ws=10 ws=20 ws=40
drinking 0.291 0.146 0.000 0.000 0.574
eating 0.507 0.465 0.548 0.050 0.523
writing 0.671 0.645 0.792 0.542 0.822
opening cupboard 0.284 0.308 0.352 0.033 0.495
washing hands 0.501 0.562 0.650 0.308 0.718
opening microwave 0.521 0.607 0.590 0.492 0.703
sweeping 0.434 0.461 0.515 0.498 0.643
gargling 0.466 0.437 0.433 0.083 0.623
throwing trash 0.475 0.554 0.315 0.000 0.459
wiping 0.867 0.857 0.748 0.433 0.780
overall 0.525 0.540 0.565 0.334 0.653
Table 9: Score of using different window sizes (ws) for SVM-SW method on OAD dataset.
Actions RNN-SW [15] JCR-RNN
ws=5 ws=10 ws=20 ws=40
drinking 0.495 0.441 0.142 0.033 0.574
eating 0.525 0.550 0.532 0.183 0.523
writing 0.665 0.859 0.837 0.725 0.822
opening cupboard 0.331 0.321 0.413 0.217 0.495
washing hands 0.588 0.668 0.867 0.517 0.718
opening microwave 0.628 0.665 0.680 0.647 0.703
sweeping 0.475 0.590 0.811 0.783 0.643
gargling 0.426 0.550 0.534 0.108 0.623
throwing trash 0.434 0.674 0.325 0.050 0.459
wiping 0.734 0.747 0.797 0.550 0.780
overall 0.512 0.600 0.627 0.476 0.653
Table 10: Score of using different window sizes (ws) for RNN-SW [15] method on OAD dataset.
Scores SVM-SW JCR-RNN
ws=5 ws=10 ws=20 ws=40
0.288 0.316 0.339 0.182 0.418
0.300 0.325 0.343 0.184 0.443
Table 11: and Scores of using different window sizes (ws) for SVM-SW method on OAD dataset.
Scores RNN-SW [15] JCR-RNN
ws=5 ws=10 ws=20 ws=40
0.287 0.366 0.393 0.276 0.418
0.291 0.376 0.401 0.274 0.443
Table 12: and Scores of using different window sizes (ws) for RNN-SW [15] method on OAD dataset.

Appendix B Experimental Results on the G3D Dataset

In this section, we show experimental results on all the seven categories of videos in the G3D dataset [32]. Note that we have only shown the results on the first two categories of videos in the paper due to the space limitation. We evaluate the performance in terms of the Score and and Score as described in this paper, and summarize them in Table 13 and Table 14. Table 15 shows the comparsion of these methods using the evaluation metric of action-based as defined in [32].

Action Category SVM-SW RNN-SW [15] CA-RNN JCR-RNN
Fighting 0.486 0.613 0.700 0.735
Golf 0.680 0.745 0.900 0.967
Tennis 0.598 0.480 0.774 0.788
Bowling 0.667 0.889 1.000 1.000
FPS 0.571 0.581 0.378 0.523
Driving 1.000 1.000 1.000 1.000
Misc 0.712 0.742 0.813 0.862
Table 13: Score on the G3D Dataset.
Action Category Scores SVM-SW RNN-SW [15] CA-RNN JCR-RNN
Fighting 0.318 0.412 0.512 0.528
0.328 0.419 0.525 0.557
Golf 0.553 0.635 0.789 0.793
0.524 0.656 0.791 0.836
Tennis 0.444 0.338 0.605 0.665
0.460 0.333 0.617 0.667
Bowling 0.612 0.777 0.933 0.959
0.550 0.713 0.816 0.861
FPS 0.351 0.388 0.183 0.311
0.353 0.393 0.199 0.327
Driving 0.991 0.983 0.957 0.955
0.975 0.975 0.964 0.975
Misc 0.487 0.593 0.609 0.614
0.515 0.612 0.690 0.766
Table 14: and Score on the G3D Dataset.
Action Category G3D [32] SVM-SW RNN-SW [15] CA-RNN JCR-RNN
Fighting 58.54 76.72 83.28 94.00 96.18
Golf 11.88 45.00 55.00 50.00 70.00
Tennis 14.85 37.57 36.68 59.04 62.38
Bowling 31.58 22.22 44.44 44.44 66.67
FPS 13.65 35.35 39.89 23.69 33.85
Driving 2.5 39.99 50.00 19.99 50.00
Misc 18.13 53.32 65.24 73.81 86.19
Table 15: Action-based [32] on the G3D Dataset.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
47058
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description