Am I Done? Predicting Action Progress in Videos

Am I Done? Predicting Action Progress in Videos

Federico Becattini Media Integration and Communication Center, University of Florence, Italy {federico.becattini,tiberio.uricchio,lorenzo.seidenari,alberto.delbimbo} Tiberio Uricchio Media Integration and Communication Center, University of Florence, Italy {federico.becattini,tiberio.uricchio,lorenzo.seidenari,alberto.delbimbo} Lorenzo Seidenari Media Integration and Communication Center, University of Florence, Italy {federico.becattini,tiberio.uricchio,lorenzo.seidenari,alberto.delbimbo} Alberto Del Bimbo Media Integration and Communication Center, University of Florence, Italy {federico.becattini,tiberio.uricchio,lorenzo.seidenari,alberto.delbimbo} Lamberto Ballan Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy {lamberto.ballan}

In this paper we introduce the problem of predicting action progress in videos. We argue that this is an extremely important task because, on the one hand, it can be valuable for a wide range of applications and, on the other hand, it facilitates better action detection results. To solve this problem we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make framewise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets. Additionally, we show that exploiting action progress it is also possible to improve spatio-temporal localization.

1 Introduction

Many human activities and behaviors rely on understanding what actions are taking place in the surrounding environment, to what point they have advanced, and even when they might be completed. From simple choices, like crossing the street when cars have passed, to more complex activities like intercepting the ball in a basketball game, an agent has to recognize and understand how far an action has advanced at an early stage, based only on what it has seen so far (Fig. 1). If an agent has to act to assist humans, it needs to understand the progress of their intended actions and plan its own actions in real time. It can not wait for the end of the action to perform the visual processing. Therefore, the ultimate goal of action understanding should be the development of an agent equipped with a fully functional perception action loop, from predicting an action before it happens, to following its progress until it ends. This is supported also by experiments showing that humans continuously understand the actions of others in order to plan goals accordingly [6]. Consequently, a model that is able to forecast action progress would enable new applications in robotics (e.g. human-robot interaction, realtime goal definitions) and autonomous driving (e.g. avoid road accidents).

Figure 1: Why is action progress prediction important? This example sequence from “Young Frankenstein” movie shows that in most interactive scenarios it is really necessary to be able to accurately predict action progress and react accordingly. Note: you can watch the sequence on

Fully solving the task of action understanding is very hard. The most related literature is that of predictive vision, an emerging area which is gaining much interest in the recent years. Approaches have been proposed to perform prediction of the near future, be that of a learned representation [37, 23], a video frame [38, 26] or directly the action that is going to happen [5]. We believe that fully solving action understanding requires not only to predict the future outcome of an action, but also to fully understand what is observed so far in the progress of an action. As a result, in this paper we introduce the novel task of predicting action progress, i.e. the prediction of how far an action has advanced during its execution. In other words, considering a partial observation of some human action, in addition to understanding what action and where it is happening, we want to infer how long this action has been executed for with respect to its expected duration. As a simple example of an application let us consider the use case of a robot trained to interact socially with humans. The correct behaviour to respond to a handshake would be to anticipate, with the right timing, the arm motion, so as to avoid a socially awkward moment in which the person is left hanging. This kind of task cannot be solved unless the progress of the action is known.

Similar tasks have been addressed in the literature. First of all, predicting action progress is conceptually different from action recognition and detection [9, 42, 45], where the focus is on finding where the action occurred in time and space. Action completion [25, 13, 41, 44, 46] is a closely related task, where the goal is to predict when an action can be classified as complete, in order to improve temporal boundaries and the classifier accuracy on incomplete sequences. However, this is easier than predicting action progress because it does not require to predict the partial progress of an action.

Action progress prediction is an extremely challenging task since, to be of maximum utility, the prediction should be made while observing the video. While a thick crop of literature addresses action detection and spatio-temporal localization [7, 40, 42, 34, 16, 19, 8, 47], predicting action progress is more related to the variant of online action detection [29, 3, 27]. Here the goal is to accurately detect, as soon as possible, when an action has started and when it has finished but they do not have a model to estimate the progress. The additional information provided by action progress may help in a better definition of temporal boundaries. While advancements in deep learning [22, 12] have largely improved performance for action classification and localization, these are still unsolved problems. We hypothesize that a model for action progress would be forced to embed further knowledge of the rich dynamics of videos, thus improving the temporal understanding of actions.

In this paper we propose the first method able to predict action progress using a supervised recurrent neural network fed with convolutional features, and make three primary contributions:

  • We define the new task of action progress prediction, which we believe is useful in developing intelligent planning agents and an experimental protocol to assess performance.

  • Our approach is holistic and it is capable of predicting action progress while performing spatio-temporal action detection.

  • Interestingly enough, there is a direct benefit in being able to correctly define action boundaries and predicting action progress. In fact, we show that this leads also to improvements in spatio-temporal action detection on untrimmed video benchmarks such as the UCF-101 dataset.

2 Related Work

Human action understanding has been traditionally framed as a classification task. It has been addressed with a plethora of methods using global features, local features and representation learning [28, 1]. Recently, several tasks have emerged aiming at a more precise semantic annotation of videos, namely action localization, action completion and action prediction.

Figure 2: Proposed Architecture. On the bottom (highlighted in orange), we show the classification and localization data flows for tube generation. On the top (in yellow), our ProgressNet. Region (ROI FC6) and Contextual features (SPP FC6) from the last convolutional map are concatenated and then fed to a Fully Connected layer (FC7). Two cascaded LSTMs perform action progress prediction.

Action detection (localization).

Frame level action localization has been tackled extending state-of-the art object detection approaches [29] to the spatio-temporal domain. A common strategy is to start from object proposals and then perform object detection over RGB and optical flow features using convolutional neural networks [9, 27, 30]. Gkioxari et al. generate action proposals by filtering Selective Search boxes with motion saliency, and fuse motion and temporal decision using an SVM [9]. More recent approaches, devised end-to-end tunable architectures integrating region proposal networks in their model [27, 30, 31, 2, 19, 16, 34, 46, 8]. As discussed in [30], most action detection works do not deal with untrimmed sequences and do not generate action tubes. To overcome this limitation, Saha et al. [30] propose an energy maximization algorithm to link detections obtained with their framewise detection pipeline. Another way of exploiting the temporal constraint is to address action detection in videos as a tracking problem, learning action trackers from data [40]. For certain applications, such as video search and retrieval, a temporal segmentation of actions is considered sufficient. To allow online action detection, Singh et al. [34] adapted the Single Shot Multibox Detector [24] to regress and classify action detection boxes in each frame. Then, tubes are constructed in real time via an incremental greedy matching algorithm.

Approaches concentrating in providing starting and ending timestamps of actions have been proposed [7, 42, 14, 31]. Heilbron et al. [14] have recently proposed a very fast approach to generate temporal action proposals based on sparse dictionary learning. Yeung et al. [42] looked at the problem of temporal action detection as joint action prediction and iterative boundary refinement by training a RNN agent with reinforcement learning. In Shou et al. [31] a 3D convolutional neural network is stacked with Convolutional-De-Convolutional filters in order to abstract semantics and predict actions at the frame-level granularity. They report an improved performance in action detection frame-by-frame, allowing a more precise localization of temporal boundaries. A distinct line of research considers explicitely the temporal dimension [16, 19] either using 3D ConvNets on tubes [15] or exploiting multiple frames to generate a tube [19]. Zhao et al. [46] introduces an explicit modelization of starting, intermediate and ending phases via structured temporal pyramid pooling for action localization. They show that this assumption helps to infer the completeness of the proposals.

All these methods do not understand and predict the progress of actions, but only the starting and ending points. Differently from them, we explicitly model and predict action progress with which we are able to significantly improve temporal action detection.

Action completion and prediction.

A closely related line of work addresses the specific case of online action detection. In [15], a method based on Structured Output SVM is proposed to perform early detection of video events. To this end, they introduce a score function of class confidences that has a higher value on partially observed actions. In [21], the same task is attacked with a very fast deep network. Interestingly, the authors noted that the predictability of an action can be very different, from instantly to lately predictable.

A model to continuously predict the class of a hand manipulation from the very few movements is proposed in [5]. They also show that it is possible to predict the forces of the fingers in real-time and how dexterously an action is performed. The structure of sequential actions is used in Soral et al. [36], in an ego-vision scenario, to predict future actions given the current status. Their purpose is to timely remind the user of actions missing in the sequence.

The very recent direction of predictive vision [37, 39] is also related to action progress prediction. Given an observed video, the goal is to obtain some kind of prediction of its near future. Vondrick et al. [37] predict a learned representation and a semantic interpretation, while subsequent works predict the entire video frame [38, 26]. All these tasks are complementary to predicting action progress since, instead of analyzing the progress of an action, they focus on predicting the aftermath of an action based on some preliminary observations.

3 Predicting Action Progress

In addition to categorizing actions (action classification), identifying their boundaries spanning through the video (action detection) and localizing the area where they are taking place within the frames (action localization), our idea is to learn to predict the progress of an ongoing action. We refer to this task as action progress prediction. Additionally, being able to predict to what point the action has advanced, can then help to improve detected spatio-temporal tubes by refining their temporal boundaries.

3.1 Problem definition

Given a video composed of frames , an action can be represented as a sequence of bounding boxes spanning from a starting frame to an ending frame , and enclosing the subject performing the action. This forms what in literature is usually referred to as action tube [9, 30]. For each box in a tube at time , we define the action progress as:


Therefore, given a frame within a tube in [], the action progress can be interpreted as the fraction of the action that has already passed. This definition models the framewise action progress as a linearly and monotonically increasing value in the range .

Although the definition of progress in Eq. 1 may be seen as simplistic, a major issue has to be considered in the definition of our task. Formalizing the progress as a more structured prediction task, i.e. localizing all action phases, would require a correct and unambiguos definition and annotation of such relevant short-term action boundaries (which is an extremely hard annotation task [32]). On the positive end, a simple linear and monotonic definition of action progress allows to learn predictive models from any dataset containing spatio-temporal boundary annotations. This means that our task does not require to collect any additional annotation to existing datasets, since the action progress values can be directly inferred from the existing temporal annotations.

3.2 Model Architecture

The whole architecture of our method is shown in Fig. 2, highlighting the first branch dedicated to action classification and localization, and the second branch which predicts action progress. We believe that sequence modelling can have a huge impact on solving the task at hand, since time is a signal that carries a highly informative content. Therefore, we treat videos as ordered sequences and propose a temporal model that encodes the action progress with a Recurrent Neural Network. In particular we use a model with two stacked Long Short-Term Memory layers (LSTM), with 64 and 32 hidden units respectively, plus a final fully connected layer with a sigmoid activation to predict action progress. We named this model ProgressNet.

Our model emits a prediction at each time step in an online fashion, with the LSTMs attention windows that keep track of the whole past history, i.e. from the beginning of the tube of interest until the current time step. Since actions can be also seen as transformations on the environment [39], we feed the LSTMs with a feature representing regions and their context. We concatenate a contextual feature, computed by spatial pyramid pooling (SPP) of the whole frame [11], with a region feature extracted with ROI Pooling [29]. The two representations are blended with a fully connected layer (FC7). The usage of a SPP layer allows us to encode context information for arbitrarily-sized images. To extract region features and perform detection we build upon the framework recently proposed by Saha et al[30], which is an extension of the popular object detection model Faster R-CNN [29], fine-tuned for action recognition. This model adopts a trained Region Proposal Network (RPN) to generate candidate regions (ROI) that are likely to contain known classes within an image. Using a ROI Pooling layer, these regions are projected onto an intermediate feature map generated by the model and separate predictions are made for each ROI. We use ReLU non linearities after every Fully Connected layer, and dropout to moderate overfitting. In order to be able to feed action tubes to ProgressNet we use the tube generation module from [30], which links bounding boxes across adjacent frames using dynamic programming. Compared to [30], ProgressNet adds a negligible computational footprint of about 1ms per frame, while the whole processing of a single frame takes approximately 150ms on a TITAN X Maxwell GPU.

Tube Refinement.

Given an action tube, composed of boxes we predict the sequence of progress values for each box and compute its first order temporal derivative . We trim the tube if and or and where is used to find when the derivative comes close to zero (i.e. the action is not progressing) and and ensure that the action is about to begin or has reached a sufficiently far point in its execution. We use , and , which have proven to be suitable by cross validation.

3.3 Learning

Our spatio-temporal localization network is initialized with the pretrained network from [30], which is based on the VGG-16 architecture [33]. To train our ProgressNet we use positive ground truth tubes as training samples. To avoid overfitting the network, we apply the two following augmentation strategies. First, for every tube we randomly pick a starting point and a duration, so as to generate a shorter or equal tube (keeping the same ground truth progress values in the chosen interval). For instance, if the picked starting point is at the middle of the video and the duration is half the video length, the corresponding ground truth progress targets would be the values in the interval . This also force the model to not assume 0 as starting progress value. Second, for every tube we generate a subsampled version, reducing the frame rate by a random factor uniformly distributed in . This second strategy helps in generalizing with respect to the speed of execution of different instances of the same action class.

Action progress is defined using Eq. 1. Since one of our goals is to improve action boundaries, we encourage the network to be more precise on the temporal boundaries of an action by using our Boundary Observant loss:


where and are the predicted progress and the correspondent ground truth value, respectively.

Figure 3: Comparison between the (left) and the Boundary Observant (right) loss functions. Predicted values and ground truth targets are on the two axes. It can be seen that the Boundary Observant loss is stricter against errors on the action boundaries.

Compared to a standard loss for regression, the Boundary Observant loss penalizes errors on the action boundaries more than in intermediate parts, since we want to precisely identify when the action starts and ends. At the same time, it avoids the trivial solution of always predicting the intermediate value . Fig. 3 shows the difference between the two loss functions, where predicted values are on the axis and desired values on the axis. Note from Eq. 1 that in action progress prediction only values in can be expected

We initialize all layers of ProgressNet with the Xavier [10] method and employ the Adam [20] optimizer with a learning rate of . We use dropout with a probability of 0.5 on the fully connected layers.

4 Experiments

In this section we show results on action progress prediction for spatio-temporal tubes and propose two evaluation protocols. Since this is a novel task we introduce some simple baselines and show the benefits of our approach. To further underline the importance of predicting action progress, we exploit our predictions to refine precomputed action tubes and improve their temporal localization.

We experiment on the J-HMDB [18] and UCF-101 [35] datasets. J-HMDB consists of 21 action classes and 928 videos, annotated with body joints from which spatial boxes can be inferred. All the videos are temporally trimmed and contain only one action. We use this dataset to benchmark action progress prediction. UCF-101 contains 24 classes annotated for spatio-temporal action localization. It is a more challenging dataset because actions are temporally untrimmed and there can be more than one action of the same class per video. Moreover, it contains video sequences with large variation in appearance, scale and illumination. We use this dataset to predict action progress but also to measure the improvement in action localization. In order to be comparable with previous spatio-temporal action localization works, we adopt the same split of UCF-101 used in [40, 27, 30, 43]. Note that larger datasets such as THUMOS [17] and ActivityNet [4] do not provide bounding box annotations, and therefore can not be used in our setting.

4.1 Evaluation protocol and Metrics

In order to evaluate the task of action progress prediction, we introduce two evaluation protocols along with standard Video-AP for spatio-temporal localization.

Framewise Mean Squared Error.

This metric tells how well the model behaves at predicting action progress when the spatio-temporal coordinates of the actions are known. Test data is evaluated frame by frame by taking the predictions on the ground truth boxes and comparing them with action progress targets . We compute mean squared error across each class. Being computed on ground truth boxes, this metric assumes perfect detections and thus disregards the action detection task, only evaluating how well progress prediction works.

Average Progress Precision.

Average Progress Precision (APP) is identical to framewise Average Precision (Frame-AP) [9] with the difference that true positives must have a progress that lays within a margin from the ground truth target. Frame-AP measures the area under the precision-recall curve for the detections in each frame. A detection is considered a hit if its Intersection over Union (IoU) with the ground truth is bigger than a threshold and the class label is correct. In our case, we fix and evaluate the results at different progress margins in . A predicted bounding box is matched with a ground truth box and considered a true positive when has not been already matched and the following conditions are met:


where is the predicted progress, is the ground truth progress and is the progress margin. We compute Average Progress Precision for each class and report a mean (mAPP) value for a set of values.


Video Average Precision (Video-AP) [9] is used in order to compare with previous methods and show how predicting action progress can impact on spatio-temporal action localization. We report results at varying IoU thresholds, ranging from 0.05 to 0.6. Note that IoU is computed over spatio-temporal tubes, so detected tubes must be precise both at locating the action within the frame and at finding the correct temporal boundaries. Therefore, differently from detection in static images, IoUs higher than 0.4 are very strict.

Figure 4: Mean Squared Error obtained by three formulations of our model (ProgressNet Static, ProgressNet L2 and ProgressNet) on the J-HMDB dataset. Random and constant 0.5 predictions are reported as reference.
Figure 5: t-SNE visualization of ProgressNet’s hidden states of the second LSTM layer on the J-HMDB test set. Each point corresponds to a frame and its color represents the ground truth progress quantized with a 0.1 granularity.
Figure 6: t-SNE visualization of ProgressNet’s hidden states from the second LSTM layer of 4 classes from J-HMDB test set. Each point corresponds to a frame and its color represents the ground truth progress quantized with a 0.1 granularity (as in Fig. 5).
Random 0.166 0.166
0.5 0.084 0.083
ProgressNet Static 0.079 0.104
ProgressNet L2 0.032 0.052
ProgressNet 0.026 0.050
Table 1: Mean Square Error values for action progress prediction on the UCF-101 and J-HMDB datasets. Results are averaged among all classes.

4.2 Implementation Details

In practice, we observed that on some classes of the UCF-101 dataset it is hard to learn accurate progress prediction models. These are action classes like Biking or WalkingWithDog that exhibit a cyclic behavior, where even for a human observer is hard to guess how far the action has progressed. Therefore we extended our framework by adopting a curriculum learning strategy. First, the model is trained as described in Sect. 3.3 on classes that have a clear non-cyclic behavior111This subset consists of the following classes: Basketball, BasketballDunk, CliffDiving, CricketBowling, Diving, FloorGymnastics, GolfSwing, LongJump, PoleVault, TennisSwing, VolleyballSpiking.. Then, we fix all convolutional, FC and LSTM layers and fine-tune the FC8 layer that is used to perform progress prediction from the last LSTM output on the whole UCF-101 dataset. This strategy improves the convergence of the model.

4.3 Action Progress Prediction

In this section we report the experimental results on the task of predicting action progress. Since ProgressNet is a multitask approach, we first measure action progress performance on perfectly localized actions. We then test the method on real detected tubes with the full model and finally report a qualitative analysis which shows some success and failure cases.

Action progress on correct localized actions.

In this first experiment we evaluate the ability of our method to predict action progress on correctly localized actions in both time and space. We take the ground truth tubes of actions on the test set and compare the MSE of three variants of our method: the full architecture trained with our Boundary Observant loss (ProgressNet), the same model trained with L2 loss (ProgressNet L2) and a reduced memoryless variant (ProgressNet Static).

The comparison of our full model against ProgressNet L2 is useful to understand the contribution of the Boundary Observant loss with respect to a simpler L2 loss. To underline the importance of using recurrent networks in action progress prediction, in the variant ProgressNet Static we substitute the two LSTMs with two fully connected layers that predict progress framewise. In addition, we provide two baselines: random prediction and a constant prediction of the progress expectation. The random prediction provides us with a higher bound on the MSE values. The latter, with a prediction of for every frame, is a trivial solution that obtains good MSE results. Both are clearly far from being informative for the task.

We report the results obtained in this experiment in Table 1. We first observe that the MSE values are consistent among the two datasets, with ProgressNet models ahead of the baselines. ProgressNet and ProgressNet L2 obtain a much lower error than ProgressNet Static and the baselines. This confirms the ability of our model to understand action progress. In particular, the best result is obtained with ProgressNet, proving that our Boundary Observant loss plays an important role in training the network effectively. ProgressNet Static has an inferior MSE than the variants with memory, suggesting that single frames for some classes can be ambiguous and a temporal context can help to accurately predict action progress. In particular, observing the class breakdown for J-HMDB in Fig. 4, we note that the static model gives better MSE values for some actions such as Swing Baseball and Stand. This is due to the fact that such actions have clearly identifiable states, which help to recognize the development of the action. On the other hand, classes such as Clap and Shoot Gun are hardly addressed with models without memory because they exhibit only few key poses that can reliably establish the progress.

We also report in Fig. 5 and 6 embeddings of the hidden states of the second LSTM layer. The former contains the whole test set while the latter focuses on four classes. Each point is a frame of the test set of J-HMDB and is colored according to its true action progress. In all figures we note that progress increases radially along trajectories from points labeled with 0.1. This suggests that ProgressNet has learned directions in the hidden state space that follow action progress.

Action progress with the full pipeline.

In this second experiment, we evaluate the performance of action progress while also performing the spatio-temporal action detection with the entire pipeline. Differently from the previous experiment, we test the full approach where action tubes are generated by the detector. In Fig. 7, we report the mAPP of ProgressNet (trained with BO loss), ProgressNet Static and the two baselines Random and on both the UCF-101 and J-HMDB benchmarks. Note that the mAPP upper bound is given by standard mean Frame-AP [9], which is equal to mAPP with margin . In the baseline, this upper bound is reached with .

It can be seen that ProgressNet has mAPP higher than the baselines for stricter progress margin. This confirms that our approach is able to predict action progress correctly even when tubes are noisy such as those generated by a detector. ProgressNet Static exhibits a lower performance than ProgressNet, confirming again that the memory is helpful to model action progress.

Figure 7: mAPP on the UCF-101 and J-HMDB datasets.

Qualitative analysis.

Qualitative results taken from the UCF-101 dataset, obtained by our ProgressNet, are shown in Fig. 8. It is interesting to notice how in some of the examples the predicted progress does not have a decise linear trend, but follows the visual appearance of the action. In particular in the second row (LongJump), while the athlete is running, the output values tend to fluctuate, but as soon as he jumps the predicted progress grows firmly towards one. Insights on the model behavior are also given by the failure cases: in the GolfSwing clip, the actor hesitates before hitting the golf ball and the progress is therefore late respect to the ground truth. In the FloorGymnastics case instead, the first and the last pose are almost identical and the predicted progress grows straight towards completion after a few frames.

Figure 8: Qualitative results (success cases in the green frame, failure cases in the red one). Each row represents the progression of an action. Progress values are plotted inside the detection box with time on the x axis and progress values on the y axis. Progress targets are shown in green and predicted progresses in blue.

4.4 Spatio-temporal action localization

IoU 0.05 0.1 0.2 0.3 0.4 0.5 0.6
Ours 79.9 77.0 67.0 55.6 46.4 35.7 26.9
Saha et al. [30]2 78.9 76.1 66.4 54.9 45.2 34.8 25.9
Yu et al. [43] 42.8 - - - - - -
Weinzaepfel et al. [40] 54.3 51.7 46.8 37.8 - - -
Peng et al. [27] 54.5 50.4 42.3 32.7 - - -
Singh et al[34] - - 73.5 - - 46.3
Kalogeiton et al[19] - - 77.2 - - 51.4
Table 2: Mean Video-AP on UCF-101 (split1) for spatio-temporal action localization.

Besides having an intrinsic value in the understanding of human behavior in videos, action progress is also a useful tool for obtaining more precise action tubes. The reason why action tubes generated by machine learning methods are often imprecise can be traced back to two main causes: inaccurate frame level detections and difficulties in properly identifying the temporal action boundaries, i.e. the first and the last frame in which the action is present. Since most methods rely on powerful and precise detectors [30, 27] to generate candidate boxes, we argue that the primary cause of defects concerns temporal boundaries.

In this experiment we perform action localization of tubes trimmed with the strategy described in Sect. 3.2. We follow previous work [30, 34, 19] and use UCF-101 (split1), reporting performance in terms of mean Video-AP. We apply our trimming strategy to tubes generated with [30] 222Please note that these numbers are slightly different with respect to the results reported in the original paper [30] (apparently there was a bug in the code). We report in Table 2 the updated results by the authors, as shown in, but the proposed approach is applicable to tubes generated by any generic method333State of the art methods like Singh et al. [34] and Kalogeiton et al. [19], only recently appeared, are likely to generate better tubes.. We report the results for this experiment in Table 2. Comparing our approach to the baseline tubes generated with [30], we obtain higher mean Video-AP for every IoU threshold. This confirms that action progress can be useful to better localize actions in time.

5 Conclusion

In this paper we defined the novel task of action progress prediction. Our method is the first that can predict spatio-temporal localization of actions and at the same time understand the evolution of such activities by predicting their stage on-line. This approach opens new scenarios for any goal planning intelligent agent in interaction applications. In addition to our model, we propose a boundary observant loss which helps to avoid trivial solutions. Moreover, we show that action progress can be used to improve action temporal boundaries.


  • [1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: a review. ACM Computing Surveys, 43(3):16:1–16:43, 2011.
  • [2] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [3] R. De Geest, E. Gavves, A. Ghodrati, Z. Li, C. G. Snoek, and T. Tuytelaars. Online action detection. In European Conference on Computer Vision, pages 269–284, 2016.
  • [4] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
  • [5] C. Fermüller, F. Wang, Y. Yang, K. Zampogiannis, Y. Zhang, F. Barranco, and M. Pfeiffer. Prediction of manipulation actions. International Journal of Computer Vision, pages 1–17, 2016.
  • [6] J. R. Flanagan and R. S. Johansson. Action plans used in action observation. Nature, 424(6950):769, 2003.
  • [7] A. Gaidon, Z. Harchaoui, and C. Schmid. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2782–2795, 2013.
  • [8] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In IEEE International Conference on Computer Vision, 2017.
  • [9] G. Gkioxari and J. Malik. Finding action tubes. In IEEE International Conference on Computer Vision, pages 759–768, 2015.
  • [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics Conference, pages 249–256, 2010.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361, 2014.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [13] F. Heidarivincheh, M. Mirmehdi, and D. Damen. Beyond action recognition: Action completion in rgb-d data. In British Machine Vision Conference, 2016.
  • [14] F. C. Heilbron, J. C. Niebles, and B. Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1914–1923, 2016.
  • [15] M. Hoai and F. De la Torre. Max-margin early event detectors. International Journal of Computer Vision, 107(2):191–202, 2014.
  • [16] R. Hou, C. Chen, and M. Shah. Tube convolutional neural network (t-cnn) for action detection in videos. In IEEE International Conference on Computer Vision, 2017.
  • [17] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
  • [18] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision, pages 3192–3199, 2013.
  • [19] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal action localization. In IEEE International Conference on Computer Vision, 2017.
  • [20] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [21] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1481, 2017.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [23] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang. Unsupervised representation learning by sorting sequences. In IEEE International Conference on Computer Vision, 2017.
  • [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [25] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1942–1950, 2016.
  • [26] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  • [27] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In European Conference on Computer Vision, pages 744–759, 2016.
  • [28] R. Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 28(6):976–990, 2010.
  • [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
  • [30] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In British Machine Vision Conference, 2016.
  • [31] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [32] G. A. Sigurdsson, O. Russakovsky, and A. Gupta. What actions are needed for understanding human actions in videos? In IEEE International Conference on Computer Vision, pages 2137–2146, 2017.
  • [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [34] G. Singh, S. Saha, M. Sapienza, P. Torr, and F. Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE International Conference on Computer Vision, 2017.
  • [35] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv preprint arXiv:1212.0402, 2012.
  • [36] B. Soran, A. Farhadi, and L. Shapiro. Generating notifications for missing actions: Don’t forget to turn the lights off! In IEEE International Conference on Computer Vision, pages 4669–4677, 2015.
  • [37] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations with unlabeled video. In IEEE Conference on Computer Vision and Pattern Recognition, pages 98–106, 2016.
  • [38] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
  • [39] X. Wang, A. Farhadi, and A. Gupta. Actions transformations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2658–2667, 2016.
  • [40] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE International Conference on Computer Vision, pages 3164–3172, 2015.
  • [41] Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716, 2017.
  • [42] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2678–2687, 2016.
  • [43] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1302–1311, 2015.
  • [44] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng. Temporal action localization by structured maximal sums. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2017.
  • [45] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision, pages 2914–2923, 2017.
  • [46] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision, 2017.
  • [47] H. Zhu, R. Vial, and S. Lu. Tornado: A spatio-temporal convolutional regression network for video action proposal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5813–5821, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description