Video Summarization via Actionness Ranking
To automatically produce a brief yet expressive summary of a long video, an automatic algorithm should start by resembling the human process of summary generation. Prior work proposed supervised and unsupervised algorithms to train models for learning the underlying behavior of humans by increasing modeling complexity or craft-designing better heuristics to simulate human summary generation process. In this work, we take a different approach by analyzing a major cue that humans exploit for summary generation; the nature and intensity of actions.
We empirically observed that a frame is more likely to be included in human-generated summaries if it contains a substantial amount of deliberate motion performed by an agent, which is referred to as actionness. Therefore, we hypothesize that learning to automatically generate summaries involves an implicit knowledge of actionness estimation and ranking. We validate our hypothesis by running a user study that explores the correlation between human-generated summaries and actionness ranks. We also run a consensus and behavioral analysis between human subjects to ensure reliable and consistent results. The analysis exhibits a considerable degree of agreement among subjects within obtained data and verifying our initial hypothesis.
Based on the study findings, we develop a method to incorporate actionness data to explicitly regulate a learning algorithm that is trained for summary generation. We assess the performance of our approach on 4 summarization benchmark datasets, and demonstrate an evident advantage compared to state-of-the-art summarization methods.111Accepted as an oral presentation in WACV-19.
With the immense growth in the use of smart-phones and cameras, the amount of recorded visual data has become by far much more available than what can be attentively viewed. Each day 144,000 hours of video are uploaded to YouTube, which is almost 17 years worth of videos [16, 36, 12]. Moreover, recent statistics report that 245 million CCTV cameras are professionally installed around the world, actively surveying day-to-day activities . Records in 2017 show that there are at least 2.32 billion active camera phones . Estimates show that about 2.4 million GoPro body cameras were sold world-wide in 2015 . This calls for efficient and automatic methods that quickly examine visual data and provide an informative briefing about the original videos. Video summarization addresses the problem of selecting a subset of video frames such that summary captures the most important and representative events of the original video.
Several prior works made substantial efforts to better understand the video summarization problem and have proposed heuristic solutions (e.g., [26, 29, 2, 50, 31, 39]). The remarkable success of deep neural networks [23, 46, 34, 13] has motivated researchers in designing even more complex black-box models instead of a developing a profound understanding of the problem (e.g., [32, 55, 3, 20]). While increasing model complexity often helps in better modeling the latent patterns of data, it has the risk of overfitting to standard benchmark training video datasets and being sensitive to noise and irrelevant features, unless a proper learning objective is used. To address this challenge, here we seek to investigate a new learning objective that takes into account the role of deliberate actions performed by generic agents within the human-generated summaries and utilize this correlation to perform a robust automatic summarization. The premise of our work stems from our observation that humans tend to include frames with deliberate actions more frequently in the summary, since they tend to represent more “unexpected and important” events, and tell more about the story of the video.
Actions and motion patterns in videos present an intricate visual stimulation to the eyes of the viewer and thus become major cues when generating summaries for long videos. In the philosophy of actions , there are three aspects that define a generic action instance: i) it is carried out by an agent, ii) it requires an intention, and iii) it leads to side-effects. Spatial Actionness was introduced to quantify the likelihood of an image region to contain a generic action instance [4, 49]. Along the same lines, video summarization aims to localize temporal instances where important events occur. We propose to extend this definition to the temporal domain to better serve the summarization problem. That is, Temporal Actionness is the likelihood of a generic action to appear within a temporal video segment.
Temporal actionness ranking can assist an automatic summarization algorithm in localizing and quantifying the intensity of generic action instances. Consequently, it can also estimate the likelihood of including each event in the summary. Fig. 2 shows an example of a first-person video of a person performing base jumping. There are four distinct types of motion in this video: running water, camera relative motion, a jumping partner, and first-person own-hand manipulation; but only the last two instances qualify as strong temporal actionness which tend to constitute the vast majority of the summary.
Our main contributions in this paper is three-fold. First, we establish the concept of temporal actionness and study how it relates to video summarization. Second, we introduce a new set of actionness labels over four existing summarization benchmarks, and run a consensus and behavioral analysis on them to verify their consistency. Finally, we propose a method that utilizes temporal actionness to improve the summary generation through a multi-task learning formulation.
2 Related Work
In this section, we start by reviewing the concept of spatial actionness in the literature. Then, we briefly review Recurrent Neural Networks (RNN) and mention some of their applications in video processing. Finally, we conclude by discussing some prior approaches that have applied RNN models to the video summarization problem.
Actionness: The concept of spatial actionness was first introduced in  as the deliberate bodily movement performed by an agent; which is distinct from general instances of motion since it requires intention. They used Lattice Conditional Ordinal Random Fields to rank the regions of an image based on its likelihood of containing an action (i.e., ranking actionness).
Accurate and efficient ranking of spatial actionness was shown to benefit other related tasks [49, 53, 30, 27]. For example, Wang et al.  used a fully convolutional network to estimate spatial actionness. Then, they embedded the predicted actionness heat-map within a hybrid approach that performs action detection. Also, Ting et al. [53, 10] suggested a framework that performs action proposals by generating actionness curves via a snippet-level actionness classifier, then grouping them over time to produce the proposal candidates. Finally, Zhao et al.  proposed a temporal action proposal scheme called Temporal Actionness Tagging (TAG). This method uses an actionness classifier to evaluate the binary actionness probabilities for individual snippets. Our definition of temporal actionness is consistent with theirs, but also generalizes to agents other than humans as discussed in Section 3.1.
Recurrent Neural Networks (RNNs): Since their introduction in [40, 51], RNNs have been commonly used to model sequential data. Unlike feed-forward networks (e.g., CNNs) whose output only depends on the input at the current time-step, RNN output also relies on previous time-steps. The basic formulation of RNN has the drawback of missing long-term dependencies due to the vanishing gradient problem . Several extensions of RNNs have been introduced to resolve this problem. Popular approaches include: Long-Short Term Memory (LSTM) , and Gated Recurrent Unit (GRU) . Both of these models have been successfully employed for applications such as video captioning using LSTM [47, 38, 52, 28], and action recognition and action proposals using GRU [1, 22, 48].
Video Summarization using RNNs: Because of their ability to process temporal data, RNNs have been widely used to train supervised and unsupervised video summarization models (e.g., [20, 32, 55, 3, 43, 56]). Zhang, et al.  were the first to use a supervised LSTM and a Multi-Layer Perceptron (MLP) while optimizing the Determinantal Point Process (DPP) maximum likelihood [25, 33, 24, 12]. DPP is used to quantify the diversity in the selected subset of frames which deems maximizing DPP to be equivalent to selecting a representative summary since the redundancy is minimized. Recently, Mahesseni et al.  presented an unsupervised video summarization framework by training an LSTM network in an adversarial manner to better model the complexity of the data. Further, Chen et al.  used a hybrid framework that utilizes GRU, MLP, and a temporal segmentation algorithm to perform the tasks of video summarization and video captioning simultaneously.
3 Relating Actionness to Summarization
In this work we hypothesize that human-generated summaries favor frames that contain deliberate motions over stationary or monotonous motions that are deemed boring. To test this hypothesis, we start by defining the type of motion that we expect to be a substantial component in human-generated summaries, which we refer to as temporal actionness. Then, we conduct a user study on human subjects investigating the relationship between temporal actionness and generated summaries. Finally, we conduct a consensus analysis on the obtained data to measure the agreement among subjects and a behavioral analysis to ensure the reliability of our findings.
3.1 Temporal Actionness
As discussed in Section 2.1, spatial actionness is defined as the likelihood of a certain region in an image to contain an action . An image region is considered to contain an action based on the definition of actions in  as ”what an agent can do with a deliberate bodily movement that leads to side-effects”.
Our definition of actionness is consistent with the aforementioned definitions, but we extend it in two ways. First, we also consider non-human agents that perform deliberate motions, because human agents do not necessarily exist in the videos that are required to be summarized. For example, a swimming dolphin represents an action while a running river is not. Even though both of them contain similar magnitudes of motion but there is no intention in the latter.
Second, we adapt the actionness concept to the temporal domain, where we estimate the likelihood of a given video segment to contain an action. For biological agents, it is possible to predict the likelihood of the action from the agent’s pose. However, since we are generalizing our definition to non-biological agents, their motion often is not distinguishable within a single frame. Thus, a video segment is essential to determine the nature of motion. For instance, detecting a moving vehicle requires monitoring several frames to track the vehicle’s location changes and to distinguish it from a stopped one.
We target a rank ordering of actionness rather than a binary classification of whether a segment contains an action (i.e., action proposal ) for two reasons. First, the fundamental notion of temporal actionness as ”localizing when there is an action” immediately presents a difficulty: temporal segmentation remains a challenging and open problem. Some efficient methods exist for this purpose such as KTS , but the average f-score remains too low for robust use (about 0.41). Ranking makes it more plausible to provide a stratified quantification to the likelihood of a segment based on the prevalence of an action. Second, in any given video, often background actions (e.g., monotonous actions) are overlooked by the viewers as opposed to foreground abrupt actions. For instance, in a surveillance video, it is only natural to dismiss the background monotonous moving traffic, and monitor the abrupt motions around a building’s entrance.
3.2 User Study
To estimate actionness, we first used KTS algorithm  to produce semantically consistent variable-size segments that contain atomic semantic meanings. Then, for each segment, we asked five users to label it by selecting the appropriate rank from the following scales:
0: No action (No deliberate motion by an agent)
1: Background action (Weak indication of an action)
2: Partial foreground action (Strong action indication covering a minor part of the segment)
3: Active foreground action (Strong action indication covering a major part of the segment)
For a tractable annotation process, we subsampled the videos to 1 fps. Then, we constructed the displayed segment to contain all the frames in a grid display allowing the users to see all the frames of one segment simultaneously. Before starting the process, users underwent a training stage to understand the task and the procedure. They were asked to rank actionness on four videos. After training, the users were asked to perform the same task on four benchmark summarization datasets: SumMe , TVSum , Youtube , and OVP . Videos used during the user training stage were discarded in model development.
3.3 Data Analysis
Consensus analysis. To ensure the validity of the annotations, we measured the consensus among users using two metrics. The first metric is the f-1 score. We computed the average pairwise f1-measure to estimate the agreement among the annotators for each scale. We obtained 0.55, 0.40, 0.48, and 0.51 for SumMe, TVSum, OVP, and Youtube datasets, respectively. The second metric is the rank-frequency over original videos for each user. That is, how often each user chose a given scale for all the videos of the annotation? Fig. 3 shows the frequency ranks for all users. We observe that ratios by users are close to each other for all the scales, which along with the f-1 scores demonstrates evident consensus among users.
Do summaries contain high actionness? To answer this question, we computed the average frequency of each actionness scale in both of the ground-truth summary and the original video. Fig. 4 demonstrates that scale-three actionness frames seem to be the dominant majority rank among the summary despite their minority existence in the original video. Hence, frames containing high actionness are more likely to be included in the summary.
Were the annotators just looking for abrupt motions? For a more extensive verification, we examine if the users tended to choose segments containing abrupt motion (i.e., high magnitudes of motion) as representation for the high-actionness segments. To answer this question, we first need to provide an evaluation for abrupt motion. We calculated the mean magnitude of optical flow for each of the segments, and normalized it across each video. Then, we computed the histogram plot of the segments scored by the users as level-three actionness sorted by their normalized mean magnitude of optical flow. As shown in Fig. 5, the selected segments are distributed among a wide variation of optical-flow intensities. This shows that users were not merely selecting the most abrupt motion segments as representatives for the deliberate actions required in high actionness.
Having established our hypothesis, we seek to utilize the data obtained from the study to further improve the automatic video summarization algorithms. In order to train a supervised learning model, we need to produce a single set of labels out of multiple annotations for each video. This is often referred to as Oracle Labels set. We follow the algorithm proposed in [12, 24] that greedily selects the segment that results in the largest marginal gain on the f-1 score computed between the users’ annotations. To produce frame-level labels, we consider all the frames within a segment to have its ranking label.
In this section we propose a model that incorporates actionness ranking task to regularize video summarization.
Figure 6 shows an overview of our framework. The input is a video of frames. First, a visual encoder (i.e., a pretrained CNN) is used to extract spatial features for each frame. Next, the extracted features are sent to a sequential encoder (i.e., a Bi-directional GRU) to extract their corresponding temporal features. GRU is used as a sequential encoder because it has fewer parameters than LSTM, which results in faster training and a less risk of overfitting, and shown to perform on par to the LSTM . Next, we aggregate both types of the features, spatial and temporal, to generate a comprehensive spatio-temporal feature vector for each frame. These features represent the visual information of the current frame as well as encode all the temporal information from other frames in the video. Finally, the aggregate features are mapped to the actionness and importance scores using two independent MLPs.
The framework is trained to learn two tasks: 1) summarization by minimizing importance estimation loss, and 2) actionness ranking by minimizing actionness classification loss. The framework is optimized by applying a regularized multi-task learning paradigm . Imposing a regularization term in a joint loss is aimed to penalize the unnecessary complexity of the original learning problem that might cause overfitting to training data, while enforcing learning task relationship.
By combining the two losses into a single joint loss, the network is trained to learn a set of trainable parameters such that:
where is the summarization loss (section 4.2), is the actionness classification loss (section 4.3) which acts as a regularizer, and is the regularization weight used to force both the losses to operate on comparable ranges, preventing the learning to be biased towards one of the losses.
4.2 Importance Estimation
Importance scores (i.e., summarization labels) are binary labels that indicate the frames selected to be a part of the summary: 1 for selected frames, and 0 otherwise. The problem with this type of labeling is that frames within the same segment tend to have similar semantic features, therefore the annotators could have chosen any other frame within a selected frame’s segment (i.e., key segment). To reduce the effect of the inherent noise in the labels, we apply Gaussian smoothing as a preprocessing step. Particularly, binary labels are converted to real-values where the mean is the selected frame within the summary, and the Gaussian distribution is sampled across its key segment (see Fig. 6). Thus, the framework would not be penalized for choosing a frame within a key segment as much as it would be penalized for choosing a frame outside a key segment.
Increasing the diversity within the selected subset is equivalent to choosing a representative subset since the redundancy is minimal. Following [12, 33], we follow the decomposition in  to compute the marginal kernel as a of a Gram matrix in the following manner:
where can be seen as a representative feature vector, and is quality score of frame in the selected subset . Similar to , we construct the features with a dimensionality of 256 for each frame, and the quality score as a single scalar for every frame. In our framework, we apply two independent MLPs with the aforementioned dimensions to obtain and and compute the marginal DPP kerenl as in Eq. 2.
Finally, we optimize the Maximum Likelihood Estimation (MLE) of the normalized marginal DPP kernel that quantifies the diversity in the ground-truth summaries as follows:
where is the marginal kernel of the ground-set of all the frames in the video, and is the identity matrix.
4.3 Actionness Ranking
This task aims to provide a regularization term to the joint loss (Eq. 1) which is determined by classifying the actionness scale of each frame; . We train an independent MLP to map the spatio-temporal features of each frame to an actionness rank using the categorical cross entropy loss as follows:
where are the predicted and target values of actionness rank for the -th frame.
5 Experimental Results
In above sections, we proposed that deliberate motion provides a significant cue when humans are summarizing a given video. Then, we established this hypothesis by performing a user study among multiple human subjects that were asked to rank the magnitude of deliberate motion. By analyzing the study results, it is clear that a significant portion of the summary includes high intensity of deliberate motion, as opposed to the original video contents. Therefore, we introduced an approach that can rank the intensity of deliberate motion and uses this knowledge to improve the performance to perform a better video summarization. In this section, we run an extensive set of experiments where we show the effect of learning the actionness in learning summarization.
We evaluated our approach on four summarization benchmark datasets: SumMe , TVSum , Open Video Project (OVP) , and Youtube . The first dataset consists of 25 user videos covering multiple events such as bears climbing a tree and cooking. It contains both first-person and third-person videos with lengths varying from 1.5 to 6.5 minutes. The second dataset consists of 50 Youtube videos from 10 categories of the TRECVid Multimedia Event Detection (MED), 5 videos per category. They vary in length from 1 to 5 minutes and include both first and third person videos.
The third and fourth datasets are quite large. We use the same subset of videos used in [7, 32, 55], 50 videos from OVP, and 39 videos from Youtube. OVP videos contain mostly news reports and documentary clips that vary in length from 1 to 4 minutes. All of them are third-person videos. The last dataset contains news and sports videos (third-person videos) with lengths varying from 1 to 10 minutes.
5.2 Experimental Setup
For a fair comparison with the related approaches, we evaluate our method using the keyshot-based metric similar to [55, 32]. We first convert frame-level scores to shot scores by applying the KTS algorithm  that generates semantic shots. The resulting shots are ranked based on their importance score, which is the average score of the frames in that shot. By applying the Knapsack algorithm, a subset of the highest ranked keyshots are selected such that the total duration of the generated summary is less than 15% of the original video. We report the average f1-scores to evaluate the predicted summary as compared to the ground-truth summary.
Implementation Details: Similar to [32, 55], we use the output of the pool5 layer of GoogLeNet  architecture trained on ImageNet  as the visual encoder for our framework to extract a 1024 dimension spatial feature vector for each frame. Then, we use a single-layer GRU with 256 hidden units as the sequential encoder and 256 hidden units MLPs for both of the optimization tasks. Similar to the training setup of , we run our model for 100 iterations in the training stage and stop the training if the validation f1-score does not improve for more than 5 consecutive iterations. The validation split is set to be 20% random subset of the training data. We use Adam optimizer to train our framework with learning rate of 0.001. To learn the task of actionness ranking, we set to 0.003. The value of was selected to make both of the losses operate on close ranges so that none of them bias the optimization while training the network.
5.3 System Performance
Test Configurations: We follow [55, 32] to evaluate our method in three test configurations. In the first configuration (Canonical), we use 80% of one dataset to train the method, and test the method on the remaining 20% of the same dataset. In the second configuration (Augmented), TVSum and SumMe datasets are used together - one dataset is used to train the method while being tested on the entire other dataset. In the last configuration (Transfer), we adapt the same paradigm as the second configuration but augment the training set with OVP and Youtube datasets, which improves the results on SumMe and TVSum.
Baselines: We conduct an extensive comparison with the state of the art methods [14, 15, 54], two models from : LSTM+MLP (vsLSTM) and LSTM+MLP+DPP (dppLSTM), and two models from : Unsupervised DPP (DPP) and supervised model (SUP).
Also, to perform an ablation study on our model, we introduce three variants of our approach. First, Ours-Basic is our model without the actionness regularization;. It reduces the model’s complexity to be close to , however, our model uses GRU instead of LSTM and performs Gaussian smoothing preprocessing on the labels. Second, Ours-FT is the same as the basic model, but the sequential encoder is first trained for human-based action localization, then the entire framework is fine-tuned for video summarization. To train the GRU for action localization, we follow  to train the sequential encoder on GoogLeNet features for action recognition task on UCF-101  for 100 epochs, then fine-tune it for action localization on THUMOS-14  for another 100 epochs. The last model is Ours-Reg, which is a model that is trained for simultaneous video summarization and actionness estimation as discussed in Section 4.
Summarization Evaluation: Table 1 shows the f-1 scores of our models compared to the state-of-the-art methods. As shown, Ours-Basic performs similarly to vsLSTM and dppLSTM. Training our model on the action recognition labels prior to summarization (Ours-FT) performs on par with the state-of-the-art methods. However, the model that is trained for actionness estimation, that is considering deliberate motions performed by generic agents (not just humans unlike Ours-FT), significantly outperforms all other methods in most of the settings (Ours-Reg).
Actionness Evaluation: To investigate whether actionness helps summarization, we ran two analyses. First, we verify that our model effectively learns the actionness ranking task by computing the actionness classification accuracy in all test configurations. As shown in Table 2), Ours-Reg performs significantly better than chance, indicating that the model actually learns actionness estimation and does not dismiss it from the learning procedure. Second, we compute the distribution of actionness scales in the ground-truth summary, Ours-Reg, and  over the SumMe dataset for test configuration 1 (see Fig. 5). As shown in Fig. 7, our model resembles the ground-truth summary better than . The two results suggest that learning actionness ranking is indeed useful for better video summarization.
6 Conclusion and Future work
In this work, we present a further step in analyzing and understanding the video summarization problem. We hypothesize that humans actively rely on deliberate motion and action cues -among other cues- to generate a brief summary that best expresses long visual sequences. We examine this hypothesis by running a user study, investigating the correlation between human-generated summaries and actionness ranking. We then conduct a consensus and behavioral analysis on the data obtained from users to ensure the data reliability and agreement among the users. The findings of the study show a substantial likelihood of including frames containing high actionness ranks within the summaries.
Thus, we propose a new method that utilizes actionness cues to better learn the task of video summarization. We use a recurrent neural network that is trained for video summarization while being explicitly regularized to learn the actionness ranking task in a multi-task learning formulation. The evaluation on four benchmark summarization datasets shows a significant improvement by our approach over several state-of-the-art summarization methods.
Future Work: The main objective of this work was to examine the relationship between the tasks of actionness estimation and video summarization, and using the former to improve the performance of the latter. As the initial step, we used an extra set of annotations called actionness to train a summarization model in a supervised manner. For future work, we plan to utilize the actionness information to train simpler more efficient video summarization methods in an unsupervised manner.
Acknowledgment. We would like to thank Professor Abhijit Mahalanobis for helpful discussions and feedback, and NVIDIA for donating the GPU used in the experiments.
-  S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6373–6382. IEEE, 2017.
-  Z. Cernekova, I. Pitas, and C. Nikou. Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on circuits and systems for video technology, 16(1):82–91, 2006.
-  B.-C. Chen, Y.-Y. Chen, and F. Chen. Video to text summary: Joint video summarization and captioning with recurrent neural networks. 2017.
-  W. Chen, C. Xiong, R. Xu, and J. J. Corso. Actionness ranking with lattice conditional ordinal random fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 748–755, 2014.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
-  D. Davidson. Actions, reasons, and causes. The journal of philosophy, 60(23):685–700, 1963.
-  S. E. F. De Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albuquerque Araújo. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 32(1):56–68, 2011.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  T. Evgeniou and M. Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004.
-  J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. Turn tap: Temporal unit regression network for temporal action proposals, 2017.
-  G. Geisler, G. Marchionini, B. M. Wildemuth, A. Hughes, M. Yang, T. Wilkens, and R. Spinks. Video browsing interfaces for the open video project. In CHI’02 Extended Abstracts on Human Factors in Computing Systems, pages 514–515. ACM, 2002.
-  B. Gong, W.-L. Chao, K. Grauman, and F. Sha. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems, pages 2069–2077, 2014.
-  A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
-  M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool. Creating summaries from user videos. In European conference on computer vision, pages 505–520. Springer, 2014.
-  M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In Proceedings CVPR 2015, pages 3090–3098, 2015.
-  R. Hirsch. Seizing the Light: A Social & Aesthetic History of Photography. Taylor & Francis, 2017.
-  R. Hirsch. Seizing the Light: A Social & Aesthetic History of Photography. Taylor & Francis, 2017.
-  S. Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Z. Ji, K. Xiong, Y. Pang, and X. Li. Video summarization with attention-based encoder-decoder networks. arXiv preprint arXiv:1708.09545, 2017.
-  Y. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. Thumos challenge: Action recognition with a large number of classes, 2014.
-  K. KAUST. End-to-end, single-stream temporal action detection in untrimmed videos.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  A. Kulesza and B. Taskar. Learning determinantal point processes. 2011.
-  A. Kulesza, B. Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012.
-  Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video summarization. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1346–1353. IEEE, 2012.
-  N. Li, D. Xu, Z. Ying, Z. Li, and G. Li. Searching action proposals via spatial actionness estimation and temporal path inference and tracking. In Asian Conference on Computer Vision, pages 384–399. Springer, 2016.
-  X. Li, B. Zhao, and X. Lu. Mam-rnn: multi-level attention model based rnn for video captioning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2208–2214. AAAI Press, 2017.
-  Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2714–2721. IEEE, 2013.
-  Y. Luo, L.-F. Cheong, and A. Tran. Actionness-assisted recognition of actions. In Proceedings of the IEEE International Conference on Computer Vision, pages 3244–3252, 2015.
-  Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li. A user attention model for video summarization. In Proceedings of the tenth ACM international conference on Multimedia, pages 533–542. ACM, 2002.
-  B. Mahasseni, M. Lam, and S. Todorovic. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  V. A. Malyshev and A. M. Vershik. Asymptotic combinatorics with application to mathematical physics, volume 77. Springer Science & Business Media, 2012.
-  T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
-  A. Montes, A. Salvador, S. Pascual, and X. Giro-i Nieto. Temporal activity detection in untrimmed videos with recurrent neural networks. arXiv preprint arXiv:1608.08128, 2016.
-  W. OBILE. Ericsson mobility report, 2016.
-  W. OBILE. Ericsson mobility report, 2016.
-  P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1029–1038, 2016.
-  D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In European conference on computer vision, pages 540–555. Springer, 2014.
-  T. Robinson and F. Fallside. A recurrent error propagation network speech recognition system. Computer Speech & Language, 5(3):259–274, 1991.
-  Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5179–5187, 2015.
-  K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
-  A. Swartz. Gopro posts record fourth-quarter sales but stock falls 15 percent on poor outlook, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
-  T. H. Vu, A. Dang, L. Dung, and J.-C. Wang. Self-gated recurrent neural networks for human activity recognition on wearable devices. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 179–185. ACM, 2017.
-  L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2016.
-  M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia, 14(4):975–985, 2012.
-  P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
-  L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Video description generation incorporating spatio-temporal features and a soft-attention mechanism. arXiv preprint arXiv:1502.08029, 2015.
-  T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. Msr asia msm at activitynet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning events in videos.
-  K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Summary transfer: Exemplar-based subset selection for video summarization. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 1059–1067. IEEE, 2016.
-  K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Video summarization with long short-term memory. In European conference on computer vision, pages 766–782. Springer, 2016.
-  B. Zhao, X. Li, and X. Lu. Hierarchical recurrent neural network for video summarization. In Proceedings of the 2017 ACM on Multimedia Conference, pages 863–871. ACM, 2017.
-  Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. ICCV, Oct, 2, 2017.