Learning to score and summarize figure skating sport videos
This paper focuses on fully understanding the figure skating sport videos. In particular, we present a large-scale figure skating sport video dataset, which include 500 figure skating videos. On average, the length of each video is 2 minute and 50 seconds. Each video is annotated by three scores from nine different referees, i.e., Total Element Score(TES), Total Program Component Score (PCS), and Total Deductions(DED). The players of this dataset come from more than 20 different countries. We compare different features and models to predict the scores of each video. We also derive a video summarization dataset of 476 videos with the ground-truth video summary produced from the great shot. A reinforcement learning based video summarization algorithm is proposed here; and the experiments show better performance than the other baseline video summarization algorithms.
With the rapid development of digital cameras and proliferation of social media sharing, there is also an explosive growth of available figure skating sport videos in both the quantity and granularity. Every year there are over 20 international figure skating competitions held by International Skating Union (ISU) and hundreds of skaters participated in them. Most of the high-level international competitions, such like ISU championships and ISU Grand Prix of Figure Skating are broadcast on worldwide broadcaster, for instance CBC, NHK, Eurosport, CCTV. Over 100 figure skating videos are uploaded in Youtube and Dailymotion a day during the season.
Analysis of figure skating sport videos also have many real-world applications, such as automatic scoring the players, highlight shot generation, and video summarization. By virtue of the state-of-the-art deep architectures and action recognition approaches, the techniques of figure skating sport analysis will also facilitate statistically comparing the players and teams, analyzing player’s fitness, weaknesses and strengths assessment. In terms of these sport statistics, professional advises can be drawn and thus help the training of players.
Sport video analytics and action recognition in general have been extensively studied in previous works. There exist many video datasets, such as Sports-1M (deepvideo, ), UCF 101(Soomro2012, ), HMDB51 (HMDB51, ), FCVID (fcvid_2017, ) and ActivityNet (activitynet, ). The videos in these datasets are mostly the Internet videos (e.g. YouTube, Flicker, etc), and crowdsourced annotated. In contrast, as one type of fine-grained sport videos, the video datasets on figure skating are usually not large enough to facilitate predicting scores of skaters, or learning to summarize figure skating videos.
On these datasets, previous efforts are mainly focused on video classification (jiang2011consumervideo, ; deepvideo, ), video event detection (over2011trecvid, ) and so on. In contrast, we target at predicting the scores of each skater and abstract the high-quality video summary of figure skating videos. The figure skating videos should be of high quality and captured by professional devices. Thus we use the international figure skating competition videos as the data source to construct our dataset. Furthermore, the figure skating videos are often long and captured from multiple views for live viewing. The actions and movements of skaters are usually very fast. These unique characteristics make it extremely difficult in analyzing the figure skating videos.
To this end, this paper aims at addressing several mentioned problems in figure skating videos. In particularly, we present a relatively large scale figure skating video dataset. The example video frames of this dataset are shown in Fig. 1. The videos in this dataset all come from the matching videos of the high standard international skating competitions. In each video, we only save the whole playing process of each skater; and the irrelevant parts towards the skater (such as warming up, bowing to the audience after the performance) are deleted. Thus the length of each video is about 2 minutes and 50 seconds. Totally, we collect 500 videos of 149 professional figure skating players from more than 20 different countries. We also gather the scores given by nine different referees in the competitions, as well as the ground-truth video summary of each video contributed by the professional video editors. This video summaries were the replay of highlight shots of each skater’s performance, such as the wonderful movement sequence, and important technical element justification.
On this dataset, we evaluate two tasks, namely predicting scores of skater, and learning to produce video summary. Specifically, we extract the state-of-the-art video features – SENet (SEnet, ) and C3D (Tran_ICCV2015, ) to predict the scores of each video. The supervised video summarization tasks are formulated an MDP-based reinforcement learning problem. The experiments evaluate our performance on these two tasks.
The rest of this paper is organized in such a way. Sec. 2 compares some related work. We describe the details of constructing the dataset in Sec. 3. The methodology of solving the two tasks are discussed in Sec. 4. We finally give the experimental results in Sec. 5. The whole paper is concluded in Sec. 6.
2. Related Work
Video Understanding in general. The sheer volume of video data nevertheless makes automatic video content understanding intrinsically difficulty. Very recent, deep architectures have been successfully applied to effectively extract feature representations in video domain. While the development of image representation techniques has matured quickly in recent years (gupta_pami2009, ; graph_matching, ; human_still_latent, ; distributed_repre_cvpr2011, ; person_interaction, ), more advanced architectures were conducted for video understanding (retrieving_movie, ; action_bank, ; efros2003action, ), including Convolutional Networks (ConvNets) with Long Short-Term Memory (LSTMs) (donahue_cvpr2015, ; snippet_cvpr2015, ) and 3D Convolutional Networks (two_strems_nips, ) for visual recognition and action classification, two-stream network fusion for video action recognition (snippet_cvpr2015, ; two_strems_nips, ), Convolutional Networks learning spatiotemporal features (taylor_eccv2010, ; Tran_ICCV2015, ).
Video Representation. The success of deep learning in video analysis tasks roots from its ability to derive discriminative spatial-temporal feature representations directly from raw data tailored for a specific task (Ji2010, ; C3D, ). While stacking frames directly as inputs to CNN models is straightforward, learning spatial-temporal features directly with limited data is difficult as suggested by the results (the performance of 3D convolutions is worse than that of state-of-the-art hand-crafted features (Wang2013a, )). Furthermore, 3D convolutions are also computationally expensive, requiring more iterations to reach convergence. In order to mitigate these issues, Sun proposed to factorize spatial-temporal convolutions (sun2015human, ). It is worth noting that videos can be naturally considered as an ensemble of spatial and temporal components. Motivated by this observation, Simonyan and Zisserman introduced a two-stream framework, which learns spatial and temporal feature representations concurrently with two convolutional networks (Simonyan2014, ). Such a two stream approach achieved state-of-the-art performance on many benchmarks. Furthermore, several important variants of fusing two streams are proposed. For instance, Wang proposed Trajectory-Pooled Descriptors (TDD) (cvpr15:wang, ) by leveraging trajectories (Wang2013a, ) to encode two-stream convolutional feature maps. Feichtenhofer investigated a better fusion approach to integrate spatial and temporal streams (Feichtenhofer16, ). Wang proposed to divide a video clip into segments and learn a consensus function over segments (WangXWQLTV16, ). Zhang utilized motion vectors as a replacement of optical flow images to speed up inference (ZhangWWQW16, ). Wang modeled the transitions between actions as a proxy to learn feature representations (Wang_Transformation, ). Zhu introduced a key volume mining method that attempts to discover key volumes for better classification (zhu2016key, ). Recently, Bilen encoded motions with rank pooling to dynamic images as inputs to a CNN model for recognition (bilen2016dynamic, ). Ye further evaluated different implementation choices, including dropout ratio, network architecture, etc., and reported their results in (icmr15:eval2stream, ).
Video Fusion. In video categorization systems, two types of feature fusion strategies are widely used, i.e., the early fusion and the late fusion. Multiple kernel learning (mkl_2004, ) was utilized to estimate fusion weights (heterog_iccv_2009, ; natarajan_2012, ), which are needed in both early fusion and late fusion. Since both methods cannot exploit the hidden feature relationships, several more advanced feature fusion techniques were conducted. For instance, an optimization framework proposed by Ye et al. (late_fusion_2012, ) applied a shared low-rank matrix to reduce noises, an audio-visual joint codebook proposed by Jiang el al. (jiang_acmmm2009, ) discovered the correlations of audio and visual features for video classification, and dynamic fusion adopted by Liu et al. (liu_2013_cvpr, ) acknowledged the best feature combination strategy. With the rapid growth of deep neural networks, combination of multiple futures in neural networks gradually comes into sight. In multimodal deep learning, a deep de-noised auto-encoder (multimodal_icml2011, ) and Boltzmann machines (Srivastava_NIPS2012, ) were utilized respectively.
Sports Video Analysis. Recently, analysis for sports videos has been rapidly growing in both the quantity and granularity of data sources (camera_light, ). A common and important unit of information is action, or a short sequence of actions. There are various works that assess how well people perform actions in different sports, including an application of Automated Video Assessment demonstrated by a computer system that analyzes video recordings of gymnasts performing the vault (gordon_1995, ), a probabilistic model of a basketball team play based on trajectories of all players (jug_2003, ), trajectory-based evaluation of multi-player basketball activity using the Bayesian network (perse_2007, ), and machine learning classifier on top of a rule-based algorithm to recognize on-ball screens (McQueen2014, ). Pirsiavash et al. introduced a learning-based framework evaluating on two distinct types of actions (diving and figure figure skating) by training a regression model from spatiotemporal pose features to scores obtained from expert judges (quality_action, ).
Video Summarization. Video summarization has been extensively studied over the past two decades in the Multimedia Community (Truong:2007:VAS:1198302.1198305, ). Most of works in video summarization are unsupervised. The video summary can be either organized into key-frames (hanjalic_1999, ; liu_pami_2009, ) and video skims (fu2010summarize, ; scene_graph2005, ; event_driven_summary, ; DBLP:conf/mm/WangJCGDW14, ). There are many different kinds of information to help extract the video summarization, including low-level information (motion cues (event_driven_summary, ) and visual saliency (scene_graph2005, )), middle-level information (object trajectories (liu_pami_2009, ), tag localization (event_driven_summary, ) and semantic recognition (DBLP:conf/mm/WangJCGDW14, )). Video summarization has been explored for various types of content, such as movies (scene_graph2005, ), news reports (event_driven_summary, ), surveillance videos (fu2010summarize, ). However, supervised video summarization has been relatively less studied, especially on figure skating videos. The key difference of video summary between figure skating videos and previous consumer videos is that wonderful movement sequence and important technical element justification should be included.
3. Figure skating Video Dataset
Our figure skating video dataset is designed to study the problem of analyzing figure skating videos, including learning to predict scores of each player, highlight shots generation, and video summarization. This dataset would be released to the community under necessary license.
3.1. Dataset construction
Data source. To construct the dataset, we search and download a great quantity of figure skating videos. The figure skating videos come from formal high standard international skating competitions, including NHK Trophy (NHK), Trophee Eric Bompard (TEB), Cup of China (COC) , Four Continents Figure Skating Championships (4CC) and so on. Please the full names of the competition refer to the wiki111https://en.wikipedia.org/wiki/ISU_Grand_Prix_of_Figure_Skating. The videos of our figure skating video dataset are only about the playing process in the competitions. In contrast, previous datasets (e.g., UCF 101(Soomro2012, ), HMDB51 (HMDB51, ), Sports-1M (deepvideo, ) and ActivityNet (activitynet, )) are usually searched and downloaded from various search engines (e.g., Google, Flicker and Bing, etc), or the social media sharing platforms (e.g. Youtube, DailyMotion, etc.). Those online figure skating videos include many un-relevant parts about the players, such as, warming up, bowing to the audience after performance, and waiting for scores at the Kiss & Cry. We also collect the ground-truth scores given at least nine different referees shown in each competition.
Selection Criteria. We carefully select the figure skating videos. To obtain a standard and authorized score prediction, we select the videos only from the highest level of international competitions with fair and reasonable judgement. In particular, we are using the videos from ISU Championships, ISU Grand Prix of Figure Skating and Winter Olympic Games. Totally we have the videos about 149 players from more than 20 different countries. Furthermore, in figure skating competitions, the mark scheme is slightly changing every season, and very different for men and women. To make the scores more comparable, only the competitions videos about ladies’ singles short program happened over the past five years are utilized in our figure skating video dataset.
3.2. Pre-processing and Scoring
Pre-processing. We initially downloaded 100 hour videos; and the processing procedure is thus needed to prune some low quality videos. In particular, we manually select and remove the videos that are not fluent nor coherent. To make sure the figure skating videos exactly corresponding to the ground-truth scores, we manually process each video by further cutting the redundant clips (e.g. replay shots or player’s warming up shots). We only reserve the video from the exact the beginning of each performance, to the moment of ending pose, with duration of about 2 minutes and 50 seconds. This time slot also meets the music standard stipulated by the International Skating Union.
Scoring of figure skating. We carefully annotated each video with the skater and competition, and labeled with three scores, namely, Total Element Score (TES), Total Program Component Score (PCS), and Total Deductions (DED). These scores are given by the mark scheme of figure skating competition. Specifically, these three scores measure the performance of skater at each stage over the whole competition. The score of TES is used to judge by the difficulty and execution of all technical movement; and PCS aims at evaluating the performance and interpretation of the music by the skaters. DED records the number of times for critical miss or extra illegal movement, most commonly perceived as fall. Among all the three scores, the TES and PCS are given by nine different referees who are the experts on figure skating. The DED is an objective measure. All three scores collectively determine the final score of the performance; and the final score of each skater is computed as TES+PCS-DED. Note that the same skaters may receive very different scores at different competition due to her performance. Finally we gather 500 videos about ladies’ singles short program, and each video comes with the ground-truth scores.
3.3. Data Analysis
The Spearman correlation and Kendall tau correlation between TES and PCS are analyzed over different matches (in Fig. 2) or different players (in Fig. 3). Particular, we take the TES and PCS values of all skaters participating each match, and compute their correlations as shown in Fig. 2. We also take the same skater TES and PCs values of all matches she took, and calculate their correlations in Fig. 3. Essentially, the correlations between TES and PCS reflect the relationship between technical and program content.
As shown in Fig. 2, we find that almost all competitions are scored linearly and positively in terms of technical (TES) and program content (PCS). Thus when the skater finishes the technical content well, her scores about the content of the program will be increased accordingly. Interestingly, in terms of mark scheme, the TES and PCS are designed to measure two quite different aspects of the skater’s performance in the whole game. In other words, TES and PCS should have relatively independent distributed. But the real situation in the real world is that the referees would refer to the difficult degree of technical development, and adjust the subjective content of the program accordingly, or vice versa. Such a strong correlation should be intrinsically inevitable in the current expert scoring systems. Furthermore, similarly, the strong correlations between TES and PCS are also shown in Fig. 3.
In this section, we explore two important tasks we are using the figure skating video dataset to solve. Section 4.1 discusses our efforts in learning a model of predicting the skating scores. We also explore the way of learning to generate video summary of figure skating videos in Sec. 4.2.
4.1. Predicting the skating scores
Predicting Tasks. In figure skating matches, the scores of TES and PCS are usually given by the float numbers; and the Total Deductions (DED) which denotes the times of failure during the game, are discrete values. Thus the tasks of predicting the scores of TES and PCS can be formulated as regression tasks; in contrast, predicting DED can be organized as a classification task. To perform the prediction on TES, PCS and DED tasks, we will need to further consider the deep feature used and prediction models.
Deep Video Features. Video features are very important in representing the video content. We extract two types of deep features – static frames and clip-based features. In particular, we utilize as the static frame features the SENet (SEnet, ) which is the winner of ILSVRC 2017 Image Classification Challenge. We extract the features from the pool5 layer which results the feature dimension of . The C3D (Tran_ICCV2015, ) is employed here to represent the features of each clip. We use the 4096 dimensional output features of fc6 layer in C3D. We use the sliding window of size 16 frames over the temporal dimension to cut the video clips; we set the stride as 8. We use max or average operator to further fuse the extracted frame-based SENet and clip-based C3D into video-level representation.
Prediction models. Various classification and regression models can be used for the proposed tasks here. We will consider and compare several the most basic models, including Support Vector Machine/Regressor (SVM/SVR), and Neural Networks with three fully-connected layers. The pre-extracted video-level SENet and C3D features are concatenated as the input features. We compare these models in the experimental sections.
4.2. Supervised Video summarization
Ground-truth summary. Our videos are collected and cut from the videos of high standard international skating competitions. Such match videos are edited by the professional video editors. Usually, once one player finished her performance, the video summary of this skater will be shown next to replay the highlight shots including the wonderful movement sequence, important technical element justification and also critical miss/extra illegal movement. This professional video summary can be taken as the ground-truth for our supervised video summarization algorithms. In our dataset, we collect the professional edited video summary for 464 videos; and we randomly split the summary dataset into 371 training videos/summary and 93 testing videos/summary. Two types of video summary are provided: 10 second short summary and 20 second long summary.
Our algorithm. In order to learning to produce video summary, we employ the reinforcement learning algorithm for video summarization. Specifically, our model aims at learning an MDP-based agent (paletta2000activeobject, ) that can interact with a video over the sequence. The agent is implemented based on the Long Short-Term Memory (Hochreiter1997, ) model. At each time-step , the LSTM-based agent achieves observation through aggregation on video frames and the current location . The agent can then produce a candidate time interval , and a prediction indicator which decides whether to emit the currently selected segment; and the observation location output determining where to look next.
When training the model, we put the outputs into different settings. As for the candidate time interval , although the final prediction is related to the prediction indicator , the model would be stronger in predicting accurate nearby ground truth at each time-step; so we conduct supervised back-propagation on with mean square error
where indicates the th ground truth summary clip that is the nearest to to location , i.e., , where is the -th ground truth summary clip. is the total time-step.
Here we use REINFORCE (williams1992simple, ) to learn the agent’s decision policy, since back-propagation is not adequate in this non-differentiable setting. This includes the prediction indicator and observation location output . The reward of each episode is defined as
where means the number of emitted segment whose overlap ratio with ground truth is over , and denote the positive and negative rewards. To discourage non-emitting policy, we give the model a penalty value larger than if no candidate is emitted. To reduce the variance of the gradient estimation, we adopt the self-critic training algorithm in (rennie2016self, ), and use the reward obtained by the current model under the inference mode. The codes and models of this algorithm will be released.
5.1. Settings and Evaluation
We consider several different tasks in this experiments. (1) Regression. We take the prediction of TES and PCS scores as the regression tasks. Mean square errors are used here. (2) classification. We take the DED as a classification task. The mean Average Precision (mAP) is employed to evaluate the classification results. (3) Supervised video summarization. Our model is learned to summarize the figure skating videos. We use the training split to train the summarization model and the objective evaluation results are reported on the test split. In particular, the Recall and F-score are used for the qualitative measurement of summarization methods in previous works (gygli2015video, ; zhang2016video, ). For a pair of a generated summary and a reference summary , the recall and F-score are computed by,
where denotes the temporal intersection of and , symbol denotes the temporal length of the segment set.
Parameter Settings. For score pridiction model, we use two hidden layers with size of . For summarization model, we use a 3-layer LSTM with a hidden size of in the summarization model, and set as , as and as . When sampling location output from Normal distribution, we use a standard deviation of . We adopt Adam(kingma2014adam, ) as optimization algorithm. Both of the models are implemented on pytorch.
5.2. Results on Regression.
Settings. As shown in Tab. 1, we compare three basic regression models, namely, SVR with linear kernel and RBF kernel, the Neural Network (NN) with three fully connected layers. As discussed in Sec. 4.1, we also compare two types of fusion methods, i.e., average and maximum fusion. In particular, the average fusion operator will average all the frame-based static SENet features and/or clip-based C3D features. The maximum operator is similar as maxout; and on each feature channel, we only use the highest activation value. We compare the results of using SENet, C3D as well as the concatenation of SENet and C3D features.
Results. The regression results are in Tab. 1. We can draw several conclusions of the results. (1) Max Vs. Avg. If we compare the results of using different types of features, and different supervised models, we find that the average fusion method in general has better performance than the maximum fusion method. This indicates that the video-level feature fused by average operator can better represent the figure skating videos. (2) RBF/Linear SVR Vs. NN. The results predicted neural networks are much better than those from RBF/Linear SVR, due to the high nonlinear ability of neural network. (3) SENet Vs. C3D. The SENet and C3D features are modeling different aspects of videos. SENet are the static frame-based features, while C3D features are clip-based features. In general, the models using C3D features can produce better prediction results than those from SENet, since the figure skating videos are mostly about the movement of each skater. The clip-based C3D features can better abstract this moving information from the videos. Furthermore, we also note that the SENet and C3D features are very complementary to each other; and the concatenation of SENet and C3D can generate the best predicting results. (4) TES Vs. PCS. With comparable models and features, the MSE results on PCS are generally better than those of TES. This reflects that the PCS is a relatively easier to be predicted than TES.
5.3. Results on Classification
Settings. Table 2 compares our experiments on the classification task. Still we compare three classification models: SVM with linear kernel and RBF kernel, the Neural Network (NN) with three fully connected layers. The average and maximum fusion operators are also compared here. The SENet and C3D are used as the input features. We also discuss their concatenation as the features.
Results. We also compare the results on Tab. 2. Interestingly, on the classification tasks, we found some conclusions are not held as the regression tasks. (1) The maximum fusion has better performance on predicting DED tasks than the average fusion method. (2) The neural network models can achieve higher performance than the models of RBF/Linear SVM, still thanks to the highly nonlinearity of neural networks. (3) The models of SENet features can beat the corresponding models using C3D features. This also makes sense, since one can easily judge whether the skater makes some critical miss or extra illegal movement even on a single image. Thus the frames should be good enough to judge the DED.
5.4. Supervised Video Summarization
We compare the results on supervised video summarization. In particular, we compare four different baseline models for video summarization as follows,
Uniform sampling. We parse the videos into clips with the length of 2 seconds. We then uniformly sample the clips from the video, in order to generate the required length of video summary. We use this method only on the testing split.
K-means. We still segment each video into clips of 2 second length. Each clip has the C3D features. For each video, we cluster the clips into clusters by K-means algorithm. We select the nearest segments to the cluster center as the most representative clip. The final video summary is constructed by the representative clips from the several largest clusters. We summarize the testing videos by K-means.
|Feature||Fusion||method||MSE (TES)||MSE (PCS)|
|[C3D, SENet]||Max||RBF/Linear SVR||35.17/29.12||18.84/16.08|
Submodular. This is a supervised video summarization baseline. As in (gygli2015video, ), the submodular function is introduced to measure two metrics of the representativeness and uniformity. This method can learn the approximately optimal submodular mixtures for summarization. The weights of two metrics are trained by the training set. The produced results are compared in the testing set.
|SENet||Max||RBF/Linear SVM||35.66 / 43.28|
|Avg||RBF/Linear SVM||40.22 / 41.33|
|C3D||Max||RBF/Linear SVM||35.28 / 40.42|
|Avg||RBF/Linear SVM||34.49 / 38.31|
|[SENet, C3D]||Max||RBF/Linear SVM||36.04/42.05|
LSTM. Following the method in (zhang2016video, ), the Long Short-Term Memory (LSTM) is employed to help model temporal dependency among video frames. In particular, the LSTM can derive both representative and compact video summaries. The selected key-frames or segments are encoded as binary indicator vectors. The whole model would be learned on the training set; and the parameters of the LSTM model are optimized with stochastic gradient descent. We use the suggested parameters in (zhang2016video, ). We evaluate the results on the testing split.
|Methods||Short (10s)||Long (20s)|
Subjective Evaluation. User-study is conducted to subjectively evaluate the results of video summary. Particularly, we randomly select 10 generated video summary for the testing split; and make sure each competitor will generate 10-second length of video summary. We enroll 5 volunteers who are kept unknown to our project and relatively familiar with the figure skating sport. We firstly show them the original figure skating videos, and then the video summary results of the corresponding video from each competitors are shown to the volunteer. We ask them to rate the results based on five-point scale by the following several metrics: (1) Coverage: is the summary cover all the wonderful movement sequence and important technical element justification from the skater? (2) Quality: how is the overall subjective quality of the summary produced? (3) Accuracy: if you are the referee, can the summary accurately reflect the final scores the skater obtained? The whole evaluation process is repeated for 5 times; and each time, the new 10 videos are randomly selected again. The final averaged results are reported. We compare the subjective evaluation results in Tab. 4. From the table, we can highlight two conclusions. (1) the supervised summarization algorithms in general have better results than those unsupervised learning algorithms on all three metrics. (2) Our algorithms can produce better results than the other algorithms.
Objective Evaluation. Since we have the ground-truth video summary, we can also do the objective evaluation over all the methods. The results are compared in Tab. 3. We compare the summarized results of both 10 second and 20 second. We find that (1) the supervised summary algorithms in general have better Recall and F-score than those unsupervised summary algorithms. This shows that the supervised summarization paradigm may generate better required summary results. (2) Our summarization algorithm in general has the best Recall and F-score. This validates the effectiveness of proposed method on video summarization.
In this paper, we present a new dataset for figure skating sport video analysis. We target at two tasks: learning to score of each skater’s performance and supervised video summarization of each skater. We compare using the state-of-the-art deep features for video understanding; and we propose a new reinforcement learning based video summarization algorithm. We conducted both subjective and objective evaluations of the summaries on the testing dataset. The experimental results validate the effectiveness of our proposed algorithm.
-  F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In ICML, 2004.
-  H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
-  L. Cao, J. Luo, F. Liang, and T. S. Huang. Heterogeneous feature machines for visual recognition. In ICCV, 2009.
-  V. Delaitre, J. Sivic, and I. Laptev. Learning person-object interactions for action recognition in still images. In NIPS, 2011.
-  D.Liu, G.Hua, and T.Chen. A hierarchical visual model for video object summarization. In IEEE TPAMI, 2009.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
-  A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In IEEE International Conference on Computer Vision, pages 726–733, 2003.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In NIPS, 2014.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
-  Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z.-H. Zhou. Multi-view video summarization. IEEE Transactions on Multimedia, 12(7):717–729, 2010.
-  A. Gordon. Automated video assessment of human performance. In AI-ED, 1995.
-  Gupta, K. A., and L. Davis. Observing human-object interactions: Using spatial and functional compatibility for recognitions. In IEEE TPAMI, 2009.
-  G.W.Taylor, R.Fergus, Y.LeCun, and C.Bregler. Convolutional learning of spatio-temporal features. In ECCV, 2010.
-  M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3090–3098, 2015.
-  A. Hanjalic and H. Zhang. An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis. In IEEE TCSVT, 1999.
-  F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In arxiv, 2017.
-  S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010.
-  W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-term audio-visual atoms for generic video concept classification. In ACM MM, 2009.
-  Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE TPAMI, 2017.
-  Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ACM International Conference on Multimedia Retrieval, 2011.
-  M. Jug, J. Pers, B. Dezman, and S. Kovacic. Trajectory based assessment of coordinated human activity, 2003.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.
-  I. Laptev and P. Perez. Retrieving actions in movies. In ICCV, 2007.
-  D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang. Sample-specific late fusion for visual category recognition. In CVPR, 2013.
-  Z. Lowe. Lights, cameras, revolution. 2013.
-  S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and appearance. In CVPR, 2011.
-  A. McQueen, J. Wiens, and J. Guttag. Automatically recognizing on-ball screens. In MIT Sloan Sports Analytics Conference (SSAC), 2014.
-  P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, and R. Prasad. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012.
-  J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, 2011.
-  C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang. Video summarization and scene detection by graph modeling. In IEEE TCSVT, 2005.
-  P. Over, G. Awad, M. Michel, J. Fiscus, W. Kraaij, and A. F. Smeaton. Trecvid 2011 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2011, 2011.
-  L. Paletta and A. Pinz. Active object recognition by view integration and reinforcement learning. Robotics and Autonomous Systems, 31:71–86, 2000.
-  M. Perse, M. Kristan, J. Pers, and S. Kovacic. Automatic evaluation of organized basketball activity using bayesian networks. In Citeseer, 2007.
-  H. Pirsiavash, C. Vondrick, and Torralba. Assessing the quality of actions. In arxiv, 2017.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.
-  S. Sadanand and J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012.
-  N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012.
-  L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In CVPR, 2015.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
-  D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. In ICCV, 2015.
-  B. T. Truong and S. Venkatesh. Video abstraction: A systematic review and classification. ACM TOMM, 3(1):79–82, 2007.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
-  L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia, 14(4):975–985, 2012.
-  X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In CVPR, 2016.
-  X. Wang, Y. Jiang, Z. Chai, Z. Gu, X. Du, and D. Wang. Real-time summarization of user-generated videos based on semantic recognition. In ACM MM, 2014.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
-  W. Yang, Y. Wang, and G. Mori. Recognizing human actions from still images with latent poses. In CVPR, 2010.
-  B. Yao and L. Fei-Fei. Action recognition with exemplar based 2.5d graph matching. In ECCV, 2012.
-  G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusion with rank minimization. In CVPR, 2012.
-  H. Ye, Z. Wu, R.-W. Zhao, X. Wang, Y.-G. Jiang, and X. Xue. Evaluating two-stream cnn for video classification. In ACM ICMR, 2015.
-  B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion vector cnns. In CVPR, 2016.
-  K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Video summarization with long short-term memory. In European Conference on Computer Vision, pages 766–782. Springer, 2016.
-  W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao. A key volume mining deep framework for action recognition. In CVPR, 2016.