Meta Learning for Task-Driven Video Summarization

Meta Learning for Task-Driven Video Summarization

Xuelong Li, Fellow, IEEE, Hongli Li, and Yongsheng Dong, Senior Member, IEEE This work was supported in part by the National Key Research and Development Program of China under Grant 2018YFB1107400, in part by the National Natural Science Foundation of China under Grant 61871470 and Grant U1604153, in part by the Key Specialized Research and Development Breakthrough of Henan Province under Grant 192102210121, and in part by the Program for Science and Technology Innovation Talents in Universities of Henan Province under Grant 19HASTIT026. X. Li and H. Li are with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, P.R. China (e-mails:; Y. Dong is with the School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China (e-mail:

Existing video summarization approaches mainly concentrate on sequential or structural characteristic of video data. However, they do not pay enough attention to the video summarization task itself. In this paper, we propose a meta learning method for performing task-driven video summarization, denoted by MetaL-TDVS, to explicitly explore the video summarization mechanism among summarizing processes on different videos. Particularly, MetaL-TDVS aims to excavate the latent mechanism for summarizing video by reformulating video summarization as a meta learning problem and promote generalization ability of the trained model. MetaL-TDVS regards summarizing each video as a single task to make better use of the experience and knowledge learned from processes of summarizing other videos to summarize new ones. Furthermore, MetaL-TDVS updates models via a two-fold back propagation which forces the model optimized on one video to obtain high accuracy on another video in every training step. Extensive experiments on benchmark datasets demonstrate the superiority and better generalization ability of MetaL-TDVS against several state-of-the-art methods.

video summarization, key frame extraction, meta learning

I Introduction

Humans get information from videos easily, but the time cost to browse these videos is noticeable. Thus, an efficient way to manage large amount of videos is needed urgently[1, 2, 3]. Being able to capture the essence and discard redundant frames in videos automatically, video summarization meets this requirement well [4], and is widely used in video storage, displaying, compression etc. For instance, due to limited battery energy and unstable internet speed, it is better to transmit a representative summary instead of entire video for some real-time communication mobile phone apps.

Unsupervised video summarization methods [5, 6] usually use manually defined criteria to extract key frames or key shots. While supervised ones [7, 8] learn models with the help of human-annotated data to determine which frames or shots are more important. In this paper, we mainly focus on supervised ones.

The majority of existing supervised methods mainly pay more attention to sequential or structural nature of video data. Zhang et al. stated that video summarization is inherently a structured prediction problem and proposed a supervised subset selection technique [9] to transfer summary structures from training labeled videos to unseen ones. Though this approach gets attractive results, its assumption, that is similar videos have similar summary structures, requires scenes of training video to be abundant enough. Obviously, when training data is not sufficiently rich, generalization ability of the learned model will be limited.

To model temporal dependency among video frames in variable range, two long short term memory (LSTM) based models were proposed by casting video summarization task as a structured prediction problem [10]. For computing probability of each frame being selected into summary, a deep summarization network (DSN) was constructed by viewing summarizing video as a sequential decision making problem [11]. To catch long range temporal dependency of video frames well, Zhao et al. proposed a hierarchical RNN (H-RNN) [12]. Moreover, LSTM was also utilized in other methods [13, 14] to model sequential or structural characteristics of video.

Though these existing supervised methods get state-of-the-art performance, they mainly concentrate on sequential or structural nature of video data rather than directly on video summarization task itself. In this way, the learned models also focus more on sequential or structural characteristic of video data, but not explicitly explore how to summarize video. Undoubtedly, sequential structure of video is critical to video summarization, but the mechanism for summarizing video is more crucial and actually essential to video summarization task itself.

Therefore, this work emphasizes on exploration of the latent mechanism for video summarization, and lays more stress on video summarization problem itself rather than only on video data. Specifically, we reformulate summarizing video as a meta learning problem and propose a general framework MetaL-TDVS to explicitly explore the latent mechanism for video summarization.

In MetaL-TDVS, summarizing each video is seen as a single task and learning proceeds among tasks. Specifically, parameter of learner (specific model to summarize video) is learned. Each update of parameter consists of two stages with two training tasks and parameter of learner can be obtained via a two-fold back propagation. As demonstrated by extensive experiments, MetaL-TDVS obtains better generalization ability and is superior than many state-of-the-art supervised methods.

Contributions of MetaL-TDVS are summarized as follows:

  1. We first propose a MetaL-TDVS method to perform video summarization by employing meta learning. To the best of our knowledge, it is the first one to use meta learning in video summarization domain.

  2. We use MetaL-TDVS to explicitly explore the latent mechanism of summarizing video, and focus more on video summarization problem itself instead of only on sequential or structural video data.

  3. We use both deep features and traditional features respectively to demonstrate the effectiveness of MetaL-TDVS. Experimental results reveal that MetaL-TDVS outperforms current representative video summarization methods, whether deep features or traditional features are used.

Rest of the paper is organized as follows. Section II talks about related works, section III describes the proposed MetaL-TDVS in detail. Experimental details and results are shown and discussed in section IV, section V concludes the paper.

Ii Related Work

Supervised video summarization methods rely on annotated data to train models and capture increasingly attention due to their outstanding performance. Among them, subset selection [14, 15], structured prediction [10], sequential decision making [11], and sequence to sequence learning [13] are four of the classical formulations.

A probabilistic model, sequential determinantal point process (seqDPP) [15], is designed to select a diverse and informative subset of items (video frames or shots) from ground set (whole video frames or shots). [14] includes a sLSTM (selector LSTM) which is devised and trained to be a subset selector and summary are generated from selected frames or shots. Two bidirectional LSTM based network models are proposed in [10] which can give either binary labels or importance scores of video frames when inputting frame features. [11] produces video summary via three steps: frame features are firstly generated by an encoder (a convolutional neural network); and decoder (a bidirectional LSTM network) computes probabilities according to the features; finally, approaches proposed in [16, 17] are employed to make summary from probabilities. Encoder-decoder structure is also adopted in AVS [13] where the encoder employs a bidirectional LSTM to extract features of video frames and the decoder uses two attention-based LSTMs to compute importance scores of frames.

Though existing supervised methods get promising results, nearly none of them explicitly explore the mechanism for summarizing video among summarizing processes on different videos. In contrast, meta learning is a choice worthy of consideration.

First proposed by Maudsley, meta learning is described as a process by which learners become aware of and increasingly taking control of their habits such as perception and learning [18]. After conceptual basis of meta learning set by Maudsely, Biggs interprets meta learning as a state of being aware of and in control of one’s own learning [19]. Basically, meta learning can be interpreted as the process of learning to learn. Typically, in meta learning, meta-learner is a trainable algorithm [20, 21] or a trainable model [22, 23] which can guide the learning of learner. Learner is a specific model which handles problems directly.

Imitating human ability of learning to learn at a high level, meta learning aims to make better use of the previously learned experience and knowledge to help learning on new tasks. Evidently, it is different with most supervised methods which learn each task in isolation and from scratch. By meta learning, the model trained among tasks can obtain ability of learning to learn and is able to learn new tasks quickly. Lifting learning level from data to task and being able to learn to learn, meta learning is a more intelligent way of learning, thus attracts growing attention. Following the idea of meta learning, there are many excellent studies. To be able to quickly adapt to a new task using only a few data-points and by a few training iterations, a model-agnostic meta-learning (MAML) algorithm was proposed in [24]. A meta-learner called Meta-SGD was developed in [20] for making better use of experience and knowledge learned from related tasks. Motivated by successful move from hand-designed features to learned features, a meta learning way to learn suitable optimization algorithms for some specific problems was proposed in [21].

The idea of meta learning, that is not only learning but also learning to learn, seems tailored to video summarization problem. Because learning to summarize video (the mechanism for summarizing video) is more crucial than only learning to distinguish which frames are more important. Thus, video summarization problem itself is actually a problem of learning to learn (learn to summarize video) and meta learning is a reasonable way to address the video summarization problem.

Iii Proposed MetaL-TDVS

This section firstly gives definition of MetaL-TDVS, then presents outline of MetaL-TDVS. Finally, a compact description of the specific video summarization model is introduced.

Iii-a Definition of MetaL-TDVS

Video summarization is to summarize a given video by using the prior knowledge. The prior knowledge can be seen as meta knowledge, which can be obtained by meta learning from known videos. Suppose a video in video space , summarizing is a single task in task space and summarizing different videos are seen as different tasks. As an example, summarizing is a single task while summarizing is another task if and are different. Based on this definition, summarizing all videos in any of video datasets (in this paper we use Youtube, OVP, TVSum, SumMe) form a task set . We follow rules in [10] to split into three disjoint subsets: training task set used for learning parameter of learner, validation task set that is used for deciding when learning can be stopped, testing task set used for computing performance of MetaL-TDVS.

Upon above settings, MetaL-TDVS can be defined as:


where denotes meta learner in meta learning. is the model of learner in meta learning and is randomly initialized in our implementation. As specific model to summarize video, is implemented by vsLSTM [10]. In fact, it can be implemented by any differentiable model. denotes optimized learner after learning.

Iii-B Details of MetaL-TDVS

Fig. 1: Overview of the -th iteration for update from to ( is randomly initialized to at beginning): The update consists of two stages. The first stage updates to according to and the second stage updates to on . Every “update” is done by one gradient descent step. In this example . denotes learning rate and is meta learning rate. Green arrows represent inputs of learner which are frame-level features. Purple arrows denote outputs which are frame-level probabilities. Red arrows stress update processes of . More details in section III-B.

Mathematically, learner is represented by a parametrized function , where is a parameter to be learned and is randomly initialized to . To learn (update) , each iteration is completed by two stages and two training tasks in are utilized. In the th iteration, is updated from to and two training tasks, (the first used training task in the th iteration) and (the second used training task in the th iteration), update as follows:

In the first stage, for training task , can be updated from to by one gradient descent step and can be adjusted to on training task by one gradient descent step as well. Theoretically, several adjustments on can be made and can be obtained after adjustments. In the th iteration, one gradient descent update on in the th adjustment takes the form:


where denotes learning rate and is fixed as a hyperparameter. and are losses on and states of learner are and respectively. Specifically, is described as:


where is the number of frames of . is the annotation of which is a score vector with length . Here, denotes output of learner and state of learner is represented by . Output is a score vector which has the same length as the video and the th element in it represents probability of the th frame being selected to summary. Loss takes the same form of in addition to state of learner changes form to .

The th iteration ends with the second stage where is updated on by one gradient descent step:


where denotes state of learner after adjustments on from . is loss on and represents meta learning rate which is fixed as a hyperparameter. is updated state of learner after the th iteration.

For simplicity of description, only the th iteration for updating parameter is presented, but multiple iterations in MetaL-TDVS is a straightforward extension as shown in Algorithm 1. By minimizing expected generalization loss of with respect to on , as shown in (5), parameter of learner can be obtained. In experiments, we use early stopping strategy and training is stopped when expected generalization loss does not decrease in 800 iterations or maximum iteration (30000) is achieved.


Fig. 1 illustrates process of the th iteration where in detail. As shown, each update of parameter consists of two stages and two training tasks are employed. First, parameter is tuned on the first training task by several gradient descent steps. Then, this update ends with adjustment on the second training task based on tuned parameter. Moreover, MetaL-TDVS is not the special case with batch size of 1. Because each update of MetaL-TDVS contains two stages on two training tasks, and value of can be any positive integer in theory. Thus, learner is updated by higher order derivative. To simplify the description, all hyperparameters of MetaL-TDVS are represented by , and where and denote learning rate and meta learning rate respectively, and is updated on the first used training task times in the first stage of each iteration.

On the other hand, it can be found from Fig. 1 and (4) that the two-stage learning in each training step forces transcendental task (the first used training task) to provide experiences for next learning (learning on the second used training task and the learning is based on state which is learned from the first used training task). Associating learning among different tasks in each update is actually propitious to the learning. Because these different tasks essentially have the same nature, that is summarizing video. Moreover, all tasks are not considered in isolation in total learning process, which helps learner reuse previous experiences from different tasks, learn faster, and perform better.

Formulating as a meta learning problem, MetaL-TDVS treats summarizing each video as a single task and forces learner to learn information of task level. The learning which proceeds among tasks (video summarization tasks) makes learner focus more on video summarization task itself rather than only on data. Furthermore, the strategy figured out by learner among tasks is exactly the latent mechanism for video summarization and what the framework intends to excavate. The learning which is from tasks instead of data facilitates exploration of the latent mechanism.

Note that differences between MetaL-TDVS and existing supervised video summarization methods (denoted by ESVSs for simiplicity) can be summarized as follows:

  1. MetaL-TDVS formulates video summarization as a meta learning problem, but ESVSs mainly formulate it as a subset selection, a structured prediction, or a sequential decision making problem.

  2. In addition to sequential nature of video data, MetaL-TDVS pays more attention to video summarization problem itself; but ESVSs have not statemented this clearly and majority of ESVSs mainly focus on structural or sequential characteristic of video data.

  3. MetaL-TDVS aims to force the specific model (which summarizes video directly) to explicitly explore the mechanism for summarizing video, but ESVSs have not claimed to explore the mechanism unequivocally.

2:: training, validation, test task set;
3:: learning rate;
4:: meta learning rate;
5:: number of times adjusted on the first used training task;
7: : parameter of learner;
9:Initialize and ;
10:while  do
11:     Sample two training tasks and from ;
12:     for  to  do
13:         if  then
14:              ;
15:         end if
16:         if  then
17:              ;
18:         end if
19:     end for
20:     ;
21:     ;
22:end while
Algorithm 1 MetaL-TDVS to Learn Parameter of Learner

Iii-C Specific Model for the Learner

To show effectiveness of MetaL-TDVS, we consider two types of features. The first one is deep feature extracted from output of the penultimate layer of GoogLeNet [25]. By using this feature extraction method, each frame of input video is encoded into a 1024-dimensional feature descriptor. The second one is traditional feature consisting of four image descriptors: color histograms, GIST, HOG and dense SIFT. Color histograms are computed from RGB images and all the other features are extracted on gray scale images.

On the other hand, because video summarizaion is made based on storyline which progresses through entire video, the sequential or structural nature of video data is also of great importance to effectively address video summarization problem. Thus, ways that only rely on visual cues and do not take consideration of temporal relation across frames are not such qualified to summarize video. To get an ideal video summary, highlevel semantic understanding about video over a long-range temporal span needs to be taken into account. So we employ vsLSTM [10] to implement learner in MetaL-TDVS.

Fig. 2: Structures of the employed learner (vsLSTM) and MLP. Left side shows network structure of vsLSTM where green circles represent input features of video frames and brown ones are outputs (probabilities to be selected). Blue rectangles indicate LSTM units and gray ones are MLPs. Right side shows structure of one MLP which has an input layer, one hidden layer and an output layer.

The vsLSTM consists of bidirectional LSTM layers [26] and one multi-layer perceptron (MLP) layer. For clarity, we plot the structure of the employed learner (vsLSTM), as well as the structure of MLP, in Fig. 2. There is no direct interaction between forward and backward LSTM layers. Combining hidden states of these two LSTM layers and features of video frames, MLPs are all implemented by one-hidden-layer and are utilized to compute probabilities of frames. Hidden units and output layers of MLPs are all activated by sigmoid activation function. The size of hidden layers of MLPs, the number of hidden units of each unidirectional LSTM as well as the output dimension of MLPs are all 256.

Iv Experiments

This section firstly presents detailed descriptions of experimental setups, then various experiments are carried out to demonstrate the efficiency and superiority of MetaL-TDVS.

Iv-a Experimental Setups

Iv-A1 Datasets

Performance of MetaL-TDVS is evaluated on SumMe [27] and TVSum [17].

There are 25 user videos in SumMe and events recorded by these videos are multifarious, such as sports and holidays. Both ego-centric and third-person camera are included, and contents expressed are diverse. Video lengths range from 1.5 to 6.5 minutes and provided labels are frame-level importance scores. TVSum consists of 50 videos downloaded from YouTube and videos are organized into groups with a key-word as topic of each group. Selected from 10 categories in TRECVid Multimedia Event Detection (MED), the 50 videos are organized into 10 topics (5 videos per topic) and lengths of them are in range of 1 to 5 minutes. Videos in TVSum include first-person and third-person camera and contents are extremely diverse. Labels are frame-level importance scores.

To invistigate generalization ability of learned model and combat the need of huge amount of annotated data, the other two datasets, Youtube [28] and Open Video Project (OVP) [29], are also utilized. Youtube includes 50 videos collected from websites and contents include news and sports. Video lengths vary from 1 to 10 minutes and annotations provided are multiple user-annotated subsets of keyframes for each video. For OVP, we utilize the same 50 videos as [28]. Videos are from various genres such as documentary and educational, and their lengths are form 1 to 4 minutes.

Iv-A2 Evaluation Metrics

To make a fair comparison, we use keyshot-based metrics proposed in [10] for evaluation which follow protocols in [27, 30] as well.

Suppose is the generated keyshot-based summary and is the human-annotated keyshots. Precision () and recall () against human-annotated summary are computed according to temporal overlap between them:


the finally used harmonic mean F-score () is computed as:


Iv-A3 Implementation Details

To generate key frames or key shots, we follow methods described in [10]. Videos are temporally segmented into disjoint intervals by kernel temporal segmenation (KTS) according to frame scores. Based on importance score of each interval (average importance score of frames in the interval), resulting intervals are ranked. Summary consists of keyshots selected from ranked intervals and total duration of summary is less than 15% of input video. To obtain a single ground-truth set when there are multiple human annotations, we use the algorithm proposed in [15]. For each video with multiple annotations, single ground-truth set is initialized to be empty and one frame is added to by maximizing (8):


where is the number of annotations and denotes the th annotation. represents F-score of and . Frames not in can be iteratively added to until there is no frame increases the F-score.

The way to split datasets into training, validation, testing sets is referenced from [10]. We follow the “Transfer” way in learning: for a given dataset (SumMe or TVSum), the other three datasets are utilized for training and validation, then the learned model is tested on that dataset. This way allows us to verify generalization ability of learned model on an unseen dataset. We run it for each testing fold 5 times and average results are computed as final results.

Iv-B Results

In this subsection, we investigate the sensitivity of hyperparameters and structures, followed by comparisons with representative methods.

Iv-B1 Sensitivity Evaluation of Hyperparameters

Performances of MetaL-TDVS with different hyperparameters (, , ) are evaluated and shown in Fig. 3 where and can be 0.1, 0.01, 0.001, 0.0001, and 0.00001. Due to limitation of video memory, we test performances when is equal to 1 or 2.

Fig. 3: Results of MetaL-TDVS with different hyperparameters.
Fig. 4: Performance and comparison of with and without the second stage. “one” denotes without the second stage (only the first stage with hyperparameters and ). All of others in (a) and (b) represent MetaL-TDVS with different and have two stages with hyperparameters , and . Both “twoAvg” and “twoMax” are statistics for models with the second stage. In specific, “twoAvg” and “twoMax” represent the average and maximum with respect to when and specified.
Fig. 5: Performance and comparison of simultaneously learning and successively learning. “simu” denotes training on two tasks simultaneously in each iteration. All of others represent MetaL-TDVS with different and train on two tasks successively in each iteration. Both “twoAvg” and “twoMax” are statistics for models where training on two tasks successively in each iteration as shown in Fig. 1. “twoAvg” and “twoMax” are the same meaning as in Fig. 4.
Fig. 6: Exemplar summaries (red intervals) from four sample videos with ground-truth importance scores (blue background).

It can be seen that for each , MetaL-TDVS with different s and different s gives different results, and different performances are presented on the two datasets. For instance, performance of MetaL-TDVS represented by pink polyline (with =0.1, =2) on TVSum shocks drastically with the change of , but its counterpart (pink polyline on SumMe) has completely distinct trend. This change trend discrepancy on these two datasets also occurs when =0.1, =1 (purple polyline), =0.00001, =1 (black polyline), =0.01, =1 (orange polyline) etc. Based on results on different hyperparameters, the one with =0.001, =0.0001, =1 (the third point of red polyline) has better generalization ability (performs well on both the two datasets). Because though the one with =0.001, =0.00001, =2 (the third point of green polyline) performs best on SumMe, it gets poor results on TVSum (the first point of blue polyline with =0.00001, =0.0001, =2 the same). Though the one with =0.00001, =0.1, =1 (the first point of purple polyline) gets promising results on TVSum, it performs poor on SumMe (the second and the last points of orange polyline with =0.0001, =0.01, =1 and =0.1, =0.01, =1 the same). But MetaL-TDVS has similar performance change trend on both these two datasets with =0.0001 and n=1 (red polyline). Thus, to get better generalization ability, is set to 0.0001 and set to 1 in reported results. Based on the two fixed hyperparameters, is 0.001 since better results are got on both the two datasets.

Method TVSum SumMe
Gygli et al. [30] - 39.7
vsLSTM  [10] 56.9 40.7
Zhang et al. [9] - 40.9
SUM-GA [14] 56.3 41.7
DSSE [31] 57.0 -
DR-DS [11] 58.1 42.1
Li et al. [32] 52.7 43.1
MetaL-TDVS (ours) 58.2 44.1
TABLE I: Performance comparison (F-score %) with seven state-of-the-art methods. Best results are renoted in bold.

Iv-B2 Performance Evaluation on Different Structures

Because in IV-B1, =, =, = are selected as the final hyperparameters, experiments for different structures are done with =1 (only and are variable hyperparameters).

To confirm whether the second stage improves performance, frameworks with and without the second stage are tested on SumMe and TVSum. Results are shown in Fig. 4.(a) and (b) where pink polylines denote without the second stage. All of others have the second stage and different colors represent frameworks with different .

It can be seen that for each , there is at least one making the framework with two stages outperform “one” and mostly a large margin. The superiority of two stages is extremely obvious on TVSum. Moreover, Fig. 4.(c) and (d) compares “one” with the average and maximum F-scores of frameworks with two stages. “twoAvg” represents with the second stage and the average is computed as (when ):


where is the F-score of MetaL-TDVS with =, =. are elements in set , 5 is the number of elements in the set. is the point with = on the corresponding blue polyline. “twoMax” indicates with the second stage and the maximum is computed as (when ):


where and are the same as in (9), denotes the point with = on the corresponding red polyline.

Obviously, “twoMax” are better than “one” on both the two datasets. Though only two of five points on “twoAvg” better than “one” (one slightly bad and two visibly poorer than “one”) on SumMe, all points of “twoAvg” better than “one” on TVSum. There are still few points on “twoAvg” worse than “one” because the average is computed on many values of (0.1,0.01,0.001,0.0001,0.00001) when specified and there may exist some cases where performances are bad enough due to inappropriate . Overall, we can state that the second stage really improves performance and mostly a large margin.

To validate whether successively learning on two tasks performs better, frameworks of simultaneously and successively training on two tasks (in each iteration) are tested. Results are shown in Fig. 5.(a) and (b) where pink polylines denote training on two tasks simultaneously. All of others indicate training successively (as shown in Fig. 1) with different .

Evidently, there exists at least one for each making successively learning outperforms “simu” except =0.1 on SumMe, and margins are visibly large in most cases. Furthermore, Fig. 5.(c) and (d) (where “twoAvg” and “twoMax” have similar meaning as in Fig. 4.(c) and (d) except from Fig. 5.(a) and (b) rather than 4.(a) and (b)) makes comparison between “simu” with the average and maximum F-scores of frameworks where training on two tasks successively. Though “twoAvg” performs poorer than “simu” on SumMe, almost all points on “twoMax” outperform “simu” distinctly, and “twoAvg” on TVSum outperforms “simu” a large margin. Thus, it is reasonable to say that training on two tasks successively performs better than simultaneously.

Iv-B3 Comparisons With Representative Methods

Table I summarizes performance of MetaL-TDVS and makes comparison with seven state-of-the-art supervised methods. For competitors, published results are directly used. Furthermore, we only compare with supervised methods since supervised approaches perform better than unsupervised ones to a certain extent with the help of annotations. Specifically, there are several models proposed in [14], but we only make comparison with its supervised one which performs best in all its proposed models (both unsupervised and supervised). And [11] the same.

Shown in Table I, MetaL-TDVS performs better than competitors on SumMe and TVSum. Despite MetaL-TDVS performs slightly better than DR-DSN on TVSum, there are 2.0 percentage increases in performance on SumMe. In addition, MetaL-TDVS outperforms the approach proposed by Li et al. and there are 1.0 and 5.5 percentage increases on SumMe and TVSum respectively. On the two datasets, there are 1.4 and 3.3 percentage points better than the vsLSTM (the same “Transfer” learning settings as MetaL-TDVS) respectively which is the model implementing learner in MetaL-TDVS.

As is expected, experimental results demonstrate the superiority and effectiveness of MetaL-TDVS, and also indicate the way of meta learning is suitable to summarize video.

To promote generalization ability and make ideal video summaries, model is supposed to learn how to summarize video. Thus, what the model learned form processes of summarizing other videos, that is how to summarize video, is essentially what it needs to summarize new ones. In fact, this is in complete accord with the idea of meta learning which is making better use of the experience and knowledge learned from other tasks (summarize other videos) to handle new ones (summarize new videos). Therefore, meta learning is a reasonable way to summarize video and this is verified by experimental results. Besides, laying more stress on video summarization problem itself rather than only on sequential or structural video data, MetaL-TDVS forces the model to explicitly explore the mechanism for summarizing video among tasks and is superior in terms of generalization ability.

Method TVSum SumMe
SUM-GA [14] 59.5 39.5
dppLSTM [10] 57.9 40.7
MetaL-TDVS (ours) 57.9 43.5
TABLE II: Performance comparison (F-score %) with state-of-the-art supervised methods when using traditional features. Best results are renoted in bold.
Method infer final total
vsLSTM [10] 2040 3470 1285
dppLSTM [10] 1956 1306 783
MetaL-TDVS (ours) 2050 49751 1969
TABLE III: Time comparison (frames per second, fps) with two models. Best results are renoted in bold.

Iv-B4 Generalization on Non-Deep Features

Generalization ability of MetaL-TDVS to non-deep features is demonstrated by evaluating its performance with shallow features as utilized in [17]. Table II summarizes performances of MetaL-TDVS and some state-of-the-art supervised methods where only shallow features are used.

It can be seen that MetaL-TDVS performs comparable to competitors. On TVSum, SUM-GA performs best and it is 1.6 percentage better than MetaL-TDVS. But on SumMe, MetaL-TDVS outperforms the competitors, and there are 4.0 and 2.8 percentage increase than SUM-GA and dppLSTM (“Transfer” learning settings) respectively. Promising results demonstrate the robustness on non-deep features of MetaL-TDVS.

Iv-B5 Qualitative Results

To show performance of MetaL-TDVS intuitively, selected frames on four videos (Air_Force_One and car_over_camera of SumMe, AwmHb44_ouw and qqR6AEXwxoQ of TVSum) are demonstrated in Fig. 6. Blue blocks represent ground-truth frame-level importance scores and the ones selected by MetaL-TDVS are marked red. Colored regions are several selected frames in video summaries. Despite some variations, it can be seen that MetaL-TDVS is able to extract frames with high importance and discard the ones which do not contain enough valuable information.

Iv-B6 Performance on Specific Types of Videos

The proposed MetaL-TDVS is a generic video summarization method and not specific to certain types of video. To see how it will behave on some specific types of videos, we test it on fast moving football matches, slow video, and long video (such as full 3hrs).

For fast moving football matches, we selected 17 (v71 to v87) videos in Youtube which are all about football matches and the lengths range from 1min to 10min. Trained on OVP, SumMe and TVSum, learner is tested on these videos and average precision, recall and F-score are 48.53%, 42.16%, and 41.1% respectively. For slow video, we selected four videos in SumMe (Air_Force_One, Bus_in_Rock_Tunnel, Cockpit_Landing, and St Maarten Landing) where Air_Force_One records process of landing a plane from a fixed perspective; Bus_in_Rock_Tunnel shows how a bus passing through a tunnel (moves very very slowly); Cockpit_Landing records view of birds eye of the ground in airplane and then the airplane lands; St Maarten Landing shows process of plane landing near the beach. Trained on OVP, Youtube, and TVSum, learner is tested on these videos and average precision, recall and F-score are 55.15%, 54.08%, and 57.28% respectively. It can be seen that MetaL-TDVS performs slightly better on slow videos than fast moving ones.

For long videos, we use four videos (P01-P04) in [33]. Durations of these videos (P01-P04) are 3 hours 51 minutes 51 seconds, 5 hours 7 minutes 37 seconds, 2 hours 59 minutes 16 seconds, and 4 hours 59 minutes respectively. All of these videos record equipment wearers’ daily life where P01-P03 mainly include shopping, eating, driving, cocking and interacting with others while P04 shows working indoors and outdoors with computer. Because the method proposed in [33] is important people and objects based (summary is made according to the detection response in each frame and each video segment), the ground truth provided are pixel-wise which can not be used by MetaL-TDVS due to basic modeling of video summarization is not the same. Thus, we shorten these videos by model trained on SumMe and TVSum, and provide summarized result videos (which are available by th=%2F) rather than quantitative measures such as F-score, precision and recall. In experiments, videos (P01-P04) are all sampled to 1500 frames around due to classical sample way (select 2 frames per second) produces too many sampled frames which can not be totally loaded into our video memory. Frames in summary generate result summarized video at a frame rate of 2fps. From result videos, it can be seen that main contents of each long video are captured which can demonstrate the effectiveness of MetaL-TDVS to a certain extent.

Iv-B7 Time Comparison

Table III shows time comparison between MetaL-TDVS and vsLSTM, dppLSTM where “infer” stage with frame features as input and importance scores as output, and “final” stage with importance scores as input and binary labels of being selected or non-selected as outputs, “total” with frame features as input and binary labels of frames as output. It is evident MetaL-TDVS is faster than competitors.

V Conclusions

In this paper, we reformulate summarizing video as a meta learning problem and propose a novel and effective method MetaL-TDVS. MetaL-TDVS views summarizing each video as a single task and the learning proceeds among tasks. This way of learning makes learner focus more on video summarization problem itself and facilitates exploring for the latent mechanism of summarizing video. Experimental results reveal that MetaL-TDVS is effective and outperforms recently state-of-the-art methods including GAN based and deep reinforcement learning based methods. So meta learning is suitable to summarize video. In future, we will further explore to summarize video with meta learning. On the one hand, we intend to design more suitable network models for learner to better capture intrinsic sequential and structural characteristic of video data (from perspective of video data); on the other hand, we plan to devise more superior meta learners (training frameworks or real models) to force learner to better explore the mechanism for summarizing video at the same time (from perspective of video summarization task level). Besides, some foreground extraction [34] or manifold learning [35] methods may be used for improving the performance of video summarization as well.


  • [1] L. Zhang, Y. Xia, K. Mao, H. Ma, and Z. Shan, “An effective video summarization framework toward handheld devices.” IEEE Trans. Ind. Electron., vol. 62, no. 2, pp. 1309–1316, 2015.
  • [2] Z. Ji, Y. Zhang, Y. Pang, and X. Li, “Hypergraph dominant set based multi-video summarization,” Signal. Process., vol. 148, pp. 114–123, 2018.
  • [3] Z. Ji, Y. Ma, Y. Pang, and X. Li, “Query-aware sparse coding for web multi-video summarization,” Inform. Sciences., vol. 478, pp. 152–166, 2019.
  • [4] Z. Ji, Y. Zhang, Y. Pang, X. Li, and J. Pan, “Multi-video summarization with query-dependent weighted archetypal analysis,” Neurocomputing, vol. 332, pp. 406–416, 2019.
  • [5] R. Panda and A. K. Roy-Chowdhury, “Collaborative summarization of topic-related videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2, no. 4, 2017, p. 5.
  • [6] B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2513–2520.
  • [7] B. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
  • [8] B. Zhao, X. Li, and X. Lu, “Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7405–7414.
  • [9] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1059–1067.
  • [10] ——, “Video summarization with long short-term memory,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 766–782.
  • [11] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward,” arXiv:1801.00054, 2017.
  • [12] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. ACM Conf. Multimedia., 2017, pp. 863–871.
  • [13] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” arXiv:1708.09545, 2017.
  • [14] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial lstm networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
  • [15] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2069–2077.
  • [16] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 540–555.
  • [17] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5179–5187.
  • [18] D. B. Maudsley, “A theory of meta-learning and principles of facilitation: an organismic perspective,”, 1980.
  • [19] J. B. Biggs, “The role of metalearning in study processes,” Briti. J. Educat. Psycho., vol. 55, no. 3, pp. 185–212, 1985.
  • [20] Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-sgd: Learning to learn quickly for few shot learning,” arXiv:1707.09835, 2017.
  • [21] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas, “Learning to learn by gradient descent by gradient descent,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 3981–3989.
  • [22] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in Proc. IEEE Int. Conf. Mach. Learn., 2016, pp. 1842–1850.
  • [23] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “Meta-learning with temporal convolutions,” arXiv:1707.03141, 2017.
  • [24] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv:1703.03400, 2017.
  • [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
  • [26] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm networks,” in Proc. IEEE Int. Joint Conf. Neural Networks, vol. 4.   IEEE, 2005, pp. 2047–2052.
  • [27] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 505–520.
  • [28] S. E. F. De Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albuquerque Araújo, “Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognit. Lett., vol. 32, no. 1, pp. 56–68, 2011.
  • [29] “Open video project,”
  • [30] M. Gygli, H. Grabner, and L. Van Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3090–3098.
  • [31] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE Trans. Circuits Syst. Video Technol., 2017.
  • [32] X. Li, B. Zhao, and X. Lu, “A general framework for edited video and raw video summarization,” IEEE Trans. Image Process., vol. 26, no. 8, pp. 3652–3664, 2017.
  • [33] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 1346–1353.
  • [34] X. Li, K. Liu, and Y. Dong, “Superpixel-based foreground extraction with fast adaptive trimaps,” IEEE Trans. Cybern., vol. 48, no. 9, pp. 2609–2619, 2018.
  • [35] X. Li, K. Liu, Y. Dong, and D. Tao, “Patch alignment manifold matting,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 7, pp. 3214–3226, 2018.

Xuelong Li (M’02-SM’07-F’12) is a full professor with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, P. R. China.

Hongli Li is working toward the Ph. D. degree in the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, P. R. China.

Yongsheng Dong (SM’19) received his Ph. D. degree in applied mathematics from Peking University in 2012. He was a postdoctoral research fellow with the Center for Optical Imagery Analysis and Learning, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, P. R. China from 2013 to 2016. From 2016 to 2017, he was a visiting research fellow at the School of Computer Science and Engineering, Nanyang Technological University, Singapore. He is currently an associate professor with the School of Information Engineering, Henan University of Science and Technology, P. R. China. His current research interests include pattern recognition, machine learning, and computer vision.

He has authored and co-authored over 30 papers at famous journals and conferences, including IEEE TIP, IEEE TNNLS, IEEE TCYB, IEEE TCSVT, IEEE SPL and ACM TIST. He is an associate editor of Neurocomputing.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description