A Framework towards Domain Specific Video Summarization
Abstract
In the light of exponentially increasing video content, video summarization has attracted a lot of attention recently due to its ability to optimize time and storage. Characteristics of a good summary of a video depend on the particular domain under question. We propose a novel framework for domain specific video summarization. Given a video of a particular domain, our system can produce a summary based on what is important for that domain in addition to possessing other desired characteristics like representativeness, coverage, diversity etc. as suitable to that domain. Past related work has focused either on using supervised approaches for ranking the snippets to produce summary or on using unsupervised approaches of generating the summary as a subset of snippets with the above characteristics. We look at the joint problem of learning domain specific importance of segments as well as the desired summary characteristic for that domain. Our studies show that the more efficient way of incorporating domain specific relevances into a summary is by obtaining ratings of shots as opposed to binary inclusion/exclusion information. We also argue that ratings can be seen as unified representation of all possible ground truth summaries of a video, taking us one step closer in dealing with challenges associated with multiple ground truth summaries of a video. We also propose a novel evaluation measure which is more naturally suited in assessing the quality of video summary for the task at hand than F1 like measures. It leverages the ratings information and is richer in appropriately modeling desirable and undesirable characteristics of a summary. Lastly, we release a gold standard dataset for furthering research in domain specific video summarization, which to our knowledge is the first dataset with long videos across several domains with rating annotations. We conduct extensive experiments to demonstrate the benefits of our proposed solution.
1 Introduction
With the explosion of video data, automatic analysis of videos is increasingly becoming important. Examples of such videos include user videos, sports videos, TV videos or CCTV footages or for that matter videos coming from any other source. One of the popular requirements in this context is the ability to automatically summarize videos. Video summarization finds its uses in a wide variety of applications ranging from security and surveillance to compliance and quality monitoring to user applications aimed at saving time and storage. Broadly speaking, in terms of the kind of video summaries produced video summarization can be of two types  compositional video summarization, which aims at producing spatiotemporal synopsis [pritch2007webcam, rav2006making, pritch2009clustered, pritch2008nonchronological] and extractive video summarization, which aims at selecting key frames (also called key frame extraction, static story boards or static video summarization, eg. [de2011vsumm]) or shots (also called dynamic video summarization or dynamic video skimming, eg. [GygliECCV14]). Past work has tried to address the problem of video summarization using unsupervised as well as supervised techniques. Early unsupervised techniques used attention models [ma2005generic], MMR [li2010multi], clustering [de2011vsumm, mahmoud2013unsupervised] etc. and more recently auto encoder based techniques [yang2015unsupervised] and LSTM based techniques [mahasseni2017unsupervised] have been used. Use of supervised techniques began with forms of indirect supervision from external sources of information like related images [khosla2013large], similar videos [chu2015video] or title of videos [song2015tvsum]. Gong et al’s work [gong2014diverse] was the first work on video summarization which used direct high level supervision in form of user annotations. This was followed by Gygli et al’s work [gygli2015video]. Sharghi, Gong et al took this forward by incorporating a notion of user query in the produced video summaries thereby involving the user in the summary generation process [sharghi2016query, sharghi2017query]. Theirs became the first work on query focussed video summarization. Then came several deep learning based video summarization techniques. For example, Zhang et al. [zhang2016video] used LSTMs [hochreiter1997long] to model long range dependencies among video key frames to produce summaries. Emphasis was given to sequential structures in videos and their modeling. Mahasseni et al.[mahasseni2017unsupervised] used LSTM networks in an adversarial setting to generate summaries. Both these methods were domain agnostic. Some researchers viewed video summarization as a subset selection problem [zhang2016summary, elhamifar2017online, sharghi2016query, sharghi2017query]. Another key approach was to combine the elements of both deep learning and subset selection in summary generation. Gong et al. achieved this using their seqDPP architecture [gong2014diverse] where an LSTM network was coupled with a DPP (Determinantal Point Process) [kulesza2012determinantal].
In this work we address the challenge of domain specific dynamic video summarization. A good video summary should be a meaningful short abstract of a long video, it must contain important events, it should exhibit some continuity and it should be free from redundancy. However, the notion of importance varies with domain. For example a ”six” or a ”wicket” could be considered important in cricket videos while ”entry of birthday girl” and ”cutting of cake” would be considered important in birthday videos. Further the characteristics of a good summary like representativeness, coverage or diversity also depends on the domain. For example, while for surveillance videos a good summary should contain outliers, for a user video coverage and representativeness become more important. We hereby propose a novel domain specific video summarization framework which automatically learns to produce summaries that possess desired characteristics suitable for that domain. Past work on video summarization has focused either on using supervised approaches for ranking the snippets/segments thereby producing a video summary (eg. [potapov2014category, ma2005generic]) or on unsupervised approaches of generating the video summary as a subset of snippets/segments with desired characteristics of representativeness, diversity, coverage, etc. (eg. [li2010multi, de2011vsumm]). Some work has also focused on learning the relative importance of uniformity, interestingness and representativeness for different domains (eg. [gygli2015video, sahoo2017unified]). None of these works, however, have looked at the joint problem of generating domain specific summaries by automatically learning the concepts that are deemed important for a domain, together with having the desired summary characteristics like diversity, coverage etc. for that domain. Building upon the max margin structured learning framework in [gygli2015video] we learn a mixture of modular and submodular terms. Modular terms help to capture shots more important to the domain under consideration and submodular terms help to capture characteristics of summary important to that domain. The different weights learnt for different components indicate the varying notion of importance from domain to domain.
Further, having many possible ground truth summaries has been posed as one of the challenges in video summarization [truong2007video]. Multiple ground truths are due to the difference in perspectives and the fact that different visual content can have the same semantic meaning. However, one reason is also a lack of any other information about the video. Dealing with domain specific videos, however, allows us to define a notion of importance ratings that are unique and unambiguous for that domain. Such ratings can be seen as a unified representation of all possible ground truth summaries of a video. We establish the importance of ratings in the training dataset in producing good summaries as against binary inclusion/exclusion information. Instead of getting ground truth user summaries from annotators we rather ask them to provide ratings for segments in the entire video. The framework learns to generate more accurate summaries when it is provided more supervision this way. We thus establish that the more efficient way of incorporating domain specific relevances into a summary is to provide a supervision in form of ratings as against multiple ground truths.
We also define a new evaluation measure which is more naturally suited for this task than other standard measures used in literature like F1 score. Our measure evaluates video summaries considering the ratings and not just binary inclusion/exclusion information. It also evaluates summaries not only with respect to what they should contain but also with respect to what they should not contain and the degree of diversity.
As a part of this work we will also release a gold standard dataset for furthering research in domain specific video summarization. To the best of our knowledge, ours is the first dataset with long videos across several domains with rating annotations.
2 Related Work and Our Contributions
2.1 Domain specific video summarization
One of the earliest works on domain specific video summarization was by Potapov et al.[potapov2014category] in 2014. They were one of the first to realize the importance of building separate models for summarization for distinct categories of videos. They used an SVM[hearst1998support] classifier conditioned on video category to produce summaries. The SVM learns to score the segments according to their importance to the domain. The segments having higher scores are then selected greedily and put in temporal order to create the final summary. Sun et al. [sun2014ranking] analyzed edited videos of a particular domain as an indicator of highlights of that domain. After finding pairs of raw and corresponding edited videos, they obtain pairwise ranking constraints to train their model. Zhang et al. [zhang2016summary] use supervision in the form of humancreated summaries to perform automatic keyframebased video summarization. Their main idea is to nonparametrically transfer summary structures of a particular domain of videos from annotated videos to unseen test videos. By learning a joint model to understand what snippets and what characteristics are important for a domain, we use a more principled approach with a form of supervision which is more efficient.
2.2 Submodular functions for video summarization
Video summarization can be viewed as a subset selection problem subject to certain constraints. Given a set of items which we also call the Ground Set, define a utility function (set function) , which measures how good a subset is. Let be a cost function, which describes the cost of the set (for example, the size of the subset). Often the cost is budget constrained (for example, a fixed set summary) and a natural formulation of this is the following problem:
(1) 
The goal is then to have a subset which maximizes while simultaneously minimizing the cost function . It is easy to see that maximizing a generic set function becomes computationally infeasible as grows.
A special class of set functions, called submodular functions [nemhauser1978analysis], however, makes this optimization easy. A function is submodular if for every and and it holds that
(2) 
Likewise,a function is supermodular if for every and and it holds that
(3) 
Submodular functions exhibit a property that intuitively formalizes the idea of “diminishing returns”. That is, adding some instance to the set provides more gain in terms of the target function than adding to a larger set , where . Informally, since is a superset of and already contains more information, adding will not help as much. Using a greedy algorithm to optimize a submodular function (for selecting a subset) gives a lowerbound performance guarantee of around 63% of optimal [nemhauser1978analysis] to the above problem, and in practice these greedy solutions are often within 98% of optimal [krause2008optimizing]. This makes it advantageous to formulate (or approximate) the objective function for data selection as a submodular function.
This concept has been used in document summarization [lin2012learning] and in image collection summarization [tschiatschek2014learning]. More recently, Elhamifar et al. [elhamifar2017online] used submodular optimization for online video summarization by performing incremental subset selection. One of the first attempts to summarize videos using a submodular mixture of objectives was by Gygli et al. [gygli2015video]. They, however, did not distinguish between various domains of videos and had a specially crafted video frame interestingness model which played a significant role in the summaries produced. Building upon the approach in [gygli2015video] we learn a mixture for a domain to produce summary specific to that domain. However, in our work, domain specific importance of snippets/shots is not predicted separately using another model. Rather weighted features are directly used in the mixture as modular terms. These modular components capture the shot level domain importance while the other submodular and supermodular components in the mixture correspond to different desired characteristics of the summary like diversity and coverage.
2.3 Evaluation measures
Different measures have been reported in literature for the purpose of accurately evaluating the quality of the video summary produced. VIPER [doermann2000tools] addresses the problem by defining a specific ground truth format which makes it easy to evaluate a candidate summary. SUPERSEIV [huang2004automatic] is an unsupervised technique to evaluate video summarization algorithms that perform frame ranking. VERT [li2010vert] was inspired by BLEU in machine translation and ROUGE in text summarization. More recently approaches by Yeung et al. [yeung2014videoset] and Plummer et al. [plummer2017enhancing] also used text based evaluation methods. De Avila et al. [de2011vsumm] propose a method of evaluation which considers several ground truth summaries. Others, like [khosla2013large, GygliECCV14, elhamifar2017online] and [xu2015gaze] more directly use precision and recall type measures. Kannappan et al. [kannappan2016pertinent] propose an approach which is only for static video summaries. The evaluation measure proposed by Potapov et al. [potapov2014category] is capable of evaluating a summary only against one ground truth. As annotations are done by several users producing several ground truth summaries, they evaluate those annotations against each other to form some kind of an upper bound on performance. Approaches like [zhao2014quasi, conf/cvpr/SongVSJ15, gygli2015video] and [zhang2016video] combine several ground truths into one before using them for evaluation. This comes at the cost of losing individual opinion. In search for a measure which would work directly on ratings (which is a potential generator of multiple ground truths) and having certain other desired characteristics (as enumerated in the corresponding section below) we developed our own evaluation measure.
2.4 Datasets for video summarization
Different researchers in the past have released different datasets for the purpose of video summarization. Examples include The Video Summarization (SumMe) dataset [GygliECCV14], MED Summaries dataset [potapov2014category] and Titlebased Video Summarization (TVSum) dataset[conf/cvpr/SongVSJ15]. However, none of these existing datasets were found suitable for our work for following reasons. Firstly, we aim to summarize videos across a large number of domains like surveillance, sports, user etc. in a single framework. For that purpose we need a wide variety of videos summarized uniformly. The various datasets only provide certain subsets of types of videos and they use vastly different methods to annotate those videos. So, it was essential that we had a uniformly annotated, diverse set of videos from diverse domains. Secondly, we wanted to test our method on long videos, as the true benefit of a video summary in real world applications is seen only with respect to long videos. Thirdly, we wanted the annotations to not only capture what is important, but also what is not important and what is repetitive. Identifying segments which are relatively long and contain repetitive information (for example, scene of spectators clapping for 5 minutes in a cricket video) and retaining only a fraction of them to be included in the summary, is essential to having a good quality summary.
2.5 Our Contributions
In the following, we summarize the main contributions of this paper.

We address the problem of Domain specific video summarization, by jointly ranking the most important portions of the video for that domain (for example, a goal in Soccer), while simultaneously capturing diverse and representative shots. We do this by training a joint mixture model with features which capture domain importance along with diversity models.

We argue how different models capture aspects of the summarization task. For example, diversity is more important in surveillance videos when we want to capture outliers, while representation is more important in personal videos. Similarly importance or relevance plays a critical role in domains like Sports (like a goal in soccer).

We introduce a novel evaluation criteria which captures these aspects of a summary, and also introduce a large dataset for domain specific summarization. Our dataset comprises of several long videos for different domains (surveillance, personal videos, sports etc.) and to our knowledge is the first domain specific video summarization dataset with long videos.

We then empirically demonstrate various interesting insights. a) We first show that by jointly modeling diversity, relevance and importance, we can learn substantially superior summaries on all domains compared to just learning any one of these aspects. b) We next show that by learning on the same domain, we can obtain superior results than using learnt mixtures from other domains, thus proving the benefit for domain specific video summarization. c) We then look at the top components learnt for different domains and show how those individual components perform best for that domain if considered in isolation. Moreover, we argue how intuitively it makes sense that these components are important. For example, in surveillance, we see that diversity functions tend to have high ranking compared to other models, while in personal videos (like birthday), we see that representation is important. We moreover also look at the highest ranking snippets based on these components and show how they capture the most important aspects of that domain.
The major contribution of this work is that this is the first systematic study of domain specific video summarization on large videos and we provide several insights into the role of different summarization models for this problem.
3 Methodology
We begin by creating a training dataset comprising of videos from several categories annotated with ratings information. Our method works on ratings and hence better deals with issue of multiple ground truths. Building upon the approach in [gygli2015video], we create a mixture, but our mixture contains modular terms (to capture the domain specific importance of snippets) and submodular terms (for imparting certain desired characteristics to the summary). For each training video of a domain, the components of the mixture are instantiated and the weights of the complete mixture for that domain are learnt using max margin learning framework. After the training phase, for any given test video of that domain, the weighted mixture is then maximized to produce the desired summary video. Below we describe details of every step in the above methodology.
3.1 Training Data
In this work we focus on videos from five different domains  birthday, cricket, soccer, entry exit and office. The latter two are surveillance videos taken from CCTV cameras installed at various entry/exit locations and offices respectively. We have collected birthday, cricket and soccer videos from internet (existing published datasets / youtube). Due to privacy reasons and to be able to experiment with presence/absence of abnormal events in the surveillance videos, we have collected surveillance videos from our own setup of surveillance cameras. Table 1 shows distribution of number and duration of videos across these categories.
Category  Number of Videos  Duration in mins 

Cricket  7  276 
Birthday  9  136 
Soccer  11  609 
Entry Exit  21  306 
Office  33  687 
Next, for each domain, we go over every video and first prepare a table of scenes that occur across different videos in that category and using domain knowledge we assign ratings to those scenes. Negative ratings are assigned to segments which must not be included in the summary. Since the ratings are relative (for example, a 2 rated scene is supposed to be more important than a 1 rated scene but less important than a 3 rated scene) it was necessary to gather information about all scenes before starting to rate the scenes in the specific videos. Also going through the extra step of creating scenes document for each category enabled us to come up with a consistent philosophy of ratings and consequently a very high inter annotator correlation. Using this scenes table as annotation guidelines for each category, the annotators were then asked to annotate the videos in each category. Segments that are long and contain repetitive content are explicitly marked repetitive in addition to their rating. For the purpose of annotating, we customized a tool called oTranscribe [oTranscribe] to make the annotation task easy and to produce the desired annotation JSON. The oTranscribe interface was cleaned up and a lot more keyboard shortcuts were added in for ease of annotation. Shortcuts were added in to mark the beginning and ends of segments, to rate segments, to give a short description of the segment, to mark a segment repetitive and to skip to the end and beginning of the previous segment. Finally, hooks were added in to output the annotation as a JSON file. As a sanity check, we visually verified the annotations thus produced by looking at the annotated videos. The annotated videos were produced by overlaying the labels on top of the original videos.
3.2 Learning framework
The task of video summarization is posed as a discrete optimization problem for finding the best subset representing the summary. Given a video we split it into a snippets of fixed length. Now we have a set of all snippets in the video. Our problem reduces to picking such that that maximizes our objective.
(4) 
is the predicted summary, the feature representation of the video snippets and is the weighted mixture of components each capturing some aspect of the domain. Different weights are learnt for different domains.
(5) 
where and are the various modular, submodular and supermodular components. Given pairs of a video and a reference summary , we learn the weight vector by optimizing the following largemargin [taskar2005learning] formulation:
(6) 
where is the generalized hinge loss of training example and and are the weight vectors for the modular terms and the submodular terms respectively.
(7) 
This objective is chosen so that each human reference annotation scores higher than any other summary by some margin. For training example , the margin we chose is denoted by . We use as margin where normalized score is computed using minmax normalization of the score generated by our evaluation measure, given the ratings, as described below.
3.3 Components of the mixture
Our mixture contains several hand picked components. Every component serves to impart certain characteristics to the optimal subset (the predicted summary).
Set Cover: For a subset being scored, the set cover is defined as is a concept belonging to a set of all concepts , and is the weight of coverage of concept by element . This component governs the coverage aspect of the candidate summary and is monotone submodular.
Probabilistic Set Cover: This variant of the set cover function is defined as where is the probability with which concept is covered by element . Similar to the set cover function, this function governs the coverage aspect of the candidate summary, viewed stochastically and is also monotone submodular.
Facility Location: The facility location function is defined as where is an element from the ground set and measures the similarity between element v and element x. Facility Location governs the representativeness aspect of the candidate summaries and is monotone submodular.
Saturated Coverage Saturated Coverage is where measures the relevance of set to item and is a saturation hyper parameter that controls the level of coverage for each item by the set . Saturated Coverage is similar to Facility Location except for the fact that for every category, instead of taking a single representative, it allows for taking potentially multiple representatives. Saturated Coverage is also monotone submodular.
Generalized GraphCut Generalized Graph Cut is Similar to above two functions, Generalized Graph Cut also models representation. When becomes large, it also tries to model diversity in the subset. governs the tradeoff between representation and diversity. For it is monotone submodular. For it is nonmonotone submodular.
Disparitymin: Denoting the distance measure between snippet/shot and by , disparitymin is defined as a set function . It is easy to see that maximizing this function involves obtaining a subset with maximal minimum pairwise distance, thereby ensuring a diverse subset of snippets or shots. In principle this is similar to determinantal point processes (DPP), but DPP becomes computationally expensive at inference time. This function, though not submodular, can be efficiently optimized via a greedy algorithm.
Continuity: We work on a set of 2 second snippets as ground set. A summary (subset) could thus may not look continuous enough to give good viewing experience. Thus we add this continuity term in the mixture which would give more score when nearby snippets are chosen  this would ensure a visually more coherent and appealing summary. Essentially it is modeled as a redundancy function (this function is supermodular) within a shot as follows: where is the set of shots as a result of a shot boundary detection algorithm and is the similarity between two snippets which can be defined as how close they are to each other based on their index. That is, the features used here are the indices of the snippets.
Modular components:
We use weighted features of snippets (described in the next section) as modular terms in the mixture.
3.4 Features used for instantiating the components
Let the video be a set of frames . Let us define the ground set as a set of snippets where each snippet is 2 seconds long. A snippet for a video with frame rate would thus contain consecutive frames. Feature vectors are calculated for each snippet independently by aggregating the feature vectors of the frames/images in that snippet. . Different components of the mixture (as above) are instantiated for each video using the features mentioned below:

vgg_features: fc6 layer of VGG19 [simonyan2014very] trained on ImageNet dataset[deng2009imagenet] is used as the feature vector for each image in the snippet. The final vector is an average of all the image features. The feature size is 4096.

googlenet_features: pool5_7x7_s1 layer of GoogLeNet [szegedy2015going] trained on MIT Places dataset [zhou2014learning] is used as the feature vector for each image in the snippet. The final vector is 1024d and is an average of all the image feature vectors.

vgg_p_concepts: The output probability layer of VGG19 [simonyan2014very] trained on ImageNet dataset [deng2009imagenet] is used as the feature vector for each image in the snippet. The final vector is an average of all the image features. The feature size is 1000.

googlenet_p_concepts: The output probability layer GoogLeNet [szegedy2015going] trained on MIT Places dataset [zhou2014learning] is used as the feature vector for each image in the snippet. The final vector size is 365.

vgg_concepts: The output probability layer of VGG19 [simonyan2014very] trained on ImageNet dataset [deng2009imagenet] is used as the feature vector for each image in the snippet. To create the final vector an average of all the image features is taken this vector is onehotencoded based on a 0.5 threshold. The feature size is 1000.

goolenet_concepts: The output probability layer of GoogLeNet [szegedy2015going] trained on MIT Places dataset [zhou2014learning] is used as the feature vector for each image in the snippet. To create the final vector an average of all the image features is taken this vector is onehotencoded based on a 0.5 threshold. The feature size is 365.

yolo_coco_concepts: A vector of size 80 (corresponding to 80 classes in COCO dataset) where each component represents the average number of objects of the respective COCO [lin2014microsoft] class, found by the YOLO [redmon2016you] network trained on COCO dataset, in all images in the snippet.

yolo_voc_concepts: A vector of size 20 (corresponding to 20 classes in PASCAL VOC dataset) where each component represents the average number of objects of the respective PASCAL VOC[Everingham10] class, found by the YOLO [redmon2016you] network trained on PASCAL VOC dataset, in all images in the snippet.

yolo_coco_p_concepts: A vector of size 81 (corresponding to 80 classes in COCO dataset and 1 dummy class) where each component represents the fraction of objects of the respective COCO [lin2014microsoft] class relative to the total number of objects detected, found by the YOLO [redmon2016you] network trained on COCO dataset, in all images in the snippet. The last component is 1 if no objects were detected.

yolo_voc_p_concepts: A vector of size 21 (corresponding to 20 classes in PASCAL VOC dataset and 1 dummy class) where each component represents the average fraction of objects of the respective PASCAL VOC [Everingham10] class relative to the total number of objects detected, found by the YOLO [redmon2016you] network trained on PASCAL VOC dataset, in all images in the snippet. The last component is 1 if no objects were detected.

color_hist_r_features: A vector of size 256 representing the average red color histogram of the images in the snippet.

color_hist_g_features: A vector of size 256 representing the average green color histogram of the images in the snippet.

color_hist_b_features: A vector of size 256 representing the average blue color histogram of the images in the snippet.

color_hist_h_features: A vector of size 180 representing the average hue histogram of the images in the snippet.

color_hist_s_features: A vector of size 256 representing the average saturation histogram of the images in the snippet.
3.5 Evaluation measure
To serve the desired purpose and to be suitable to be used in our framework, we wanted an evaluation measure which would satisfy the following characteristics:

The reward for including an rated snippet must be greater than the reward for including an rated snippet

A negative rated snippet must be penalized

An rated segment, no matter how big, should not displace an rated segment from a budget constrained gold summary

No number of rated segments should displace an rated segment from a budget constrained gold summary

In the gold summary, segments marked nonrepetitive should not be broken unless it is absolutely necessary (possibly the last one to fit within the boundary)

No reward should be given for picking more than seconds of a segment marked repetitive
After careful design, we came up with the following formulation. It is not very difficult to see that this formulation satisfies the above characteristics.
The score function for video V, , is defined as:
where, is the set of segments in V marked nonrepetitive and rated positive, is the set of segments in V marked repetitive and rated positive, is the set of segments in V rated negative, is the reward scaling hyperparameter, is the repetitiveness cutoff factor and is the penalty factor.
This function is neither submodular nor supermodular. However, this can be written as a sum of a submodular and a supermodular function and hence the bounds discussed below hold true when this appears as the margin in the discrete optimization of the loss augmented objective (Equation 7).
Lemma 1.
The evaluation function can be written as a sum of a submodular and a supermodular function where,
and
It is easy to see that the function and above are submodular and supermodular respectively, and moreover, expanding the terms we can see that by adding these two functions, we get back .
3.6 Discrete Optimization
Our framework entails two different discrete optimization problems  maximization of the weighted mixture in the loss augmented inference, Equation 7, and during inference to obtain the summary once the mixture is learnt (Equation 4). For efficient optimization with guaranteed bounds, it is important to understand certain characteristics of these components. Note that our mixture of set functions can be written as follows:
where is a monotone submodular function, is a nonmonotone submodular function, is a monotone supermodular function, and is a dispersion function (also called disparitymin) (), and . Moreover, we assume that each of the functions above are nonnegative (without loss of generality). Note that in the above, we have grouped all monotone submodular, nonmonotone submodular, supermodular functions together. The only function which is neither submodular nor supermodular is Disparity Min.
We would like to understand the theoretical guarantees for the following optimization problem:
(8) 
for various values of .
Theorem 2.
The following theoretical results hold for solving the optimization problem of Equation 8:

We obtain an approximation factor of if and .

We obtain an approximation factor of if and .

We obtain an approximation factor of if and .

We obtain an approximation factor of if and

We obtain an approximation factor of if and .

We obtain an approximation factor of where and , if and .

We obtain an approximation factor of where and , if and .

The optimization problem of Equation 8 is inapproximable unless P = NP, if and .
Proof.
The following paragraphs provide the proofs for each of the cases enumerated above:

The first result follows directly from [nemhauser1978analysis] since when , is a monotone submodular function. Hence the greedy algorithm admits an approximation guarantee of .

For the second result, notice that when , we have , the dispersion function. Following Lemma 1 from [dasgupta2013summarization], the Dispersion function admits a approximation factor for the cardinality constrained optimization problem from Equation 8.

The third result is a consequence of [buchbinder2014submodular], since we have a nonmonotone submodular maximization problem subject to a cardinality constraint. Note that the Randomized Greedy algorithm achieves an approximation guarantee of .

The fourth result follows the proof technique from Theorem 1 in [dasgupta2013summarization]. since when , is a sum of a monotone submodular function and a dispersion function. Denote as the monotone submodular function, and as the dispersion function. Note that . For the algorithm, we first optimize alone, say using a greedy algorithm and then we optimize alone, again using the greedy algorithm. We then take the better amongst the solutions achieved. Let denote the solution obtained by optimizing the monotone submodular function, and denote the solution obtained by optimizing the Dispersion function. Denote and as the optimal solutions respectively. Note that and from the respective approximation guarantees of the individual algorithms. Denote as the better amongst in terms of the final objective . Therefore, . Note that and and therefore where is the optimal solution (note that and by definition.

For the fifth result, we use a similar proof technique as above. Note that we can achieve an approximation factor of by maximizing a nonmonotone submodular function subject to a cardinality constraint, and a factor of for maximizing the Dispersion function. Let as the nonmonotone submodular function, and as the dispersion function. We know that (w.l.o.g, take the monotone function term into the nonmonotone function since the sum of a monotone and nonmonotone function is in general nonmonotone). For the algorithm, we first optimize alone, say using a randomized greedy algorithm and then we optimize alone, using the greedy algorithm. We then take the better amongst the solutions achieved. Let denote the solution obtained by optimizing the nonmonotone submodular function, and denote the solution obtained by optimizing the Dispersion function. Denote and as the optimal solutions respectively. Note that and from the respective approximation guarantees of the individual algorithms. Denote as the better amongst in terms of the final objective . Therefore, . Note that and and therefore where is the optimal solution (note that and by definition.

The sixth result is a direct consequence of [bai2018greed], since in this case is a sum of monotone submodular and a monotone supermodular function. Note that the approximation bounds here depend on the curvature of the two functions.

The seventh result follows by combining the proof technique from the fourth and fifth result, with the bound above. In the interest of space, we skip the details of this.

Finally, when , we have the most general case of the sum of non monotone submodular and supermodular functions. This is then equivalent to a difference of submodular functions, which is known to be inapproximable, following Theorems 5.1 and 5.2 from [iyer2012algorithms].
∎
3.7 Generating Ground Truth Summaries
The use of ratings allow us to generate multiple ground truth summaries for a video. The total number of possible summaries could be exponential in video duration (a variant of knapsack on duration of segments), so for our experiments we randomly generate upto 500 ground truth summaries for each video for each budget percentage (5, 15 and 30). Starting from the highest rating, if all segments of that rating can be fit in the budget, they are included in the summary. If all segments of a rating cannot be included in the remaining budget, using a flavor of standard coin exchange problem, maximal combinations of segments are enumerated such that possibly only the last segment gets broken. Algorithm 1 demonstrates how we generate the ground truth summaries given a set of ratings and a budget.
4 Experiments and Results
What follows is a description of various experiments performed and results observed using the above framework for domain specific video summarization. For surveillance videos (entry exit and office), since night videos are black and white we do not use the color histogram features based on hue and saturation. For videos in Soccer, Cricket and Birthday domains we additionally perform shot detection to identify distinct shots in the video and create a feature which keeps track of the snippets present in each shot. This is consumed by the continuity component in the mixture. However, we do not use this continuity component in the mixture for the surveillance videos where the notion of a shot is not well defined in those videos. Also unless explicitly stated, for training, during each epoch, a random ground truth summary, out of the many possible summaries, was chosen for each video so that over a large number of epochs, all ground truths get covered. We do a train test split of 7030 with respect to the number of videos in each domain in the dataset. The hyperparameters used for the evaluation measure while training were , and . We arrive at best values for and by testing the models on the heldout validation set.
Sanity of ground truths and behavior of evaluation measure
We perform the following sanity checks on the ground truth summaries produced and the behavior of our evaluation measure:

scores of all ground truth summaries of a particular budget for a video should be same, asserting that the synthesis of ground truths as above is consistent with

scores of ground truth summaries should always be greater than randomly produced summaries  for all lengths, for all videos for all categories
In this experiment we compare the normalized scores of the ground truth summaries against the normalized scores of 1000 random summaries picked exclusively from segments rated highly positive. We do the standard minmax normalization. We plot the minimum, maximum and average scores for both the random summaries and the ground truth summaries. To ensure that the random summaries do not get very low scores (and hence favoring the ground truth summaries during comparison), the random summaries used in this experiment are not truly random. They do not include the negatively rated segments.
The results as per figure 1 show that our evaluation measure and our ground truth generation algorithm behave as expected. All ground truth summaries have the same normalized scores and hence the minimum, maximum and the average scores for all ground truth summaries coincide as one line in the plots. Also all ground truth summaries score higher than any random summary across all domains and all videos. We also verify the ground truth summaries visually by representing them as videos and visually assessing their quality.
Learning experiments
For learning weights in the above formulation, we constrain the weights of all submodular components to be always positive. There is no such constraint for modular components in the mixture. We compare AdaGrad and stochastic gradient descent and find AdaGrad to work better for all experiments and hence all results reported use AdaGrad. The reported numbers are losses (i.e.  ). We compare the following results: a) All Modular: Train with only modular terms in the mixture, b) All Submodular: Train with only submodular terms in the mixture, c) Full: Complete Mixture (all modular terms and all submodular terms).
We compare all these to random summaries, uniform summaries and the best individual component baselines instantiated with different features are used individually to produce summaries. Results are reported in Table 2. We observe that combining both the submodular and modular terms in the mixture (full) provides the best results, as compared to just using submodular and modular terms alone or as compared to any of the baselines. Moreover, the learnt mixtures for only modular and either submodular also outperform Random, Uniform and the average individual submodular functions. We see that the best individual submodular functions also perform better than random or uniform baselines, but not as well as the learnt mixtures, thus proving the benefit of learning for this problem. We also verify the goodness of a summary both quantitatively (using the scores from our evaluation measure) and qualitatively (by visualizing the summary produced). This establishes our hypothesis that joint training can significantly help.
Domain  Method  ScoreLoss 
Birthday  All Modular  0.7234 
All Submodular  0.7107  
Full  0.6625  
Random  0.7579  
Uniform  0.7569  
Submodular  0.7232  
EntryExit 
All Modular  0.5967 
All Submodular  0.6306  
Full  0.5884  
Random  0.7706  
Uniform  0.7785  
Submodular  0.6666  
Cricket 
All Modular  0.8140 
All Submodular  0.8275  
Full  0.7733  
Random  0.8911  
Uniform  0.8979  
Submodular  0.8432  
Office 
All Modular  0.5140 
All Submodular  0.4783  
Full  0.3696  
Random  0.5743  
Uniform  0.5599  
Submodular  0.5190  

Verification of learning domain specific characteristics
To demonstrate that the learnt summaries are domain specific, we test the model learnt on one domain in producing summaries of another domain. The results in Table 3 shows that models learnt on one domain perform poorly on other domains, establishing that the model has indeed learnt domain specific characteristics.
Model Trained On  Model Tested On  ScoreLoss 
Birthday  Birthday  0.6625 
Soccer  0.9753  
Cricket  0.9177  
EntryExit  EntryExit  0.5884 
Soccer  0.9900  
Cricket  0.9710  
Birthday  0.8009  
Cricket  Cricket  0.7733 
Soccer  0.8284  
Birthday  0.8103 
Next, we look at the weights learnt for the different submodular components. Figure 2 (top) shows magnitude of the learnt weights the different domains. We see that different domains prefer different submodular components and features to produce good summaries. For example, scene features (googlenet_features) are important for Cricket, while object detection features (yolo_voc_p_concepts and yolo_coco_concepts) are more important for Surveillance Videos. For Cricket, the scene of ground or pitch or crowd has a lot of bearing on the importance and for Surveillance, detection of entities assume more importance, given the static scene. Next, we look at the correlation between the components which achieve the best weight in the learnt mixture and the components which achieve the best score when run in isolation. We see in the bottom table of Figure 2 that there is a strong correlation between the two. In particular about 6 to 7 out of the top ten components are the same in both buckets. It is also informative to look at the components themselves. For Birthday and Cricket, we see that Saturated Coverage and Facility Location (i.e. the representative models) are the winners, while in Office (which are surveillance videos), we see a lot of Disparity Min (diversity) functions as winners.
Finally, we look at the top ranked frames based on the learnt mixture for each domain. We see that the the frames intuitively capture the most important aspects for that domain. For example, in office footages, the top frames are frames where people are either entering/leaving or meeting, in Birthday videos, we see a shot where people are taking a selfie and a shot where the birthday girl is posing are ranked the best. Similarly in Cricket, shots where the player is about to hit, and where there is a four being hit are selected. Hence we see that the joint training has learnt specific domain specific importance of events along with the right weights for diversity and representation.
Importance of ratings in generating multiple ground truths
To show that ability to generate multiple ground truth from our ratings based annotation is beneficial, we compare the scores obtained by training with a single ground truth summary in every epoch against the scores obtained by training with a random ground truth summary in every epoch.
Birthday  Random GTs  0.6625 

Same GT  0.6818  
EntryExit  Random GTs  0.5883 
Same GT  0.6188 
The results on Table 4 suggests that the model indeed learns better when multiple ground truth summaries are used hence establishing the importance of the system of ratings which allow us to generate many ground truths.
5 Conclusion
Motivated by the fact that what makes a good summary differs across domains, we set out to develop a framework which would automatically learn what is considered important for a domain, both in terms of the kind of snippets to be selected and also in terms of the desired characteristics of the summary produced in terms of representativeness, coverage, diversity, etc. We also establish that ratings provide a more efficient way of supervision to impart domain knowledge necessary to create such summaries. Further, we propose a novel evaluation measure well suited for this task. In the absence of any existing dataset which would lend itself well to this particular problem, we created a gold standard dataset and will be making it public as a part of this work. Through several experiments we demonstrated the effectiveness of our solution in producing domain specific summaries which can be seen as a first significant breakthrough in this direction.