SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition
While many action recognition datasets consist of collections of brief, trimmed videos each containing a relevant action, videos in the real-world (e.g., on YouTube) exhibit very different properties: they are often several minutes long, where brief relevant clips are often interleaved with segments of extended duration containing little change. Applying densely an action recognition system to every temporal clip within such videos is prohibitively expensive. Furthermore, as we show in our experiments, this results in suboptimal recognition accuracy as informative predictions from relevant clips are outnumbered by meaningless classification outputs over long uninformative sections of the video. In this paper we introduce a lightweight “clip-sampling” model that can efficiently identify the most salient temporal clips within a long video. We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips. Furthermore, we show that this yields significant gains in recognition accuracy compared to analysis of all clips or randomly/uniformly selected clips. On Sports1M, our clip sampling scheme elevates the accuracy of an already state-of-the-art action classifier by and reduces by more than 15 times its computational cost.
Most modern action recognition models operate by applying a deep CNN over clips of fixed temporal length [TwoStreamAZ:NIPS14, carreira2017quo, tran2017closer, NonLocal, SlowFast]. Video-level classification is obtained by aggregating the clip-level predictions over the entire video, either in the form of simple averaging or by means of more sophisticated schemes modeling temporal structure [NgEtAl:CVPR15, VarolEtAl:TPAMI17, Ghirdar:CVPR17]. Scoring a clip classifier densely over the entire sequence is a reasonable approach for short videos. However, it becomes computationally impractical for real-world videos that may be up to an hour long, such as some of the sequences in the Sports1M dataset [Sports1M]. In addition to the issue of computational cost, long videos often include segments of extended duration that provide irrelevant information for the recognition of the action class. Pooling information from all clips without consideration of their relevance may cause poor video-level classification, as informative clip predictions are outnumbered by uninformative predictions over long unimportant segments.
In this work we propose a simple scheme to address these problems (see Fig. 1 for a high-level illustration of the approach). It consists in training an extremely lightweight network to determine the saliency of a candidate clip. Because the computational cost of this network is more than one order of magnitude lower than the cost of existing 3D CNNs for action recognition [carreira2017quo, tran2017closer], it can be evaluated efficiently over all clips of even long videos. We refer to our network as SCSampler (Salient Clip Sampler), as it samples a reduced set of salient clips from the video for analysis by the action classifier. We demonstrate that restricting the costly action classifier to run only on the clips identified as the most salient by SCSampler, yields not only significant savings in runtime but also large improvements in video classification accuracy: on Sports1M our scheme yields a speedup of 15 and an accuracy gain of over an already state-of-the-art classifier.
Efficiency is a critical requirement in the design of SCSampler. We present two main variants of our sampler. The first operates directly on compressed video [R1:MR1, wu2018coviar, zhang2016cvpr], thus eliminating the need for costly decoding. The second looks only at the audio channel, which is low-dimensional and can therefore be processed very efficiently. As in recent multimedia work [arandjelovic2017look, aytar2016soundnet, Gao18ECCV, owens2018audio], our audio-based sampler exploits the inherent semantic correlation between the audio and the visual elements of a video. We also show that combining our video-based sampler with the audio-based sampler leads to further gains in recognition accuracy.
We propose and evaluate two distinct learning objectives for salient clip sampling. One of them trains the sampler to operate optimally with the given clip classifier, while the second formulation is classifier-independent. We show that, in some settings, the former leads to improved accuracy, while the benefit of the latter is that it can be used without retraining with any clip classifier, making this model a general and powerful off-the-shelf tool to improve both the runtime and the accuracy of clip-based action classification. Finally, we show that although our sampler is trained over specific action classes in the training set, its benefits extend even to recognition of novel action classes.
2 Related work
The problem of selecting relevant frames, clips or segments within a video has been investigated for various applications. For example, video summarization [GongEtAl:NIPS16, GygliEtAl:CVPR15, MahasseniEtAl:CVPR17, PotapovEtAl:ECCV14, zhang2016cvpr, ZhangEtAl:CVPR16, ZhangEtAl:ECCV16] and the automatic production of sport highlights [MerlerEtAl:CVPRW18, MerlerEtAl:TMM18] entail creating a much shorter version of the original video by concatenating a small set of snippets corresponding to the most informative or exciting moments. The aim of these systems is to generate a video composite that is pleasing and compelling for the user. Instead the objective of our model is to select a set of segments of fixed duration (i.e., clips) so as to make video-level classification as accurate and as unambiguous as possible.
More closely related to our task is the problem of action localization [Mihir:CVPR14, ShouEtAL:CVPR16, ShouEtAl:CVPR17, xu2017rc3d, ZhaoEtAl:ICCV17], where the objective is to localize the temporal start and end of each action within a given untrimmed video and to recognize the action class. Action localization is often approached through a two-step mechanism [BuchEtAl:CVPR17, EscorciaEtal:ECCV16, BuchEtAl:CVPR17, GaoEtAl:ICCV17, Gao18ECCV, HeilbronEtAl:CVPR16, LinEtAl:ECCV18, ActionSearch], where first an action proposal method identifies candidate action segments, and then a more sophisticated approach validates the class of each candidate and refines its temporal boundaries. Our framework is reminiscent of this two-step solution, as our sampler can be viewed as selecting candidate clips for accurate evaluation by the action classifier. However, several key differences exist between our objective and that of action localization. Our system is aimed at video classification, where the assumption is that each video contains a single action class. Action proposal methods solve the harder problem of finding segments of different lengths and potentially belonging to different classes within the input video. While in action localization the validation model is typically trained using the candidate segments produced by the proposal method, the opposite is true in our scenario: the sampler is learned for a given pretrained clip classifier, which is left unmodified by our approach. Finally, the most fundamental difference is that high efficiency is a critical requirement in the design of our clip sampler. Our sampler must be orders of magnitude faster than the clip classifier to make our approach worthwhile. Conversely, most action proposal or localization methods are based on optical flow [LinEtAl:CVPRW17, LinEtAl:ECCV18] or deep action-classifier features [BuchEtAl:CVPR17, Gao18ECCV, xu2017rc3d] that are typically at least as expensive to compute as the output of a clip classifier. For example, the TURN TAP system [GaoEtAl:ICCV17] is one of the fastest existing action proposal methods and yet, its computational cost exceeds by more than one order of magnitude that of our scheme. For 60 seconds of untrimmed video, TURN TAP has a cost of 4128 GFLOPS; running densely our clip classifier (MC3-18 [tran2017closer]) over the 60 seconds would actually cost less, at 1097 GFLOPs; our sampling scheme lowers the cost down dramatically, to only 168 GFLOPs.
Closer to our intent are methods that remove from consideration uninformative sections of the video. This is typically achieved by means of temporal models that “skip” segments by leveraging past observations to predict which future frames to consider next [yeung2016end, fan18ijcai, Wu_2019_CVPR]. Instead of learning to skip, our approach relies on a fast sampling procedures that evaluates all segments in a video and then performs further analysis on the most salient ones.
Our approach belongs to the genre of work that performs video classification by aggregating temporal information from long videos [GaidonEtAl:TPAMI2013, R1:MR3, NgEtAl:CVPR15, PirsiavashEtAl:CVPR14, VarolEtAl:TPAMI17, WangCherian:ECCV18, WangEtaAl:IJCV16, WangEtAl_TSN:ECCV16, WangEtAl:CVPR16, WuEtAl:CVPR19, ZhouEtAl:ECCV18]. Our aggregation scheme is very simple, as it merely averages the scores of action classifiers over the selected clips. Yet, we note that the most recent state-of-the-art action classifiers operate precisely under this simple scheme. Examples include Two-Stream Networks [TwoStreamAZ:NIPS14], I3D [carreira2017quo], R(2+1)D [tran2017closer], Non-Local Networks [NonLocal], SlowFast [SlowFast]. While in these prior studies clips are sampled densely or at random, our experiment suggest that our sampling strategy yields significant gains in accuracy over both dense, random, and uniform sampling and it is as fast as random sampling.
3 Technical approach
Our approach consists in extracting a small set of relevant clips from a video by scoring densely each clip with a lightweight saliency model. We refer to this model as the “sampler” since it is used to sample clips from the video. We formally define the task in subsection 3.1, proceed to present two different learning objectives for the sampler in section 3.2, and finally discuss sampler architecture choices and features in subsection 3.3.
3.1 Problem Formulation
Video classification from clip-level predictions. We assume we are given a pretrained action classifier operating on short, fixed-length clips of RGB frames with spatial resolution and producing output classification probabilities over a set of action classes . We note that most modern action recognition systems [carreira2017quo, feichtenhofer2016spatiotemporal, tran2015learning, tran2017closer] fall under this model and, typically, they constrain the number of frames to span just a handful of seconds in order to keep memory consumption manageable during training and testing. Given a test video of arbitrary length , video-level classification through the clip-classifier f is achieved by first splitting the video into a set of clips with each clip consisting of adjacent frames and where denotes the total number of clips in the video. The splitting is usually done by taking clips every frames in order to have a set of non-overlapping clips that spans the entirety of the video. A final video-level prediction is then computed by aggregating the individual clip-level predictions. In other words, if we denote with aggr the aggregation operator, the video-level classifier is obtained as .
Most often, the aggregator is a simple pooling operator which averages the individual clip scores (i.e., ) [carreira2017quo, SlowFast, TwoStreamAZ:NIPS14, tran2017closer, NonLocal] but more sophisticated schemes based on RNNs [yue2015beyond] have also been employed.
Video classification from selected clips In this paper we are interested in scenarios where the videos are untrimmed and may be quite long. In such cases, applying the clip classifier to every clip will result in a very large inference cost. Furthermore, aggregating predictions from the entire video may produce poor action recognition accuracy since in long videos the target action is unlikely to be exhibited in every clip. Thus, our objective is to design a method that can efficiently identify a subset of salient clips in the video (i.e., with ) and to reduce video-level prediction to be computed from this set of clip-level predictions as ( is hyper-parameter studied in our experiments). By constraining the application of the costly classifier to only clips, inference will be efficient even on long videos. Furthermore, by making sure that includes a sample of the most salient clips in , recognition accuracy may improve as irrelevant or ambiguous clips will be discarded from consideration and will be prevented from polluting the video-level prediction. We note that in this work we address the problem of clip selection for a given pretrained clip classifier , which is left unmodified by our method. This renders our approach useful as a post-training procedure to further improve performance of existing classifiers both in terms of inference speed as well as recognition accuracy.
Our clip sampler. In order to achieve our goal we propose a simple solution that consists in learning a highly efficient clip-level saliency model that provides for each clip in the video a “saliency score” in . Specifically, our saliency model takes as input clip features that are fast to compute from the raw clip and that have low dimensionality () so that each clip can be analyzed very efficiently. The saliency model is designed to be orders of magnitude faster than , thus enabling the possibility to score on every single clip of the video to find the most salient clips without adding any significant overhead. The set is then obtained as where returns the indices of the top- values in the set. We show that evaluating f on these selected set, i.e., computing ) results in significantly higher accuracy compared to aggregating clip-level prediction over all clips.
In order to learn the sampler , we use a training set of untrimmed video examples, each annotated with a label indicating the action performed in the video: with denoting the -th video and indicating its action label. In our experiments, we use as training set the same set of examples that was used to train the clip classifier . This setup allows us to demonstrate that the gains in recognition accuracy are not due to leveraging additional data but instead are the result of learning to detect the most salient clips for within each video.
Oracle sampler. In this work we compare our sampler against an “oracle” that makes use of the action label to select the best clips in the video for classification with . The oracle set is formally defined as . Note that is obtained by looking for the clips that yield the highest action classification scores for the ground-truth label under the costly action classifier . In real scenarios the oracle cannot be constructed as it requires knowing the true label and it involves dense application of over the entire video, which defeats the purpose of the sampler. Nevertheless, in this work we use the oracle to obtain an upper bound on the accuracy of the sampler. Furthermore, we apply the oracle to the training set to form pseudo ground-truth data to train our sampler, as discussed in the next subsection.
3.2 Learning Objectives for SCSampler
We consider two choices of learning objective for the sampler and experimentally compare them in 4.2.1.
3.2.1 Training the sampler as an action classifier
A naïve way to approach the learning of the sampler is to first train a lightweight action classifier on the training set by forming clip examples using the low-dimensional clip features . Note that this is equivalent to assuming that every clip in the training video contains a manifestation of the target action. Then, given a new untrimmed test video , we can compute the saliency score of a clip in the video as the maximum classification score over the classes, i.e., . The rationale behind this choice is that a salient clip is expected to elicit a strong response by the classifier, while irrelevant or ambiguous clips are likely to cause weak predictions for all classes. We refer to this variant of our loss as AC (Action Classification).
3.2.2 Training the sampler as a saliency ranker
One drawback of AC is that the sampler is trained as an action classifier independently from the model and by assuming that all clips are equally relevant. Instead, ideally we would like the sampler to select clips that are most useful to our given f. To achieve this goal we propose to train the sampler to recognize the relative importance of the clips within a video with respect to the classification output of f for the correct action label. To achieve this goal, we define pseudo ground-truth binary labels for pairs of clips from the same video :
We train by minimizing a ranking loss over these pairs:
where is a margin hyper-parameter. This loss encourages the sampler to rank higher clips that produce a higher classification score under for the correct label. We refer to this sampler loss as SAL-RANK (Saliency Ranking).
3.3 Sampler Architecture
Due to the tight runtime requirements, we restrict our sampler to operate on two types of features that can be computed efficiently from video and that yield a very compact representation to process. The first type of features are obtained directly from the compressed video without the need for decoding. Prior work has shown that features computed from compressed video can even be used for action recognition [wu2018coviar]. We describe in detail these features in subsection 3.3.1. The second type of features are audio features, which are even more compact and faster to compute than the compressed video features. Recent work [arandjelovic2017look, arandjelovic2017objects, aytar2016soundnet, Gao18ECCV, korbar2018cooperative, owens2018audio, zhao2018sound] has shown that the audio channel provides strong cues about the content of the video and this semantic correlation can be leveraged for various applications.
In subsection 3.3.2 we discuss how we can exploit the low-dimensional audio modality to find efficiently salient clips in a video.
3.3.1 Visual sampler
Wu et al. [wu2018coviar] recently introduced an accurate action recognition model directly trained on compressed video. Modern codecs such as MPEG-4 and H.264 represent video in highly compressed form by storing the information in a set of sparse I-frames, each followed by a sequence of P-frames. An I-frame (IF) represents the RGB-frame in a video just as an image. Each I-frame is followed by 11 P-frames, which encode the 11 subsequent frames in terms of motion displacement (MD), and RGB-residual (RGB-R). MDs capture the frame-to-frame 2D motion while RGB-Rs store the remaining difference in RGB values between adjacent frames after having applied the MD field to rewarp the frame. In [wu2018coviar] it was shown that each of these three modalities (IFs, MDs, RGB-Rs) provides useful information for efficient and accurate action recognition in video. Inspired by this prior work, here we train three separate ResNet-18 networks [he2016residual] on these three inputs as samplers using the learning objectives outlined in the previous subsection. The first ResNet-18 takes as input an IF of size . The second is trained on MD frames, which have size : the 2 channels encode the horizontal and vertical motion displacements at a resolution that is 16 times smaller than the original video. The third ResNet-18 is fed individual RGB-Rs of size . At test time we average the predictions of these 3 models over all the I-frames and P-frames (MDs and RGB-Rs) within the clip to obtain a final global saliency score for the clip. As an alternative to ResNet-18, we experimented also with a lightweight ShuffleNet architecture [zhang2018shufflenet] of 26 layers. We compare these models in 4.2.2. We do not present results for the large ResNet-152 model that was used in [wu2018coviar], since it adds a cost of 3 GFLOPS per clip which far exceeds the computational budget of our application.
3.3.2 Audio sampler
We model our audio sampler after the VGG-like audio networks used in [chung2016out, arandjelovic2017look, korbar2018cooperative]. Specifically, we first extract MEL-spectrograms from audio segments twice as as long as the video-clips, but with stride equal to the video-clip length. This stride is chosen to obtain an audio-based saliency score for every video clip used by the action recognizer f. However, for the audio sampler we use an observation window twice as long as the video clip since we found this to yield better results. A series of 200 time samples is taken within each audio segment and processed using MEL filters. This yields a descriptor of size . This representation is compact and can be analyzed efficiently by the sampler. We treat this descriptor as an image and process it using a VGG network [vgg2014] of 18 layers. The details of the architecture are given in the supplementary material.
3.3.3 Combining video and audio saliency
Since audio and video provide correlated but distinct cues, we investigated several schemes for combining the saliency predictions from these two modalities. With AV-convex-score we denote a model that simply combines the audio-based score and the video-based score by means of a convex combination where is a scalar hyperparameter. The scheme AV-convex-list instead first produces two separate ranked lists by sorting the clips within each video according to the audio sampler and the visual sampler independently. Then the method computes for each clip the weighted average of its ranked position in the two lists according to a convex combination of the two positions. The top- clips according to this measure are finally retrieved. The method AV-intersect-list computes an intersection between the top- clips of the audio sampler and the top- clips of the video sampler. For each video, is progressively increased until the intersection yields exactly clips. In AV-union-list we form a set of clips by selecting -top clips according to the visual sampler (with hyperparameter s.t. ) and by adding to it a set of different clips from the ranked list of the audio sampler. Finally, we also present results for AV-joint-training, where we simply average the audio-based score and the video-based score and then finetune the two networks with respect to this average.
In this section we evaluate the proposed sampling procedure on the large-scale Sports1M and Kinetics datasets.
4.1 Large-scale action recognition with SCSampler
4.1.1 Experimental Setup
Action Recognition Networks. Our sampler can be used with any clip-based action classifier f. We demonstrate the general applicability of our approach by evaluating it with six popular 3D CNNs for action recognition. Four of these models are pretrained networks publicly available [VMZ] and described in detail in [tran2017closer]: they are 18-layer instantiations of ResNet3D (R3D), Mixed Convolutional Network (MC3), and R(2+1)D, with this last network also in a 34-layer configuration. The other two models are our own implementation of I3D-RGB [carreira2017quo] and a ResNet3D of 152 layers leveraging depthwise convolutions (ir-CSN-152) [Tran19]. These networks are among the state-of-the-art on Kinetics and Sports1M. For training procedure, please refer to supplementary material.
Sampler configuration. In this subsection we present results achieved with the best configuration of our sampler architecture, based on the experimental study that we present in section 4.2. The best configuration is a model that combines the saliency scores of an audio sampler and of a video sampler, using the strategy of AV-union-list. The video sampler is based on two ResNet-18 models trained on MD and RGB-R features, respectively, using the action classification loss (AC). The audio sampler is trained with the saliency ranking loss (SAL-RANK). Our sampler is optimized with respect to the given clip classifier f. Thus, we train a separate clip sampler for each of the 6 architectures in this evaluation. All results are based on sampling clips from the video, since this is the best hyper-parameter value according to our experiments (see analysis in supplementary material).
Baselines. We compare the action recognition accuracy achieved with our sampler, against three baseline strategies to select clips from the video: Random chooses clips at random, Uniform selects clips uniformly spaced out, while Empirical samples clips from the discrete empirical distribution (i.e., a histogram) of the top Oracle clip locations over the entire training set (the histogram is computed by linearly remapping the temporal extent of each video to be in the interval ). Finally, we also include video classification accuracy obtained with Dense which performs “dense” evaluation by averaging the clip-level predictions over all non-overlapping clips in the video.
4.1.2 Evaluation on Sports1M
Our approach is designed to operate on long, real-world videos where it is neither feasible nor beneficial to evaluate every single clip. For these reasons, we choose the Sports1M dataset [Sports1M] as a suitable benchmark since its average video length is 5 minutes and 36 seconds, and some of its videos exceed 1 hour. We use the official training/test split. We do not trim the test videos and instead seek the top clips according to our sampler in each video. We stress that our sampling strategy is applied to test videos only. The training videos in Sports1M are also untrimmed. As training on all training clips would be unfeasible, we use the training procedure described in [tran2017closer] which consists in selecting from each training video 10 random 2-second segments, from which training clips are formed. We reserve to future work the investigation of whether our sampling can be extended to sample training clips from the full videos.
We present the results in Table 3, which includes for each method the video-level classification accuracy as well as the cumulative runtime (in days) to run the inference on the complete test set using 32 NVIDIA P100 GPUs (this includes the time needed for sampling as well as clip-level action classification). The most direct baselines for our evaluation are Random, Uniform and Empirical which use the same number of clips () in each video as SCSampler. It can be seen that compared to these baselines, SCSampler delivers a substantial accuracy gain for all action models, with improvements ranging from 6.0% for R(2+1)D-34 to 9.9% for R(2+1)D-18 with respect to Empirical, which does only marginally better than Random and Uniform.
Our approach does also better than “Dense” prediction, which averages the action classification predictions over all non-overlapping clips. To the best of our knowledge the accuracy of 77.0% achieved by ir-CSN-152 using Dense evaluation is currently the best published result on this benchmark. SCSampler provides an additional gain of 7.0% over this state-of-the-art model, pushing the accuracy to 84.0%. We note that when using ir-CSN-152, Dense requires 14 days whereas SCSampler achieves better accuracy and requires only 0.65 days to run inference on the Sports1M test set. Finally, we report also the performance of the “Oracle” , which selects the clips that yield the highest classification score for the true class of the test video. This is an impractical model but it gives us an informative upper bound on the accuracy achievable with an ideal sampler.
Fig. 2 (left) shows the histogram of the clip temporal locations using samples per video for the test set of Sports1M (after remapping the temporal extent of each video to ). Oracle and SCSampler produce similar distributions of clip locations, with the first section and especially the last section of videos receiving many more samples. It can be noted that Empirical shows a different sample distribution compared to Oracle. This is due to the fact that it computes the histogram from the training set which in this case appears to have different statistics from the test set.
Thumbnails of top-ranked and bottom-ranked clips for two test videos are shown in Fig. 3.
4.1.3 Evaluation on Kinetics
We further evaluate SCSampler on the Kinetics [Kinetics] dataset. Kinetics is a large-scale benchmark for action recognition containing 400 classes and 280K videos (240K for training and 40K for testing), each about 10 seconds long. The results are reported in Table 3. Kinetics videos are short and thus in principle the recognition model should not benefit from a clip-sampling scheme such as ours. Nevertheless, we see that for all architectures SCSampler provides accuracy gains over Random/Uniform/Empirical selection and Dense evaluation, although the improvements are understandably less substantial than in the case of Sports1M. To the best of our knowledge, the accuracy of 80.2% achieved by ir-CSN-152 with our SCSampler is the best reported result so far on this benchmark.
Note that [Tran19] reports an accuracy of 79.0% using Uniform (instead of the 78.5% we list in Table 3, row 6) but this accuracy is achieved by applying the clip classifier spatially in a fully-convolutional fashion on frames of size 256x256, whereas here we use a single center spatial crop of size 224x224 for all our experiments. Sliding the clip classifier spatially in a fully-convolutional fashion (as in [Tran19]) raises the accuracy of SCSampler to 81.1%.
Fig. 2 (right) shows the histogram of clip temporal locations on the validation set of Kinetics. Compared to Sports1M, the Oracle and SCSampler distributions here is much more uniform.
4.1.4 Unseen Action Classifiers and Novel Classes
While our SCSampler has low computational cost, it adds the procedural overhead of having to train a specialized clip selector for each classifier and each dataset. Here we evaluate the possibility of reusing a sampler that was optimized for a classifier on a dataset , for a new classifier on a dataset that contains action classes different from those seen in . In Table 3, we present cross-dataset performance of an SCSampler trained on Kinetics but then used to select clips on Sports1M (and vice-versa). We also report cross-classifier performance obtained by optimizing SCSampler with pseudo-ground truth labels (see section 3.2.2) generated by R(2+1)D-18 but then used for video-level prediction with action classifier MC3-18. On the Kinetics validation set, using an SCSampler that was trained using the same action classifier (MC3) but a different dataset (Sports1M) causes a drop of about 2% (65.0% vs 67.0%) while training using a different action classifier (R(2+1)D) to generate pseudo-ground truth labels on the the same dataset (Kinetics) causes a degradation of 1.1% (65.9% vs 67.0%). The evaluation on Sports1M shows a similar trend, where cross-dataset accuracy (69.2%) is lower than cross-classifier accuracy (72.1%). Even in the extreme setting of cross-dataset and cross-classifier, the accuracies achieved with SCSampler are still better than those obtained with Random or Uniform selection. Finally, we note that samplers trained using the loss (section 3.2.1) do not require pseudo-labels and thus are independent of the action classifier by design.
4.2 Evaluating Design Choices for SCSampler
In this subsection we evaluate the different choices in the design of SCSampler. Given the many configurations to assess, we make this study more computationally feasible by restricting the evaluation to a subset of Sports1M, which we name miniSports. The dataset is formed by randomly choosing for each class 280 videos from the training set and 69 videos from the test set. This gives us a class-balanced set of 136,360 training videos and 33,603 test videos. All videos are shortened to the same length of 2.75 minutes. For our assessment, we restrict our choice of action classifier to MC3-18, which we retrain on our training set of miniSports. We assess the SCSampler design choices in terms of how they affect the video-level accuracy of MC3-18 on the test set of miniSports, since our aim is to find the best configuration for video classification.
4.2.1 Learning objective
We begin by studying the effect of the loss function used for training SCSampler, by considering the two loss variants described in section 3.2. For this evaluation, we assess separately the visual sampler and the audio sampler. The video sampler is based on two ResNet-18 networks with MD and RGB-R features, respectively. These 2 networks are pretrained on ImageNet and then finetuned on the training set of miniSport for each of the three different SCSampler loss functions. The audio sampler is our VGG network pretrained for classification on AudioSet [audioset] and then finetuned on the training set of miniSports. The MC3-18 video classification accuracy is 73.1% when the visual sampler is trained with the Action Classification (AC) loss whereas it is 64.8% when it is trained with the Saliency Ranking (SAL-RANK) loss. Conversely, we found that the audio sampler is slightly more effective when trained with the SAL-RANK loss as opposed to the AC loss (video-level accuracy is 67.8% with SAL-RANK and 66.4% with AC). A possible explanation for this difference in results is that the AC loss defines a more challenging problem to address (action classification vs binary ranking) but provides more supervision (multiclass vs binary labels). The model using compressed video features is a stronger model that can benefit from the AC supervision and do well on this task (as already shown in [wu2018coviar]) but the weaker audio model does better when trained on the simpler SAL-RANK problem.
4.2.2 Sampler architecture and features
In this subsection we assess different architectures and features for the sampler. For the visual sampler, we use the AC loss and consider two different lightweight architectures: ResNet-18 and ShuffleNet26. Each architecture is trained on each of the 3 types of video-compression features described in section 3.3.1: IF, MD and RGB-R. We also assess performance of combination of these three features by averaging the scores of classifiers based on individual features. The results are reported in Table 4. We can observe that given the same type of input features, ResNet-18 provides much higher accuracy than ShuffeNet-26 at a runtime that is only marginally higher. It can be noticed that MD and RGB-R features seem to be quite complementary: for ResNet-18, MD+RGB-R yields an accuracy of 73.1% whereas these individual features alone achieve an accuracy of only 68.0% and 63.5%. However, adding IF features to MD+RGB-R provides a modest gain in accuracy (74.9 vs 73.1) but impacts noticeably the runtime. Considering these tradeoffs, we adopt ResNet-18 trained on MD+RGB-R as our visual sampler on all subsequent experiments.
We perform a similar ablation study for the audio sampler. Given our VGG audio network pretrained for classification on AudioSet, we train it on miniSport using the following two options: finetuning the entire VGG model vs training a single FC layer on several VGG activations. Finetuning the audio sampler yields the best classification accuracy (see detailed results in supplementary material).
|MD + RGB-R||ResNet-18||73.1||20.9|
|MD + RGB-R||ShuffleNet-26||67.9||19.1|
4.2.3 Combining audio and visual saliency
In this subsection we assess the impact of our different schemes for combining audio-based and video-based saliency scores (see 3.3.3). For this we use the best configurations of our visual and audio sampler (described in 4.1.1). Table 5 shows the video-level action recognition accuracy achieved for the different combination strategies.
Perhaps surprisingly, the best results are achieved with AV-union-list, which is the rather naïve solution of taking clips based on the video sampler and different clips based on the audio sampler ( is the best value when ). The more sophisticated approach of joint training AV-joint-training performs nearly on-par with it. Overall, it is clear that the visual sampler is a better clip selector than the audio sampler. But considering the small cost of audio-based sampling, the accuracy gain provided by AV-union-list over visual only (76.0 vs 73.1) warrants the use of this combination.
|accuracy (%)||runtime (min)|
|Visual SCSampler only||73.1||20.9|
|Audio SCSampler only||67.8||22.0|
|Dense||61.6||2293.5 (38.5 hrs)|
We presented a very simple scheme to boost both the accuracy and the speed of clip-based action classifiers. It leverages a lightweight clip-sampling model to select a small subset of clips for analysis. Experiments show that, despite its simplicity, our clip-sampler yields large accuracy gains and big speedups for 6 different strong action recognizers, and it retains strong performance even when used on novel classes. Future work will investigate strategies for optimal sample-set selection, by taking into account clip redundancies. It would be interesting to extend our sampling scheme to models that employ more sophisticated aggregations than simple averaging, e.g., those that use a set of contiguous clips to capture long-range temporal structure. SCSampler scores for the test videos of Kinetics and Sports1M are available for download at http://scsampler.ai.
We would like to thank Zheng Shou and Chao-Yuan Wu for providing help with reading and processing of compressed video.
Appendix A Action classification networks
In the main paper, we provide an overview of the gains in accuracy and speedup enabled by SCSampler for several video-classification models. In this section, we provide the details of the action classifier architectures used in our experiments and discuss the training procedure used to train these models.
a.1 Architecture details
3D-ResNets (R3D) are residual networks where every convolution is 3D. Mixed-convolution models (MC) are 3D CNNs leveraging residual blocks, where the first convolutional groups use 3D convolutions and the subsequent ones use 2d convolutions. In our experiments we use an MC3 model. R(2+1)D are models that decompose each 3D convolution in a 2D convolution (spatial), followed by 1D convolution (temporal). For further details, please refer to the paper that introduced and compared these models [tran2017closer] or the repository [VMZ] where pretrained models can be found.
a.2 Training procedure
For the Sports1M dataset, we use the training procedure described in [tran2017closer] for all models except ip-CSN-152. Frames are first re-scaled to have resolution , and then each clip is generated by randomly cropping a window of size at the same location from 16 adjacent frames. We use batch normalization after all convolutional layers, with a batch size of 8 clips per GPU. The models are trained for 100 epochs, with the first 15 epochs used for warm-up during distributed training. Learning rate is set to 0.005 and divided by 10 every 20 epochs. The ip-CSN-152 model is trained according to the training procedure described in [Tran19].
Kinetics. On Kinetics, the clip classifiers are trained with mini-batches formed by sampling five 16-frame clips with temporal jittering. Frames are first resized to resolution , and then each clip is generated by randomly cropping a window of size at the same location from 16 adjacent frames. The models are trained for 45 epochs, with 10 warm-up epochs. The learning rate is set to 0.01 and divided by 10 every 10 epochs as in [tran2017closer]. ip-CSN-152 [Tran19] and R(2+1)D [tran2017closer] are finetuned from Sports1M for 14 epochs with the procedure described in [Tran19].
Appendix B Implementation details for SCSampler
In this section, we give the implementation details of the architectures and describe the training/finetuning procedures of our sampler networks.
b.1 Visual-based sampler
Following Wu et al. [wu2018coviar], all of our visual samplers are pre-trained on the ILSVRC dataset [ILSVRC15]. The learning rate is set to 0.001 for both Sports1M and Kinetics. As in [wu2018coviar], the learning rate is reduced when accuracy plateaus and pre-trained layers use smaller learning rates. The ShuffleNet0.5 [zhang2018shufflenet] (26 layers) model is pretrained on ImageNet. We use three groups of group convolutions as this choice is shown to give the best accuracy in [zhang2018shufflenet]. The initial learning rate and the learning rate schedule are the same as those used for ResNet-18.
b.2 Audio-based sampler
We use a VGG model [vgg2014] pretrained on AudioSet [audioset] as our backbone network, with MEL spectrograms of size as input. When fine-tuning the network with SAL-RANK, we use an initial learning rate of 0.01 for Sports1M and 0.03 for Kinetics for the first 5 epochs and then divide the learning rate by every 5 epochs. The learning rate of the pretrained layers is multiplied by a factor of . When finetuning with the SAL-CL loss, we set the learning rate to 0.001 for 10 epochs, and divide it by 10 for 6 additional epochs. When finetuning with AC loss, we start with learning rate 0.001, and divide it by 10 every 5 epochs.
Appendix C Additional evaluations of design choices for SCSampler
Here we present additional analyses of the design choices and hyperparameter values of SCSampler.
c.1 Varying the audio sampler architecture.
Table 6 shows video classification accuracy using different variants of our audio sampler. Given our VGG audio network pretrained for classification on AudioSet, we train it on miniSport using the following two options: finetuning the entire VGG model vs training a single FC layer on VGG activations from one layer (conv4_2, pool4, or fc1). All audio samplers are trained with the SAL-RANK loss. We can see that finetuning the audio sampler gives the best classification accuracy.
|Audio SCSampler||accuracy (%)||runtime (min)|
|FC trained on VGG-conv4_2||67.03||21.6|
|FC trained on VGG-pool4||67.01||21.4|
|FC trained on VGG-fc1||59.84||21.4|
c.2 Varying the number of sampled clips ()
Figure 4 shows how video-level classification accuracy changes as we vary the number of sampled clips (). The sampler here is AV-union-list. provides the best accuracy for our sampler. For the Oracle, gives the top result as this method can conveniently select the clip that elicits the highest score for the correct label on each test video.
c.3 Selecting hyperparameter for AV-union-list
The AV-union-list method (described in section 3.3.3 of our paper) combines the audio-based and the video-based samplers, by selecting top-clips according to the visual sampler (with hyper-parameter s.t. ) and adds a set of different clips from the ranked list of the audio sampler to form a sample set of size ( is used in this experiment). In Figure 5 we analyze the impact of on action classification. The fact that the best value is achieved at suggests that the signals from the two samplers are somewhat complementary, but the visual sampler provides a more accurate measure of clip saliency.
Appendix D Comparison to Random/Uniform under the same runtime.
Fig. 6 shows runtime (per video) vs video-level classification accuracy on miniSports, obtained by varying the number of sampled clips per video (). For this test we use MC3-18, which is the fastest clip-classifier in our comparison. The overhead of running SCSampler on each video is roughly equivalent to 3 clip-evaluations of MC3-18. Even after adding clip evaluations to Random/Uniform to obtain a comparison under the same runtime, SCSampler significantly outperforms these baselines. Note that for costlier clip-classifiers the SCSampler overhead would amount to less than one clip evaluation (e.g., 0.972 for R(2+1)D-50), making the option of Random/Uniform even less appealing for the same runtime.
Appendix E Applying SCSampler every clips
While our sampler is quite efficient, further reductions in computational cost can be obtained by running SCSampler every clips in the video. This implies that the final top- clips used by the action classifier will be selected from a subset of clips obtained by applying SCSampler with a stride of clips. As usual, we fix the value of to 10 for SCSampler. Figure 7 shows the results obtained with the best configuration of our SCSampler (see details in 4.1.1) and the ip-CSN-152 [Tran19] action classifier on the full Sports1M dataset. We see that we can apply SCSampler with clip-strides of up to before the action recognition accuracy degrades to the level of costly dense predictions. This results in further reduction of computational complexity and runtime, as we only need to apply the sampler to clips.