AdaFrame: Adaptive Frame Selection for Fast Video Recognition

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, Larry S. Davis
University of Maryland, Salesforce Research, Georgia Institute of Technology
Most of the work is done when the author was an intern at Salesforce. Corresponding author.

We present AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame contains a Long Short-Term Memory network augmented with a global memory that provides context information for searching which frames to use over time. Trained with policy gradient methods, AdaFrame generates a prediction, determines which frame to observe next, and computes the utility, i.e. missing, expected future rewards, of seeing more frames at each time step. At testing time, AdaFrame exploits predicted utilities to achieve adaptive lookahead inference such that the overall computational costs are reduced without incurring a decrease in accuracy. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and AvtivityNet. AdaFrame matches the performance of using all frames with only 8.21 and 8.65 frames on FCVID and AvtivityNet, respectively. We further qualitatively demonstrate learned frame usage can indicate the difficulty of making classification decisions; easier samples need fewer frames while harder ones require more, both at instance-level within the same class and at class-level among different categories.

1 Introduction

The explosive increase of Internet videos, driven by the ubiquity of mobile devices and sharing activities on social networks, is phenomenal: around 300 hours of video are uploaded to YouTube every minute of every day! Such growth demands effective and scalable approaches that can recognize actions and events in videos automatically for tasks like indexing, summarization, recommendation, etc. Most existing work focuses on learning robust video representations to boost accuracy [18, 23, 14, 22], while limited effort has been devoted to improving efficiency [25, 31].

State-of-the-art video recognition frameworks rely on the aggregation of prediction scores from uniformly sampled frames 111Here, we use frame as a general term, and it can be in the forms of a single RGB image, stacked RGB images (snippets), and stacked optical flow images., if not every single frame [11], during inference. While uniform sampling has been shown to be effective [14, 22, 23], the analysis of even a single frame is still computationally expensive due to the use of high-capacity backbone networks such as ResNet [5], ResNext [27], InceptionNet [16], etc. On the other hand, uniform sampling assumes information is evenly distributed over time, which could therefore incorporate noisy background frames that are not relevant to the class of interest.

It is also worth noting that the difficulty of making recognition decisions relates to the category to be classified—one frame might be sufficient to recognize most static objects (e.g., “dogs” and “cats”) or scenes (e.g., “forests” or “sea”) while more frames are required to differentiate subtle actions like “drinking coffee” and “drinking beer”. This also holds for samples even within the same category due to large intra-class variations. For example, a “playing basketball” event can be captured from multiple view points (e.g., different locations of a gymnasium), occur at different locations (e.g., indoor or outdoor), with different players (e.g., professionals or amateurs). As a result, the number of frames required to recognize the same event are different.

Figure 1: A conceptual overview of our approach. AdaFrame aims to select a small number of frames to make correct predictions conditioned on different input videos so as to reduce the overall computational cost.

With this in mind, to achieve efficient video recognition, we explore how to automatically adjust computation within a network on a per-video basis such that—conditioned on different input videos, a small number of informative frames are selected to produce correct predictions (See Figure 1). However, this is a particularly challenging problem, since videos are generally weakly-labeled for classification tasks, one annotation for a whole sequence, and there is no supervision informing which frames are important. Therefore, it is unclear how to effectively explore temporal information over time to choose which frames to use, and how to encode temporal dynamics in these selected frames.

In this paper, we propose AdaFrame, a Long Short-Term Memory (LSTM) network augmented with a global memory, to learn how to adaptively select frames conditioned on inputs for fast video recognition. In particular, a global memory derived from representations computed with spatially and temporally downsampled video frames is introduced to guide the exploration over time for learning frame usage policies. The memory-augmented LSTM serves as an agent interacting with video sequences; at a time step, it examines the current frame, and with the assistance of global context information derived by querying the global memory, generates a prediction, decides which frame to look at next and calculates the utility of seeing more frames in the future. During training, AdaFrame is optimized using policy gradient methods with a fixed number of steps to maximize a reward function that encourages predictions to be more confident when observing one more frame. At testing time, AdaFrame is able to achieve adaptive inference conditioned on input videos by exploiting the predicted future utilities that indicate the advantages of going forward.

We conduct extensive experiments on two large-scale and challenging video benchmarks for generic video categorization (FCVID [8]) and activity recognition (ActivityNet [6]). AdaFrame offers similar or better accuracies measured in mean average precision over the widely adopted uniform sampling strategy, a simple yet strong baseline, on FCVID and ActivityNet respectively, while requiring and fewer computations on average, going as high as savings of . AdaFrame also outperforms by clear margins alternative methods [29, 2] that learn to select frames. We further show that, among other things, frame usage is correlated with the difficulty of making predictions—different categories produce different frame usage patterns and instance-level frame usage within the same class also differs. These results corroborate that AdaFrame can effectively learn to generate frame usage policies that adaptively select a small number of relevant frames for classification for each input video.

2 Related Work

Video Analysis. Extensive studies have been conducted on video classification and action recognition. Most existing work focuses on extending 2D convolution to the video domain and modeling motion information in videos [14, 23, 22, 28, 18]. Only a few methods have been proposed to achieve efficient video classification [31, 14] by speeding up the extraction of motion information. However, these CNN-based approaches all follow the same practice for testing—averaging scores from 25 uniformly sampled frames as the prediction of a video clip. In contrast, we focus on selecting a small number of relevant frames on a per-video basis with an aim to achieve efficient video recognition. Note that our framework is also applicable to 3D CNNs; the inputs to our framework can be easily replaced with features from stacked frames. Recently, Zhu et al. introduce a deep feature flow framework that propagates feature maps of key frames to other frames with a flow field [32]. Pan et al. present Recurrent Residual Modules that explore the similarity of feature maps between neighboring frames to speed up inference [12]. These approaches process videos frame by frame and attempt to reduce computation cost by exploring frame similarities, while our goal is to selectively choose relevant frames directly based on inputs.

Our work is more related to [29] and [2] that select frames with reinforcement learning. Yeung et al. introduce an agent to predict whether to stop and where to look next through sampling from the whole video for action detection [29]. For detection, ground-truth temporal boundaries are available, providing strong feedback about whether viewed frames are relevant. In the context of classification, there is no such supervision, and thus directly sampling from the entire sequence is difficult. To overcome this issue, Fan et al. propose to sample from a predefined action set deciding how many steps to jump [2], which reduces the search space but sacrifices flexibility. In contrast, we introduce a global memory module that provides context information to guide the frame selection process. We also decouple the learning of frame selection and when to stop, exploiting predicted future returns as stop signals.

Figure 2: An overview of the proposed framework. A memory-agumented LSTM serves as an agent, interacting with a video sequence. At each time step, it takes features from the current frame, previous states, and a global context vector derived from a global memory to generate the current hidden states. The hidden states are used to produce a prediction, decides where to look next and calculates the utility of seeing more frames in the future. See texts for more details.

Adaptive Computation. Our work also relates to adaptive computation to achieve efficiency by deciding whether to stop inference based on the confidence of classifiers. The idea dates back to cascaded classifiers [21] that quickly reject easy negative sub-windows for fast face detection. Several recent approaches propose to add decision branches to different layers of CNNs to learn whether to exit the model [17, 7, 9, 3]. Graves introduce a halting unit to RNNs to decide whether computation should continue [4]. Related is also [24, 20, 26] that learn to drop convolutional layers in residual networks conditioned on input images. In this paper, we focus on adaptive computation for videos to adaptively select frames rather than layers/units in neural networks for fast inference.

3 Approach

Our goal is, given a testing video, to derive an effective frame selection strategy that produces a correct prediction while using as few frames as possible. To this end, we introduce AdaFrame, a memory-augmented LSTM (Section 3.1), to explore the temporal space of videos effectively with the guidance of context information from a global memory. AdaFrame is optimized to choose which frames to use on a per-video basis, and to capture the temporal dynamics of these selected frames. Given the learned model, we perform adaptive lookahead inference (Section 3.2) to accommodate different computational needs through exploring the utility of seeing more frames in the future.

3.1 Memory-augmented LSTM

The memory-augmented LSTM can be seen as an agent that recurrently interacts with a video sequence of frames, whose representations are denoted as . More formally, the LSTM, at the -th time step, takes features of the current frame , previous hidden states and cell outputs , as well as a global context vector derived from a global memory as its inputs, and produces the current hidden states and cell contents :


where denotes concatenation. The hidden states are further input into a prediction network for classification, and the probabilities are used to generate a reward measuring whether the transition from the last time step brings information gain. Furthermore, conditioned on the hidden states, a selection network decides where to look next, and a utility network calculates the advantage of seeing more frames in the future. Figure 2 gives an overview of the framework. In the following, we elaborate detailed components in the memory-augmented LSTM.

Global memory. The LSTM is expected to make reliable predictions and explore the temporal space to select frames guided by rewards received. However, learning where to look next is difficult due to the huge search space and limited capacity of hidden states [1, 30] to remember input history. Therefore, for each video, we introduce a global memory to provide context information, which consists of representations of spatially and temporally downsampled frames, . Here, denotes the number of frames (), and the representations are computed with a lightweight network using spatially downsampled inputs (more details in Sec. 4.1). This is to ensure the computational overhead of the global memory is small. As these representations are computed frame by frame without explicit order information, we further utilize positional encoding [19] to encode positions in the downsampled representations. To obtain global context information, we query the global memory with the hidden states of the LSTM to get an attention weight for each element in the memory:

where maps hidden states to the same dimension as the -th downsampled feature in the memory, denotes the operation of adding positional encoding to features, and is the normalized attention vector over the memory. We can further derive the global context vector as the weighted average of the global memory: . The intuition of computing a global context vector with soft-attention as inputs to the LSTM is to derive a rough estimate of the current progress based on features in the memory block, serving as global context to assist the learning of which frame in the future to examine.

Prediction network. The prediction network ) parameterized by weights maps the hidden states to outputs with one fully-connected layer, where is the number of classes. In addition, is further normalized with Softmax to produce probability scores for each class. The network is trained with cross-entropy loss using predictions from the last time step :


where is a one-hot vector encoding the label of the corresponding sample. In addition, we constrain , since we wish to use as few frames as possible.

Reward function. Given the classification scores of the -th time step, a reward is given to evaluative whether the transition from the previous time step is useful—observing one more frame is expected to produce more accurate predictions. To this end, we introduce a reward function that forces the classifier to be more confident when seeing additional frames. Formally, the reward function for is defined as:


where is the margin between the probability of the ground-truth class (indexed by ) and the largest probabilities from other classes, pushing the score of the ground-truth class to be higher than other classes by a margin. And the reward function in Equation 3 encourages that the current margin to be larger than historical ones to receive a positive reward, which demands that the confidence of the classifier increases when seeing more frames. Such a constraint acts as a proxy to measure if the transition from the last time step brings additional information for recognizing target classes, as there is no supervision directly providing feedback about whether a single frame is informative.

Selection network. The selection network defines a policy with a Gaussian distribution using fixed variance, to decide which frame to observe next, using hidden states that contain information of current inputs and historical context. In particular, the network, parameterized by , transforms the hidden states to a 1-dimensional output , as the mean of the location policy. Following [10], during training, we sample from the policy , and at testing time, we directly use the output as the location. We also clamp to be in the interval of , so that it can be further transfered to a frame index multiplying by the total number of frames. It is worth noting that at the current time step, the policy searches through the entire time horizon and there is no constraint; it can not only jump forward to seek future informative frames but also go back to re-examine past information. We train the selection network to maximize the expected future reward:


Utility network. The utility network, parameterized by , produces an output using one fully-connected layer. It serves as a critic to provide an approximation of expected future rewards from the current state, which is also known as the value function [15]:


where is the discount factor fixed to 0.9. The intuition is to estimate the value function derived from empirical rollouts with the network output to update policy parameters in the direction of performance improvement. More importantly, by estimating future returns, it provides the agent with the ability to look ahead, measuring the utility of subsequently observing more frames. The utility network is trained with the following regression loss:


Optimization. Combining Eqn. 2, Eqn. 3 and Eqn. 6, the final objective function can be written as:

where controls the trade off between classification and temporal exploration and denotes all trainable parameters. Note that the first two terms are differentiable, and we can directly use back propagation with stochastic gradient to learn the optimal weights. Thus, we only discuss how to maximize the expected reward in Eqn. 4. Following [15], we derive the expected gradient of as:


where denotes the expected future reward, and serves as a baseline function to reduce variance during training [15]. Eqn. 7 can be approximated with Monte-Carlo sampling using samples in a mini-batch, and further back-propagated downstream for training.

3.2 Adaptive Lookahead Inference

While we optimize the memory-augmented LSTM for a fixed number of steps during training, we aim to achieve adaptive inference at testing time such that a small number of informative frames are selected conditioned on input videos without incurring any degradation in classification performance. Recall that the utility network is trained to predict expected future rewards, indicating the utility/advantage of seeing more frames in the future. Therefore, we explore the outputs of the utility network to determine whether to stop inference through looking ahead. A straightforward way is to calculate the utility at each time step, and exit the model once it is less than a threshold. However, it is difficult to find an optimal value that works well for all samples. Instead, we maintain a running max of utility over time for each sample, and at each time step, we compare the current utility with the max value ; if is larger than by a margin more than times, predictions from the current time step will be used as the final score and inference will be stopped. Here, controls the trade-off between computational cost and accuracy; a small constrains the model to make early predictions once the predicted utility begins to decrease while a large tolerates a drop in utility, allowing more considerations before classification. Further, we also introduce as a patience metric, which permits the current utility to deviate from the max value for a few iterations. This is similar in spirit to reducing learning rates on plateaus, which instead of intermediately decays learning rate waits for a few more epochs when the loss does not further decrease.

Note that although the same threshold is used for all samples, comparisons made to decide whether to stop or not is based on the utility distribution of each sample independently, which is softer than comparing with directly. One can add another network to predict whether to stop inference using the hidden states as in [29, 2], however coupling the training of frame selection with learning a binary policy to stop makes optimization challenging, particularly with reinforcement learning, as will be shown in experiments. In contrast, we leverage the utility network to achieve adaptive lookahead inference.

4 Experiments

4.1 Experimental Setup

Datasets and evaluation metrics. We experiment with two challenging large-scale video datasets, Fudan-Columbia Video Datasets (FCVID[8] and ActivityNet [6], to evaluate the proposed approach. FCVID consists of videos from YouTube with an average duration of seconds, manually annotated into 239 classes. These categories cover a wide range of topics, including scenes (e.g., “river”), objects (e.g., “dog”), activities (e.g., “fencing”), and complicated events (e.g., “making pizza”). The dataset is split evenly for training ( videos) and testing ( videos). ActivityNet is an activity-focused large-scale video dataset, containing YouTube videos with an average duration of seconds. Here we adopt the latest release (version 1.3), which consists of around videos belonging to classes. We use the official split with a training set of videos, a validation set of videos and a testing set of videos. Since the testing labels are not publicly available, we report performance on the validation set. We compute average precision (AP) for each class and use mean average precision (mAP) to measure the overall performance on both datasets. It is also worth noting that videos in both datasets are untrimmed, for which efficient recognition is extremely critical given the redundant nature of video frames.

Implementation details. We use a one-layer LSTM with and hidden units for FCVID and ActivityNet respectively. To extract inputs for the LSTM, we decode videos at 1fps and compute features from the penultimate layer of a ResNet-101 model [5]. To improve performance, the ResNet model is pretrained on ImageNet with a top-1 accuracy of and further finetuned on target datasets. To generate the global memory that provides context information, we compute features using spatially and temporally downsampled video frames with a lightweight CNN to reduce overhead. In particular, we lower the resolution of video frames to , and sample 16 frames uniformly. We use a pretrained MobileNetv2 [13] as the lightweight CNN, which achieves a top-1 accuracy of on ImageNet with downsampled inputs. We adopt PyTorch for implementation and leverage SGD for optimization with a momentum of , a weight decay of and a of 1. We train the network for 100 epochs with a batch size of 128 and 64 for FCVID and ActivityNet, respectively. The initial learning rate is set to and decayed by a factor of 10 every 40 epochs. For the patience during inference, it is set to 2 when , and when , where is number of time steps the model is trained for.

4.2 Main Results

FCVID ActivityNet
Method R8 U8 R10 U10 R25 U25 All R8 U8 R10 U10 R25 U25 All
AvgPooling 78.3 78.4 79.0 78.9 79.7 80.0 80.2 67.5 67.8 68.9 68.6 69.8 70.0 70.2
LSTM 77.8 77.9 78.7 78.1 78.0 79.8 80.0 68.7 68.8 69.8 70.4 69.9 70.8 71.0
AdaFrame 78.6 79.2 80.2 69.5 70.4 71.5
5 4.92 8 6.15 10 8.21 5 3.8 8 5.82 10 8.65
Table 1: Performance of different frame selection strategies on FCVID and ActivityNet. R and U denote random and uniform sampling, respectively. We use to denote the frame usage for AdaFrame, which uses frames during training and frames on average when performing adaptive inference. See texts for more details.

Effectiveness of learned frame usage. We first optimize AdaFrame with steps during training and then at testing time we perform adaptive lookahead inference with , allowing each video to see frames on average while maintaining the same accuracy as viewing all frames. We compare AdaFrame with the following alternative methods to produce final predictions during testing: {enumerate*}[label=(0)]

AvgPooling, which simply computes a prediction for each sampled frame and then performs a mean pooling over frames as the video-level classification score;

LSTM, which generates predictions using hidden states from the last time step of an LSTM. We also experiment with different number of frames () used as inputs for AvgPooling and LSTM, which are sampled either uniformly (U) or randomly (R). Here, we use for AdaFrame while for other methods as a way to offset the additional computation cost incurred, which will be discussed later. The results are summarized in Table 1. We observe AdaFrame achieves better results than AvgPooling and LSTM whiling using fewer frames under all settings on both datasets. In particular, AdaFrame achieves a mAP of , and using an average of 4.92 and 3.8 frames on FCVID and ActivityNet respectively. These results, requiring 3.08 and 4.2 fewer frames, are better than AvgPooling and LSTM with 8 frames and comparable with their results with 10 frames. It is also promising to see that AdaFrame can match the performance of using all frames with only 8.21 and 8.65 frames on FCVID and ActivityNet. This verifies that AdaFrame can indeed learn to derive frame selection policies while maintaining the same accuracies.

In addition, the performance of random sampling and uniform sampling for AvgPooling and LSTM are similar and LSTM is worse than AvgPooling on FCVID, possibly due to the diverse set of categories incur significant intra-class variations. Note that although AvgPooling is simple and straightforward, it is a very strong baseline and has been widely adopted during testing for almost all CNN-based approaches due to its strong performance.

Figure 3: Mean average precision vs. computational cost. Comparisons of AdaFrame with FrameGlimpse [29], FastForward [2], and alternative frame selection methods based on heuristics.

Computational savings with adaptive inference. We now discuss computational savings of AdaFrame with adaptive inference and compare with state-of-the-art-methods. We use average GFLOPs, a hardware independent metric, to measure the computation needed to classify all the videos in the testing set. We train AdaFrame with fixed time steps to obtain different models, denoted as AdaFrame- to accommodate different computational requirements during testing; and for each model we vary such that adaptive inference can be achieved within the same model.

In addition to selecting frames based on heuristics, we also compare AdaFrame with FrameGlimpse [29] and FastForward [2]. FrameGlimpse is developed for action detection with a location network to select frames and a stop network to decide whether to stop; ground-truth boundaries of actions are used as feedback to estimate the quality of selected frames. For classification, there is no such ground-truth and thus we preserve the architecture of FrameGlimpse but use our reward function. FastForward [2] samples from a predefined action set, determining how many steps to go forward. It also consists of a stop branch to decide whether to stop. In addition, we also attach the global memory to these frameworks for fair comparisons, denoted as FrameGlimpse-G and FastForward-G, respectively. Figure 3 presents the results. For AvgPooling and LSTM, accuracies gradually increase when more computation (frames) is used and then become saturated. Note that the computational cost for video classification grows linearly with the number of frames used, as the most expensive operation is extracting features with CNNs. For ResNet-101 it needs 7.82 GFLOPs to compute features and for AdaFrame, it takes an extra 1.32 GFLOPs due to the computation in global memory. Therefore, we expect more savings from AdaFrame when more frames are used.

Compared with AvgPooling and LSTM using 25 frames, AdaFrame-10 achieves better results while requiring and less computation on average on FCVID (80.2 vs. 195 GFLOPs 222195.5 GFLOPS for AvgPooling and 195.8 GFLOPs for LSTM) and ActivityNet (71.5 vs. 195 GFLOPs), respectively. Similar trends can also be found for AdaFrame-5 and AdaFrame-3 on both datasets. While the computational saving of AdaFrame over AvgPooling and LSTM reduces when fewer frames are used, accuracies of AdaFrame are still clearly better, i.e., vs. on FCVID, and vs. on ActivityNet. Further, AdaFrame also outperforms FrameGlimpse [29] and FastForward [2] that aim to learn frame usage by clear margins, demonstrating that coupling the training of frame selection and learning to stop with reinforcement learning on large-scale datasets without sufficient background videos is difficult. In addition, the use of a global memory, providing context information improves accuracies of the original model in both frameworks.

Figure 4: Dataflow through AdaFrame over time. Each circle represents, by size, the percentage of samples that are classified at the corresponding time step.

We can also see that changing the threshold within the same model can also adjust computation needed; the performance and average frame usage declines simultaneously as the threshold becomes smaller, forcing the model to make predictions as early as possible. However, the resulting policies with different thresholds still outperform alternative counterparts in both accuracy and computation required.

Comparing across different models of AdaFrame, it is interesting to see the best model of AdaFrame trained with a smaller achieves better or comparable results over AdaFrame optimized with a large using a smaller threshold. For example, AdaFrame-3 with achieves a mAP of using 25.1 GFLOPs on FCVID, which is better than AdaFrame-5 with that produces a mAP of with 31.6 GFLOPs on average. This possibly results from the discrepancies between training and testing—during training a large allows the model to “ponder” before emitting predictions. While computation can be adjusted with varying thresholds at test time, AdaFrame-10 is not fully optimized for classification with extremely limited information as is AdaFrame-3. This highlights the need to use different models based on computational requirements.

Figure 5: Learned inference policies for different classes over time. Each square, by density, indicates the fraction of samples that are classified at the corresponding time step from a certain class in FCVID.
Figure 6: Validation videos from FCVID using different number of frames for inference. Frame usage differs not only among different categories but also within the same class (e.g., “making cookies” and “hiking”).

Analyses of learned policies. To gain a better understanding of what is learned in AdaFrame, we take the trained AdaFrame-10 model and vary the threshold to accommodate different computational needs. And we visualize in Figure 4, at each time step, how many samples are classified, and the prediction accuracies of these samples. We can see high prediction accuracies tend to appear in early time steps, pushing difficult decisions that require more scrutiny downstream. And more samples emit predictions at later time steps when computational budget increases (larger ).

We further investigate whether computations differ based on the categories to be classified. To this end, we show the fraction of samples from a subset of classes in FCVID that are classified at each time step in Figure 5. We observe that, for simple classes like objects (e.g., “gorilla” and “elephants”) and scenes (“Eiffel tower” and “cathedral exterior”), AdaFrame makes predictions for most of the samples in the first three steps; while for some complicated DIY categories (e.g., “making ice cream” and “making egg tarts”), it tends to classify in the middle of the entire time horizon. In addition, AdaFrame takes additional time steps to differentiate very confusing classes like “dining at restaurant” and “dining at home”. Figure 6 further illustrates samples using different numbers of frames for inference. We can see that frame usage varies not only across different classes but also within the same category (see the top two rows of Figure 6) due to large intra-class variations. For example, for the “making cookies” category, it takes AdaFrame four steps to make correct predictions when the video contains severe camera motions and cluttered backgrounds.

In addition, we also examine where the model jumps at each step; for AdaFrame-10 with , we found that it goes backward at least once for of videos on FCVID to re-examine past information instead of always going forward, confirming the flexibility AdaFrame enjoys when searching over time.

4.3 Discussions

In this section, we conduct a set of experiments to justify our design choices of AdaFrame.

Global Memory Inference
# Frames Overhead mAP # Frames
0 0 77.9 8.40
12 0.98 79.2 8.53
32 2.61 80.2 8.24
16 1.32 80.2 8.21
Table 2: Results of using different global memories on FCVID. Different number of frames are used to generate different global memories. The overhead is measured for each frame compared to a standard ResNet-101.

Global memory. We perform an ablation study to see how many frames are needed in the global memory. Table 2 presents the results. The use of a global memory module improves the non-memory model with clear margins. In addition, we observe using 16 frames offers the best trade-off between computational overheads and accuracies.

Reward function. Our reward function forces the model to increase its confidence when seeing more frames, to measure the transition from the last time step. We further compare with two reward functions: {enumerate*}[label=(0)]

Prediction Reward, that uses the prediction confidence of the ground-truth class as reward;

Prediction Transition Reward, that uses as reward. The results are summarized in Table 3. We can see that our reward function and Prediction Transition Reward, both modeling prediction differences over time, outperform Prediction Reward that is simply based on predictions from the current step. This verifies that forcing the model to increase its confidence when viewing more frames can provide feedback about the quality of selected frames. Our result is also better than Prediction Transition Reward by further introducing a margin between predictions from the ground-truth class and other classes.

Reward function mAP # Frames
Prediction Reward 78.7 8.34
Prediction Transition Reward 78.9 8.31
Ours 80.2 8.21
Table 3: Comparisons of different reward functions on FCVID. Frames used on average and the resulting mAP.

Stop criterion. In our framework, we use the predicted utility, measuring future rewards of seeing more frames, to decide whether to continue inference or not. An alternative is to simply rely on the entropy of predictions, as a proxy to measure the confidence of classifiers. We also experimented with entropy to stop inference, however we found that it cannot enable adaptive inference based on different thresholds. We observed that predictions over time are not as smooth as predicted utilities, i.e., high entropies in early steps and extremely low entropies in the last few steps. In contrast, utilities are computed to measure future rewards, explicitly considering future information from the very first step, which leads to smooth transitions over time.

5 Conclusion

In this paper, we presented AdaFrame, an approach that derives an effective frame usage policy so as to use a small number of frames on a per-video basis with an aim to reduce the overall computational cost. It contains an LSTM network augmented with a global memory to inject global context information. AdaFrame is trained with policy gradient methods to predict which frame to use and calculate future utilities. During testing, we leverage the predicted utility for adaptive inference. Extensive results provide strong qualitative and quantitative evidence that AdaFrame can derive strong frame usage policies based on inputs.


ZW and LSD are supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00345. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied of IARPA, DOI/IBC or the U.S. Government.


  • [1] J. Collins, J. Sohl-Dickstein, and D. Sussillo. Capacity and trainability in recurrent neural networks. In ICLR, 2017.
  • [2] H. Fan, Z. Xu, L. Zhu, C. Yan, J. Ge, and Y. Yang. Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI, 2018.
  • [3] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive computation time for residual networks. In CVPR, 2017.
  • [4] A. Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [6] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  • [7] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional networks for efficient prediction. arXiv preprint arXiv:1703.09844, 2017.
  • [8] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE TPAMI, 2018.
  • [9] M. McGill and P. Perona. Deciding how to decide: Dynamic routing in artificial neural networks. In ICML, 2017.
  • [10] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, 2014.
  • [11] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [12] B. Pan, W. Lin, X. Fang, C. Huang, B. Zhou, and C. Lu. Recurrent residual module for fast inference in videos. In CVPR, 2018.
  • [13] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [14] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [15] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
  • [16] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, 2017.
  • [17] S. Teerapittayanon, B. McDanel, and H. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In ICPR, 2016.
  • [18] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. In ICCV, 2015.
  • [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
  • [20] A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, 2018.
  • [21] P. Viola and M. J. Jones. Robust real-time face detection. IJCV, 2004.
  • [22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [23] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks, 2018.
  • [24] X. Wang, F. Yu, Z.-Y. Dou, and J. E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In ECCV, 2018.
  • [25] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Compressed video action recognition. In CVPR, 2018.
  • [26] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, 2018.
  • [27] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  • [28] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, 2018.
  • [29] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, 2016.
  • [30] D. Yogatama, Y. Miao, G. Melis, W. Ling, A. Kuncoro, C. Dyer, and P. Blunsom. Memory architectures in recurrent neural network language models. In ICLR, 2018.
  • [31] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion vector cnns. In CVPR, 2016.
  • [32] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. In CVPR, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description