Learning Motion Priors for Efficient Video Object Detection
Convolution neural networks have achieved great progress on image object detection task. However, it is not trivial to transfer existing image object detection methods to the video domain since most of them are designed specifically for the image domain. Directly applying an image detector cannot achieve optimal results because of the lack of temporal information, which is vital for the video domain. Recently, image-level flow warping  has been proposed to propagate features across different frames, aiming at achieving a better trade-off between accuracy and efficiency. However, the gap between image-level optical flow with high-level features can hinder the spatial propagation accuracy. To achieve a better trade-off between accuracy and efficiency, in this paper, we propose to learn motion priors for efficient video object detection. We first initialize some motion priors for each location and then use them to propagate features across frames. At the same time, Motion priors are updated adaptively to find better spatial correspondences. Without bells and whistles, the proposed framework achieves state-of-the-art performance on the ImageNet VID dataset with real-time speed.
Recently, deep convolution neural networks have achieved significant process on various tasks, such as classification [5, 25, 12], object detection [8, 21, 19, 17, 10]. Comparing with image object detection, videos contain extra-temporal information cues, which allows boosting not only efficiency but also the performance of detection. However, frames in videos are always deteriorated by part occlusion, rare pose, and motion blur. It is difficult to handle these challenges for single-frame object detectors. Thus, a promising direction is how to leverage temporal cues effectively.
A few works have paid their attention to use such cues in videos. Existing methods [16, 15, 9] firstly apply image detectors in single frames to obtain the bounding boxes and then post-process them across the temporal dimension. More specifically, it relies on off-the-shelf motion estimation such as optical flow estimation , and hand-crafted bounding box association rules such as object tracking. As s result, it can not be trained in an end-to-end manner, and also endues poor performance and low-speed issues.
Flow-based feature warp across video frames attracts a lot of attention due to the superior performance on video object detection benchmarks. With the extraction of optical flow ahead, they allow calculating the correspondences in feature space among different frames, which is termed as flow-warping and usually implemented via bilinear interpolation operation. DFF  first proposes to propagate feature maps from sparse key-frames to other frames via a flow field network, which achieves significant speedup as flow computation is relatively fast. Aiming to improve the accuracy, FGFA  improves per-frame features by aggregation of nearby features according to flow motion. Impression Networks  proposes an efficient propagation mechanism, which achieves a better tradeoff between accuracy and speed. An Impression feature is conducted by iteratively absorbing sparsely extracted frame features. Then the Impression feature is propagated down the video to enhance features of low-quality frames. Towards High Performance  is the state-of-the-art trade-off video detector between accuracy and speed. It proposes recursive feature aggregation according to feature quality and reduces the computational cost by operating only on sparse key-frames. Meanwhile, it also adopts spatially-adaptive partial feature updating to recompute features on non-key frames. However, it is quite disadvantageous because the mentioned methods above have to do the costly optical flow extraction. What’s more, precise flow extraction requires a large amount of data for training. At the same time, the gap between image-level optical flow with high-level features can hinder the spatial propagation accuracy.
Some other works attempt to propagate features between different frames without the usage of flow extraction. Spatial-Temporal Memory Networks  is proposed to model long-range temporal appearance and motion dynamics with a MatchTrans module aiming to align spatial-temporal memory from one frame to another frame. MANet  proposes to align features using both pixel-level and instance-level motion cues, which achieves leading accuracy. However, locations that are used to propagated features are manually designed, which can be sub-optimal for propagation accuracy, and thus can not achieves superior performance.
Driven by these shortcomings above, we propose to estimate feature correspondence among different frames from a different perspective. The correspondences between different video frames can be termed Motion Priors for simplicity. The Motion Priors can be used to propagated features accurately, which can boost the performance for video object detection. We first initialize motion priors then use them to propagate features. At the same time, locations of motion priors can be updated adaptively according to final detection loss. Elaborate ablative study shows the advancement of our designs comparing with manually design ways. Further, without bells and whistles, the proposed framework achieves state-of-the-art performance on ImageNet VID datasets in real-time speed.
We summarize the major contributions as follows:
We propose an end-to-end framework for efficient video detection without heavy optical-flow network extraction.
Our Motion Priors Learning Module can precisely align features between consecutive frames which is vital for temporal feature propagation.
Experiments on VID dataset demonstrate that our proposed method achieves state-of-the-art trade-off performance with remarkable speed and fewer model parameters compared with other methods.
Image Object Detection
Recently, state-of-the-art methods for static image object detection are mainly based on the deep convolutional neural networks [21, 19, 17]. Generally, the image object detector can be divided into two paradigms, single-stage and two-stage. Two-stage detector usually first generate region proposals, which are then refined by classification and regression through the RCNN stage. ROI pooling  was proposed to speed up R-CNN . Faster RCNN  propose Anchor mechanism to replace Selective Search proposal generation process, achieving great performance promotion as well as faster speed. FPN  introduced an inherent multi-scale, pyramidal hierarchy of deep convolution networks to build feature pyramids with marginal extra cost, with significant improvements. A single-stage detector pipeline is often more efficient but often achieving less accurate performance. SSD  directly generate results from anchor boxes on a pyramid of feature maps. RetinaNet  handled extreme foreground and background imbalance issue by a novel loss named focal loss. Usually, image object detection provides a base single frame detector for video object detection.
Video Object Detection
Compared with static image object detection, fewer researchers focus on video object detection task, until the new VID challenge was introduced on ImageNet Challenge. DFF  was proposed with an efficient framework which only runs expensive CNN feature extraction on sparse and regularly selected keyframes, achieving more than 10x speedup than using an image detector for per-frame detection. To improve feature quality, FGFA  proposed to aggregate nearby features for each frame feature quality. It achieves better accuracy at the cost of slow speed, which only runs on 3 fps due to dense prediction and heavy flow extraction process. Towards High Performance  proposed spare recursive feature aggregation and spatially-adaptive feature updating strategies, which helps run real-time speed with significant performance. On the one hand, the slow flow extraction process is still the bottleneck for higher speed. On another hand, the image-level flow which is used to propagate high-level feature may hinder the propagation accuracy, resulting in inferior accuracy.
Motion Estimation by Optical Flow
The correspondence in raw images or features can be used to build motion relationship between neighboring frames. Optical flow  has been widely used in many video-based applications, such as video action recognition . DFF  is the first work to propagate keyframe deep feature to non-keyframe using warp-operation based on optical-flow extraction, resulting in 10x speedup, which also indicates that frames in videos are redundant. However, optical flow extraction is time-consuming, which means that we are also expected to design lightweight optical flow extraction network for higher speed, which can be in the price of losing precision. What’s more, it is less robust for feature-level warping using image-level optical flow.
Usage of Temporal Information in Video
The optimal detection results cannot be achieved by directly applying the image detector on videos because of the low-quality images in videos as Figure 1 shows. However, the effective usage of temporal information can provide an effective way to tackle such issues . Impression Network  proposed to propagate and aggregation impression feature strategy between keyframes. Towards High Performance  proposed to partially update the non-keyframe feature strategy for better aggregation accuracy. We also adopt such sequence strategies for better performance, which will be discussed in detail later.
Our method is built on the standard image detector rfcn , which consists of feature extraction network or backbone network , the region proposal network and the region-based detector . For simplicity, both and are called as . Video frames are first divided into keyframes and non-keyframes following . For the keyframe detection, the high-level feature is extracted through the whole backbone network. While for the non-keyframe detection, only a part of the backbone network is needed to extract low-level feature for high speed. The high-level feature of the non-keyframe is usually estimated by the keyframe high-level feature and corresponding pair-wise frame motion cues such as image-level optical flow . So the key for efficient video object detection is how to obtain and utilize motion cues for feature propagation.
In this section, we first introduce our Adaptive Motion Priors Learning Module (AMPL), which mainly consists of Motion Priors Initialization, Similarity Calculator, Feature Estimation and Motion Priors Update. We clarify that AMPL is the key component of our method. Next, we present the overall framework for our method, together with training and inference process.
Adaptive Motion Priors Learning Module
Given the grid location on frame , the corresponding location sets along the motion path on frame are defined as correspondences. For adjacent two frames feature maps and , we want to learn feature-level pixel-wise correspondences. The flow warping-based propagation  usually uses a flow network to extract image-level optical flow to estimate the feature motion for each location. And Non-local  can also be used to propagate features using dense correspondences. Different from above methods, We use the neighboring sets called to aggregate more accurate corresponding information for each location. Besides, our motion priors sets can be adjusted adaptively.
To be more specific, we first initialize the motion priors, which coarsely provides the correspondences between two consecutive video frames. Next, we use these sets to calculate the appearance similarity, which indicates the similarity between each motion prior to the current grid location. Finally, we aggregate these motion priors using above normalized similarity probability, which results in the estimated feature value. During the training process, there is also a backward flow to the initialized motion priors, which will be adjusted by the final loss supervision, helping find more accurate motion priors sets.
Motion Priors Initialization
As Figure 2 shows, initialization provides coarse correspondences. Good initialization can accelerate the learning process, reducing the steps for finding optimal locations. So it’s very important for the initialization of . Motivated by the statistical motion ranges , we propose two kinds of initialization methods, Uniform Initialization, and Gaussian Initialization, which will be introduced details in the experimental section.
The motions priors on the feature map are denoted as , is the total number of motion priors. And the specific grid location on feature map is denoted as . Let and denote the features at the location from the feature maps and at location from feature maps , respectively. Similarity Calculator will give appearance similarity probability between and .
Generally, needs firstly to do bilinear interpolation operation because may be in the arbitrary location:
Where enumerates all integral spatial locations on the feature map , and is the bilinear interpolation kernel function.
After obtaining the value of , we use similarity function to measure the distance between the vector and the vector . Suppose that both and are dimensional vectors, then we have the similarity value :
A very common function can be dot-product. After getting all similarity value on each location , then the normalized similarity probability can be calculated by:
Let to be estimated feature map. After obtaining the corresponding probability for location on feature map , then the estimated value on location can be calculated as:
Motion Priors Update
For the training process, we not only backward the feature map but also the motion priors locations, which helps find optimal . We use the dot-product for similarity function for simplicity. Then we have:
So the gradients for location can be calculated by:
can be easily calculated due to the function is bilinear interpolation kernel. So Motion Priors Update can be easily implemented on modern deep learning architecture. We implemented this by initializing motion priors as then backward this on Mxnet .
In this section, we first introduce the overall inference framework. Then, we introduce the training process. Finally, some other important units are presented in detail.
The inference of our method is illustrated in Figure 3. There are mainly two kinds of inference processes, keyframe to keyframe process and keyframe to non-keyframe process. For the keyframe to keyframe process, we aim to enhance the task feature for the keyframe. The last keyframe memory feature first aligns with the current keyframe, then aggregate with the current keyframe high-level feature, which generates the task feature and updates the memory feature for the current keyframe. And the memory feature will be propagated to the next keyframe recursively. For keyframe to the non-keyframe process, we aim to propagate the keyframes task feature to the current non-keyframe avoiding expensive current non-keyframe feature extraction. So we only pass through the low-level part of backbone networks for the current non-keyframe. Then the transformer module is added to extract more high-level feature for current non-keyframe. Finally the keyframe feature aggregate with the high-level feature to generate the final task feature for the current non-keyframe.
For the training process, each batch data comprises three images , , from the same video sequence. and are random offsets whose max ranges are the segment length . To be more specific, lies in , and lies in . This setting is consistent with the inference phase as represents any non-keyframe, represents the new keyframe, and represents the old keyframe. For simplicity, the three images are briefly as respectively. The ground-truth annotation at is used for final supervision.
|Methods||Base Network||mAP (%)||Speed(fps)||Params (M)|
|Winner ILSVRC’15||VID Winner||73.8||-||-|
|Winner ILSVRC’16||VID Winner||76.2||-||-|
|D (&T loss) ||Resnet-101||75.8||7.8||-|
|MANet + seq-nms||Resnet-101||80.3||-||-|
|STMN  + seq-nms ||Resnet-101||80.5||1.2||-|
|Towards High Performance* ||Resnet-101 + DCN||78.6||13.0||-|
|MPNet*(ours)||Resnet-101 + DCN||79.7||21.2||65.5|
|MPNet*(ours) + seq-nms ||Resnet-101 + DCN||81.7||4.6||65.5|
In each iteration, first, is applied on to obtain the high-level features . Then high-level features are fed into AMPL module to generate the aligned feature for the new-keyframe. is generated by the aggregation unit using . For the non-keyframe, the low-level feature is extracted through a part of the backbone . After that, the transformer module is used to get more high semantic features . The fused feature together with are also fed into AMPL to generate aligned feature for current non-keyframe. Finally, the aggregation unit is used for to generate the task feature, which is responsible for current non-keyframe detection. We use the current non-keyframe detection loss to train the network.
To reduce redundancy for video object detection, we only extract low-level features for the non-keyframe, which is not high semantic for the non-keyframe. So, the lightweight unit is used to generate the more semantic feature. And the architecture is as Figure 4.
The weights for aggregating the input features are generated by a quality estimation network, which has three randomly initialized layers: a convolution, a convolution and a convolutions. The output of the quality estimation network is position-wise raw score map. Raw score maps of different features are first normalized and then sum them up to obtain the final aggregated features.
Datasets and Evaluation Metrics
We evaluate our method on the ImageNet VID dataset, which is the benchmark for video object detection  and composes of 3862 training videos and 555 validation videos containing 30 object categories. All videos are fully annotated with bounding boxes and tracking IDs. With only the training set, we report results on the validation set. And mean average precision (mAP) is used as the evaluation metric following the previous methods .
The VID dataset has extreme redundancy among video frames, which prevents the efficient training. At the same time, video frames of VID datasets have poorer quality than images in the ImageNet DET  dataset. So, we follow the previous training method  to use both ImageNet VID and DET dataset. For the ImageNet DET set, only the same 30 class categories of ImageNet VID are used.
|Sparse Deep Feature Extraction|
|Keyframe Memory Update|
|Quality-Aware Memory Update|
|Sparse Deep Feature Extraction|
|Keyframe Memory Update|
|Quality-Aware Memory Update|
For the training process, each mini-batch contains three images. If sampled from DET, the image will be copied three times. In both the training and testing stage, the shorter side of the images is resized to 600 pixels . Besides, conv4_3 is the split point between the lower and higher parts of the network. Following the setting of most previous methods, the R-FCN detector  pretrained on ImageNet  with ResNet-101  serves as the single-frame detector. The OHEM strategy  and horizontal flip data augmentation during the training stage is also adopted. In our experiment, our batch size is one for a single GPU. We train our network on an 8-GPUs machine for 4 epochs with SGD optimization, starting with a learning rate of 2.5e-4 and reducing it by a factor of 10 at every 2.33 epochs. The keyframe interval is 10 frames in default as in [32, 31].
Comparing with State-of-the-arts
We compare our method with existing state-of-the-art image and video object detectors. The results are shown in Table 1. We can see that most of video object detectors can hardly balance accuracy and speed. However, first of all, our method outperforms the most of previous methods considering both accuracy and speed. Secondly, our method has fewer parameter count comparing with flow-warp based method. Lastly, the results indicate that our AMPL module can learn feature correspondence between consecutive video frames more precise than optical flow-warp based methods. To conclude, our detector surpasses the static image-based R-FCN detector with large margin (+3.2%) while maintaining high speed (23.0fps). Furthermore, the parameter count (64.5M) is fewer than other video object detectors using an optical flow network (e.g., around 100M), which indicate that our method is more friendly for mobile devices.
In this section, we conduct ablation studies to validate the effectiveness of each component for our method.
Sparse Deep Feature Extraction: The entire backbone network is used to extract feature only on keyframes.
Keyframe Memory Update: The keyframe feature aggregates with the last keyframe memory to generate the task feature and updated memory feature (see Figure 3). The weights are naively fixed to 0.5.
Quality-Aware Memory Update: The keyframe feature aggregates with the last keyframe memory to generate the task feature and updated memory feature using quality-aware aggregation unit.
Non-Keyframe Aggregation: The task feature for the non-keyframe is naively aggregated with aligned feature from keyframes, and current low-level feature is obtained by a part of the backbone network. The weights are set to 0.5.
Non-Keyframe Transformer: We use a transformer unit for low-level feature to get higher semantic feature on the non-keyframe (see Figure 4).
Quality-Aware Non-Keyframe: The task feature for the non-keyframe is aggregated with aligned feature from the keyframe using quality-aware aggregation unit, and the current high-level feature is obtained though transformer unit after low-level feature.
As is shown in Table 3, directly applying R-FCN  (with ResNet-101) frame by frame on videos achieves 74.1% mAP. For optical flow-warp based method, DFF  as our baseline, achieves 73.0% mAP and runtime at 34.0 ms. The sparse deep feature will extract deep features at the keyframe while propagating information from keyframes to non-keyframes by optical flow-warp operation. Then applying keyframe memory propagation, features between keyframes will be latently aligned and naively combined to capture the long-range video information. By naive memory propagation, performance can be boosted to 75.2% mAP. Following Impression Network , we train a quality network which can adaptively combine current keyframe feature with aligned memory feature. Sharing with the same aim with GRU , the quality network will strengthen and suppress memory features accordingly. By adding quality-aware memory aggregation, we further boost the performance to 75.4% mAP. Quality-aware non-keyframe aggregation can update the non-keyframe feature according to the quality of features, which can also improve the performance. Last, a transformer unit can help to get more semantic features. Finally, we achieve 76.1% mAP and runs 53 ms with 100.5 Mb parameters.
For Motion-Prior based method, our per-frame baseline achieves 74.1% mAP and runs at 99.0 ms. After using the sparse deep feature, we have 73.5% mAP and runs at 42.0 ms. When applying the quality-aware keyframe memory propagation, we have 75.9% mAP and runs at 42.5 ms with 64.0 Mb parameters. Last, non-keyframe quality-aware aggregation can also improve performance which achieves 76.4% mAP with 43.0 ms speed. Finally, we achieve 76.8% mAP and run 43.5 ms with 64.5 parameters after using quality-aware memory aggregation, non-keyframe transformer unit, and quality-aware non-keyframe aggregation.
Comparing with optical flow-warp based method, we can see that our method can achieve more accurate results with fewer params (about 64% parameters) and faster speed.
Different Priors Analysis
In this part, we present how the different initialization of the motion prior can impact our results. Several initialization settings are as follows:
Gaussian Initialization: The motion prior is initialized with two-dimensional Gaussian Distribution with zero means and one variance. The number of motion priors is the same as Uniform Initialization for a fair comparison.
Learning or Not? The results of different initialization settings can be seen in Table 4. We can figure out, no matter what the initialization methods are, there is a consistent trend that the performance can be significantly boosted by learning. To be more specific, Gaussian initialization can achieve 76.8% mAP comparing with fixed initialization motion prior 75.5%, which obtain 1.3%mAP improvement.
In this section, we conduct an ablation experiment to study the influence of the testing keyframe interval. The results can be seen in Table 5. Comparing with the flow-warp based method, our method is more robustness for the interval. In a certain degree, our method can capture long-range motion information, which further demonstrates our superiority over flow-warp based methods.
In this work, we have proposed spatial continuous motion prior learning mechanism to align features for neighboring video frames. Elaborate ablative studies have shown the advancement of these designs. And, without any whistle and bell, the proposed framework has achieved state-of-the-art performance on ImageNet VID dataset with real-time speed and fewer parameters.
-  (2018) Optimizing video object detection via a scale-time lattice. In CVPR, pp. 7814–7823. Cited by: Table 1.
-  (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: Motion Priors Update.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: Adaptive Motion Priors Learning Module, Table 1, Methodology, Implementation Details, Architecture Design.
-  (2017) Deformable convolutional networks. In CVPR, pp. 764–773. Cited by: Table 1.
-  (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–25. Cited by: Introduction, Implementation Details.
-  (2015) Flownet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. Cited by: Introduction, Motion Estimation by Optical Flow, Methodology.
-  (2017) Detect to track and track to detect. In ICCV, pp. 3038–3046. Cited by: Table 1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: Introduction, Image Object Detection.
-  (2016) Seq-nms for video object detection. arXiv preprint arXiv:1602.08465. Cited by: Introduction, Usage of Temporal Information in Video, Table 1.
-  (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: Introduction.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: Image Object Detection.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Introduction, Implementation Details.
-  (2017) Impression network for video object detection. arXiv preprint arXiv:1712.05896. Cited by: Introduction, Usage of Temporal Information in Video, Architecture Design.
-  (2019) Video object detection with locally-weighted deformable neighbors. In AAAI, Cited by: Table 1.
-  (2017) Object detection in videos with tubelet proposal networks. In CVPR, pp. 727–735. Cited by: Introduction, Table 1.
-  (2016) Object detection from video tubelets with convolutional neural networks. In CVPR, pp. 817–825. Cited by: Introduction, Table 1.
-  (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: Introduction, Image Object Detection.
-  (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: Image Object Detection.
-  (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: Introduction, Image Object Detection.
-  (2018) Semantic video segmentation by gated recurrent flow propagation. In CVPR, pp. 6819–6828. Cited by: Architecture Design.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Introduction, Image Object Detection, Implementation Details.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Datasets and Evaluation Metrics, Datasets and Evaluation Metrics.
-  (2016) Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. Cited by: Implementation Details.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: Motion Estimation by Optical Flow.
-  (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: Introduction.
-  (2013) Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: Image Object Detection.
-  (2018) Fully motion-aware network for video object detection. In ECCV, pp. 542–557. Cited by: Introduction, Table 1.
-  (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: Adaptive Motion Priors Learning Module.
-  (2018) Video object detection with an aligned spatial-temporal memory. In ECCV, pp. 485–501. Cited by: Introduction, Table 1, 1st item.
-  (2018) Towards high performance video object detection. In CVPR, pp. 7210–7218. Cited by: Introduction, Video Object Detection, Usage of Temporal Information in Video, Table 1, Datasets and Evaluation Metrics.
-  (2017) Flow-guided feature aggregation for video object detection. In ICCV, pp. 408–417. Cited by: Introduction, Video Object Detection, Motion Priors Initialization, Table 1, Implementation Details.
-  (2017) Deep feature flow for video recognition. In CVPR, pp. 2349–2358. Cited by: Learning Motion Priors for Efficient Video Object Detection, Introduction, Video Object Detection, Motion Estimation by Optical Flow, Table 1, Methodology, Datasets and Evaluation Metrics, Implementation Details, Architecture Design.