Object-aware Feature Aggregation for Video Object Detection

Object-aware Feature Aggregation for Video Object Detection

Abstract

We present an Object-aware Feature Aggregation (OFA) module for video object detection (VID). Our approach is motivated by the intriguing property that video-level object-aware knowledge can be employed as a powerful semantic prior to help object recognition. As a consequence, augmenting features with such prior knowledge can effectively improve the classification and localization performance. To make features get access to more content about the whole video, we first capture the object-aware knowledge of proposals and incorporate such knowledge with the well-established pair-wise contexts. With extensive experimental results on the ImageNet VID dataset, our approach demonstrates the effectiveness of object-aware knowledge with the superior performance of and mAP with ResNet-101 and ResNeXt-101, respectively. When further equipped with Sequence DIoU NMS, we obtain the best-reported mAP of and upon the paper submitted. The code to reproduce our results will be released after acceptance.

1 Introduction

Convolutional neural networks (CNNs) Krizhevsky et al. (2012); Simonyan and Zisserman (2014b); Szegedy et al. (2015); He et al. (2016); Huang et al. (2017); Cao et al. (2019); Li et al. (2018a, b); Pan et al. (2016); Qiu et al. (2017); Simonyan and Zisserman (2014a) have gained remarkable success in image object detection Girshick et al. (2014); Ren et al. (2015); Law and Deng (2018); Zhou et al. (2019); Liu et al. (2016); Fu et al. (2017). Video object detection (VID) extends the idea of localizing and recognizing objects in videos. But beyond the single image, video object detection is a more challenging task that suffers from difficulties like motion blur, out-of-focus and occlusion.

Videos contain various spatial-temporal cues which can be exploited to overcome the aforementioned problems. Several studies have been carried out to improve the performance of localizing and recognizing with spatial-temporal context. Box-level association methods Feichtenhofer et al. (2017); Han et al. (2016); Kang et al. (2016, 2017a, 2017b) link bounding boxes of each frame according to different constraints. Spatio-temporal feature aggregations Wang et al. (2018); Xiao and Jae Lee (2018); Zhu et al. (2018, 2017a); Wu et al. (2019); Deng et al. (2019); Chen et al. (2020) augment features by capturing contexts across adjacent frames. Recent works Wu et al. (2019); Deng et al. (2019); Chen et al. (2020) have shown that attention-based operators can generate powerful features to obtain more accurate and stable predictions. As the attention operators are capable of reducing the variance within features via extracting highly semantically similar pair-wise contexts from the whole sequence.

Figure 1: Example outputs of the previous state-of-the-art attention-based approach Wu et al. (2019). The first row is the result without temporal aggregation, which is produced by the single frame version, and the second row is the result which aggregates pair-wise context from support frames.

Existing works modeling the spatial-temporal contexts mainly focus on short-time windows or pair-wise feature relations. In comparison, less effort has been made to incorporate global video content. Fig. 1 visualizes several frames containing the bus category. Obviously, after passing the context information from other proposals, the recognition ability of bus is increasing. However, as shown in Fig. 1(), there is a limited improvement when the appearance of the object varies substantially. Moreover, the prediction of the salient object will be interfered by other proposals (bus is wrongly categorized as car after the propagation in Fig. 1()).

To mitigate the situation above, we revisit the mechanism of humans to recognize videos. There is an intriguing observation that when people are not certain about the identity of objects with deteriorated appearances in the reference frames, it is natural to seek the existence of objects in the video and assign them with the most similar ones. As the prior provides the video-level constraints of object appearance and category, we refer to it as the object-aware knowledge. It narrows down the scope of categories to be assigned due to the capability of capturing the global content of the video. Meanwhile, the object-aware knowledge about appearances serves as references to facilitate the bounding box predictions.

In this paper, we propose an Object-aware Feature Aggregation (OFA) module to distill this insight into video object detection. As the object-aware knowledge is usually extracted from isolated regions containing objects, it is hard to obtain such information before being detected. Therefore, our OFA module selectively aggregates features corresponding to proposals. By incorporating object-aware knowledge with the well-established pair-wise features, we can effectively improve the performance of video object detection. Concretely, the input features are split into two parts and fed into separate paths, i.e., semantic path and localization path. a) Semantic path. We first collect the object-aware knowledge from proposals across the whole video. Meanwhile, the pair-wise semantic context is obtained via calculating the similarity between proposals. By aggregating above pair-wise semantic context and object-aware knowledge of proposals, the features encode more knowledge about other regions and the whole video. b) Localization path. Localization features are also augmented with the pair-wise localization context. In contrast to the semantic path, we locally enhance the features to ensure the features sensitive to relative positions.

To further improve the performance, we propose an efficient post-processing strategy named Sequence DIoU NMS. Instead of linking the bounding boxes according to the intersection of union (IoU), we perform Distance-IoU (DIoU) Zheng et al. (2020) to associate and suppress boxes in videos.

Though being simple, the OFA module achieves surprisingly good performance, and it is easy to complement existing attention-based and proposal-based video object detection methods. Extensive empirical evaluations explicitly demonstrate that the proposed method is competitive in performance. Furthermore, we conduct ablation experiments to analyze the effects of various design choices in the OFA module.

2 Related Work

Figure 2: Overview of the Object-aware Feature Aggregation module for video object detection. Given the input reference frame and support frames, their objectness score and features corresponding to object proposals are produced by the Region Proposal Network (RPN) and the succeeding RoI operator. We split into and , which are augmented in two parallel paths of the OFA module. The semantic features and localization features of the reference frame are augmented by its relation features over all the and . To aggregate all proposals corresponding to the main body or salient parts of the objects, we propose the effective and efficient Object-aware Knowledge Extraction to highlight class-dependent channels of semantic features. With the help of object-aware knowledge, the aggregated semantic features tend to be consistent with the whole global context.

Image Object Detection.

Thanks to the advances in deep neural networks He et al. (2016); Krizhevsky et al. (2012); Simonyan and Zisserman (2014b); Szegedy et al. (2015) and large-scale annotated datasets Russakovsky et al. (2015); Lin et al. (2014), several state-of-the-art image object detection approaches Girshick et al. (2014); Girshick (2015); Ren et al. (2015); He et al. (2015); Dai et al. (2016); Law and Deng (2018); Lin et al. (2017a, b); Liu et al. (2016); Redmon et al. (2016); Redmon and Farhadi (2017, 2018); Fu et al. (2017) are proposed. The proposal-based approaches Girshick et al. (2014); Girshick (2015); Ren et al. (2015); Dai et al. (2016); Lin et al. (2017a) include the stages of feature extraction, proposal generation and bounding box prediction. Region Proposal Network (RPN) Ren et al. (2015) was proposed to generate proposals with the assistance of predefined anchors. Beyond Spatial pyramid pooling He et al. (2015), RoIPooling Girshick (2015), RoIAlign He et al. (2017), position-sensitive RoIPooling Dai et al. (2016) and other similar operations were introduced to extract features from proposals and improved the efficiency. Inspired by the attention mechanism Vaswani et al. (2017), Relation Network Hu et al. (2018) proposed an object relation module to model object relations in a single image. With detailed experiments Wu et al. (2020), it was proved that different head structures had opposite preferences towards classification and localization.

Although anchors and proposals have boosted the accuracy of detectors, it also hurts the efficiency of the whole network. Hence, several recent works are dedicated to achieving a real-time detection speed under the premise of high precision. One-stage proposal-free object detectors Lin et al. (2017b); Liu et al. (2016); Redmon et al. (2016); Redmon and Farhadi (2017, 2018); Fu et al. (2017) and anchor-free methods Law and Deng (2018); Zhou et al. (2019) recognized and localized objects on the feature map extracted by the CNN backbone. These approaches deeply analyzed the mechanism of CNN-based detectors and pointed out new directions for improving detectors. However, to study the effect of the video-level object-aware knowledge of proposals on video object detection, we follow the well-studied two-stage detection pipeline.

Video Object Detection.

To extend the image object detectors to the video domain, two main directions are proposed to leverage the spatio-temporal information. The first direction is to associate related boxes in time and space. To incorporate motion information, D&T Feichtenhofer et al. (2017) integrated tracking and detection in a unified framework to promote each other. Tubelet Kang et al. (2016, 2017a, 2017b) was built on anchor cuboids in consecutive frames to ensure temporal continuous. As a dense motion representation in frames, optical flow Wang et al. (2018); Zhu et al. (2017b, a) can be used to align and warp features extracted from adjacent frames to boost video object detection, which heavily relied on accurate motion estimation. Seq-NMS Han et al. (2016) made temporal links according to Jaccard overlap between bounding boxes of adjacent frames and rescored predictions in the optimal paths.

The second direction is to enhance features by aggregating related contexts from frames. STSN Bertasius et al. (2018) directly predicted sampling locations in different frames without relying on optical flow. STMN Xiao and Jae Lee (2018) proposed Spatial-Temporal Memory Module to model long-term appearance and movement changes. Beyond previous methods utilizing features from a short temporal window, SELSA Wu et al. (2019) considered the semantic impact between related proposals in all the frames. RDN Deng et al. (2019) distilled relation through repeatedly refining supportive object proposals with high confidences, which were used to upgrade features. MEGA Chen et al. (2020) considered both local and global aggregation to enhance the feature representation, which overcame the ineffective and insufficient problem. The above methods verified the merit on pair-wise object relation to eventually enhance feature discrimination.

Consistent with human cognitive behavior, including the prior relevant to objects in the entire video can directly reduce the difficulty of object recognition and localization. So different from the above methods, we directly extract the object-aware knowledge of proposals in the video and try to provide the network with the knowledge about what objects are in these video frames.

3 Method

In this section, we elaborate on how we devise the Object-aware Feature Aggregation (OFA) module to enable the whole architecture to fully utilize object-aware knowledge for detection in videos. Incorporating object-aware knowledge with pair-wise contexts, we integrate the OFA module to improve feature discrimination. Therefore, guided by the video-level object-aware knowledge, the aggregated features tend to be consistent with this whole global context.

Roughly, given the reference frame and support frames, proposals are produced with the faster-rcnn-style manner. Then, the proposal features are enhanced with two stacked OFA modules. Two independent paths, i.e., semantic path and localization path, are designed to aggregate the object-aware features with the semantic and localization contexts for classification and localization. Sequence DIoU NMS is introduced to further improve the performance in the post-processing stage. Fig. 2 gives an overview of the proposed approach.

3.1 Pair-wise Feature Aggregation

The basic idea of pair-wise feature aggregation Wu et al. (2019); Deng et al. (2019); Chen et al. (2020) is to measure relation features of one object as the weighted sum of features from other proposals, and the weights reflect object dependency in terms of both semantic and/or localization information. Previous works have demonstrated that spatial-temporal feature aggregations help to enhance the performance of detectors. Formally, suppose that we are aggregating the features , the learnable association between proposals serves as weights to aggregate features from others as follows:

(1)

where both and are linear learnable transformation functions. is the dimension of , where is the index of proposals.

3.2 Object-aware Feature Aggregation Module

As aforementioned, pair-wise context aggregations of proposals help alleviate the problem of appearance degradation in video object detection. However, such knowledge is insufficient for the proposals which are quite different from others. Intuitively, knowledge in the proposal feature space corresponds to meaningful prior information about existing objects, thus we obtain the rich object-aware knowledge by aggregating the global context of proposals, which is able to serve as a semantic prior of the video to recognize objects. Nevertheless, proposals contain false positive samples which include only parts of objects. The information cannot provide an accurate global context.

Object-aware Knowledge Extraction.

The OFA module is designed to emphasize on the high-quality proposals. Obviously, the proposals with high objectness scores are most likely to contain the main area or salient parts of the objects. Moreover, each frame consists of clusters of object proposals, which help eliminate the uncertainty of objectness scores produced by RPN. For a specific sequence with objectness score , we design the object-aware features extractor , which votes the proposals with the corresponding objectness scores. Additionally, to maintain the magnitude of features, we normalize the features across all proposal scores. The object-aware feature is obtained as:

(2)

Considering the two paths in our framework, proposal features are split into and which are fed into the semantic path and the localization path respectively. We obtain the aggregated features and as follows:

Semantic Path: Aggregating the object-aware features with the semantic context.

To generate object-aware features , we first estimate a set of factors from Eqn. 2 to selectively highlight the class-dependent channels of . Then we employ the pair-wise semantic features aggregation with Eqn. LABEL:eq:pw. Combining with the object-aware feature, we have

(3)

where is non-linear transformations applied to the object-aware feature.

Localization Path: Aggregating the localization context.

As localization contexts among proposals usually focus on short temporal windows and the object-aware knowledge may harm the overall performance, we design a parallel localization branch urging the relation module to focus only on localization features. To keep sensitive to relative position, is aggregated as follows:

(4)

To avoid being interfered by irrelevant semantic-similar contexts, we get the augmented feature by integrating localization contexts with individual learnable and functions.

3.3 Sequence DIoU NMS

The predictions from faster-rcnn-style detectors consist of a series of redundant results that need to be filtered. Considering temporal consistency, sequence post-processing methods link detection boxes across consecutive frames and select optimal tubelets to suppress redundant predictions. Seq-NMS Han et al. (2016) links the bounding boxes according to intersection over union (IoU). With modified criteria Deng et al. (2019); Gkioxari and Malik (2015), smooth tubelets could be selected. As objects are usually dense and occlude each other in consecutive frames, the criterion to decide whether different bounding boxes correspond to the same object is a key and difficult factor.

Without extra inputs like pair-wise relation weight, we employ the Distance-IoU (DIoU) Zheng et al. (2020) to link and suppress boxes, which is an extended overlap metric in the image object detection to distinguish crowded objects. With the center distance of bounding boxes and diagonal length of the union box of inputs , we have

(5)

In our proposed Sequence DIoU NMS, we recursively maximize the sum of object scores subject the modified constraint to find maximum paths as follows:

(6)

The optimization returns a set of indices to rescore the detection confidence , with its average value.

Although being simple, Sequence DIoU NMS further improves the performance in our experiments. We present the pseudo-code of Sequence DIoU NMS in Algorithm 1.

1:Input: , , , ,
2: is the time stamp,
3: is the list of initial bounding boxes in frame , where is the indice,
4: contains corresponding detection confidence,
5: and are the thresholds.
6:Ensure: optimal , .
7:Create linkes :
8:for  to  do
9:     ;
10:     for  ,  do
11:         if  then
12:              ;
13:         end if
14:     end for
15:     ;
16:end for
17:Find maximum paths:
18:while  do
19:     find the maximum path from to ;
20:     if  then
21:         stop loop;
22:     end if
23:     for  to  do
24:         for  do
25:              if  then
26:                  delete and corresonding links;
27:              end if
28:         end for
29:     end for
30:     update ;
31:     rescore according to Eqn. 6;
32:end while
33:Output: The rescored result of the sequences: .
Algorithm 1 Sequence DIoU NMS.

3.4 Relation to Other Approaches

Global features aggregation.

The global context has gained success on a variety of tasks like semantic segmentation Liu et al. (2015); Zhao et al. (2017), action recognition Wang and Gupta (2018) and similar tasks. However, most of the previous works Liu et al. (2015); Zhao et al. (2017); Zhang et al. (2018) capture global context from both foreground and background, without concentrating on foreground objects. In a two-stage video object detection pipeline, proposals usually contain rich foreground knowledge for object classification. This motivates us to adopt the global context operator in detection. Although the global context of proposals is employed in the action recognition Wang and Gupta (2018), however, it does not emphasize those proposals containing salient views of objects. Our approach is the first to extract object-aware knowledge from object proposals to improve the performance in VID. Remarkably, with prior knowledge, we design a novel feature aggregation strategy to avoid the negative effect of redundant or low-quality proposals.

Multi-head attention-based approaches.

Existing multi-head attention works Vaswani et al. (2017); Deng et al. (2019) are designed with multiple parallel attention operators to augment features. They capture enriched relations by collecting the features from different heads. In fact, video object detection is to classify and regress objects, hinting that the separable branches for classification and regression may boost the detection performance. Therefore, our approach divides features into two independent branches and optimizes with different supervision. Importantly, this is a necessary step to augment the semantic features and keep the localization features spatial sensitive.

4 Experiments

 Methods  Backbone  mAP
FGFA Zhu et al. (2017a)  ResNet-101  76.30
D&T Feichtenhofer et al. (2017)  ResNet-101  75.80
MANet Wang et al. (2018)  ResNet-101  78.10
SELSA Wu et al. (2019)  ResNet-101  82.69
RDN Deng et al. (2019)  ResNet-101  81.80
MEGA Chen et al. (2020)  ResNet-101  82.90
FEVOD Jiang et al. (2019)  ResNet-101  80.10
Ours  ResNet-101  83.93
FGFA* Zhu et al. (2017a)  ResNet-101  78.40
MANet* Wang et al. (2018)  ResNet-101  80.30
ST-Lattice* Chen et al. (2018)  ResNet-101  79.60
D&T* Feichtenhofer et al. (2017)  ResNet-101  79.80
STMN*+ Xiao and Jae Lee (2018)  ResNet-101  80.50
RDN* Deng et al. (2019)  ResNet-101  83.8
MEGA* Chen et al. (2020)  ResNet-101  84.5
FEVOD* Jiang et al. (2019)  ResNet-101  82.10
Ours*  ResNet-101  85.07
D&T* Feichtenhofer et al. (2017)  ResNeXt-101  81.60
SELSA Wu et al. (2019)  ResNeXt-101  84.30
RDN Deng et al. (2019)  ResNeXt-101  83.2
RDN* Deng et al. (2019)  ResNeXt-101  84.7
MEGA Chen et al. (2020)  ResNeXt-101  84.1
MEGA* Chen et al. (2020)  ResNeXt-101  85.4
Ours  ResNeXt-101  86.09
Ours*  ResNeXt-101  86.88
Table 1: Performance comparison with state-of-the-art systems on the ImageNet VID validation set. indicates the use of model emsembling. indicates the use of sequence post-processing methods (e.g Seq-NMS, tubelet rescoring, and our Sequence DIoU NMS).

4.1 Dataset and Evaluation

We evaluate our method on the ImageNet VID dataset Russakovsky et al. (2015), which is a large scale benchmark for the video object detection task. The ImageNet VID dataset consists of 3862 training videos and 555 validation videos. There are 30 object categories annotated in this dataset. Meanwhile, we follow the common protocols Wu et al. (2019); Deng et al. (2019); Chen et al. (2020) to mix the ImageNet VID dataset with the Image DET dataset for training. We evaluate our method on the validation set and use mean average precision (mAP) as the main evaluation metric. Furthermore, we also report motion-specific mAP on the validation set to illustrate the effectiveness of our approach.

4.2 Network Architecture

Backbone.

We use ResNet-101 He et al. (2016) and ResNeXt-101 Xie et al. (2017) as backbone networks. For those two backbone networks, we enlarge the resolution of feature maps in the last stage by halving the strides and doubling the dilation rates of convolutions. We make ablation experiments mainly with ResNet-101. For the more powerful ResNeXt-101, we report the final results.

Region Feature Extraction.

We apply RPN on the top of the conv4 stage. In RPN, the anchors of 3 aspect ratios and 3 scales are predefined on each spatial location for proposal generation. With generated proposals, we apply RoIAlign He et al. (2017) on the conv5 stage and a 1024-D fully-connected (FC) layer to extract for proposals.

OFA Module.

We split the 1024-D into two 512-D and . In the semantic path and localization path, the internal channels of pair-wise feature aggregation are 512 after inserting a FC layer with 512 channels. In the stage of extracting object-aware knowledge of proposals, the is implemented with FC-ReLU-FC with 512 channels.

Sequence DIoU NMS.

We conduct our Sequence DIoU NMS by slightly modifying the original Seq-NMS. In contrast to other methods, our Sequence DIoU NMS does not need extra inputs. Without bells and whistles, the network predictions can be improved effectively with the setting of and .

4.3 Implementation Details

Our approach is mainly built on SELSA Wu et al. (2019). The input frames are resized to a shorter side of 600 pixels. The network is trained on 8 Nvidia P40 GPUs. A total of 6 epochs of SGD training is performed with a batch size of 8, with a learning rate of and in the first 4 epochs and in the last 2 epochs, respectively. In the training phase, three frames are sampled from the same given video. Except photometric distortion, we apply the same data augmentation, including random expand, crop and flipping, to these frames to keep them aligned. In the test phase, 21 frames from the given video are sampled.

Training our OFA module.

In training and test phases, different strategies are applied to generate proposals. To ensure consistency in both two phases, we use the class-agnostic objectness scores provided by RPN in these two phases. Furthermore, we block the gradients propagated from the object-aware knowledge extraction to update the parameters in both the backbone and RPN.

4.4 Main Performance

We compare our method against motion-guided methods Zhu et al. (2017a); Feichtenhofer et al. (2017); Wang et al. (2018), SELSA Wu et al. (2019), RDN Deng et al. (2019) and MEGA Chen et al. (2020). Expanding from motion-guided methods, FEVOD Jiang et al. (2019) directly learns and predicts sampling positions to improve performance. For SELSA Wu et al. (2019), RDN Deng et al. (2019) and MEGA Chen et al. (2020), they all treat the attention operators as core components to improve their performance. Especially the previous state-of-the-art method, MEGA, exploring the idea to empower predictions with context from longer content both globally and locally. Table 1 shows the performance comparison with these state-of-the-art approaches. With different backbones and post-processing algorithms, we fully compare our approach with others. Obviously, no matter end-to-end models or performance enhanced with post-processing, our approach consistently achieves the best performances on different backbones.

End-to-End models.

For fair comparisons, we use naive NMS as the post-processing operation when evaluating different end-to-end models. With the ResNet-101 backbone, our approach can achieve mAP, with absolute improvement over the baseline. Among all competitors, our approach gains at least improvement. As one of the most similar methods, MEGA benefits from the global context with the Long Range Memory (LRM) module. Although it can get access to more temporal contexts in larger temporal windows, the recurrent updated LRM is not able to get an overview of the input video. Hence, our approach is much more flexible to process variable-length sequences and brings more improvement. Even with stronger ResNeXt-101, the baseline model still gains an improvement of and achieves the new state-of-the-art performance of mAP.

Add post-processing.

As most of the state-of-the-art video object detection approaches could benefit from sequence post-processing, we also compare our approach with their best performances achieved with different post-processing strategies. Instead of BLR, we adopt our Sequence DIoU NMS without extra inputs. Table 1 summarizes the results of state-of-the-art methods with different post-processing. Obviously, our approach still performs the best with and mAP with backbone ResNet-101 and ResNeXt-101, respectively.

4.5 Ablation Study

To study the impacts of components in our approach, extensive ablation experiments are conducted. All these experiments mainly start from baseline with the same ResNet-101 backbone.

Effect of the object-aware knowledge of proposals.

To explore the effect of the object-aware knowledge of proposals in our approach, we show the performance in Table 2. In addition, we introduce two different strategies to extract object-aware knowledge. Inspired by global context extraction in semantic segmentation Liu et al. (2015); Zhao et al. (2017), we compute the global context of frames () by averaging the value of feature maps. We also calculate the mean value of all proposals () to validate the necessary to weight proposals.

Comparing different object-aware knowledge extraction strategies, the global context obtained by even harms performance. It is proven that too much background information in the global context harms the recognition of objects. On the positive side, the object-aware knowledge obtained by and our OFA module () can improve performance. It is because proposals mainly focus on the part of the scene where objects may exist and most of the background is filtered out. Especially, the strategy of highlighting proposals with high objectness scores significantly surpasses the baseline by mAP. With motion-specific mAP, we find that our approach improves the recognition of objects with different motion speeds. Especially for objects with small or medium motion, since more clear object-aware knowledge can be produced, larger benefits are obtained. Meanwhile, as shown in the ablation experiments of feature splitting (SP), parallel feature augmentation does not make a significant impact on the performance. Nevertheless, the individual semantic path indeed is a necessary key to combine object-aware knowledge to boost recognition ability.

Additionally, with stronger backbone ResNeXt-101, the absolute improvement is further increased. It is inferred that a more powerful object-aware knowledge is extracted from stronger encoded features and more accurate proposals.

 OA  SP  mAP      
-  -  82.69  88.00  81.35  67.10
-  ✓  82.79  88.46  81.50  66.31
 -  82.24  89.22  80.59  65.17
 ✓  83.14  87.54  82.23  67.13
 -  83.29  88.30  82.05  67.74
 ✓  83.93  89.44  82.67  67.36
Table 2: Ablation study on the components of the OFA module. Here, we mainly verify the impacts of feature splitting (SP) and different object-aware knowledge (OA) extraction strategies. , , represent mAP(small), mAP(medium), mAP(fast), respectively.
 Network  Post-Processing  mAP
Baseline  NMS  82.69
Baseline  Seq DIoU NMS  84.20
RDN  NMS  81.80
RDN  BLR  83.80
MEGA  NMS  82.90
MEGA  BLR  84.50
Ours  NMS  83.93
Ours  DIoU NMS  84.66
Ours  Seq-NMS  84.00
Ours  Seq DIoU NMS  85.07
Table 3: Performance comparison with state-of-the-art video object detection models with post-processing methods (e.g. Seq-NMS, BLR, and our Sequence DIoU NMS (Seq DIoU NMS)).

Effect of post-processing.

To explore different post-processing methods, we show their performance in Table 3. Compared with the common NMS, different sequence post-processing all have greater improvements in the results. At present, the most effective post-processing BLR improved two state-of-the-art networks, MEGA Chen et al. (2020) and RDN Deng et al. (2019), by a margin from to . It can be seen that the improvements of post-processing methods are compressed in more powerful networks. Compared with others, our Sequence DIoU NMS improves the baseline by . Even on our network, of which performance is much higher than other networks, the Sequence DIoU NMS also brings improvement. To analyze the improvement of Sequence DIoU NMS, we tested the effects of DIoU NMS and Seq-NMS in our approach. Sequence DIoU NMS significantly improves the performance of our approach with different backbones, which is more than the straightforward combination of DIoU NMS and Seq-NMS.

5 Conclusion

In this paper, we have presented the Object-aware Feature Aggregation (OFA) module to extract video-level object-aware knowledge of proposals for video object detection. Our OFA module contains two separable parallel paths, i.e., semantic path and localization path for classification and regression, respectively. In fact, the OFA module improves the performance via incorporating the prior knowledge with well-established pair-wise contexts, which is compatible with any attention-based video object detection methods. Sequence DIoU NMS further boosts the performance at the post-processing stage. Extensive experiments on the ImageNet VID dataset have demonstrated the effectiveness of the proposed method. Future research may focus on introducing video-level object-aware knowledge in other proposal-based vision tasks like object tracking and action recognition.

References

  1. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346. Cited by: §2.
  2. Gcnet: non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  3. Optimizing video object detection via a scale-time lattice. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7814–7823. Cited by: Table 1.
  4. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346. Cited by: §1, §2, §3.1, §4.1, §4.4, §4.5, Table 1.
  5. R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.
  6. Relation distillation networks for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7023–7032. Cited by: §1, §2, §3.1, §3.3, §3.4, §4.1, §4.4, §4.5, Table 1.
  7. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046. Cited by: §1, §2, §4.4, Table 1.
  8. Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §1, §2, §2.
  9. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1, §2.
  10. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.
  11. Finding action tubes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 759–768. Cited by: §3.3.
  12. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465. Cited by: §1, §2, §3.3.
  13. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2, §4.2.
  14. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §2.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2, §4.2.
  16. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §2.
  17. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
  18. Learning where to focus for efficient video object detection. External Links: 1911.05253 Cited by: §4.4, Table 1.
  19. Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735. Cited by: §1, §2.
  20. T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28 (10), pp. 2896–2907. Cited by: §1, §2.
  21. Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 817–825. Cited by: §1, §2.
  22. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §2.
  23. Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §2, §2.
  24. Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the European conference on computer vision (ECCV), pp. 303–318. Cited by: §1.
  25. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia 21 (2), pp. 416–428. Cited by: §1.
  26. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.
  27. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2, §2.
  28. Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.
  29. Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §2, §2.
  30. Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §3.4, §4.5.
  31. Learning deep intrinsic video representation by exploring temporal coherence and graph structure.. In IJCAI, pp. 3832–3838. Cited by: §1.
  32. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §1.
  33. You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2, §2.
  34. YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2, §2.
  35. Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2, §2.
  36. Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.
  37. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §2, §4.1.
  38. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §1.
  39. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.
  40. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §2.
  41. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.4.
  42. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 542–557. Cited by: §1, §2, §4.4, Table 1.
  43. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pp. 399–417. Cited by: §3.4.
  44. Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9217–9225. Cited by: Figure 1, §1, §2, §3.1, §4.1, §4.3, §4.4, Table 1.
  45. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10186–10195. Cited by: §2.
  46. Video object detection with an aligned spatial-temporal memory. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501. Cited by: §1, §2, Table 1.
  47. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §4.2.
  48. Context encoding for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7151–7160. Cited by: §3.4.
  49. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §3.4, §4.5.
  50. Distance-iou loss: faster and better learning for bounding box regression.. In AAAI, pp. 12993–13000. Cited by: §1, §3.3.
  51. Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §1, §2.
  52. Towards high performance video object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218. Cited by: §1.
  53. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417. Cited by: §1, §2, §4.4, Table 1.
  54. Deep feature flow for video recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2349–2358. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
419487
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description