RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder


Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While these different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer Vaswani et al. (2017) to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of key instances to strengthen the main query representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a key sampling approach and a shared location embedding approach. The proposed module is named bridging visual representations (BVR). It can perform in-place and we demonstrate its broad effectiveness in bridging other representations into prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS, where about AP improvements are achieved. In particular, we improve a state-of-the-art framework with a strong backbone by about AP, reaching AP on COCO test-dev. The resulting network is named RelationNet++. The code will be available at

1 Introduction

Object detection is a vital problem in computer vision that many visual applications build on. While there have been numerous approaches towards solving this problem, they usually leverage a single visual representation format. For example, most object detection frameworks Girshick et al. (2014); Girshick (2015); Ren et al. (2015); Lin et al. (2017b) utilize the rectangle box to represent object hypotheses in all intermediate stages. Recently, there have also been some frameworks adopting points to represent an object hypothesis, e.g., center point in CenterNet Zhou et al. (2019a) and FCOS Tian et al. (2019), point set in RepPoints Yang et al. (2019, 2020); Chen et al. (2020) and PSN Wei et al. (2020). In contrast to representing whole objects, some keypoint-based methods, e.g., CornerNet Law and Deng (2018), leverage part representations of corner points to compose an object. In general, different representation methods usually steer the detectors to perform well in different aspects. For example, the bounding box representation is better aligned with annotation formats for object detection. The center representation avoids the need for an anchoring design and is usually friendly to small objects. The corner representation is usually more accurate for finer localization.

It is natural to raise a question: could we combine these representations into a single framework to make good use of each strength? Noticing that different representations and their feature extractions are usually heterogeneous, combination is difficult. To address this issue, we present an attention based decoder module similar as that in Transformer Vaswani et al. (2017), which can effectively model dependency between heterogeneous features. The main representations in an object detector are set as the query input, and other visual representations act as the auxiliary keys to enhance the query features by certain interactions, where both appearance and geometry relationships are considered.

In general, all feature map points can act as corner/center key instances, which are usually too many for practical attention computation. In addition, the pairwise geometry term is computation and memory consuming. To address these issues, two novel techniques are proposed, including a key sampling approach and a shared location embedding approach for efficient computation of the geometry term. The proposed module is named bridging visual representations (BVR).

Figure 0(a) illustrates the application of this module to bridge center and corner representations into an anchor-based object detector. The center and corner representations act as key instances to enhance the anchor box features, and the enhanced features are then used for category classification and bounding box regression to produce the detection results. The module can work in-place. Compared with the original object detector, the main change is that the input features for classification and regression are replaced by the enhanced features, and thus the strengthened detector largely maintains its convenience in use.

The proposed BVR module is general. It is applied to various prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS. Extensive experiments on the COCO dataset Lin et al. (2014) show that the BVR module substantially improves these various detectors by AP. In particular, we improve a strong ATSS detector by about AP with small overhead, reaching AP on COCO test-dev. The resulting network is named RelationNet++, which strengthens the relation modeling in Hu et al. (2018) from bbox-to-bbox to across heterogeneous object/part representations.

The main contributions of this work are summarized as:

  • A general module, named BVR, to bridge various heterogeneous visual representations and combine the strengths of each. The proposed module can be applied in-place and does not break the overall inference process by the main representations.

  • Novel techniques to make the proposed bridging module efficient, including a key sampling approach and a shared location embedding approach.

  • Broad effectiveness of the proposed module for four prevalent object detectors: RetinaNet, Faster R-CNN, FCOS and ATSS.

(a) Bridge representations.
(b) Typical object/part representations.
Figure 1: (a) An illustration of bridging various representations, specifically leveraging corner/center representations to enhance the anchor box features. (b) Object/part representations used in object detection (geometric description and feature extraction). The red dashed box denotes ground-truth.

2 A Representation View for Object Detection

2.1 Object / Part Representations

Object detection aims to find all objects in a scene with their location described by rectangle bounding boxes. To discriminate object bounding boxes from background and to categorize objects, intermediate geometric object/part candidates with associated features are required. We refer to the joint geometric description and feature extraction as the representation, where typical representations used in object detection are illustrated in Figure 0(b) and summarized below.

Object bounding box representation   Object detection uses bounding boxes as the final output. Probably because of this, bounding box is now the most prevalent representation. Geometrically, a bounding box can be described by a -d vector, either as center-size or as opposing corners . Besides the final output, this representation is also commonly used as initial and intermediate object representations, such as anchors Ren et al. (2015); Liu et al. (2016); Redmon and Farhadi (2017, 2018); Lin et al. (2017b) and proposals Girshick et al. (2014); Dai et al. (2016); Lin et al. (2017a); He et al. (2017). For bounding box representations, features are usually extracted by pooling operators within the bounding box area on an image feature map. Common pooling operators include RoIPool Girshick (2015), RoIAlign He et al. (2017), and Deformable RoIPool Dai et al. (2017); Zhu et al. (2019). There are also simplified feature extraction methods, e.g., the box center features are usually employed in the anchor box representation Ren et al. (2015); Lin et al. (2017b).

Object center representation  The -d vector space of a bounding box representation is at a scale of for an image with resolution , which is too large to fully process. To reduce the representation space, some recent frameworks Tian et al. (2019); Yang et al. (2019); Zhou et al. (2019a); Kong et al. (2019); Wang et al. (2019) use the center point as a simplified representation. Geometrically, a center point is described by a 2-d vector , in which the hypothesis space is of the scale , which is much more tractable. For a center point representation, the image feature on the center point is usually employed as the object feature.

Corner representation  A bounding box can be determined by two points, e.g., a top-left corner and a bottom-right corner. Some approaches Tychsen-Smith and Petersson (2017); Law and Deng (2018); Law et al. (2019); Duan et al. (2019); Lu et al. (2019); Zhou et al. (2019b); Samet et al. (2020) first detect these individual points and then compose bounding boxes from them. We refer to these representation methods as corner representation. The image feature at the corner location can be employed as the part feature.

Summary and comparison  Different representation approaches usually have strengths in different aspects. For example, object based representations (bounding box and center) are better in category classification while worse in object localization than part based representations (corners). Object based representations are also more friendly for end-to-end learning because they do not require a post-processing step to compose objects from corners as in part-based representation methods. Comparing different object-based representations, while the bounding box representation enables more sophisticated feature extraction and multiple-stage processing, the center representation is attractive due to the simplified system design.

Figure 2: Representation flows for several typical detection frameworks.

2.2 Object Detection Frameworks in a Representation View

Object detection methods can be seen as evolving intermediate object/part representations until the final bounding box outputs. The representation flows largely shape different object detectors. Several major categorization of object detectors are based on such representation flow, such as top-down (object-based representation) vs bottom-up (part-based representation), anchor-based (bounding box based) vs anchor-free (center point based), and single-stage (one-time representation flow) vs multiple-stage (multiple-time representation flow). Figure 2 shows the representation flows of several typical object detection frameworks, as detailed below.

Faster R-CNN Ren et al. (2015) employs bounding boxes as its intermediate object representations in all stages. At the beginning, multiple anchor boxes at each feature map position are hypothesized to coarsely cover the 4-d bounding box space in an image, i.e., anchor boxes with different aspect ratios. The image feature vector at the center point is extracted to represent each anchor box, which is then used for foreground/background classification and localization refinement. After anchor box selection and localization refinement, the object representation is evolved to a set of proposal boxes, where the object features are usually extracted by an RoIAlign operator within each box area. The final bounding box outputs are obtained by localization refinement, through a small network on the proposal features.

RetinaNet Lin et al. (2017b) is a one-stage object detector, which also employs bounding boxes as its intermediate representation. Due to its one-stage nature, it usually requires denser anchor hypotheses, i.e., anchor boxes at each feature map position. The final bounding box outputs are also obtained by applying a localization refinement head network.

FCOS Tian et al. (2019) is also a one-stage object detector but uses object center points as its intermediate object representation. It directly regresses the four sides from the center points to form the final bounding box outputs. There are concurrent works, such as Zhou et al. (2019a); Kong et al. (2019); Yang et al. (2019). Although center points can be seen as a degenerated geometric representation from bounding boxes, these center point based methods show competitive or even better performance on benchmarks.

CornerNet Law and Deng (2018) is built on the intermediate part representation of corners, in contrast to the above frameworks where object representations are employed. The predicted corners (top-left and bottom-right) are grouped according to their embedding similarity, to compose the final bounding box outputs. The detectors based on corner representation usually reveal better object localization than those based on an object-level representation.

3 Bridging Visual Representations

For the typical frameworks in Section 2.2, mainly one kind of representation approach is employed. While they have strengths in some aspects, they may also fall short in other ways. However, it is in general difficult to combine them in a single framework, due to the heterogeneous or non-grid feature extraction by different representations. In this section, we will first present a general method to bridge different representations. Then we demonstrate its applications to various frameworks, including RetinaNet Lin et al. (2017b), Faster R-CNN Ren et al. (2015), FCOS Tian et al. (2019) and ATSS Zhang et al. (2020).

3.1 A General Attention Module to Bridge Visual Representations

Without loss of generality, for an object detector, the representation it leverages is referred to as the master representation, and the general module aims to bridge other representations to enhance this master representation. Such other representations are referred to as auxiliary ones.

Inspired by the success of the decoder module for neural machine translation where an attention block is employed to bridge information between different languages, e.g., Transformer Vaswani et al. (2017), we adapt this mechanism to bridge different visual representations. Specifically, the master representation acts as the query input, and the auxiliary representations act as the key input. The attention module outputs strengthened features for the master representation (queries), which have bridged the information from auxiliary representations (keys). We use a general attention formulation as:


where are the input feature, output feature, and geometric vector for a query instance ; are the input feature and geometric vector for a key instance ; is a linear value transformation function; is a similarity function between and , instantiated as Hu et al. (2018):


where denotes the appearance similarity computed by a scaled dot product between query and key features Vaswani et al. (2017); Hu et al. (2018), and denotes a geometric term computed by applying a small network on the relative locations between and , i.e., cosine/sine location embedding Vaswani et al. (2017); Hu et al. (2018) plus a -layer MLP. In the case of different dimensionality between the query geometric vector and key geometric vector (-d bounding box vs. -d point), we first extract a -d point from the bounding box, i.e., center or corner, such that the two representations are homogeneous in geometry description for subtraction operations. The same as in Vaswani et al. (2017); Hu et al. (2018), multi-head attention is employed, which performs substantially better than using single-head attention. We use an attention head number of by default.

The above module is named bridging visual representations (BVR), which takes query and key representations of any dimension as input and generates strengthened features for the query considering both their appearance and geometric relationships. The module can be easily plugged into prevalent detectors as described in Section 3.2 and 3.3.

3.2 BVR for RetinaNet

We take RetinaNet as an example to showcase how we apply the BVR module to an existing object detector. As mentioned in Section 2.2, RetinaNet adopts anchor bounding boxes as its master representation, where bounding boxes are anchored at each feature map location. Totally, there are bounding box instances for a feature map of resolution. BVR takes the feature map ( is the feature map channel) as query input, and generates strengthened query features of the same dimension.

We use two kinds of key (auxiliary) representations to strengthen the query (master) features. One is the object center, and the other is the corners. As shown in Figure 2(a), the center/corner points are predicted by applying a small point head network on the output feature map of the backbone. Then a small set of key points are selected from all predictions, and are fed into the attention modules to strengthen the classification and regression feature, respectively. In the following, we provide details of these modules and the crucial designs.

Auxiliary (key) representation learning  The point head network consists of two shared conv layers, followed by two independent sub-networks (a conv layer and a sigmoid layer) to predict the scores and sub-pixel offsets for center and corner prediction, respectively Law and Deng (2018). The score indicates the probability of a center/corner point locating at the feature map bin. The sub-pixel offset denotes the displacement between its precise location and the top-left (integer coordinate) of each feature bin, which accounts for the resolution loss by down-sampling of feature maps.

In learning, for the object detection frameworks with an FPN structure, we assign all ground-truth center/corner points to all feature levels. We find it performs better than the common practice where objects are assigned to a particular level Lin et al. (2017a, b); Tian et al. (2019); Law and Deng (2018); Yang et al. (2019), probably because it speeds up the learning of center/corner representations due to more positive samples employed in each level. The focal loss Lin et al. (2017b) and smooth L1 loss are employed for the center/corner score and sub-pixel offset learning, with loss weights of and , respectively.

(a) Apply BVR to RetinaNet.
(b) Attention-based feature enhancement.
Figure 3: Applying BVR into an object detector and an illustration of the attention computation.

Key selection  We use corner points to demonstrate the processing of auxiliary representation selection since the principle is same for center point representation. We treat each feature map position as an object corner candidate. If all candidates are employed in the key set, the computation cost of BVR module is unaffordable. In addition, too many background candidates may suppress real corners. To address these issues, we propose a top- ( by default) key selection strategy. Concretely, a MaxPool operator with stride 1 is performed on the corner score map, and the top-k corner candidates are selected according to their corner-ness scores. For an FPN backbone, we select the top- keys from all pyramidal levels, and the key set is shared by all levels. This shared key set outperforms that of independent key set for different levels, as shown in Table 2.

Shared relative location embedding  The computation and memory complexities for direct computation of the geometry term are and , respectively, where are the cosine/sine embedding dimension, inner dimension of the MLP network, head number of the multi-head attention module and the number of selected key instances, respectively. As shown in Table 4, the default setting () is time-consuming and space-consuming.

Noting that the range of relative locations are limited, i.e., , we apply the cosine/sine embedding and the -layer MLP network on a fixed -d relative location map to produce a -channel geometric map, and then compute the geometric terms for a key/query pair by bilinear sampling on this geometric map. To further reduce the computation, we use a -d relative location map with the unit length larger than , e.g., , where each location bin indicates a length of in the original image. In our implementation, we adopt ( indicates the stride of the pyramid level) and a location map of resolution, which accounts for a range on the original image for a pyramid level of stride S. Figure 2(b) gives an illustration. The computation and memory complexities are reduced to and , respectively, which are significantly smaller than direct computation, as shown in Table 4.

Separate BVR modules for classification and regression  Object center representations can provide rich context for object categorization, while the corner representations can facilitate localization. Therefore, we apply separate BVR modules to enhance classification and regression features respectively, as shown in Figure2(a). Such separate design is beneficial, as demonstrated in Table 6.

3.3 BVR for Other Frameworks

The BVR module is general, and can be applied to other object detection frameworks.

ATSS Zhang et al. (2020) applies several techniques from anchor-free detectors to improve the anchor-based detectors, e.g. RetinaNet. The BVR used for RetinaNet can be directly applied.

FCOS Tian et al. (2019) is an anchor-free detector which utilizes center point as object representation. Since there is no corner information in this representation, we always use the center point location and the corresponding feature to represent the query instance in our BVR module. Other settings are maintained the same as those for RetinaNet.

Faster R-CNN Ren et al. (2015) is a two-stage detector which employs bounding boxes as the inter-mediate object representations in all stages. We adopt BVR to enhance the features of bounding box proposals, the diagram is shown in Figure 3(a). In each of the proposals, RoIAlign feature is used to predict center and corner representations. Figure 3(b) shows the network structure of point (center/corner) head, which is similar with mask head in mask R-CNN He et al. (2017). The selection of keys is same with the process in RetinaNet, which is stated in Section 3.2. We use the features interpolated from the point head as the key features, center and corner features are also employed to enhance classification and regression, respectively. Since the number of the querys is much smaller than that in RetinaNet, we directly compute the geometry term other than using the shared geometric map.

(a) Apply BVR to faster R-CNN.
(b) Point head for center (corner) prediction in faster R-CNN.
Figure 4: Design of applying BVR to faster R-CNN.

3.4 Relation to Other Attention Mechanisms in Object Detection

Non-Local Networks (NL) Wang et al. (2018) and RelationNet Hu et al. (2018) are two pioneer works utilizing attention modules to enhance detection performance. However, they are both designed to enhance instances of a single representation format: non-local networks Wang et al. (2018) use self-attention to enhance a pixel feature by fusing in other pixels’ features; RelationNet Hu et al. (2018) enhance a bounding box feature by fusing in other bounding box features.

In contrast, BVR aims to bridge representations in different forms to combine the strengths of each. In addition to this conceptual difference, there are also new techniques in the modeling aspect. For example, techniques are proposed to enable homogeneous difference/similarity computation between heterogeneous representations, i.e., -d bounding box vs -d corner/center points. Also, there are new techniques proposed to effectively model relationship between different kinds of representations as well as to speed-up computation, such as key representation selection, and the shared relative location embedding approach. The proposed BVR is actually complementary to these pioneer works, as shown in Table 8 and 8.

Learning Region Features (LRF) Gu et al. (2018) and DeTr Carion et al. (2020) use an attention module to compute the features of object proposals Gu et al. (2018) or querys Carion et al. (2020) from image features. BVR shares similar formulation as them, but has a different aim to bridge different forms of object representations.

4 Experiments

We first ablate each component of the proposed BVR module using a RetinaNet base detector in Section 4.1. Then we show benefits of BVR applied to four representative detectors, including two-stage (i.e., faster R-CNN), one-stage (i.e., RetinaNet and ATSS) and anchor-free (i.e., FCOS) detectors. Finally, we compare our approach with the state-of-the-art methods.

Our experiments are all implemented on the MMDetection v1.1.0 codebase Chen et al. (2019). All experiments are performed on MS COCO datasetLin et al. (2014). A union of train images and a subset of val images are used for training. Most ablation experiments are studied on a subset of unused val images (denoted as minival). Unless otherwise stated, all the training and inference details keep the same as the default settings in MMDetection, i.e., initializing the backbone using the ImageNet Russakovsky et al. (2015) pretrained model, resizing the input images to keep their shorter side being and their longer side less than or equal to , optimizing the whole network via the SGD algorithm with momentum, weight decay, setting the initial learning rate as with the decrease at epoch and . In the large model experiments in Table 10 and 12, we train epochs and decrease the learning rate at epoch and . Multi-scale training is also adopted in large model experiments, for each mini-batch, the shorter side is randomly selected from a range of .

4.1 Method Analysis using RetinaNet

Our ablation study is built on a RetinaNet detector using ResNet-50, which achieves AP on COCO minival ( settings). Components in the BVR module are ablated using this base detector.

Key selection As shown in Table 2, compared with independent keys across feature levels, sharing keys can bring and AP gains for and keys, respectively. Using keys achieves the best accuracy, probably because that too few keys cannot sufficiently cover the representative keypoints, while too large number of keys include many low-quality candidates.

On the whole, the BVR enhanced RetinaNet significantly outperforms the original RetinaNet by AP, demonstrating the great benefit of bridging other representations.

Sub-pixel corner/center Table 2 shows the benefits of using sub-pixel representations for centers and corners. While sub-pixel representation benefits both classification and regression, it is more critical for the localization task.

  #keys share AP AP AP - - 35.6 55.5 39.0 20 36.1 54.9 39.6 50 37.0 55.8 40.6 20 37.7 56.5 41.4 50 38.5 57.0 42.3 100 38.3 56.9 42.0 200 38.2 56.7 41.9     CLS (ct.) REG (cn.) AP AP AP AP - - 35.6 55.5 39.0 9.3 integer integer 37.0 55.6 40.8 11.0 integer sub-pixel 38.0 56.1 41.7 12.5 sub-pixel integer 37.2 56.7 41.2 10.4 sub-pixel sub-pixel 38.5 57.0 42.3 12.6  

Table 1: Ablation on key selection approaches
Table 2: Ablation of sub-pixel corner/centers

  geometry memory FLOPs AP AP AP baseline M 239G 35.6 55.5 39.0 appearance only M 264G 37.4 56.7 40.4 non-shared M 468G 38.3 57.2 41.7 (+5690M) (+204G) shared M 266G 38.5 57.0 42.3 (+134M) (+2G)     unit length size AP AP AP 38.2 56.7 41.8 38.5 57.0 42.3 38.4 56.8 42.2 38.3 56.9 42.1 38.1 56.7 41.8  

Table 3: Effect of shared relative location embedding
Table 4: Comparison of different unit length and size of the shared location map

Shared relative location embedding As shown in Table 4, compared with direct computation of position embedding Hu et al. (2018), the proposed shared location embedding approach saves memory cost (+M vs +M) and saves FLOPs (+G vs +G) in the geometry term computation, while achieves slightly better performance ( AP vs AP).

Ablation study of the unit length and the size of the shared location map in Table 4 indicates stable performance. We adopt a unit length of and map size of by default.

Separate BVR modules for classification and regression Table 6 ablates the effect of using separate BVR modules for classification and regression, indicating the center representation is a more suitable auxiliary for classification and the corner representation is a more suitable auxiliary for regression.

Effect of appearance and geometry terms Table 6 ablates the effect of appearance and geometry terms. Using the two terms together outperforms that using the appearance term alone by AP and outperforms that using the geometry term alone by AP. In general, the geometry term benefits more at larger IoU criteria, and less at lower IoU criteria.

Compare with multi-task learning Only including an auxiliary point head without using it can boost the RetinaNet baseline by AP (from to ). Noting the BVR brings a AP improvement (from to ) under the same settings, the major improvements are not due to multi-task learning.

  CLS REG AP AP AP AP none none 35.6 55.5 39.0 9.3 none ct. 36.4 54.7 38.9 10.1 none cn. 37.5 54.6 40.3 12.2 ct. none 37.3 56.6 39.9 10.5 cn. none 36.2 55.1 38.4 9.8 ct. cn. 38.5 57.0 42.3 12.6     appearance geometry AP AP AP AP 35.6 55.5 39.0 9.3 37.4 56.7 41.3 10.7 37.6 55.8 41.5 12.0 38.5 57.0 42.3 12.6  

Table 5: Effect of different representations (‘ct.’: center, ‘cn.’: corner) for classification and regression
Table 6: Ablation of appearance and geometry terms

  method AP AP AP RetinaNet 35.6 55.5 39.0 RetinaNet + NL 37.0 57.0 39.3 RetinaNet + BVR 38.5 57.0 42.3 RetinaNet + NL + BVR 39.4 58.2 42.5     method AP AP AP faster R-CNN 37.4 58.1 40.4 faster R-CNN + ORM 38.4 59.0 41.3 faster R-CNN + BVR 39.3 59.5 43.1 faster R-CNN + ORM + BVR 40.4 60.6 44.0  

Table 7: Compatibility with the non-local module (NL) Wang et al. (2018)
Table 8: Compatibility with the object relation module (ORM) Hu et al. (2018). ResNet-50-FPN is used

  method #conv #ch. FLOP param AP RetinaNet 4 256 239G 38M 35.6 RetinaNet (deep) 5 256 265G 39M 35.2 RetinaNet (wide) 4 288 267G 39M 35.6 RetinaNet+BVR 4 256 266G 39M 38.5 RetinaNet+GN 4 256 239G 38M 36.5 RetinaNet (deep)+GN 5 256 265G 39M 36.8 RetinaNet (wide)+GN 4 288 267G 39M 36.5 RetinaNet+GN+BVR 4 256 266G 39M 39.2     method AP AP AP RetinaNet 42.9 63.4 46.9 RetinaNet + BVR 44.7 (+1.8) 64.9 49.0 faster R-CNN 45.0 66.2 48.8 faster R-CNN + BVR 46.5 (+1.5) 67.4 50.5 FCOS 46.1 65.0 49.6 FCOS + BVR 47.6 (+1.5) 66.2 51.4 ATSS 48.3 67.1 52.6 ATSS + BVR 50.3 (+2.0) 69.0 55.0  

Table 9: Complexity analysis
Table 10: BVR for four representative detectors using a ResNeXt-64x4d-101-DCN backbone

Complexity analysis Table 10 shows the flops analysis. The input images are resized to . The proposed BVR module introduces about more parameters (M vs M) and about more computations (G vs G) than the original RetinaNet. We also conduct RetinaNet with heavier head network to have similar parameters and computations as our approach. By adding one more layer, the accuracy slightly drops to , probably due to the increasing difficulty in optimization. We introduce a GN layer after every head conv layer to alleviate it, and one additional conv layer improves the accuracy by AP. These results indicate that the improvements by BVR are mostly not due to more parameters and computation.

The real inference speed of different models using a V100 GPU (fp32 mode is used) are shown in Table 11. By using a ResNet-50 backbone, the BVR module usually takes less than overhead. By using a larger ResNeXt-101-DCN backbone, the BVR module usually takes less than overhead.

4.2 BVR is Complementary to Other Attention Mechanisms

The BVR module acts differently compared to the pioneer works of the non-local module Wang et al. (2018) and the relation module Hu et al. (2018) which also model dependencies between representations. While the BVR module models relationships between different kinds of representations, the latter modules model relationships within the same kinds of representations (pixels Wang et al. (2018) and proposal boxes Hu et al. (2018)). To compare with the object relation module (ORM) Hu et al. (2018), we first apply BVR to enhance RoIAlign features with corner/center representations, the process of which is same as Figure 3(a). Then the enhanced features are utilized to perform object relation between proposals. Different from Hu et al. (2018), keys are sampled to make the module more efficient. Table 8 shows that the BVR module and the relation module are mostly complementary. On the basis of faster R-CNN baseline, ORM can obtain AP improvement, while our BVR improves AP by . Applying our BVR on the basis of the ORM continually improves AP by . Table 8 and 8 show that the BVR modules is mostly complementary with non-local and object relation module.

4.3 Generally Applicable to Representative Detectors

We apply the proposed BVR to four representative frameworks, i.e., RetinaNet Lin et al. (2017b), Faster R-CNN Ren et al. (2015); Lin et al. (2017a), FCOS Tian et al. (2019) and ATSS Zhang et al. (2020), as shown in Table 10. The ResNeXt-64x4d-101-DCN backbone, multi-scale and longer training (20 epochs) are adopted to test whether our approach effects on strong baselines. The BVR module improve these strong detectors by AP.


method backbone FPS FPS (+BVR)


Faster R-CNN ResNet-50/ResNeXt-101-DCN 21.3/7.5 19.5/7.3
RetinaNet ResNet-50/ResNeXt-101-DCN 18.9/7.0 17.4/6.8
FCOS ResNet-50/ResNeXt-101-DCN 22.7/7.4 20.7/7.2
ATSS ResNet-50/ResNeXt-101-DCN 19.6/7.1 17.9/6.9


Table 11: Time cost of the BVR module.


method backbone AP AP AP AP AP AP
DCN v2* Zhu et al. (2019) ResNet-101-DCN 46.0 67.9 50.8 27.8 49.1 59.5
SNIPER* Singh et al. (2018) ResNet-101 46.5 67.5 52.2 30.0 49.4 58.4
RepPoints* Yang et al. (2019) ResNet-101-DCN 46.5 67.4 50.9 30.3 49.7 57.1
MAL* Ke et al. (2020) ResNeXt-101 47.0 66.1 51.2 30.2 50.1 58.9
CentripetalNet* Dong et al. (2020) Hourglass-104 48.0 65.1 51.8 29.0 50.4 59.9
ATSS* Zhang et al. (2020) ResNeXt-64x4d-101-DCN 50.7 68.9 56.3 33.2 52.9 62.4
TSD* Song et al. (2020) SENet154-DCN 51.2 71.9 56.0 33.8 54.8 64.2
RelationNet++ (our) ResNeXt-64x4d-101-DCN 50.3 69.0 55.0 32.8 55.0 65.8
RelationNet++ (our)* ResNeXt-64x4d-101-DCN 52.7 70.4 58.3 35.8 55.3 64.7


Table 12: Results on MS COCO test-dev set, ‘’ denotes the multi-scale testing

4.4 Comparison with State-of-the-Arts

We build our detector by applying the BVR module on a strong detector of ATSS, which achieves AP on COCO test-dev using multi-scale testing based on the ResNeXt-64x4d-101-DCN backbone. Our approach improves it by AP, reaching AP. Table 12 shows the comparison with state-of-the-arts methods.

5 Conclusion

In this paper, we present a new module, BVR, which bridge various other visual representations by an attention mechanism like that in Transformer Vaswani et al. (2017) to enhance the main representations in a detector. The BVR module can be applied plug-in for an existing detector, and proves broad effectiveness for prevalent object detection frameworks, i.e. RetinaNet, faster R-CNN, FCOS and ATSS, where about AP improvements are achieved. We reach AP on COCO test-dev by improving a strong ATSS detector. The resulting network is named RelationNet++, which advances the relation modeling in Hu et al. (2018) from bbox-to-bbox to across heterogeneous object/part representations.

Broader Impact

This work aims for better object detection algorithms. Any object oriented visual applications may benefit from this work, as object detection is usually an indispensable component for them. There may be unpredictable failures, similar as most other detectors. The consequences of failures by this algorithm are determined on the down-stream applications, and please do not use it for scenarios where failures will lead to serious consequences. The method is data driven, and the performance may be affected by the biases in the data. So please also be careful about the data collection process when using it.


  1. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. In ECCV.
  2. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.
  3. Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., and Hu, H. (2020). Reppoints v2: Verification meets regression for object detection. arXiv:2007.08508.
  4. Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In NeurIPS.
  5. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y. (2017). Deformable convolutional networks. In ICCV.
  6. Dong, Z., Li, G., Liao, Y., Wang, F., Ren, P., and Qian, C. (2020). Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In CVPR.
  7. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., and Tian, Q. (2019). Centernet: Object detection with keypoint triplets. In ICCV.
  8. Girshick, R. (2015). Fast r-cnn. In ICCV.
  9. Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
  10. Gu, J., Hu, H., Wang, L., Wei, Y., and Dai, J. (2018). Learning region features for object detection. In ECCV.
  11. He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask r-cnn. In ICCV.
  12. Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. (2018). Relation networks for object detection. In CVPR.
  13. Ke, W., Zhang, T., Huang, Z., Ye, Q., Liu, J., and Huang, D. (2020). Multiple anchor learning for visual object detection. In CVPR.
  14. Kong, T., Sun, F., Liu, H., Jiang, Y., and Shi, J. (2019). Foveabox: Beyond anchor-based object detector. arXiv:1904.03797.
  15. Law, H. and Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In ECCV.
  16. Law, H., Teng, Y., Russakovsky, O., and Deng, J. (2019). Cornernet-lite: Efficient keypoint based object detection. arXiv:1904.08900.
  17. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017a). Feature pyramid networks for object detection. In ICCV.
  18. Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017b). Focal loss for dense object detection. In ICCV.
  19. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
  20. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot multibox detector. In ECCV.
  21. Lu, X., Li, B., Yue, Y., Li, Q., and Yan, J. (2019). Grid R-CNN. In CVPR.
  22. Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster, stronger. In CVPR.
  23. Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv:1804.02767.
  24. Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.
  25. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F. (2015). Imagenet large scale visual recognition challenge. IJCV.
  26. Samet, N., Hicsonmez, S., and Akbas, E. (2020). Houghnet: Integrating near and long-range evidence for bottom-up object detection. In ECCV.
  27. Singh, B., Najibi, M., and Davis, L. S. (2018). Sniper: Efficient multi-scale training. In NeurIPS.
  28. Song, G., Liu, Y., and Wang, X. (2020). Revisiting the sibling head in object detector. In CVPR.
  29. Tian, Z., Shen, C., Chen, H., and He, T. (2019). FCOS: fully convolutional one-stage object detection. In ICCV.
  30. Tychsen-Smith, L. and Petersson, L. (2017). Denet: Scalable real-time object detection with directed sparse sampling. In ICCV.
  31. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
  32. Wang, J., Chen, K., Yang, S., Loy, C. C., and Lin, D. (2019). Region proposal by guided anchoring. In CVPR.
  33. Wang, X., Girshick, R. B., Gupta, A., and He, K. (2018). Non-local neural networks. In CVPR.
  34. Wei, F., Sun, X., Li, H., Wang, J., and Lin, S. (2020). Point-set anchors for object detection, instance segmentation and pose estimation. In ECCV.
  35. Yang, Z., Liu, S., Hu, H., Wang, L., and Lin, S. (2019). Reppoints: Point set representation for object detection. In ICCV.
  36. Yang, Z., Xu, Y., Xue, H., Zhang, Z., Urtasun, R., Wang, L., Lin, S., and Hu, H. (2020). Dense reppoints: Representing visual objects with dense point sets. In ECCV.
  37. Zhang, S., Chi, C., Yao, Y., Lei, Z., and Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR.
  38. Zhou, X., Wang, D., and Krähenbühl, P. (2019a). Objects as points. arXiv:1904.07850.
  39. Zhou, X., Zhuo, J., and Krähenbühl, P. (2019b). Bottom-up object detection by grouping extreme and center points. In CVPR.
  40. Zhu, X., Hu, H., Lin, S., and Dai, J. (2019). Deformable convnets v2: More deformable, better results. In CVPR.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description