Empirical Upper Bound in Object Detection and More

Empirical Upper Bound in Object Detection and More


Object detection remains as one of the most notorious open problems in computer vision. Despite large strides in accuracy in recent years, modern object detectors have started to saturate on popular benchmarks raising the question of how far we can reach with deep learning tools and tricks. Here, by employing 2 state-of-the-art object detection benchmarks, and analyzing more than 15 models over 4 large scale datasets, we I) carefully determine the upper bound in AP, which is 91.6% on VOC (test2007), 78.2% on COCO (val2017), and 58.9% on OpenImages V4 (validation), regardless of the IOU. These numbers are much better than the mAP of the best model1 (47.9% on VOC, and 46.9% on COCO; IOUs=.5:.95), II) characterize the sources of errors in object detectors, in a novel and intuitive way, and find that classification error (confusion with other classes and misses) explains the largest fraction of errors and weighs more than localization and duplicate errors, and III) analyze the invariance properties of models when surrounding context of an object is removed, when an object is placed in an incongruent background, and when images are blurred or flipped vertically. We find that models generate boxes on empty regions and that context is more important for detecting small objects than larger ones. Our work taps into the tight relationship between recognition and detection and offers insights to build better models2.


1 Introduction and Motivation

Several years of extensive research on object detection has resulted in accumulation of an overwhelming amount of knowledge regarding model backbones, tricks for model training and optimization, data collection and annotation, and model evaluation and comparison [68], to a point that separating wheat from chaff is very difficult. As an example, truly understanding and implementing Average Precision (AP) is frustratingly difficult. A quick Google search returns numerous blogs and codes with discrepant explanations of AP. To make matters even worse, it is not quite clear whether AP has started to saturate, whether progress is significant, and more importantly how far we can improve following the current path, making one wonder maybe we have reached the peak of performance using deep learning. Further, we do not know what is holding us back from making progress in object detection, compared to human-level (although debatable) accuracy on object recognition.

To shed light on the above matters, first we systematically and carefully approximate the empirical upper bound in AP. We hypothesize that the upper bound AP (UAP) is the score of the best recognition model that is trained on the training target bounding boxes and is then used to label the testing target boxes. We also investigate whether visual context surrounding a target object or its overlapping boxes can improve the upper bound AP. Second, we identify bottlenecks by characterising the type of errors that object detectors make and measure the impact of each one on performance. Third, we study the invariance properties of various object detectors on different types of transformations including incongruent context, scale, blur, vertical flip, etc.

Figure 1: Upper bound AP (in red) and scores of the best model (in blue; FCOS [59] on VOC and FASHION, and Hybrid Task Cascade [19] on COCO. It shows that scale remains the major problem in object detection.

In a nutshell, we find that there is a large gap between the performance of the best detection models and the empirical upper bound as shown in Fig. 1. This entails that there is a hope to reach this peak with the current tools, if we can find smarter ways to adopt object recognition models for object detection. We also find that classification remains as the major bottleneck in object detection and is more critical over small objects. Specifically, object detection models inherit the main limitations of CNNs which is the lack of invariance. Example failure cases include generating a lot of bounding boxes on a white background containing a single object, and failing to detect objects in incongruent contexts, vertically flipped images or blurred ones. It seems that humans can still manage to solve these tasks, although with higher effort and lower performance than intact images.

2 Related Work

We discuss three lines of related works. The first one includes works that strive to understand detection approaches, identify their shortcomings, and pinpoint where more research is needed. Parikh et al.  [49] aimed to find the weakest links in person detectors by replacing different components in a pipeline (e.g. part detection, non-maxima-suppression) with human annotations. Mottaghi et al.  [46] proposed human-machine CRFs for identifying bottlenecks in scene understanding. Hoeim et al.  [33] inspected detection models in terms of their localization errors, confusion with other classes, and confusion with the background on PASCAL dataset. They also conducted a meta-analysis to measure the impact of object properties such as color, texture, and real-world size on detection performance. We replicate, simplify and extend this work on the larger COCO dataset and on image transformations. Russakovsky et al.  [55] analyzed the ImageNet localization task and emphasized on fine-grained recognition. Zhang et al.  [64] measured how far we are from solving pedestrian detection. Vondrick et al.  [61] proposed a method for visualizing object detection features to gain insights into their functioning. Some other related works in this line include [38, 67, 63].

The second line regards research in comparing object detection models. Some works have analyzed and reported statistics and performances over benchmark datasets such PASCAL VOC [26, 25], MSCOCO [40], CityScapes [22], and open images  [37]. Recently, Huang et al.  [35] performed a speed/accuracy trade-off analysis of modern object detectors. Dollar et al.  [24] and Borji et al.  [15, 17, 16] compared models for person detection, and salient object detection, respectively. In [44], Michaelis  et al. assessed detection models on degraded images and observed about 30–60% performance drop, which could be mitigated by data augmentation. In order to resolve the shortcomings of the AP score, some works have attempted to introduce alternative [29] or complementary evaluation measures [47, 53]. A large number of works have also assessed object recognition models and their robustness (e.g.  [56, 12, 51, 45]).

Works in the third line study the role of context in object detection and recognition (e.g.  [13, 62, 43, 32, 60, 50, 54, 27]. Heitz et al.  [32] proposed a probabilistic framework to capture contextual information between “stuff” and “things” to improve detection. Barnea et al.  [14] utilized co-occurrence relations among objects to improve the detection scores. Divvala et al.  [23] explored different types of context in recognition. See also [32, 21, 57, 34, 43, 11].

3 Experimental Setup

3.1 Benchmarks

We base our analysis on two recent large-scale object detection benchmarks: MMDetection [6, 20] and Detectron2 [4]. The former evaluates more than 25 models. The latter includes several variants of FastRCNN [28]. In both benchmarks, all COCO models have been trained on train2017 and evaluated on val2017. Here, we use MMDetection to train and test additional models on a new dataset.

3.2 Models

We consider the latest models published in the major vision conferences and the ones included in the above benchmarks. Several variants of the RCNN model including FasterRCNN [52], MaskRCNN [30], RetinaNet [39], GridRCNN [42], LibraRCNN [48], CascadeRCNN [18], MaskScoringRCNN [36], GAFasterRCNN [66], and Hybrid Task Cascade [19] are considered. We also include SSD [41], FCOS [59], and CenterNet [65]. Different backbones for each model are also taken into account.

3.3 Datasets

We employ 4 datasets including PASCAL VOC [25], our home-brewed FASHION dataset, MSCOCO [40], and OpenImages [37]. Over VOC, we use trainval0712 for training (16,551 images, 47,223 boxes) and test2007 (4,952 images, 14,976 boxes) for testing. This dataset has 20 categories. Our FASHION dataset covers 40 categories of clothing items (39 + humans). Trainval, and test sets for this dataset contain 206,530 images (776,172 boxes) and 51,650 images (193,689 boxes), respectively. Fig. 5 displays samples from this dataset (see Supplement for stats). This is a challenging dataset since clothing items are non-rigid as opposed to COCO or VOC objects. MSCOCO has 80 categories. It has carried the torch for benchmarking advances in object detection for the past 6 years. We use train2017 for training (118,287 images, 860,001 boxes) and val2017 (5,000 images, 36,781 boxes) for testing. Finally, we use the OpenImages V4 dataset, used in Kaggle competition [10]. It has 500 classes and contains 1,743,042 images (12,195,144 boxes) for training and 41,620 images (226,811 boxes) for validation (used here for testing).

3.4 Metrics

We use COCO API [2] to measure AP over IOU thresholds of 0.5 and 0.75 as well as the average AP over IOUs in the range 0.5:.05:0.95. APs are calculated per class and are then averaged. We also report breakdown APs over small (area), medium (area), and large (area) objects. Please see [2, 8, 5] for details.

4 Characterizing the Empirical Upper Bound

We hypothesize that the empirical upper bound in AP is the score of a detector with ground truth bounding boxes labeled by the best object classifier. The classification score is considered as the detection score. This way we essentially assume that the localization problem is solved and what remains is only object recognition. However, it might be possible to improve upon this detector in at least two ways: a) by exploiting local context around an object to improve classification accuracy and hence better UAP, and b) by searching over the scene and finding boxes that are easier to classify (compared to the target box) and have enough overlap with the target box. This does not matter for the perfect IOU but may affect IOUs lower than one. We carefully investigate these possibilities in the following.

Figure 2: Illustration of visual context surrounding an object.
Dataset object only object + context context only
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1.2 2 all img
VOC 39.3 68.0 82.6 92.5 94.8 93.0 91.6 90.6 88.6 87.0 63.6 64.9 35.3
FASHION - 52.9 66.4 71.7 88.8 82.3 77.2 71.8 67.9 64.8 29.0 32.2 12.0
COCO - - 67.1 79.8 86.7 82.9 78.3 72.5 67.4 63.0 43.7 48.9 11.0
OpenImg. - - - - 69.0 65.1 62.7 - - - - - -

Table 1: Recognition accuracy using object and/or its context.

4.1 Utility of the surrounding context

We trained ResNet152 [31] (see supp.) on target boxes in three settings as shown in Fig. 2: I) object only, II) object + context, and III) context only. Standard data augmentation techniques including mean pixel subtraction, color jittering, random horizontal flip and random rotation (10 degrees) were applied. Boxes were resized to 224 224 pixels and models were trained for 15 epochs. Trained models were tested on the original object box. Results (top-1 accuracy) shown in Table 1 reveal that the canonical object size contains the most information regarding the category of an object over all four datasets. Increasing or decreasing object box lowers the performance. Context-only scenario leads to high classification score but still does below other cases. Stretching the context to the whole scene drops the performance significantly. Training and testing models on the same condition (i.e. both on object+context) results in higher accuracy on that specific condition but does not lead to better overall recognition accuracy.

4.2 Searching for the best label

Essentially the problem definition here is how we can get the best classification accuracy for recognition of objects in the scene by utilizing all the information in the scene. This is different than recognition approaches that treat objects in isolation. Note that recognition accuracy is not the same as AP, since detection scores also matter in AP calculation.

Having the best classifier at hand, we are ready to approximate the empirical upper bound in AP. Before delving into details lets recap how AP is calculated.

AP calculation. For each category, detections over all images are sorted according to their confidences. Starting from the top of this list, the target with the highest IOU with each detection is considered. We have a true positive (TP; hit) if their IOU is , and if that target has not been assigned yet. We have a false positive (FP) if IOU (i.e. localization error) or if the target has been assigned (i.e. duplicate; two predictions on the same target). A target box can be matched with only one detection (the one with the highest confidence score and IOU). If a detection has IOU with two targets, it is assigned to the one with the highest IOU which is not assigned already. Scanning the sorted detection list again, a precision for each recall is obtained and is used to draw the Recall-Precision (RP) curve and to compute the AP. See [8, 5] for details.

We explore two strategies in pursuit of the upper bound AP. In the first strategy, we apply the best classifier from the previous section to the target boxes. The detector built in this fashion gives the same AP regardless of the IOU threshold, since our detections are target boxes. As we argued above, it is not possible to improve this detector at IOU=1. However, if we are interested in upper bound for a lower IOU (say ), then it might be possible to do better by searching among the candidate boxes near a target box and choose the one that can be classified better than the target box, or aggregate information from nearby boxes. Thus, in our second strategy, we sample boxes around an object and either apply the original classifier (trained on canonical object size) or train and test new classifiers on the surrounding boxes. In any case, we always keep the target box but change its label and/or its confidence. First, lets take a look at our box sampling approach, which is illustrated in Fig. 3.

Figure 3: A) Illustration of our setup for finding boxes with IOU with the target box (corresponding to ; for ), B) The solutions are 4 curves represented by Eqs. 4 to 8. Four sample rectangles are shown with dashed lines.

Sampling boxes with IOU above a threshold. Here, we are interested in finding the coordinates of the top-left corner of all rectangles with IOU with the ground-truth bounding box. We use the coordinate system centered at the top-left corner of the target box ; which can be easily converted to the image level coordinate frame. The bottom-right coordinates of the desired rectangles that intersect with the target box from the top-left follow the equation , where and are width and height of rectangles, respectively (we assume all boxes have the same width and height as the target box). According to the illustration in Fig. 3(A), we have:


From these equations and assuming , and , it is easy to derive the following equations:


and also:


The same equation governs the coordinates of the bottom-left, top-left, and top-right corners of the rectangles intersecting with the target box at points , , and , respectively (in the coordinate frames centered as these points, in order). Calculating the top-left corner of these rectangles (in their corresponding coordinate frames) and representing them in the coordinate frame of point , we arrive at the following four equations (note that these are not lines):


Acc. Most Confident Box Most Frequent Label

93.7 88.7 91.7 81.4 63.8 89.1 92.0 82.9 60
FASHION 87.4 68.1 68.6 61.9 49.5 67.7 68.2 60.7 47.8
COCO 84.8 76.9 81.8 80.6 62.8 76.4 82.0 80.4 60.7

Table 2: Results of our second strategy for estimating the upper bound AP (i.e. searching for the best bounding box or object label near a target box; among boxes with IOU ). Notice that upper bound for AP, AP and AP are all the same. Underlined numbers show where we could improve over the 1st strategy.

Using the above equations, we then sample some (here = 4) rectangles with (Fig. 3(B)) and label them with the label of the target box. We then train a new classifier (same ResNet152 as above) on these boxes. This is effectively a new data augmentation technique. Notice that AP is a direct consequence of the classification accuracy, so if we can better classify objects we can obtain a better AP. To estimate UAP, we sample a number of rectangles (=4) near a target box (all with ), and then label the target box with: a) the label (and confidence) of the box with the highest classification score (i.e. most confident box), or b) the most frequent label among the nearby boxes (with the maximum confidence score among them).

4.3 Upper bound results

Here, we report classification scores, upper bound APs, score of the models (mean AP over all IOUs; unless specified otherwise), and the breakdown AP over categories.

Comparison of strategies. Summary results of the first strategy are shown in Fig. 1. As expected UAPs over all IOUs are the same and are much better than the models. To our surprise, our second strategy did not lead to better UAP values, except for few cases including UAPs over medium and small objects on FASHION dataset and small objects on COCO (using most confident boxes), as shown in Table 2. Applying the original classifier, instead of training new ones on surrounding boxes, or only sampling boxes with higher IOU (e.g. 0.9) did not improve the results. Also, setting the confidence of detections to 1 lowers the UAP. We attribute the failure of the 2nd strategy to the fact that the surrounding boxes may contain additional visual content which may introduce noise in the labels. This leads to a lower classification accuracy and hence a lower AP. Therefore, in what follows we only discuss the results from the first strategy.

Figure 4: Model scores and upper bound AP over PASCAL VOC dataset using VOC (left) and COCO APIs (right). Categories are sorted based on the average model AP. Bar charts show classification scores. Solid red and dashed black lines represent upper bound AP, and the best model AP, respectively.
Figure 5: upper bound and model APs over the Fashion dataset.
Figure 6: APs over COCO dataset borrowed from the MMDetection benchmark [6]. We add CenterNet results to MMDetection.

PASCAL VOC. Fig. 4 shows results using both VOC and COCO evaluation APIs. The VOC evaluation code is based on IOU=0.5 and calculates the area under the PR curve slightly different than COCO. For VOC, we adopt the code from the CenterNet repository [1]. We have trained and tested 5 models on this dataset including FasterRCNN, FCOS, SSD512, and two variants of CenterNet. The classification accuracy on VOC is very high (94.7%). Consequently, the UAP is very high (91.6 using the COCO API). FCOS model does the best here with AP of 47.9 (right panel in Fig. 4; dashed lines). As it can be seen, there is a large gap between the AP of the best model and the UAP on this dataset (45). Models are consistent in their performance across different categories.

FASHION. Results are shown in Fig. 5. The best classification accuracy on this dataset is 88.8% (Table 1, and supplement). The UAP is 71.2 and the AP of the best model is 59.7 (FCOS). Interestingly, FCOS performs quite close to the upper bound at IOU=0.5 (Fig. 1). Models perform better here than over VOC. The FASHION UAP is lower than VOC UAP perhaps because classification is more challenging on the former dataset. The gap between UAP and model AP here, however, is much smaller than VOC. This could be partly due to the fact that FASHION scenes have less clutter and larger objects than the VOC scenes. While per-class UAP is above the AP of the best model over all VOC classes, UAPs of 5 FASHION categories fall below the best model AP (messenger bags, tunics, long sleeve shirts, blouses, and rompers). Looking at the classification scores, we find that they have a low accuracy.

COCO. Existing benchmarks have provided an efficient ecosystem for developing, evaluating and comparing detection models especially on the COCO dataset. They provide trained models over a variety of settings. Borrowing the MMDetection benchmark and adding the results from CenterNet to it, we end up comparing 15 models (71 in total; combination of models and backbones). Model scores are shown in Fig. 6. The best models here are Hybrid Task Cascade model [19] and Cascade MaskRCNN [18], with APs of 46.9 and 45.7, respectively. See supplement for Recall results. The upper bound AP on COCO is about 78.2. Recall that UAP does not depend on the IOU threshold since detected boxes are classified ground truth targets. The gap between the best model AP and UAP is above 30. The gap is much smaller for AP at IOU=0.5 which is about 10. The UAP is much lower over small objects than UAP over large objects. This also holds for models. The gap between UAP and model AP over small objects is about 35 which is much higher than the gap over medium or large objects.

Breakdown APs over object categories are shown in Fig. 7. For this analysis, we use the Detectron2 benchmark which reports per-category results mainly over the RCNN model family. We noticed that the aggregate scores on MMDetection and Detectron2 are quite consistent. Among the 18 variants of Faster-RCNN and MASK-RCNN, the best model has the AP of 44.3 (shown by the dashed line) which is lower than the best available model on COCO (46.9; Fig. 1) and the upper bound AP. Among 80 classes, only three (snowboard, toothbrush, and toaster) have UAPs below the best model APs.

Figure 7: Detection APs over MSCOCO dataset borrowed from Detectron2 benchmark [4]. The horizontal dash line corresponds to the best model among the shown models. “*”: The best AP here is 44.3 which is smaller than the best so far on COCO (46.9). See also Fig. 1.

OpenImages. This dataset [37] is the latest endeavor in object detection and is much more challenging than its predecessors. Our classifier achieves 69.0% top-1 accuracy on the validation set of OpenImages V4 which is lower than other the three datasets. We achieve 58.9 UAP, using the TensorFlow evaluation API for computing AP [9] on this dataset, which is different than COCO AP calculation (here we discarded grouping and super-category). We are not aware of any model scores on this set of OpenImages V4.

AP vs. classification accuracy. We found that there is a linear positive correlation ( = 0.81 on COCO) between the UAP and the classification accuracy (Fig. 8). The higher the classification accuracy, the higher the UAP. We did not find a correlation between the accuracy and model APs, nor between the object size and accuracy (or UAP). The dependency of UAP on accuracy, highlights the importance of recognition on object detection and constitutes the core of our analyses in the next two sections.

Figure 8: Correlation between classification accuracy and upper bound AP. The higher the Acc., the better the UAP.

5 Error Diagnosis

To pinpoint the shortcomings of object detectors, we follow the analysis by Hoeim et al.  [33], but revise it in two major ways. First, instead of inspecting errors across categories, here we perform a per-category error analysis (i.e. binary manner). This simplifies the process and makes it easier to understand. See Fig. 9. We combine all types of class confusions (e.g. similar classes, other classes, and background in Hoeim [33]) into two types of classification errors: a) confusion with the background (Type I), and 2) misses (Type II). Notice that this implicitly contains the above misclassification types but is much easier to investigate. In fact, recent object detectors such as FCOS [59] and CenterNet [65] also adopt this strategy to classify objects (i.e. an object is of a particular class or is not). Second, Hoeim et al. successively remove errors to reach the AP of 1. We argue that this approach convolutes different error types and does not correctly reflect the true contribution of errors (i.e. understating or exaggerating error types). For example, according to the COCO analysis tool [2], any matches to objects with a different class label but in the same supercategory do not count as either a FP or a TP. Also, the COCO tool removes mislocalized predictions. In this case, we argue that correcting the mislocalized predictions is more effective than removing them because it can reveal other sources of weakness in a model. For example, it may lead to generating duplicates which would have been overlooked by removing the detections. In contrast, here we explicitly handle the errors by removing, correcting or adding detections when appropriate. Similar to Hoeim et al. our analysis is also based on IOU=0.5.

We repeat the following procedure for each category-image pair (shown in Fig. 9; from left to right). First, we remove the detections with the maximum IOU with any target (i.e. classification error Type I; confusion with the background). Second, we correct the miss-localized predictions with . In this step, coordinates of these boxes are replaced with their matching target box coordinates (which is the target with the max IOU) while their confidence scores and labels are preserved. Third, duplicates (i.e. redundant detections) are removed. An unmatched detection is considered duplicate if it falls (i.e. has IOU) over a target with an already assigned detection (with higher score). See supplement for details. Fourth, eventually, misses are treated. A miss is a target with with any unmatched detection, and is added to the list of detections (with score of 1). Before performing this step, we set the coordinates of detections as the coordinates of their matching targets, since we now know which prediction is paired with which target (i.e. one to one mapping; no duplicates).

Results of error diagnosis are shown in Table 3 for 3 models over 3 datasets. We start from the original detection set and progressively measure the impact of fixing each error type in the order explained above and shown in Fig. 9. Confusion with the background (and other classes; see above) has the highest contribution to the overall error, across all models. This indicates that models often falsely confuse background clutter or other classes as a particular object category. The second most important error type is misses. Interestingly, localization error weighs more than duplicates and has higher impact on COCO and VOC datasets than the FASHION dataset, possibly because the former two contain a larger number of small objects. Conversely, over the FASHION dataset, duplicates matter more, perhaps because class confusion is higher (e.g. confusion in slippers vs. sandals; different types of hats, etc.). Models behave almost consistently across the three datasets.

We also cross checked our results with results obtained using the COCO analysis tool (implementing Hoeim et al. ). Notice that numbers from COCO analysis tool are not directly comparable to ours since our strategy is different and, unlike us, it does not explicitly address duplicate errors. Nevertheless, based on APs and PR curves in Fig. 10, we arrive at similar conclusions to ours. Here, again we observe that classification error Type I (Sim, Oth, and BG in Fig. 10) accounts for the largest fraction of errors, followed by misses (FN) and localization (Loc) errors.

Figure 9: Illustration of different error types in object detection.
Dataset Model mAP - Cls. (Type I) + Local. - Duplicates + Misses
MaskRCNN 54.1 85.9 87.7 88.7 100
FASHION CenterNet 54.0 88.8 91.7 96.2 100
FCOS 59.7 90.1 91.9 95.9 100


MaskRCNN 42.1 70.1 79.0 82.7 100
COCO CenterNet 39.2 66.1 78.0 81.7 100
FCOS 42.8 69.6 80.8 85.4 100


MaskRCNN 47.3 73.7 78.8 79.7 100
VOC CenterNet 47.8 79.0 88.5 92.6 100
FCOS 47.9 76.3 85.0 90.3 100
Table 3: Quantifying the contribution of errors in object detection. “Local.” and “Dup.” stand for localization error and duplicate removal, respectively. mAP is the model AP over all IOUs.

6 Invariance Analysis

Complementary to our error diagnosis, here we conduct a series of experiments to reduce the impact of localization or recognition in detection pipelines (one at a time). Our principal emphasize is on the recognition component. These experiments are performed over the COCOval2017 set and are illustrated in Fig. 11. Trained models, over the COCOtrainval0712 set, are employed.

Analysis of context. In the first experiment, we generated stimuli in which a single object was placed in a white background or in a white noise background (one object per image, hence number of images equal to the number of objects). Contrary to our expectation, we found that models either underestimate or overestimate the distribution of target bounding boxes. Fig. 12 shows the difference in distribution of predicted boxes and distribution of ground-truth boxes. Interestingly, models search all over the place. FasterRCNN and RetinaNet oversample boxes around targets, while FCOS generates a fair amount. This hints towards the shortcomings in objectness prediction in models. Quantitative results, presented in Table 4, show that models perform poorly on these images (about the same in both conditions but lower than the original images). They are hindered much more on small objects than medium or large ones, which shows how critical context is for recognition and detection of small objects. Interestingly, in white/noise BG and object-only cases, the AP-large increases but the AP-small decreases (compared to orig. images). FCOS, ranking higher on original images, does better here as well.

In the second experiment, object-only case, we removed the image background and preserved all the objects (hence the same number of images as in COCOval2017). To our surprise, FCOS and SSD performed better on these images than the original ones (Column 1 vs. 10 in Table 4). Compared to the original images, they did better on large objects and lower on small objects in the object-only case.

Figure 10: Quantifying the contribution of errors in object detection using the COCO analysis code [2].

In the third experiment, we paste objects in incongruent backgrounds (e.g. a boat in the street), similar to Rosenfeld et al.  [54] but over a larger dataset and including more models (they did not report AP). We paste 9 objects including bear, keyboard, refrigerator, surfboard, train, tv, cake, horse, and oven on 100 images taken from the FASHION dataset; 900 images in total. Results are given in Table. 5. Interestingly, models performed well on this dataset. They failed drastically on surfboard and oven which seem to be a little hard for humans. Cake, bear, and horse were the easiest ones. FCOS did the best among models.

Figure 11: Analysis of the impact of context and invariance in object detection. The bottom-left panel shows the distribution of target object boxes (COCOval2017) in log scale (See supplement).
Model white BG noise BG objects_only
FasterRCNN 31.1 42.0 36.1 31.8 39.8 36.8 35.9 55.8 39.5
RetinaNet 33.1 41.0 37.3 32.7 39.1 36.6 39.8 58.4 43.4
FCOS 34.5 42.0 37.1 34.2 39.8 37.4 43.6 60.6 46.9
SSD512 27.4 36.7 32.3 26.0 33.4 34 30.5 48.6 32.9


FasterRCNN 7.5 35.9 49.9 7.0 36.6 52.1 17.5 40.6 48.6
RetinaNet 8.3 37.5 53.2 6.4 38.3 54.2 18.9 44.5 56.4
FCOS 8.5 39.8 55.2 9.4 39.5 54.8 22.1 48.8 58.7
SSD512 7.0 31.4 45.1 4.6 29.3 45.2 9.8 35.7 48.4

Table 4: Results of invariance analysis over COCOval2017.
Model train horse bear surfboard cake tv keyboard oven fridge Avg.

64.0 58.4 84.7 2.4 77.9 74.3 54.7 15.5 20.3 50.2
RetinaNet 54.2 89.2 90.6 2 85.7 86.6 10.1 24.8 69.3 57.0
FCOS 73.4 91.5 94.0 17.1 87.6 92.1 9.8 44.2 76.2 65.1
SSD512 . 84.3 58.9 78.5 3.8 76.9 69.8 42.6 8.4 47.6 52.3


Avg. 69.0 74.5 87.0 6.3 82.0 80.7 29.3 23.2 53.4 56.2
Table 5: Model APs(IOU=.5) over objects in incongruent contexts.

Robustness to image transformations. In the fourth experiment, we evaluated models on objects that were a) cropped right out of the image, or b) cropped and resized such that their smallest dimension became 300 pixels (while preserving the aspect ratio). Models performed terribly in both cases, with RetinaNet doing better (Table 6). Poor performance here demonstrates how sensitive models are to object scale and that they lack robustness to object appearance. Visually inspecting the images, we found it very difficult to recognize the cropped objects, especially the small ones.

Fifth and sixth experiments regard testing models on Gaussian blur (with a 11 11 kernel) and vertical flip, respectively. Results in Table 6 show that both types of transformations dramatically hinder performance with higher impact for vertical flip. We do not have a baseline for human performance on these cases, but a quick browsing shows that it is still possible to detect objects, albeit with more effort. RetinaNet and FCOS outperform other models here.

Analysis of errors. Here we measure the impact of each error type in three detection tasks including object-only, Gaussian blur and vertical flip. See Table 7 for results. Error types in order of importance include: misses, localization, misclassification (Type I), and duplicates, over three tasks. Models miss more objects in vertical flip and Gaussian blur cases compared to the objects-only case. There is less confusion with BG in objects-only case than original images (classification Type I) since there is no background clutter.

Figure 12: Distribution of predicted boxes on COCOval2017 (log scale).
Figure 13: Samples of our dataset of objects in incongruent background.
Model crop Gaussian blur vertical flip orig img.
Fa.RCNN 8.4 15.0 8.2 17.1 29.6 17.4 15.5 27.3 15.7 36.4 58.4
RetinaNet 16.9 22.7 18.8 21.5 34.7 22.5 18.7 30.7 19.3 40.0 60.9
FCOS 14.3 18.5 15.3 21.0 33.7 21.6 19.1 30.2 19.6 42.8 62.6
SSD512 13.4 18.9 14.9 15.1 26.6 15.2 12.1 22.2 11.9 29.3 49.2


Fa.RCNN 0 1.3 18.7 3.8 18.3 31.5 6.2 16.6 24.7 21.5 46.6
RetinaNet 1 5.2 34.1 5.1 22.8 39.0 7.5 20.5 29.5 23.5 52.6
FCOS 1 4.5 32.2 5.3 22.5 37.4 8.0 20.8 30.0 26.5 54.5
SSD512 1 2.9 25.7 2.0 15.2 30.9 4.0 12.6 22.5 11.8 44.7

Table 6: Additional results of invariance analysis over COCOval2017 dataset. Fa.RCNN = FasterRCNN.
Dataset Model mAP - Cls. (Type I) + Local. - Duplicates + Misses
objects FasterRCNN 55.8 61.5 69.3 75.2 100
only RetinaNet 58.4 64.6 72.6 79.9 100
FCOS 60.6 67.8 77.0 82.3 100


Gaussian FasterRCNN 29.6 37.2 47.4 55.2 100
blur RetinaNet 34.7 42.3 53.5 64.3 100
FCOS 33.7 43.1 56.8 65.3 100


vertical FasterRCNN 27.3 37.0 49.6 57.3 100
flip RetinaNet 30.7 41.1 54.1 64.6 100
FCOS 30.2 41.3 57.1 65.6 100

Table 7: Error analysis of models over transformed images.

7 Conclusion and Outlook

Through exhaustive analyses, we found that a) models perform significantly below what is empirically possible, b) the performance gap is larger over small objects, indicating that scale is one of the major problems in object detection, c) the bottleneck in object detection is object recognition, and d) detection models lack generalization in terms of searching the right places, utilizing context, recognition of small objects, and robustness to image transformation. We did not find a significant contribution from the surrounding context of a target or its nearby overlapping boxes to better classify it. A further investigation of this with extensive data augmentation and optimization may increase the accuracy but is unlikely to drastically improve the UAP. To evaluate the recognition component of a model, one can feed the target boxes to a model and collect its decisions on them. This is, however, cumbersome and needs to be coded for each model separately, whereas our diagnosis tool is general.

We invite researchers to periodically update the upper bound in detection scores including AP and other recently proposed ones such as LIP [29] and probability-based detection quality [47], as new object recognition models surface. The same can also be repeated for other tasks such as semantic and instance segmentation. Further, our new diagnosis tool can be employed to pinpoint weaknesses in other object detection models.

Please download the supplementary material from here:


  1. The best published mAP on COCO test-dev is 51.0 by EfficientDet [58]. See [3] for the latest results on COCO dataset.
  2. Our code is publicly available at [7].


  1. Centernet github repository. https://github.com/xingyizhou/CenterNet.
  2. Coco evaluation. http://cocodataset.org/#detection-eval.
  3. Coco evaluation server. https://competitions.codalab.org/competitions/20794#results.
  4. Detecron2 benchmarks. https://github.com/facebookresearch/detectron2.
  5. Evaluating object detectors: Average precision (ap), and localization-recall-precision (lrp). https://medium.com/@kemal.oksz/which-one-to-measure-the-performance-of-object-detectors-ap-or-olrp-936d072a6eb0.
  6. Mmdetectron benchmarks. https://github.com/open-mmlab/mmdetection.
  7. Object Detection Upperbound. https://github.com/aliborji/DeetctionUpperbound.git.
  8. Object detection metrics. https://github.com/rafaelpadilla/Object-Detection-Metrics.
  9. Open images challenge evaluation. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/challenge_evaluation.md.
  10. Openimages kaggle challenge. https://www.kaggle.com/c/open-images-2019-object-detection.
  11. Faisal Alamri and Nicolas Pugeault. Contextual relabelling of detected objects. arXiv preprint arXiv:1906.02534, 2019.
  12. Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
  13. Moshe Bar. Visual objects in context. Nature Reviews Neuroscience, 5(8):617, 2004.
  14. Ehud Barnea and Ohad Ben-Shahar. Exploring the bounds of the utility of context for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7412–7420, 2019.
  15. Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. IEEE transactions on image processing, 24(12):5706–5722, 2015.
  16. Ali Borji and Laurent Itti. State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence, 35(1):185–207, 2012.
  17. Ali Borji, Dicky N Sihite, and Laurent Itti. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing, 22(1):55–69, 2012.
  18. Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
  19. Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4974–4983, 2019.
  20. Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  21. Zhe Chen, Shaoli Huang, and Dacheng Tao. Context refinement for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 71–86, 2018.
  22. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  23. Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A Efros, and Martial Hebert. An empirical study of context in object detection. In 2009 IEEE Conference on computer vision and Pattern Recognition, pages 1271–1278. IEEE, 2009.
  24. Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence, 34(4):743–761, 2011.
  25. Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
  26. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  27. Carolina Galleguillos and Serge Belongie. Context based object categorization: A critical survey. Computer vision and image understanding, 114(6):712–722, 2010.
  28. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  29. David Hall, Feras Dayoub, John Skinner, Peter Corke, Gustavo Carneiro, and Niko Sünderhauf. Probability-based detection quality (pdq): A probabilistic approach to detection evaluation. arXiv preprint arXiv:1811.10800, 2018.
  30. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  31. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  32. Geremy Heitz and Daphne Koller. Learning spatial context: Using stuff to find things. In European conference on computer vision, pages 30–43. Springer, 2008.
  33. Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In European conference on computer vision, pages 340–353. Springer, 2012.
  34. Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems, pages 9401–9411, 2018.
  35. Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310–7311, 2017.
  36. Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6409–6418, 2019.
  37. Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982, 2018.
  38. Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, and Larry S Davis. An analysis of pre-training on object detection. arXiv preprint arXiv:1904.05871, 2019.
  39. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  40. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  41. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  42. Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and Junjie Yan. Grid r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7363–7372, 2019.
  43. Sophie Marat and Laurent Itti. Influence of the amount of context learned for improving object classification when simultaneously learning object and contextual cues. Visual Cognition, 20(4-5):580–602, 2012.
  44. Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484, 2019.
  45. Dmytro Mishkin, Nikolay Sergievskiy, and Jiri Matas. Systematic evaluation of convolution neural network advances on the imagenet. Computer Vision and Image Understanding, 161:11–19, 2017.
  46. Roozbeh Mottaghi, Sanja Fidler, Alan Yuille, Raquel Urtasun, and Devi Parikh. Human-machine crfs for identifying bottlenecks in scene understanding. IEEE transactions on pattern analysis and machine intelligence, 38(1):74–87, 2015.
  47. Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Localization recall precision (lrp): A new performance metric for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 504–519, 2018.
  48. Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 821–830, 2019.
  49. Devi Parikh and C Lawrence Zitnick. Finding the weakest link in person detectors. In CVPR 2011, pages 1425–1432. Citeseer, 2011.
  50. Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J Belongie. Objects in context. In ICCV, volume 1, page 5. Citeseer, 2007.
  51. Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811, 2019.
  52. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  53. Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 658–666, 2019.
  54. Amir Rosenfeld, Richard Zemel, and John K Tsotsos. The elephant in the room. arXiv preprint arXiv:1808.03305, 2018.
  55. Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C Berg, and Li Fei-Fei. Detecting avocados to zucchinis: what have we done, and where are we going? In Proceedings of the IEEE International Conference on Computer Vision, pages 2064–2071, 2013.
  56. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  57. Zheng Song, Qiang Chen, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing object detection and classification. In CVPR 2011, pages 1585–1592. IEEE, 2011.
  58. Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. arXiv preprint arXiv:1911.09070, 2019.
  59. Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355, 2019.
  60. Antonio Torralba and Pawan Sinha. Statistical context priming for object detection. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 763–770. IEEE, 2001.
  61. Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, and Antonio Torralba. Hoggles: Visualizing object detection features. In Proceedings of the IEEE International Conference on Computer Vision, pages 1–8, 2013.
  62. Lior Wolf and Stanley Bileschi. A critical view of context. International Journal of Computer Vision, 69(2):251–261, 2006.
  63. Peng Zhang, Jiuling Wang, Ali Farhadi, Martial Hebert, and Devi Parikh. Predicting failures of vision systems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3566–3573, 2014.
  64. Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. How far are we from solving pedestrian detection? In Proceedings of the iEEE conference on computer vision and pattern recognition, pages 1259–1267, 2016.
  65. Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  66. Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and Jifeng Dai. An empirical study of spatial attention mechanisms in deep networks. arXiv preprint arXiv:1904.05873, 2019.
  67. Xiangxin Zhu, Carl Vondrick, Deva Ramanan, and Charless C Fowlkes. Do we need more training data or better models for object detection?. In BMVC, volume 3, page 5. Citeseer, 2012.
  68. Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description