Learning InstanceAware Object Detection Using Determinantal Point Processes
Abstract
Recent object detectors find instances while categorizing candidate regions in an input image. As each region is evaluated independently, the number of candidate regions from a detector is usually larger than the number of objects. Since the final goal of detection is to assign a single detection to each object, an additional algorithm, such as nonmaximum suppression (NMS), is used to select a single bounding box for an object. While simple heuristic algorithms, such as NMS, are effective for standalone objects, they can fail to detect overlapped objects. In this paper, we address this issue by training a network to distinguish different objects while localizing and categorizing them. We propose an instanceaware detection network (IDNet), which can learn to extract features from candidate regions and measures their similarities. Based on pairwise similarities and detection qualities, the IDNet selects an optimal subset of candidate bounding boxes using determinantal point processes (DPPs). Extensive experiments demonstrate that the proposed algorithm performs favorably compared to existing stateoftheart detection methods particularly for overlapped objects on the PASCAL VOC and MS COCO datasets.
plus 0.3ex
Learning InstanceAware Object Detection Using Determinantal Point Processes
Nuri Kim Seoul National University nuri.kim@cpslab.snu.ac.kr Donghoon Lee Seoul National University donghoon.lee@cpslab.snu.ac.kr Songhwai Oh Seoul National University songhwai.oh@cpslab.snu.ac.kr
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Object detection is one of the fundamental problems in computer vision. Its goal is to locate objects that belong to a set of target categories in an image girshick2014rich (); girshick2015fast (); ren2015faster (); redmon2016you (); redmon2016yolo9000 (); liu2016ssd (). It has received a lot of attention because of its wide range of applications such as object tracking andriluka2008people (), surveillance tian2005robust (), and face detection ranjan2017hyperface (). Most of the stateoftheart detectors show significant performance improvements based on a deep convolutional neural network.
Despite the advances in object detection, it is still difficult to assign correct detections for all objects in an image since detectors do not distinguish different object instances in the same class as it only focuses on an instanceagnostic task, i.e., object category classification. This issue becomes critical when objects are overlapped. As shown in Figure 1, the bounding box of a person in the striped shirt is not detected due to the overlapped bounding boxes in proximity.
In order to address this issue, we develop a method which can compare appearances of bounding boxes while considering their spatial arrangements. It is in line with how a human perceives the proximity and similarity to distinguish object instances koffka2013principles (). The goal of this paper is to find the most representative set of bounding boxes by extracting features of object instances, which consist of a combination of both visual differences and spatial positions, in addition to object classification. We proposed an instanceaware detection network (IDNet), which learns to differentiate different instances of objects. IDNet uses an existing detector, such as Faster RCNN, as a component to obtain candidate bounding boxes. Given candidate boxes, IDNet extracts features for all candidates using a CNN branch, named a region identification network (RIN), which aims to increase the probability of selecting an optimal subset. To this end, IDNet is trained not only with classical losses of existing detectors, such as a classification loss and a bounding box regression loss, but also with novel losses based on determinantal point processes (DPPs) kulesza2012determinantal (). Using the property that DPPs can describe the repulsiveness of the fermion system in quantum physics kulesza2012determinantal (), we design an instanceaware detection loss (ID loss), which learns to increase the probability of selecting an optimal subset. Additionally, we address the problem of multiple bounding boxes on a single object. For example, as shown in Figure 1, there are two bounding boxes categorized as a sheep and a cow for the same object. Since the objective of a detector is finding a single bounding box for a single object instance, we propose a sparsescore loss (SS loss) to make IDNet assign a single bounding box for a single object, considering all categories. In particular, we formulate a loss to suppress falsely categorized bounding boxes by optimizing weights of IDNet to have low confidence scores for bounding boxes with incorrect class labels.
Since DPPs involve calculations of determinants, the use of DPPs as a loss function to train deep neural networks introduces numerical challenges. We address this problem by scaling detection quality scores. Then, we formulate an optimization problem to select a subset of detections, which is composed of representative bounding boxes. After training, our algorithm efficiently finds an optimal set of detections using the logsubmodular property of DPPs kulesza2012determinantal (). Experimental results show that IDNet performs favorably against the stateoftheart detectors such as Faster RCNN ren2015faster () and LDDP azadi2017learning () on PASCAL VOC everingham2010pascal (), and MS COCO lin2014microsoft () datasets. In ablation study, we demonstrate that our method is more robust for detecting overlapped objects, achieving 22.3% improvement over Faster RCNN for PASCAL VOC.
2 Related Work
Classaware detection algorithms.
The goal of classaware or multiclass object detection methods is to localize objects in an image while predicting the category of each object. These systems are usually composed of region proposal networks and region classification networks girshick2015fast (); ren2015faster (); liu2016ssd (). To improve detection accuracy, a number of different optimization formulations and network architectures have been proposed ren2015faster (); kong2016hypernet (); azadi2017learning (); redmon2016you (); liu2016ssd (); redmon2016yolo9000 (); dai2016r (). Ren et al. ren2015faster () use convolutional networks, called region proposal networks, to get region proposals and combine it with Fast RCNN. Kong et al. kong2016hypernet () concatenate each layer’s feature to construct the final feature for detecting small objects in an image. A realtime multiclass object detector is proposed by combining region proposal networks and classification networks together in redmon2016you (). Liu et al. liu2016ssd () improve the performance of redmon2016you () using multiple detectors for each convolutional layer. To increase network efficiency, fully connected layers are replaced by convolution layers in dai2016r (). Redmon et al. redmon2016yolo9000 () extend redmon2016you () by classifying thousands of categories using the hierarchical structure of categories in the dataset. DPPs have been used to improve detection qualities before. Azadi et al. azadi2017learning () propose to suppress background bounding boxes using DPPs. However, this method focuses on adjusting background detection scores and uses a fixed visual similarity matrix from WordNet, while our algorithm learns the similarity matrix from data.
Instanceaware algorithms.
Instanceaware methods have been developed to provide finer solutions in different problem domains. Instanceaware segmentation aims to label instances at the pixel level dai2016instance (); ren2017end (). Li et al. dai2016instance () propose a cascade network which finds each instance stage by stage. Similar to RIN, a network in dai2016instance () finds features of each instance. Ren et al. ren2017end () use a recurrent neural network to sequentially find each instance. A face detector which takes key points of faces as an input is suggested in li2016face (). The dataset for this application contains face labels for identifying each face, while the standard object detection datasets only have a small number of categories. In object detection, Lee et al. lee2016individualness () provide an inference method to find an optimal subset for binaryclass detection considering the individualness of each candidate box. However, their approach is limited to a singleclass detection problem. Besides, instead of training networks, they use features computed from a network pretrained on the ImageNet dataset deng2009imagenet (). The proposed method tackles a challenging multiclass detection task by learning distinctive features of object instances.
3 Proposed Method
As shown in Figure 2, IDNet is composed of VGG16 for image feature extraction, a region proposal network (RPN), a region classification network (RCN) and a region identification network (RIN) (see the detailed structure of RIN in Appendix D). Based on image feature maps from VGG16, RPN determines whether objects exist in the region of interests (RoIs). Then, RCN proposes candidate boxes while locating and classifying them. RIN computes instance features of candidates, which are used by DPPs.
3.1 Determinantal Point Processes for Detection
Suppose that there are candidate bounding boxes, , where is the th bounding box. A determinantal point process (DPP) defines a probability distribution over subsets of as follows kulesza2012determinantal (). If Y is a DPP, then
(1) 
where , a kernel matrix is a real symmetric positive semidefinite matrix, an indexed kernel matrix is a submatrix of indexed by the elements of , and is an identity matrix. The kernel matrix can be decomposed as , where is a feature matrix for candidate bounding boxes with each row extracted from RIN. Similar to the kernel matrix, the indexed kernel matrix can be decomposed as .
Let be the detection score for the th bounding box . We first scale the detection score between 0 and 1 by using , where and are the minimum and maximum possible values of the th detection scores (), respectively. Let be the detection quality of and it is a rescaled score defined as , where , to avoid numerical issues during training.^{1}^{1}1 Naive logit scores or normalized scores in might cause numerical overflow or underflow while calculating determinants, particularly, when there are many detection candidates. Let be the detection quality for all detection candidates. The feature for is extracted from the last layer of RIN. Let be a normalized feature and . Using candidate bounding boxes, the intersection over union between and can be calculated by and we construct a matrix by setting . A similarity matrix is constructed as , where . Using the detection quality vector q and the similarity matrix , the kernel matrix for a DPP can be formed as , where is a Hadamard product.^{2}^{2}2 The notations in this paper are summarized in Appendix A.
If the similarity and detection qualities q are correctly assigned, a subset which maximizes (1) is a collection of the most distinctive detections due to the property of the determinant in a DPP kulesza2012determinantal (). Since IDNet is trained to maximize the probability (1) of the groundtruth detections, IDNet learns the most distinctive features and correctly scaled detection scores to separate difference object instances in order to correctly compute and q.
3.2 Learning Detection Quality
As RCN classifies each RoI into all categories, the number of candidate boxes is equal to the number of RoIs multiplied by the number of categories (). As there are multiple bounding boxes with different categories for a RoI, multiple classes often have detection scores higher than a certain threshold. For example, a detector would report a horse bounding box nearby a cow as they are visually similar. Then, conventional methods, such as NMS, typically suppresses bounding boxes in each class. In this case, even if there is a true bounding box for the cow, the horse bounding box cannot be suppressed. To alleviate this issue, we refine the score of top bounding boxes, which are bounding boxes with top detection scores. We assume that categories of the top bounding boxes are composed of visually similar categories to the correct category. By suppressing the scores of the visually similar categories, we can obtain a single bounding box with a correct category for an object.
Let be the union of all top bounding boxes from all RoIs and be a set of positive boxes, i.e., detected bounding boxes which are closest to the ground truth bounding boxes with correct class labels. Then, we define a SS loss as a negative loglikelihood of (1) as follows:
(2) 
This loss function increases detection scores of bounding boxes in the positive set, . In other words, this loss suppresses scores of all subsets which have at least one nonpositive bounding box. We would like to note that the normalization term for a DPP is included for numerical stability during learning.
We also use classification and regression losses for training RPN and RCN, similarly to Faster RCNN ren2015faster (). Suppose each of RPN and RCN output the probability of categories, , when there are categories. The classification loss () and the regression loss () are calculated as follows:
(3) 
where is the true class, is the predicted location shift , is the target location shift for the th class, and is a combination of L1 and L2 losses as defined in girshick2015fast (). The regression loss is not applied to the background category (). Since the only difference between RPN loss and the RCN loss is the number of categories, the RPN loss can be expressed as and the RCN loss can be also expressed as (3), i.e., . See ren2015faster () for more details about and .
The weights for VGG16, RPN and RCN, which are denoted by in Figure 2, can be learned by optimizing:
(4) 
3.3 Learning Instance Differences
An instanceagnostic detector solely based on object category information often fails to detect objects in proximity. For accurate detections from realworld images with frequent overlapping objects, it is crucial to distinguish different object instances. To address this problem, we propose an instanceaware detection loss (ID loss). The objective of this loss function is to obtain similar features from the same instance and different features from different instances. This is done by maximizing the probability of a subset of the most distinctive bounding boxes.
Let be a set of all candidate bounding boxes which intersect with the ground truth bounding boxes. Let be a set of the most representative boxes, i.e., candidate boxes which are closest to the ground truth boxes. Then, ID loss for all objects is defined as follows:
(5) 
Due to the determinant, it increases the cosine distance between and if and are from different instances. As we select boxes nearby the ground truth bounding boxes to construct , the network can learn what bounding boxes are similar or different.
In addition to (5), we set an additional objective which focuses on differentiating instances from the same category given , candidate boxes in the th category, and , the representative boxes for the ground truth boxes in the th category. The intraclass loss is defined as follows:
(6) 
It provides an additional guidance signal to train the network since it is more difficult to distinguish similar instances from the same category than instances from different categories. Bounding boxes for a particular category, , are illustrated in Figure 5. Then we construct the final loss by adding two losses over every category,
(7) 
The goal of the ID loss is to find all instances while discriminating different instances as shown in Figure 1. Given a set of candidate bounding boxes and subsets of them, weights of RIN ( in Figure 2) can be learned by optimizing:^{3}^{3}3The gradients of the SS loss and ID loss are derived in Appendix B.
(8) 
3.4 Inference
Given a set of candidate bounding boxes, the similarity matrix and the detection quality q, Algorithm 1 (IDPP) finds the most representative subset of bounding boxes. and are thresholds. The problem of finding an optimal subset is NPhard because normalizing probabilities of a finite point process has the complexity of , where is the number of candidate bounding boxes. Fortunately, due to the logsubmodular property of DPPs kulesza2012determinantal (), we can approximately solve the problem by using a greedy algorithm, such as Algorithm 1, which iteratively adds an index of a detection candidate until it cannot make the determinant of a new subset higher than that of the current subset azadi2017learning ().
4 Experiments
We evaluated IDNet on the standard datasets: PASCAL VOC everingham2010pascal (), and MS COCO lin2014microsoft (). Since IDNet is the first identityaware detection network in our knowledge, we compare our algorithm with the baseline methods, Faster RCNN ren2015faster () and LDDP azadi2017learning (). Since the goal of our algorithm is to discriminate instances with given candidate bounding boxes, we adopt Faster RCNN as a proposal network to get candidate detections. Additionally, we do not use the SS loss during the early stage of training, since the accuracy of detection scores is very poor and top categories do not contain similar categories during the early stage of training. The number of iterations for the early stage is found by a grid search. The amount of training iterations for adjusting scores is the same as the number of iterations required to train other detectors for fair comparisons.
For the inference method, we report results from three algorithms. First, NMS can be applied to all detectors described earlier. Second, LDPP is applied to Faster RCNN and LDDP, which is an inference method used in LDDP azadi2017learning (). Third, IDPP (Algorithm 1) is applied to the proposed algorithm. Note that IDPP cannot be applied to other detectors as they do not have a module to extract features of instances. The detailed parameter settings for the implementation are in Appendix C.
4.1 Results
Pascal Voc
We train the network with VOC2007 and VOC0712 sets and test on VOC 2007 test set. The VOC2007 dataset has 5,011 images for training and 4,952 images for testing with 20 object categories. The VOC0712 train set consists of a union of VOC2007 trainval set and VOC 2012 trainval set, which has 16,551 images. The performance was evaluated with the mean average precision (mAP), which is the average of AP of all categories. Each AP is calculated by averaging precisions of 11 uniform sections of the recall.



Network  Inference  mAP  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  person  plant  sheep  sofa  train  tv 


Faster RCNNren2015faster ()  NMS  71.4  70.4  78.2  69.7  58.9  56.9  79.5  83.0  84.3  53.3  78.6  64.5  81.7  83.7  76.1  77.9  45.4  70.5  66.7  74.3  73.3 
Faster RCNNren2015faster ()  LDPP  71.1  72.1  77.6  67.8  58.5  54.9  79.0  80.1  85.5  53.8  79.9  64.0  81.7  83.7  76.7  78.0  45.0  70.9  66.7  74.0  73.0 
LDDPazadi2017learning ()  NMS  70.5  69.7  78.6  69.2  55.0  54.4  77.0  82.7  82.6  52.0  78.7  66.0  81.7  83.3  75.3  77.9  44.5  69.7  66.0  73.2  72.2 
LDDPazadi2017learning ()  LDPP  70.5  71.6  78.4  67.2  55.9  52.9  76.8  79.9  83.5  51.4  79.5  65.1  82.1  83.6  75.6  77.9  44.9  71.0  66.3  73.7  72.6 
IDNet  NMS  71.5  70.1  78.1  67.8  56.9  56.2  82.5  82.1  83.2  56.1  81.2  66.0  81.9  84.3  76.7  78.5  42.3  70.3  65.7  76.2  73.9 
IDNet  IDPP  72.2  70.2  79.5  70.1  58.0  55.6  81.1  83.5  84.2  56.2  81.3  64.8  83.0  84.1  77.3  80.4  43.6  72.9  66.9  76.9  73.7 




Network  Inference  mAP  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  person  plant  sheep  sofa  train  tv 


Faster RCNNren2015faster ()  NMS  75.8  77.2  84.1  74.8  67.3  65.5  82.0  87.4  87.9  58.7  81.5  69.8  85.0  85.1  77.7  79.2  47.2  75.4  71.8  82.3  75.8 
Faster RCNNren2015faster ()  LDPP  76.1  77.7  82.5  75.1  66.1  65.2  82.9  88.1  87.3  59.6  82.2  70.6  85.4  86.1  80.7  79.1  48.3  76.5  71.1  83.2  75.1 
LDDPazadi2017learning ()  NMS  75.9  77.3  81.5  74.4  65.9  64.9  84.8  87.2  86.7  60.4  80.9  70.8  85.3  84.9  77.1  79.0  47.9  76.0  72.6  83.4  77.5 
LDDPazadi2017learning ()  LDPP  76.4  76.9  83.0  75.0  66.5  64.3  83.4  87.5  87.7  61.2  81.5  70.0  86.0  84.9  81.9  83.3  48.6  75.7  72.3  82.6  76.5 
IDNet  NMS  76.0  78.4  79.6  74.2  63.1  66.7  84.5  87.7  85.9  60.8  84.8  70.2  85.2  85.4  79.2  79.2  46.4  77.0  74.1  81.6  76.4 
IDNet  IDPP  76.8  78.8  83.4  74.4  64.0  66.9  83.5  87.8  87.1  61.1  84.6  70.5  85.6  85.2  80.7  83.1  47.0  79.0  73.1  83.2  76.2 

For VOC2007 train set, we set the number of iterations for the early stage as 40k and 70k for VOC0712. Then, we train RIN to learn differences of instances with the ID loss for 30k and 20k iterations, respectively. As Faster RCNN and LDDP do not have a module to extract the feature of each bounding box, we use LDPP as an inference method for them, which is proposed in LDDP azadi2017learning (). LDPP uses a classwise similarity matrix while IDPP uses the features extracted from RIN.
As shown in Table 1, the NMS results of IDNet show that the SS loss effectively suppresses a number of candidate boxes while leaving the correct boxes. As the number of categories is small, the number of similar categories is even smaller, which has caused the marginal performance improvement. When we test networks with several postprocessing methods, such as NMS and LDPP, we can observe the following results. For the VOC2007 train set, Faster RCNN with NMS has an mAP of 71.4%, LDDP with LDPP has an mAP of 70.5% and IDNet with IDPP has an mAP of 72.2%. The proposed algorithm works favorably compared to Faster RCNN with NMS by 0.8% mAP. For VOC0712 train set, Faster RCNN with NMS has an mAP of 75.8%, LDDP with LDPP has an mAP of 76.4% and IDNet with IDPP has an mAP of 76.8%, as shown in Table 2. The overall trends for VOC0712 train set are similar to the experiment of VOC2007, which show 1.0% mAP improvement compared to the Faster RCNN with NMS. Due to the constraint of the space, we visualize result images in Figure 9. Additionally, for measuring the impact of the ID loss with respect to overlap ratios, we evaluate the performance of IDNet for test images with overlapped objects. The experimental results show that the performance gap in recall between Faster RCNN with NMS and IDNet with IDPP increases as the overlap ratio increases. For VOC, the recall of overlapped objects that have IoU more than 0.6 is 71.3% for the proposed method while Faster RCNN reports 58.3% (Table 8).



Network  Inference  mean AP @ IoU:  mean AP @ Area:  mean AR, # Dets:  mean AR @ Area:  
0.50.95  0.5  0.75  S  M  L  1  10  100  S  M  L  


Faster RCNNren2015faster ()  NMS  26.2  46.6  26.9  10.3  29.3  36.4  25.5  38.1  39.0  17.9  44.0  55.7 
Faster RCNNren2015faster ()  LDPP  26.2  46.5  26.9  10.2  29.3  36.6  24.8  37.0  37.9  15.7  42.5  54.9 
LDDPazadi2017learning ()  NMS  26.4  46.8  26.9  10.5  29.4  36.7  25.7  38.5  39.4  18.2  44.6  56.4 
LDDPazadi2017learning ()  LDPP  26.4  46.7  26.8  10.5  29.4  36.8  25.0  37.4  38.4  16.0  43.1  55.3 
IDNet  NMS  27.0  47.3  27.9  10.7  29.7  37.7  25.9  38.4  39.3  18.2  44.0  56.6 
IDNet  IDPP  27.3  47.6  28.2  10.9  30.1  38.0  25.9  39.4  40.6  18.6  45.1  58.9 

Microsoft COCO
We carry out experiments with 82,783 images in the train set and 40,504 images in the validation set, which is used for testing with 80 object categories. The number of iterations for the early stage is set to 360k. After adjusting scores, we train RIN for 20k iterations. As shown in Table 3, we evaluate different algorithms with twelve different performance metrics. Average precision at IoU [.5, .95] is a method of evaluating using multiple thresholds obtained by uniformly sampling 10 samples from 0.5 to 0.95. This is a primary challenge metric in COCO detection evaluations. The proposed algorithm achieves 27.3% mAP@ IoU [.5, .95] on the validation set, higher than the other methods. mAP at IoU=0.5 is the same metric with the VOC. AP at the certain IoU threshold considers that the predicted box is well detected when the overlap with the ground truth box is greater than the threshold. Metrics with area measure AP for different scales of objects. As recall is higher when there are a large number of predicted boxes, mAP metrics constraint the number of detections per image. The mean average recall, mAR, is the maximum recall for each category given a fixed number of detections. We see that our algorithm has comparable results on all performance metrics. Additionally, as the COCO dataset has a larger number of categories, the performance improved by the SS loss is from 26.2% mAP to 27.0% mAP, which is a bigger improvement compared with that of VOC. This result indicates that the SS loss has potential to lead higher performance improvement when it is applied to largescale detection datasets which have a large number of categories. We visualize detection results in Figure 10 and Figure 11. The performance with respect to different overlap ratios is shown in Table 9.
4.2 Ablation Study
To analyze the influence of each loss, we conduct several ablation studies. Table 4 demonstrates the results of the ablation study. We check the proposed method with two postprocessing methods. Since IDPP uses the trained features with the ID loss, we substitute IDPP with LDPP for the ablation experiments that do not use the ID loss, which are the last two rows in Table 4. As shown in Table 4, the performance with NMS slightly increases to 71.5% mAP for VOC2007 train set and 76.0% for VOC0712 train set as we add the SS loss. We see that the SS loss is effective for not only DPP inference methods, but also NMS because it keeps the precision while reducing redundant detections. We note that when we use parameters in the paper azadi2017learning (), most of the results with the LDPP inference are lower than the results of NMS. The performance of IDNet trained with the ID loss is 71.9% mAP for VOC2007 and 76.7% mAP for VOC0712. It indicates that the ID loss, which learns the differences of each bounding box is critical for the performance improvement. The result with both the SS loss and the ID loss achieves 72.2% mAP for VOC2007 and 76.8% mAP for VOC0712. The detailed analyses are given below.


SS loss  ID loss  Inference  VOC2007  VOC0712 


x  x  NMS  71.4  75.8 
LDPP  71.1  76.1  
o  x  NMS  71.5  76.0 
LDPP  70.4  75.8  
x  o  NMS  71.3  75.8 
IDPP  71.9  76.7  
o  o  NMS  71.5  76.0 
IDPP  72.2  76.8  

Effect of sparsescore loss.
As stated in Section 3.2, a detector often finds falsely categorized bounding boxes. The SS loss is introduced to alleviate this problem. Specifically, in our experimental setting, the SS loss suppresses other bounding boxes except for the top1 bounding box. To validating the loss, we extract top5 bounding boxes having detection scores over a fixed threshold (set to 0.01) for each RoI. When a predicted box overlap with the ground truth box by 0.5 of IoU or more, we consider it as a correct box. We compute the ratio for each category, where is the class label, is the number of correct boxes in top5 bounding boxes, and is the number of top5 bounding boxes. Figure 3 shows that the proposed IDNet achieves superior performance in terms of correctly detected bounding boxes among top5 bounding boxes compared to other methods. On average, IDNet achieves 43.7% while Faster RCNN has 32.4% and LDDP has 32.9% for COCO. (For VOC 2007, IDNet achieves 68.9% while Faster RCNN has 61.0% and LDDP has to 60.5% as shown in Figure 7.) The images with scores are visualized in Figure 8, showing that the SS loss successfully suppresses bounding boxes having wrong classes.
Effect of instanceaware detection loss.
Table 5 gives the total number of objects in the datasets and the number of overlapped objects within the same category depending on the degree of overlaps. There are only 719 objects (6.0% of all objects) for the VOC2007 test set and 16512 objects (5.7% of all objects) for the COCO validation set. Since IDNet is more effective for overlapped objects, the small number of overlapped bounding boxes in datasets is the reason behind a marginal improvement over other methods. To further evaluate our method, we experiment with only overlapped objects. We demonstrate the probability of finding objects among the overlapped objects in Table 8. We count overlapped objects using the ground truth object boxes when they have the same class label. Then, we check there are detected bounding boxes for that overlapped objects. After calculating the probability in each category, the results are averaged over categories. Since there is a small number of highly overlapped objects in the datasets that have IoU more than 0.6, the overlap ratio of 0.6 include all objects with IoU larger than or equal to 0.6. For the overlapped objects in all overlap ratios, the probability of detecting objects is higher than Faster RCNN with LDPP and LDDP with LDPP. Figure 4 demonstrates that IDNet with IDPP successfully detects overlapped objects compare to existing instanceagnostic detectors. When comparing with Faster RCNN, the detection probability is increased from 58.2% to 62.7% for COCO. (For VOC, the detection probability is increased from 72.2% to 78.9% as shown in Figure 6.) This result shows that the ID loss is critical for detecting objects in proximity.
More results on ablation studies are in Appendix E.2 and the failure case studies are in Appendix F.2.



Overlap  [0.0, 1.0]  (0.0, 0.1]  (0.1, 0.2]  (0.2, 0.3]  (0.3, 0.4]  (0.4, 0.5]  (0.5, 0.6]  (0.6, 1.0] 


VOC2007  12032  3035  1587  720  360  192  83  84 
COCO  291874  105908  42116  19121  8922  4342  1957  1291 

5 Conclusion
We have introduced IDNet which tackles two challenges in object detection: detecting overlapped objects and suppressing falsely categorized bounding boxes. By introducing two novel losses using determinantal point processes, we have demonstrated that the proposed method is effective for detecting overlapped objects and suppressing falsely categorized bounding boxes while maintaining correctly detected bounding boxes.
Appendix A Notations
We summarized notations for DPPs used in this paper in Table 6
Notation  Definition  Description 
RoIs    Region of interest boxes which are proposed from RPN. 
b    Candidate bounding boxes which are proposed from RCN. 
Intersection over union (IoU) of two bounding boxes.  
A rescaled score. , .  
Normalized feature of a bounding box .  
Similarity between box and . .  
Kernel matrix of DPPs. 
Appendix B Gradient of losses
For notational convenience, we assume that the matrix has the same dimension as and its entries corresponding to is copied from while remaining entries are filled with zero, for any matrix and indices .
b.1 Gradient of InstanceAware Detection Loss
Here, we show the gradient over the normalized feature (). As the derivative of the logdeterminant is , the derivative of intraclass ID loss is as follows:
(9)  
where is a Frobenius inner product, is a Hadamard product, and is the th category. Note that the is the number of categories. Since we only calculate the gradient of ID loss on the similarity feature (), the derivative of is as follows:
(10) 
where . Using the property that , where are arbitrary matrices, we can derive this:
(11)  
By seeing the matrix in elementwise,
(12)  
Since the gradient of is similar with gradient of , we omit the derivation of that. Then, we can construct the gradient of ID loss as follows by summing up (12) for all batches and categories as follows:
(13) 
b.2 Gradient of SparseScore Loss
The derivation for calculating the gradient of sparsescore loss is similar with the derivation of instanceaware detection loss, while the gradient for sparsescore loss is derived over the quality (q). The derivative of sparsescore loss is as follows:
(14)  
Since
(15)  
the final derivative is this:
(16)  
Appendix C Implementation Details
The detailed settings of IDNet are as follows. Our model has three hyperparameters that need to be tuned: a ratio between spatial similarity and visual similarity for constructing the kernel matrix of DPPs (), the dimensionality of the extracted feature and the starting point of training with the SS loss. These hyperparameters are found through a grid search on the validation dataset. The parameters are searched in the following ranges: [0.2; 0.7] for , [128; 1024] for the dimensionality of , [30k; 50k] and [300k; 400k] for training SS loss on the PASCAL VOC dataset and COCO dataset, respectively. We choose as 0.6 and set the feature dimension as 256 for all experiments. Once the hyperparameters are tuned, we take the whole train set to learn the model and evaluate it in the test set. We choose to use 0.25, 4, 5 and 0.001 for all experiments, because they are the empirically best parameters. The learning rate is set to 0.001, and the SS loss and the ID loss are multiplied by 0.01 to balance with the classification loss (negative log probability loss) and the regression loss ( loss). Other details are same as chen2017implementation (). As the original Faster RCNN, we flip the input image horizontally for data augmentation. For all experiments, we use VGG network as the region proposal part of detectors. IDNet is implemented using TensorFlow, and the optimization is done with the stochastic gradient descent method. The parameters of IDNet are initialized with ImageNet pretrained model deng2009imagenet () except the RIN module. We run the experiments using an NVIDIA TITAN X graphics card for the PASCAL VOC 2007 and 2012 datasets and an NVIDIA TITAN Xp graphics card for the COCO dataset.
For training IDNet with determinantal point processes (DPPs), it is important to carefully select the most representative subset of candidate bounding boxes (). To help understanding, we show examples of in Figure 5.
Appendix D Network Architecture
RIN consists of three fully connected layers, three maxpooling layers, one RoIpooling layer, and 9 convolutional networks, while the first two of convolutional layers are shared with VGG16 network (Table 7). At the end of each convolutional and fullyconnected layer except the last layer has a batch normalization ioffe2015batch () and a rectified linear unit (ReLU) in order. We set all convolutional layers to have filters with a size of 3 3 pixels and a stride of one.
Layer  Type  Parameter  Filter size  Remark 

0  Convolution  3x3x3x64  3x3  Shared w/ VGG16 
1  Convolution  64x3x3x64  3x3  Shared w/ VGG16 
2  Maxpooling    2x2   
3  Convolution  64x3x3x128  3x3   
4  Convolution  64x3x3x128  3x3   
5  Convolution  128x3x3x256  3x3   
6  Convolution  256x3x3x256  3x3   
7  Max pooling    2x2   
8  Convolution  256x3x3x256  3x3   
9  Convolution  128x3x3x256  3x3   
10  Convolution  128x3x3x256  3x3   
11  Maxpooling    2x2   
12  RoIpooling    15x15   
13  Fully connected  57600x1000     
15  Fully connected  (1000+5)x1000    Concat w/ box locations & category 
16  Fully connected  1000x256     

Appendix E More Experimental Results
e.1 Experiments with Overlapped Objects
The experimental results are evaluated over the images in which overlapped objects exist. We measure the recall and mAP performance. The recall is calculated as the ratio of detected objects among the overlapped objects. The recall is better performance measure showing that our IDNet is robust to overlap because the recall is calculated only for objects with overlap, whereas mAP is calculated for all objects in images. As shown in Table 8 and Table 9, as the overlap ratio is getting higher, the performance gap between Faster RCNN and IDNet is bigger. For PASCAL VOC 2007 dataset, the performance gaps of recall are increasing: 5.5%, 7.8%, 10.5%, 12.2%, 13% (Table 8). For COCO dataset, the performance gaps of recall are 8%, 11%, 14.3%, 16.3%, 16.0% (Table 9). The performance gaps of mAP are smaller but also have a trend to getting bigger. Since there is no object with an overlap of 0.5 or more in a category, only the performance is measured to 0.4 or more.


Network  Inference  Overlap  # Obj  # Ovl. obj  # Det. obj  Recall  mAP 


Faster RCNNren2015faster ()  NMS  (0.0, 1.0]  5505  4714  3792  80.4  61.4 
LDDP  LDPP  3758  79.7  60.8  
IDNet  IDPP  4048  85.9  63.1  
Faster RCNNren2015faster ()  NMS  (0.1, 1.0]  3802  2675  2045  76.5  60.2 
LDDP  LDPP  2084  77.9  60.2  
IDNet  IDPP  2254  84.3  62.4  
Faster RCNNren2015faster ()  NMS  (0.2, 1.0]  2458  1352  941  69.6  58.3 
LDDP  LDPP  999  73.9  59.7  
IDNet  IDPP  1095  80.1  60.3  
Faster RCNNren2015faster ()  NMS  (0.3, 1.0]  1310  695  437  62.9  56.8 
LDDP  LDPP  477  68.6  59.5  
IDNet  IDPP  522  75.1  59.3  
Faster RCNNren2015faster ()  NMS  (0.4, 1.0]  734  355  207  58.3  53.8 
LDDP  LDPP  217  61.1  54.9  
IDNet  IDPP  253  71.3  58.8  



Network  Inference  Overlap  # Obj  # Ovl. obj  # Det. obj  Recall  mean AP @ IoU:  mean AP @ Area:  mean AR, # Dets:  mean AR @ Area:  
0.50.95  0.5  0.75  S  M  L  1  10  100  S  M  L  


Faster RCNNren2015faster ()  NMS  (0.0, 1.0]  168687  135912  87767  64.6  22.4  41.7  22.0  9.6  26.8  32.8  19.4  33.2  34.2  16.0  40.8  52.2 
LDDPazadi2017learning ()  LDPP  87982  64.7  22.7  42.1  22.1  9.8  27.1  33.5  19.0  32.7  33.8  14.6  40.3  51.8  
IDNet  IDPP  98618  72.6  23.2  42.7  23.1  10.0  27.5  34.3  19.7  34.2  35.5  16.4  41.8  55.3  
Faster RCNNren2015faster ()  NMS  (0.1, 1.0]  123532  65618  42055  64.1  21.2  40.0  20.5  9.1  26.1  31.2  18.2  31.4  32.4  15.0  39.5  50.3 
LDDPazadi2017learning ()  LDPP  43519  66.3  21.5  40.6  20.6  9.4  26.4  32.0  17.9  31.0  32.2  14.0  39.1  50.2  
IDNet  IDPP  49266  75.1  22.0  40.9  21.5  9.5  26.8  32.9  18.4  32.7  34.2  15.5  40.7  54.5  
Faster RCNNren2015faster ()  NMS  (0.2, 1.0]  79632  31963  18856  59.0  19.9  38.2  19.0  8.7  24.8  30.7  17.4  29.8  30.8  14.2  37.8  48.7 
LDDPazadi2017learning ()  LDPP  20272  63.4  20.3  39.0  19.2  9.0  25.2  31.3  17.0  29.4  30.6  13.2  37.5  48.6  
IDNet  IDPP  23423  73.3  20.9  38.8  20.4  9.4  26.2  32.4  17.5  31.9  34.2  15.1  40.6  56.0  
Faster RCNNren2015faster ()  NMS  (0.3, 1.0]  44429  15268  8070  52.9  19.2  36.9  18.4  8.5  24.3  31.0  17.0  28.6  29.6  13.4  36.4  47.8 
LDDPazadi2017learning ()  LDPP  8944  58.6  19.6  37.9  18.6  8.9  24.6  31.6  16.6  28.4  29.6  12.9  36.4  47.7  
IDNet  IDPP  10558  69.2  20.5  38.2  20.0  9.1  25.7  33.0  17.0  30.9  33.2  14.4  39.2  56.0  
Faster RCNNren2015faster ()  NMS  (0.4, 1.0]  22369  7196  3381  47.0  18.9  35.9  18.2  8.4  23.6  31.5  17.1  28.3  29.1  13.0  35.0  46.8 
LDDPazadi2017learning ()  LDPP  3765  52.3  19.3  37.2  18.4  8.6  24.0  32.6  16.5  27.6  28.7  12.3  34.6  47.3  
IDNet  IDPP  4563  63.4  20.3  38.0  19.8  9.0  24.9  34.3  17.1  30.8  33.0  14.1  38.1  56.1  

e.2 Results of Ablation Study
Additional to the results which show the impacts of ID loss and sparescore loss on COCO, we did the same experiment on PASCAL VOC. The results of ID loss is in Figure 6 and the results of sparsescore loss is in Figure 7 and Figure 8. In Figure 8, the candidate boxes over a fixed threshold (0.1 for Faster RCNN and IDNet) are visualized. The highest score in each category is visualized in images of Figure 8 and all scores are measured in , which is the normalized score. For the images in the left column of Figure 8, the highest score of the horse category in Faster RCNN (Figure 8) is 0.546 while the score in IDNet (Figure 8) is 0.154. The results clearly show that the sparsescore loss suppressed scores of bounding boxes which have horse category around the cow. Additionally, for the images in the right column of Figure 8, the score of the category "tennis racket" is 0.226 in Faster RCNN, while the score of the tennis racket category is under the threshold (0.1) in IDNet. Therefore, the SS loss successfully suppresses the scores of falsely categorized bounding boxes around a correct bounding box.




Appendix F Example Visualization
We visualize the results of PASCAL VOC in Figure 9 and results of COCO in Figure 10. The bounding boxes are selected with a score threshold of 0.6 for Faster RCNN with NMS and LDDP with LDPP. The threshold is designated in their paper azadi2017learning (). For visualization of IDNet with IDPP, we use 0.2 as a score threshold. The results show the instanceaware DPP inference method (IDPP) can detect the overlapped objects by leveraging features of objects.
f.1 Successful Cases
We visualize the successful images of IDNet (Figure 9 for VOC, Figure 10 and Figure 11 for COCO. In Figure 9, the first row images show that the wrong class bounding boxes are suppressed while selecting a correct class. The results on other rows show the objects in proximity are detected while other methods fail. In Figure 10 and Figure 11, overlapped objects are successfully detected in IDNet.
f.2 Failure Cases Analysis
The Figure 12 shows that the detector detected the bounding box of the wrong category for avocados. This means that the detector has found a class similar to avocado, such as banana and apple because there are no categories in a dataset. This case suggests that there is a need to suppress further scores for pictures in the absence of a detection class, i.e., background category. In the Figure 12, the giraffe is hidden behind two trees. If there is an occlusion for an object, detectors tend to do not notice that it is a single object. Then detectors choose several bounding boxes for the object. Since DPP inference tries to find the most representative bounding boxes, it would select all of the created bounding boxes, which increases the number of false detections.


References
 (1) Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014)
 (2) Girshick, R.: Fast rcnn. In: IEEE International Conference on Computer Vision (ICCV). (2015)
 (3) Ren, S., He, K., Girshick, R., Sun, J.: Faster rcnn: Towards realtime object detection with region proposal networks. In: Neural Information Processing Systems (NIPS). (2015)
 (4) Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, realtime object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
 (5) Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
 (6) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision (ECCV). (2016)
 (7) Andriluka, M., Roth, S., Schiele, B.: Peopletrackingbydetection and peopledetectionbytracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2008)
 (8) Tian, Y.L., Lu, M., Hampapur, A.: Robust and efficient foreground analysis for realtime video surveillance. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2005)
 (9) Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2017)
 (10) Koffka, K.: Principles of Gestalt psychology. Volume 44. Routledge (2013)
 (11) Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083 (2012)
 (12) Azadi, S., Feng, J., Darrell, T.: Learning detection with diverse proposals. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 (13) Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88(2) (2010) 303–338
 (14) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV). (2014)
 (15) Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: Towards accurate region proposal generation and joint object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
 (16) Dai, J., Li, Y., He, K., Sun, J.: Rfcn: Object detection via regionbased fully convolutional networks. In: Neural Information Processing Systems (NIPS). (2016)
 (17) Dai, J., He, K., Sun, J.: Instanceaware semantic segmentation via multitask network cascades. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
 (18) Ren, M., Zemel, R.S.: Endtoend instance segmentation with recurrent attention. arXiv preprint arXiv:1605.09410 (2017)
 (19) Li, Y., Sun, B., Wu, T., Wang, Y.: Face detection with endtoend integration of a convnet and a 3d model. In: European Conference on Computer Vision (ECCV). (2016)
 (20) Lee, D., Cha, G., Yang, M.H., Oh, S.: Individualness and determinantal point processes for pedestrian detection. In: European Conference on Computer Vision (ECCV). (2016)
 (21) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: Imagenet: A largescale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2009)
 (22) Chen, X., Gupta, A.: An implementation of faster rcnn with study for region sampling. arXiv preprint arXiv:1702.02138 (2017)
 (23) Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML). (2015)