Learning Instance-Aware Object Detection Using Determinantal Point Processes
Recent object detectors find instances while categorizing candidate regions. As each region is evaluated independently, the number of candidate regions from a detector is usually larger than the number of objects. Since the final goal of detection is to assign a single detection to each object, a heuristic algorithm, such as non-maximum suppression (NMS), is used to select a single bounding box for an object. While simple heuristic algorithms are effective for stand-alone objects, they can fail to detect overlapped objects. In this paper, we address this issue by training a network to distinguish different objects using the relationship between candidate boxes. We propose an instance-aware detection network (IDNet), which can learn to extract features from candidate regions and measure their similarities. Based on pairwise similarities and detection qualities, the IDNet selects a subset of candidate bounding boxes using instance-aware determinantal point process inference (IDPP). Extensive experiments demonstrate that the proposed algorithm achieves significant improvements for detecting overlapped objects compared to existing state-of-the-art detection methods on the PASCAL VOC and MS COCO datasets.111This paper is under consideration at Computer Vision and Image Understanding.
Object detection is one of the fundamental problems in computer vision. Its goal is to detect objects by classifying and regressing bounding boxes in an image [10, 11, 31, 28, 29, 24]. It has received much attention because of its wide range of applications, such as object tracking , surveillance , and face detection . Most of the state-of-the-art detectors show significant performance improvements based on deep convolutional neural networks. Despite the advances in object detection, it is still difficult to assign correct detections for all objects in an image since detectors do not distinguish different object instances in the same class as it only focuses on an instance-agnostic task, i.e., object category classification. This issue becomes critical when objects are overlapped. As shown in the left image of Figure 1, the person on the right is not detected due to the overlapped bounding boxes in proximity.
In order to address this issue, we propose an instance-aware detection network (IDNet), which learns to differentiate representations of different objects. IDNet learns the similarity among bounding boxes during training and selects a subset of boxes based on the learned similarity during inference. Specifically, IDNet learns to compare appearances of bounding boxes while considering their spatial arrangements.
IDNet uses an existing detector, such as Faster R-CNN, as a component to obtain candidate bounding boxes. Given candidate boxes, IDNet extracts features of all candidates using a CNN branch, named a region identification network (RIN), which aims to increase the probability of selecting an optimal subset of detections. To this end, IDNet is trained not only with the softmax loss and smooth L1 loss , but also with novel losses based on determinantal point processes (DPPs) . A DPP is used in various machine learning fields, such as document and video summarization [4, 37, 22], sensor placement , recommendation systems  and multi-label classification , to select a desirable subset from a set of candidates. Using the property of repulsiveness in DPPs, we design an instance-aware detection loss (ID loss), which learns to increase the probability of selecting an instance-aware subset from detection candidates.
Another source of the detection error is multiple detections of different classes for a single object. This error has been known to be one of the persistent problems for instance-agnostic detectors, such as Faster R-CNN . For example, as shown in the right image of Figure 1, there are two bounding boxes categorized as a dog and a horse for the same object. Since the objective of a detector is to find a single bounding box for a single object instance, we propose the sparse-score loss (SS loss) using DPPs to make IDNet assign a single bounding box for a single object, considering all categories. In particular, we formulate the SS loss to remove duplicated bounding boxes by training IDNet to have low confidence scores for bounding boxes with incorrect class labels. After training, our algorithm efficiently finds a subset of candidate detections using the log-submodular property of DPPs .
Experimental results show that IDNet is more robust for detecting overlapped objects against the baseline detectors, such as Faster R-CNN  and learning detection with diverse proposals (LDDP) , on PASCAL VOC  and MS COCO . Our IDNet achieves 5.8% mAP improvement on PASCAL VOC 2007 and 2.5% mAP improvement on PASCAL VOC 0712 over Faster R-CNN when tested on the VOC crowd set, which consists of images with overlapped objects. For COCO, the performance is improved by 1.3% AP when tested on the COCO crowd set.
The main contributions of this paper are summarized as follows: (1) Two novel losses, the sparse-score loss and the instance-aware diversity loss, are proposed for instance-aware detection; (2) To the best of our knowledge, this work is the first approach that trains a neural network to learn quality and diversity terms of a DPP for object detection; (3) The proposed algorithm outperforms baseline detectors for detecting overlapped objects.
2 Related Work
Class-aware detection algorithms.
The goal of class-aware or multi-class object detection is to localize objects in an image while predicting the category of each object. These systems are usually composed of region proposal networks and region classification networks [11, 31, 24]. To improve detection accuracy, a number of different optimization formulations and network architectures have been proposed [31, 16, 3, 28, 24, 29, 6]. Ren \etal use convolutional networks, called region proposal networks, to get region proposals and combine it with Fast R-CNN. Kong \etal utilizes each layer’s feature for detecting small objects in an image. A real-time multi-class object detector is proposed by combining region proposal networks and classification networks in . Liu \etal improve the performance of  using multiple detectors for each convolutional layer. To increase network efficiency, fully connected layers are replaced by convolution layers in . Redmon \etal extend  by classifying thousands of categories using the hierarchical structure of categories in the dataset.
DPPs have been used to improve detection qualities before. Azadi \etal propose to suppress background bounding boxes, while trying to select correct detections. However, this method focuses on ing detection scores and uses a fixed visual similarity matrix based on WordNet , while our algorithm learns the similarity matrix from data.
Instance-aware algorithms have been developed to provide finer solutions in different problem domains. Instance-aware segmentation aims to label instances at the pixel level [5, 30]. Li \etal propose a cascade network which finds each instance stage by stage. Similar to RIN, a network in  finds features of each instance. Ren \etal use a recurrent neural network to sequentially find each instance. A face detector which takes keypoints of faces as an input is suggested in . The dataset for this application contains face labels for identifying different faces, while the standard object detection datasets only have a small number of categories.
In object detection, Wang \etal introduce a repulsion loss to improve localization of instances. However, their approach is limited to a single-class detection problem and uses NMS  as a post-processing method. Lee \etal provide an inference method to find an optimal subset of detection candidates for pedestrian detection considering the individualness of each detection candidate. However, this approach tackles a single-class detection problem and uses features computed from a network pre-trained on the ImageNet dataset , instead of training the network for the desired purpose. Our method tackles a challenging multi-class detection task by learning distinctive features of object instances from data.
Recently, a detector which learns the structural relationship between objects is proposed in , where the detection score of an object is scaled by considering scene context and relationship between objects. Liu \etal show that training with a structural relationship can implicitly reduce redundant detection boxes, while our method explicitly suppresses the scores of duplicated detection boxes. Hu \etal utilize a modified attention module  for learning a relationship between bounding boxes. The module scales the scores using the instance relationship similar to ours. However, this method uses the standard softmax loss and smooth L1 loss, while our IDNet tackles this problem by training a detector with novel losses.
3 Proposed Method
An overview of the proposed IDNet is shown in Figure 2. IDNet is composed of a region proposal network (RPN), a region classification network (RCN) and a region identification network (RIN). Based on image feature maps from the backbone network, RPN predicts region proposals, i.e., the region of interests (RoIs). Then, a RoI pooling layer pools regional features from feature maps for each RoI. Using the regional features, RCN classifies the regions into multiple categories while localizing the regions. RIN computes instance features of candidates, which are used by DPPs.222RIN consists of seven convolutional layers and three fully connected layers. The detailed structure of RIN is described in the appendix.
3.1 Determinantal Point Processes for Detection
Suppose that there are candidate bounding boxes, , where is the th bounding box. A determinantal point process (DPP) defines a probability distribution over subsets of as follows . If Y is a DPP, then
where , a kernel matrix is a real symmetric positive semi-definite matrix, an indexed kernel matrix is a submatrix of indexed by the elements of , and is an identity matrix. The kernel matrix can be decomposed as , where is a feature matrix for candidate bounding boxes. Each row of is extracted from RIN and normalized to construct the matrix. Similar to the kernel matrix, the indexed kernel matrix can be decomposed as .
Let be the detection score for the th bounding box. Then, is the detection quality for all detection candidates. The feature for is extracted from the RIN. Let be a normalized feature, where . Using candidate bounding boxes, the intersection over union between and can be calculated by , where is the number of pixels in A, and we construct a matrix by setting . A similarity matrix is constructed as , where . Using the similarity matrix and the detection quality q, the kernel matrix for a DPP  can be formed as , where is the element-wise multiplication.333 Notations in this paper are summarized in the appendix.
If the similarity and detection quality q are correctly assigned, a subset which maximizes (1) is a collection of the most distinctive detections due to the property of the determinant in a DPP . Since IDNet is trained to maximize the probability (1) of the ground-truth detections, IDNet learns the most distinctive features and correctly scaled detection scores to separate difference object instances in order to compute and q.
3.2 Learning Detection Quality
As RCN classifies each RoI independently, multiple detections with different categories often have high detection scores. For example, a detector would report a horse nearby a dog as they are visually similar. Then, conventional post-processing methods, such as NMS, are typically suppress bounding boxes in each class. While heuristic post-processing algorithms are effective for removing duplicated bounding boxes in each category, these algorithms cannot remove duplicated boxes with different categories. In this case, even if there is a true bounding box for the dog, the horse bounding box cannot be removed. To alleviate this issue, we propose the sparse-score loss (SS loss) to detect an object with the correct class label by removing the other candidate boxes with incorrect categories.
We first select categories with top detection scores among categories from each RoI. We assume that the selected categories are composed of visually similar categories from the correct category. By suppressing the scores of visually similar categories except for the bounding boxes of a top-1 category, we can obtain a single bounding box with a correct category for an object. Let be all bounding boxes of top- categories from all RoIs and be a set of positive boxes, i.e., bounding boxes with a top-1 category in each RoI. Then, we define the SS loss as the negative log-likelihood of (1) as follows:
where . This loss function increases detection scores of bounding boxes in the positive set, . In other words, this loss suppresses scores of all subsets which have at least one non-positive bounding box. We would like to note that the normalization term for a DPP is included for numerical stability during training.
We also use two softmax losses for classification ( for binary classification and for multi-class classification), and two smooth L1 losses () for the RoI regression and candidate box regression . Note that the losses are the same as  since we adopt Faster R-CNN as a baseline. We call the summation of all above losses as a multi-task loss.
Suppose RPN predicts the objectness probability and location shifts , where is the index of RoIs in a mini-batch. RCN predicts of categories and location shifts , where is the index of candidate boxes. The target location shift for the th RoI and th candidate box are and , respectively. Additionally, and are the ground truth category label for a RoI and an candidate box, respectively. Then, the multi-task loss is expressed as follows:
where is an indicator function, which outputs 1 when the th candidate box has a non-background label.
With all losses defined as above, the weights for a backbone, RPN, and RCN, which are denoted by in Figure 2, can be learned by optimizing:
where is used to balance the SS loss with the multi-task loss. The similarity matrix is fixed while calculating the gradient of the SS loss, since is freezed while optimizaing .
3.3 Learning Instance Differences
An instance-agnostic detector solely based on object category information often fails to detect objects in proximity. For accurate detections from real-world images with frequent overlapping objects, it is crucial to distinguish different object instances. To address this problem, we propose the instance-aware detection loss (ID loss). The objective of this loss function is to obtain similar features from the same instance and different features from different instances. This is done by maximizing the probability of a subset of the most distinctive bounding boxes.
Let be a set of all candidate bounding boxes which intersect with the ground truth bounding boxes. Let be a set of the most representative boxes, i.e., candidate boxes which are closest to the ground truth boxes obtained by the Hungarian algorithm . Then, the ID loss for all objects is defined as follows:
Due to the determinant, it increases the cosine distance between and if and are from different instances. As we select boxes nearby the ground truth bounding boxes to construct , the network can learn what bounding boxes are similar or different.
In addition to (5), we set an objective which focuses on differentiating instances from the same category. For category , is candidate boxes in the th category and is a set of candidate boxes which are closest to the ground truth boxes. is also obtained by the Hungarian algorithm . The category-specific ID loss is defined as follows:
It provides an additional guidance signal to train the network since it is more difficult to distinguish similar instances from the same category than instances from different categories. We find an improvement when we use both and , compared to cases when only one of them is used. Finally, the ID loss is defined as:
The goal of the ID loss is to find all instances while discriminating different instances as shown in Figure 1. While the ID loss aims to distinguish instances, the multi-task loss tries to classify categories. The difference between their goals makes a network perform worse when both losses are used simultaneously. To alleviate the problem, we trained weights of RIN ( in Figure 2) separate from . Given a set of candidate bounding boxes and subsets of them, weights of RIN can be learned by optimizing:444The gradients of the SS, ID losses are derived in the appendix.
Note that while calculating the gradient of the ID loss, the detection quality (q) is fixed, as is freezed while optimizaing .
Given a set of candidate bounding boxes, the similarity matrix and the detection quality q, Algorithm 1 (IDPP) finds the most representative subset of bounding boxes. The problem of finding a subset that maximizes the probability is NP-hard . Fortunately, due to the log-submodular property of DPPs , we can approximately solve the problem using a greedy algorithm, such as Algorithm 1, which iteratively adds an index of a detection candidate until it cannot make the cost of a new subset higher than that of the current subset , where the cost of a set is .
Datasets and baseline methods.
To demonstrate that our IDNet is effective for detecting overlapped objects, we have constructed the VOC crowd set from the VOC 2007 test set and the COCO crowd set from the COCO val set, respectively. The crowd sets contain at least one overlapped object in an image. Unless otherwise specified, we define overlapped objects as those who overlap with another object over 0.3 IoU in all experiments. We name the crowd set on VOC 2007 as , where the number of images is 283. The COCO crowd set consists of 5,471 images, which is called . The indices of crowd sets will be made publicly available.
Since the goal of our algorithm is to discriminate instances with given candidate bounding boxes, we adopt Faster R-CNN as a proposal network to get candidate detections, but other proposal networks can be used in our framework. We implement baseline methods, Faster R-CNN  and LDDP , to compare with our algorithm. Since there are few methods tested on the crowd sets, we choose the two baselines for fair comparison. Note that our baseline implementation achieves a reasonable performance of 71.4% mAP when trained with VOC 2007 using VGG-16 as a backbone, considering that the performance in the original paper  is 69.9% mAP.
We use different inference algorithms for each method. Unless otherwise stated, Faster R-CNN uses NMS, LDDP uses LDPP, and IDNet uses IDPP as an inference algorithm. LDPP is an inference algorithm proposed in LDDP , which uses a fixed class-wise similarity matrix while our IDPP uses the instance-aware features extracted from RIN.
All baseline methods and our IDNet are implemented based on the Faster R-CNN in Tensorflow , where the most parameters, such as a learning rate, optimizer, data augmentation strategy, and batch size, are the same as the original paper . In our method, we use backbone networks, e.g., VGG-16 and ResNet-101, pre-trained on the ImageNet  and the RIN module is initialized with Xavier initialization . The RIN shares the parameters in a backbone, such as the layers until the conv2 of VGG-16  and the conv1 of ResNet-101 , to conserve memory. We set to five for the VOC and ten for the COCO, since VOC has around five categories in the super-category and COCO has ten categories in the super-category on average. We set the ratio between the spatial similarity and visual similarity () to 0.6, which is a similar value compared with [37, 20]. Since the performance of a detector is poor during the early stage of training, top- bounding boxes do not contain similar categories. Thus, we set to zero during the early stage of training. is increased to 0.01 after the early stage. The early stages are chosen around 60% of total training iterations. We use 40k iterations for VOC 2007, 70k for VOC 0712, and 360k for COCO. Additionally, we set the size of to 256 as it performs the best. More implementation details can be found in the appendix.
For evaluation, we use the mean average precision (mAP). We report mAP which considers detection candidates over IoU 0.5 as correct objects for VOC. For COCO, we evaluate performance with three types of mAPs in the standard MS COCO  protocols: AP, , and . AP reports the average values of mAP at ten different IoU thresholds from .5 to .95, reports mAP at IoU 0.5, and reports mAP at IoU 0.75. A high score in requires better localization of detection boxes.
4.1 Pascal Voc
For VOC 2007, we train a network with VOC 2007 trainval, which contains 5k images. For VOC 0712, we train a network with VOC 0712 trainval set, which includes 16k images. All methods are tested on VOC 2007 test set, which has 5k images. After training IDNet with the SS loss and the multi-task loss, we train RIN to learn differences of instances with the ID loss for 30k iterations for VOC 2007, and 20k iterations for VOC 0712. While training RIN, the parameters in other modules except RIN are frozen. A VGG-16 backbone is used for all tested methods for PASCAL VOC.
Since IDNet is effective for overlapped objects, we report recall which is calculated as a ratio of detected objects among the overlapped objects (Figure 3). For calculating recall, we check that there are detected objects among the objects overlapped with another object above a fixed IoU threshold. After calculating the probability of detecting overlapped objects in each category, the results are averaged over categories. The recall is a better performance measure than mAP for showing the robustness to overlap. This is because the recall is calculated only for overlapped objects, while the mAP is calculated for all objects in an image containing at least a single overlapped object.
In Figure 3, recall for the objects with overlap over 0.4 is increased from 0.58 (Faster R-CNN) to 0.71 (IDNet), which is an impressive improvement. For all overlap regions, recall is higher than baseline methods and as the overlap ratio gets higher, the performance gap between Faster R-CNN and IDNet gets bigger. The results show that IDNet is effective for detecting objects in proximity.
|Fast R-CNN ||NMS||07||66.9||-|
|Faster R-CNN ||NMS||07||71.4||56.0|
|Fast R-CNN ||NMS||07+12||70.0||-|
|Faster R-CNN ||NMS||07+12||75.8||62.0|
|Faster R-CNN ||NMS||VGG-16||26.2||19.2||46.6||36.9||26.9||18.4|
|Faster R-CNN ||NMS||ResNet-101||31.5||23.5||52.0||42.5||33.5||23.0|
To demonstrate that our IDNet is effective for detecting overlapped objects on the standard mAP, we tested Faster R-CNN, LDDP and our 555 is a version of IDNet only using ID loss. on the VOC crowd set ( in Table 1). The IDNet shows impressive improvements compared to Faster R-CNN with an improvement of 5.8% mAP for VOC 2007 and 2.5% for VOC 0712. We also observe improvements over LDDP: 4.1% improvement in mAP for VOC 2007 and 1.4% improvement for VOC 0712. Next, when we evaluated mAP for , the mAP compared with baseline methods is increased for both VOC 2007 and VOC 0712 (Table 1).
4.2 Ms Coco
MS COCO is composed of 80k images in the train set and 40k images in the val set. After training a network with the SS loss and the multi-task loss, we train the RIN module with the ID loss for 20k additional iterations.
In Table 2, we report the results using multiple APs for COCO. With respect to the crowd test set (), Table 2 shows that the performance is improved from 19.2% to 20.5% AP for VGG-16. Since the larger number of categories in COCO makes distinguishing instances harder, the improvement is smaller than the results on . To demonstrate the general effectiveness of our IDNet, we also provide the results when the backbone network is replaced by ResNet-101. The performance of IDNet is improved from 23.5% AP to 24.4% AP on the ResNet-101 backbone, compared with Faster R-CNN, which shows the effectiveness of our IDNet on a stronger backbone. We also observe that the improvement on the is bigger than the improvement on the , which means the IDNet with the IDPP inference algorithm is effective for the localization accuracy.
For all COCO val images, the performance is improved by 1.1% AP for the VGG-16 backbone and 1.2% AP for the ResNet-101 backbone (Table 2). We attribute the reason for the improvments to the fact that there are many similar categories in COCO, which has eight categories for each of 11 super categories on average. Since a number of duplicated candidate boxes can be generated, our SS loss can remove duplicated bounding boxes to increase the final detection performance.
To verify that SS loss affected the improvements, we extract candidate boxes having detection scores over a fixed threshold (0.01) in Figure 4. When a predicted box overlaps with the ground truth box by 0.5 of IoU or more, we consider it as a correct box. We divide the number of correct boxes by the number of bounding boxes to check how many boxes are correctly classified. Figure 4 shows that IDNet achieves superior performance on this measure for all categories compared to other methods. On average, IDNet achieves 43.7% while Faster R-CNN has 32.4% and LDDP has 32.9% for COCO. The results indicate that the SS loss can successfully remove incorrectly classified bounding boxes.
We measure the average inference time per image using VGG-16 as a backbone network on minival set of COCO, which is a subset of 5k samples from the val set. All running times are measured on a machine with Intel Core 3.7GHz CPU and Titan X GPU.
Our algorithm takes 2.14 seconds to find candidate boxes and extract features of them, and 0.33 seconds to select bounding boxes using IDPP. Since Faster R-CNN takes 1.61 seconds and LDDP  takes 1.55 seconds, an extra time of 0.86 seconds is needed for detecting objects in an image compared with Faster R-CNN and 0.92 seconds compared with LDDP. Although our algorithm takes more time to inference, it can be used in problems which require exact detections in a crowd.
4.3 Ablation Study
We analyze the influence of the ID loss and SS loss in Table 3, where the IDNet is trained with COCO train set using VGG-16 as a backbone. In ablation studies, we check our IDNet with two post-processing methods: NMS and IDPP. In the first two rows in Table 3, we use NMS for the experiments that do not use the ID loss, since IDPP uses the trained features with the ID loss. In the last two rows of Table 3, we use IDPP with a trained RIN module.
Instance-aware identity loss.
The ID loss is made to be effective for detecting objects in a crowded scene. In the third row of Table 3, the performance is improved from 19.2% to 19.7% AP on . Comparing the second row and the last row, the performance is improved by 0.9% AP. In Figure 5(a), a person is not detected in Faster R-CNN, while our IDNet detects the person in Figure 5(b) since IDNet learns to discriminate different objects. This result indicates that the ID loss is effective for detecting objects in proximity.
Since the SS loss is designed to remove incorrectly classified bounding boxes, the SS loss is effective for all testing images. Thus, we focus on the results on column in Table 3. The results show that as the SS loss is used, the performance is improved by 0.8% AP.
In Figure 5(c), a sheep is erroneously detected for a cow, while our IDNet removes this erroneous detection of a sheep in Figure 5(d) as IDNet learns to remove incorrectly classified bounding boxes. It shows that the SS loss can alleviate duplicated bounding box problem in a detector.
Since Figure 5 only shows the final detections, we visualize images with candidate boxes in Figure 6 to show the changes in detection scores. The score threshold is fixed to 0.1 and the highest score in each category is written in each image.
We first compare the result with Faster R-CNN. Since Faster R-CNN does not have any loss to decrease the scores of incorrect categories, the highest score of a horse in Faster R-CNN is 0.546 while the score in IDNet is 0.158 (see the first row of Figure 6). For images in the second row of Figure 6, the maximum score of an incorrect category, remote, is 0.476 in Faster R-CNN, while the maximum score of a remote is under the threshold (0.1) in IDNet.
We also compare the result with LDDP . The LDDP loss  is defined to increase the score of a single subset using a category-level relationship, while our SS loss is defined to decrease scores of all possible subsets containing incorrect candidate boxes using an instance-level relationship between candidate boxes. Thus, after softmax is applied to scores, the SS loss can better suppress the detection scores of bounding boxes with incorrect categories. For example, as shown in the third and last columns of Figure 6, given a cow image, the detection score for a horse is decreased from 0.673 (LDDP) to 0.158 (IDNet). It shows that the SS loss can successfully suppress scores of duplicated bounding boxes around a correct bounding box as expected.
We propose IDNet which tackles two challenges in object detection by introducing two novel losses. First, we propose the ID loss for detecting overlapped objects. Second, the SS loss is introduced to suppress erroneous detections of wrong categories. By introducing these two losses using DPPs, we demonstrate that learning instance-level relationship is useful for accurate detection. IDNet performs favorably for overall test sets and achieves significant improvements on the crowd sets. Additionally, the ablation studies show that IDNet learns to suppress erroneous detections of wrong categories. While the inference time is moderately slower than other detection methods, our algorithm is useful for real-world situations which require separating objects in proximity.
|RoIs||-||Region of interest boxes which are proposed from RPN.|
|b||-||Candidate bounding boxes which are proposed from RCN.|
|Intersection over union (IoU) of two bounding boxes.|
|q||-||Detection score corresponding to the candidate bounding boxes.|
|Normalized feature of a bounding box .|
|Similarity between box and . .|
|Kernel matrix of DPPs.|
Appendix A Notations
In Table 4, notations used in this paper are described.
Appendix B Gradient of losses
In this section, we derive the gradients of the proposed instance-aware detection loss (ID loss) and sparse-score loss (SS loss). For notational convenience, we assume that the matrix has the same dimension as and its entries corresponding to is copied from while remaining entries are filled with zero, for any matrix and indices .
b.1 Gradient of Instance-Aware Detection Loss
Here, we show the gradient with respect to the normalized feature (). As the derivative of the log-determinant is , the derivative of intra-class ID loss is as follows:
where is the Frobenius inner product, is the element-wise multiplication, and is the th category. Note that the is the number of categories. We only calculate the gradient of the ID loss on the similarity feature (), where . Since is a constant, the derivative of is as follows:
where . Note that is fixed while deriving gradient of the ID loss. Using the property that , where are arbitrary matrices, we can derive this:
By seeing the matrix in element-wise,
Since the gradient of is similar with a gradient of , we omit the derivation of that. Then, we can construct the gradient of ID loss by summing up (12) for all batches and categories:
b.2 Gradient of Sparse-Score Loss
The derivation for calculating the gradient of the SS loss is similar with the derivation of the instance-aware detection loss, while the gradient for the SS loss is derived over the quality (q). Note that is fixed while deriving gradient of the SS loss. The derivative of the SS loss is as follows:
Similar to the derivation of ID loss, by using the following properties,
we can derive this:
Thus, the final derivative of SS loss is as follows:
Appendix C Network Architecture
As shown in Table 5. the RIN consists of seven convolutional layers, three fully connected layers, three max-pooling layers, and one crop and resize layer. Since RIN utilizes parameters of a backbone network, the size of input channel () is chosen according to the backbone network, e.g, 64 for VGG-16 and ResNet-101. The parameters are used for training with VOC. For COCO, are used. At the end of each convolutional and fully-connected layer except the last layer has a batch normalization  and a rectified linear unit (ReLU) in order. We set all convolutional layers to have filters with a size of 3 3 pixels and a stride of one.
|4||Max pooling||-||size 22, stride 2|
|8||Crop and resize||-||size 1515|
|9||Fully connected||( )1000||-|
Appendix D More Experimental Results
Appendix E Example Visualization
We visualize qualitative results of IDNet on VOC 2007 and MS COCO. For comparison, we also visualize the ground truth bounding boxes in each image, and the results of Faster R-CNN and LDDP. For Faster R-CNN and LDDP, only bounding boxes with a score threshold of 0.6 are visualized. The threshold is designated in their paper . For IDNet, we use 0.2 as a score threshold.
|Faster R-CNN ||NMS||07||71.4||70.4||78.2||69.7||58.9||56.9||79.5||83.0||84.3||53.3||78.6||64.5||81.7||83.7||76.1||77.9||45.4||70.5||66.7||74.3||73.3|
|Faster R-CNN ||NMS||07+12||75.8||77.2||84.1||74.8||67.3||65.5||82.0||87.4||87.9||58.7||81.5||69.8||85.0||85.1||77.7||79.2||47.2||75.4||71.8||82.3||75.8|
|Faster R-CNN ||NMS||07||56.0||45.5||56.0||44.2||42.0||57.4||54.5||70.3||37.4||47.2||67.8||65.4||56.4||63.0||61.4||67.8||30.0||66.4||53.4||63.6||70.6|
|Faster R-CNN ||NMS||07+12||62.0||100.0||59.4||60.1||28.5||61.3||53.2||72.0||51.4||51.9||67.0||67.0||55.1||76.9||71.4||69.4||32.6||67.5||61.1||63.6||70.2|
|Faster R-CNN ||NMS||VGG-16||26.2||46.6||26.9||10.3||29.3||36.4||25.5||38.1||39.0||17.9||44.0||55.7|
|Faster R-CNN ||NMS||ResNet-101||31.5||52.0||33.5||12.5||35.2||45.9||29.2||43.2||44.2||20.6||49.9||63.8|
|Faster R-CNN ||NMS||VGG-16||19.2||36.9||18.4||8.5||24.3||31.0||17.0||28.6||29.6||13.4||36.4||47.8|
|Faster R-CNN ||NMS||ResNet-101||23.5||42.5||23.0||10.4||29.6||38.5||19.3||32.8||34.0||16.1||41.7||54.6|
Failure cases analysis.
The top image of Figure 9 shows that the detector detected the bounding box of the wrong category for avocados. This means that the detector has found a class similar to avocado, such as banana and apple because there are no categories in a dataset. This case suggests that there is a need to suppress further scores for pictures in the absence of a detection class, i.e., background category. In the bottom of Figure 9, a giraffe is hidden behind two trees. If there is an occlusion for an object, detectors tend not to notice that it is a single object. Then detectors choose several bounding boxes for the object. Since IDPP tries to find the most representative bounding boxes, it would select all of the created bounding boxes, which increases the number of false detections.
The successful images of IDNet are visualized in Figure 10 for VOC 2007, and Figure 11 for MS COCO. In Figure 10, the first and last row images show that incorrect class bounding boxes are suppressed while selecting a correct class, which means that the IDNet suppressed bounding boxes with incorrect categories. The results on the other rows show the objects in proximity are detected while other methods fail. The results show that overlapped objects are successfully detected in IDNet. In Figure 11, all results show that IDNet can detect overlapped objects.
The results show the proposed IDNet can detect overlapped objects compared to the other algorithms while suppressing bounding boxes with incorrect categories.
-  (2016) Tensorflow: a system for large-scale machine learning.. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), Cited by: §4.
-  (2008) People-Tracking-by-Detection and People-Detection-by-Tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Learning Detection with Diverse Proposals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Table 7, Table 8, Table 9, Appendix E, §1, §2, §2, §3.4, §4, §4, §4.2, §4.3, Table 1, Table 2.
-  (2015) Large-Margin Determinantal Point Processes. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.
-  (2016) Instance-aware Semantic Segmentation via Multi-task Network Cascades. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2016) R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Neural Information Processing Systems (NIPS), Cited by: §2.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.
-  (2010) The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision (IJCV) 88 (2), pp. 303–338. Cited by: §1, §4.
-  (2010) Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32 (9), pp. 1627–1645. Cited by: §2.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2015) Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, Table 1.
-  (2010) Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §4.
-  (2016) Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
-  (2018) Relation Networks for Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), Cited by: Appendix C.
-  (2016) HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2008) Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. Journal of Machine Learning Research (JMLR) 9, pp. 235–284. Cited by: §1.
-  (1955) THE HUNGARIAN METHOD FOR THE ASSIGNMENT PROBLEM. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.3, §3.3.
-  (2012) Determinantal Point Processes for Machine Learning. arXiv preprint arXiv:1207.6083. Cited by: §1, §1, §3.1, §3.1, §3.1, §3.4.
-  (2016) Individualness and Determinantal Point Processes for Pedestrian Detection. In European Conference on Computer Vision (ECCV), Cited by: §2, §4.
-  (2016) Face Detection with End-to-End Integration of a ConvNet and a 3D Model. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2012) Learning Mixtures of Submodular Shells with Application to Document Summarization. Conference on Uncertainty in Artificial Intelligence (UAI). Cited by: §1.
-  (2014) Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), Cited by: §1, §4, §4.
-  (2016) SSD: Single Shot Multibox Detector. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, Table 1.
-  (2018) Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (1995) WordNet: A Lexical Database for English. Communications of the ACM 38 (11), pp. 39–41. Cited by: §2.
-  (2017) HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §1.
-  (2016) You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2017) YOLO9000: Better, Faster, Stronger. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2017) End-to-End Instance Segmentation with Recurrent Attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Neural Information Processing Systems (NIPS), Cited by: Table 6, Table 7, Table 8, Table 9, §1, §1, §1, §1, §2, Figure 2, §3.2, §4, §4, Table 1, Table 2.
-  (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.
-  (2005) Robust and Efficient Foreground Analysis for Real-time Video Surveillance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Attention Is All You Need. In Neural Information Processing Systems (NIPS), Cited by: §2.
-  (2018) Repulsion Loss: Detecting Pedestrians in a Crowd. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) Deep Determinantal Point Process for Large-Scale Multi-Label Classification. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2016) Video Summarization with Long Short-term Memory. In European Conference on Computer Vision (ECCV), Cited by: §1, §4.
-  (2010) Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences (PNAS) 107 (10), pp. 4511–4515. Cited by: §1.