Learning Instance-Aware Object Detection Using Determinantal Point Processes

Learning Instance-Aware Object Detection Using Determinantal Point Processes

Nuri Kim
Seoul National University
nuri.kim@rllab.snu.ac.kr
   Donghoon Lee
Seoul National University
donghoon.lee@rllab.snu.ac.kr
   Songhwai Oh
Seoul National University
songhwai@snu.ac.kr
Abstract

Recent object detectors find instances while categorizing candidate regions. As each region is evaluated independently, the number of candidate regions from a detector is usually larger than the number of objects. Since the final goal of detection is to assign a single detection to each object, a heuristic algorithm, such as non-maximum suppression (NMS), is used to select a single bounding box for an object. While simple heuristic algorithms are effective for stand-alone objects, they can fail to detect overlapped objects. In this paper, we address this issue by training a network to distinguish different objects using the relationship between candidate boxes. We propose an instance-aware detection network (IDNet), which can learn to extract features from candidate regions and measure their similarities. Based on pairwise similarities and detection qualities, the IDNet selects a subset of candidate bounding boxes using instance-aware determinantal point process inference (IDPP). Extensive experiments demonstrate that the proposed algorithm achieves significant improvements for detecting overlapped objects compared to existing state-of-the-art detection methods on the PASCAL VOC and MS COCO datasets.111This paper is under consideration at Computer Vision and Image Understanding.

1 Introduction

Object detection is one of the fundamental problems in computer vision. Its goal is to detect objects by classifying and regressing bounding boxes in an image [10, 11, 31, 28, 29, 24]. It has received much attention because of its wide range of applications, such as object tracking [2], surveillance [33], and face detection [27]. Most of the state-of-the-art detectors show significant performance improvements based on deep convolutional neural networks. Despite the advances in object detection, it is still difficult to assign correct detections for all objects in an image since detectors do not distinguish different object instances in the same class as it only focuses on an instance-agnostic task, i.e., object category classification. This issue becomes critical when objects are overlapped. As shown in the left image of Figure 1, the person on the right is not detected due to the overlapped bounding boxes in proximity.

Figure 1: Detection errors. The errors are indicated by the dashed boxes. All examples are from baseline Faster R-CNN detector on PASCAL VOC. Left: A crowded scene with people, where a person is not detected. Right: An image with duplicated detections from a single object, where a dog is mistaken as a horse.

In order to address this issue, we propose an instance-aware detection network (IDNet), which learns to differentiate representations of different objects. IDNet learns the similarity among bounding boxes during training and selects a subset of boxes based on the learned similarity during inference. Specifically, IDNet learns to compare appearances of bounding boxes while considering their spatial arrangements.

IDNet uses an existing detector, such as Faster R-CNN, as a component to obtain candidate bounding boxes. Given candidate boxes, IDNet extracts features of all candidates using a CNN branch, named a region identification network (RIN), which aims to increase the probability of selecting an optimal subset of detections. To this end, IDNet is trained not only with the softmax loss and smooth L1 loss [31], but also with novel losses based on determinantal point processes (DPPs) [19]. A DPP is used in various machine learning fields, such as document and video summarization [4, 37, 22], sensor placement [17], recommendation systems [38] and multi-label classification [36], to select a desirable subset from a set of candidates. Using the property of repulsiveness in DPPs, we design an instance-aware detection loss (ID loss), which learns to increase the probability of selecting an instance-aware subset from detection candidates.

Another source of the detection error is multiple detections of different classes for a single object. This error has been known to be one of the persistent problems for instance-agnostic detectors, such as Faster R-CNN [31]. For example, as shown in the right image of Figure 1, there are two bounding boxes categorized as a dog and a horse for the same object. Since the objective of a detector is to find a single bounding box for a single object instance, we propose the sparse-score loss (SS loss) using DPPs to make IDNet assign a single bounding box for a single object, considering all categories. In particular, we formulate the SS loss to remove duplicated bounding boxes by training IDNet to have low confidence scores for bounding boxes with incorrect class labels. After training, our algorithm efficiently finds a subset of candidate detections using the log-submodular property of DPPs [19].

Experimental results show that IDNet is more robust for detecting overlapped objects against the baseline detectors, such as Faster R-CNN [31] and learning detection with diverse proposals (LDDP) [3], on PASCAL VOC [8] and MS COCO [23]. Our IDNet achieves 5.8% mAP improvement on PASCAL VOC 2007 and 2.5% mAP improvement on PASCAL VOC 0712 over Faster R-CNN when tested on the VOC crowd set, which consists of images with overlapped objects. For COCO, the performance is improved by 1.3% AP when tested on the COCO crowd set.

The main contributions of this paper are summarized as follows: (1) Two novel losses, the sparse-score loss and the instance-aware diversity loss, are proposed for instance-aware detection; (2) To the best of our knowledge, this work is the first approach that trains a neural network to learn quality and diversity terms of a DPP for object detection; (3) The proposed algorithm outperforms baseline detectors for detecting overlapped objects.

2 Related Work

Class-aware detection algorithms.

The goal of class-aware or multi-class object detection is to localize objects in an image while predicting the category of each object. These systems are usually composed of region proposal networks and region classification networks [11, 31, 24]. To improve detection accuracy, a number of different optimization formulations and network architectures have been proposed [31, 16, 3, 28, 24, 29, 6]. Ren \etal[31] use convolutional networks, called region proposal networks, to get region proposals and combine it with Fast R-CNN. Kong \etal[16] utilizes each layer’s feature for detecting small objects in an image. A real-time multi-class object detector is proposed by combining region proposal networks and classification networks in [28]. Liu \etal[24] improve the performance of [28] using multiple detectors for each convolutional layer. To increase network efficiency, fully connected layers are replaced by convolution layers in [6]. Redmon \etal[29] extend [28] by classifying thousands of categories using the hierarchical structure of categories in the dataset.

DPPs have been used to improve detection qualities before. Azadi \etal[3] propose to suppress background bounding boxes, while trying to select correct detections. However, this method focuses on ing detection scores and uses a fixed visual similarity matrix based on WordNet [26], while our algorithm learns the similarity matrix from data.

Instance-aware algorithms.

Instance-aware algorithms have been developed to provide finer solutions in different problem domains. Instance-aware segmentation aims to label instances at the pixel level [5, 30]. Li \etal[5] propose a cascade network which finds each instance stage by stage. Similar to RIN, a network in [5] finds features of each instance. Ren \etal[30] use a recurrent neural network to sequentially find each instance. A face detector which takes keypoints of faces as an input is suggested in [21]. The dataset for this application contains face labels for identifying different faces, while the standard object detection datasets only have a small number of categories.

In object detection, Wang \etal[35] introduce a repulsion loss to improve localization of instances. However, their approach is limited to a single-class detection problem and uses NMS [9] as a post-processing method. Lee \etal[20] provide an inference method to find an optimal subset of detection candidates for pedestrian detection considering the individualness of each detection candidate. However, this approach tackles a single-class detection problem and uses features computed from a network pre-trained on the ImageNet dataset [7], instead of training the network for the desired purpose. Our method tackles a challenging multi-class detection task by learning distinctive features of object instances from data.

Recently, a detector which learns the structural relationship between objects is proposed in [25], where the detection score of an object is scaled by considering scene context and relationship between objects. Liu \etal[25] show that training with a structural relationship can implicitly reduce redundant detection boxes, while our method explicitly suppresses the scores of duplicated detection boxes. Hu \etal[14] utilize a modified attention module [34] for learning a relationship between bounding boxes. The module scales the scores using the instance relationship similar to ours. However, this method uses the standard softmax loss and smooth L1 loss, while our IDNet tackles this problem by training a detector with novel losses.

3 Proposed Method

Figure 2: Pipeline of the instance-aware detection network (IDNet). The dashed box indicates the weights of a backbone, RPN, and RCN (). The other weights are named as . Using the features extracted from RIN and the detection quality, a probability of each bounding box to be selected can be calculated. IDNet is trained with the proposed SS loss and ID loss, as well as the softmax and smooth L1 losses from Faster R-CNN [31]. The SS loss is used to suppress duplicated candidate boxes, and the ID loss is used to learn the similarity between candidate boxes. The smooth L1 loss is for regressing bounding boxes to exact locations of objects while classifying the objects in the boxes using the softmax loss.

An overview of the proposed IDNet is shown in Figure 2. IDNet is composed of a region proposal network (RPN), a region classification network (RCN) and a region identification network (RIN). Based on image feature maps from the backbone network, RPN predicts region proposals, i.e., the region of interests (RoIs). Then, a RoI pooling layer pools regional features from feature maps for each RoI. Using the regional features, RCN classifies the regions into multiple categories while localizing the regions. RIN computes instance features of candidates, which are used by DPPs.222RIN consists of seven convolutional layers and three fully connected layers. The detailed structure of RIN is described in the appendix.

3.1 Determinantal Point Processes for Detection

Suppose that there are candidate bounding boxes, , where is the th bounding box. A determinantal point process (DPP) defines a probability distribution over subsets of as follows [19]. If Y is a DPP, then

(1)

where , a kernel matrix is a real symmetric positive semi-definite matrix, an indexed kernel matrix is a submatrix of indexed by the elements of , and is an identity matrix. The kernel matrix can be decomposed as , where is a feature matrix for candidate bounding boxes. Each row of is extracted from RIN and normalized to construct the matrix. Similar to the kernel matrix, the indexed kernel matrix can be decomposed as .

Let be the detection score for the th bounding box. Then, is the detection quality for all detection candidates. The feature for is extracted from the RIN. Let be a normalized feature, where . Using candidate bounding boxes, the intersection over union between and can be calculated by , where is the number of pixels in A, and we construct a matrix by setting . A similarity matrix is constructed as , where . Using the similarity matrix and the detection quality q, the kernel matrix for a DPP [19] can be formed as , where is the element-wise multiplication.333 Notations in this paper are summarized in the appendix.

If the similarity and detection quality q are correctly assigned, a subset which maximizes (1) is a collection of the most distinctive detections due to the property of the determinant in a DPP [19]. Since IDNet is trained to maximize the probability (1) of the ground-truth detections, IDNet learns the most distinctive features and correctly scaled detection scores to separate difference object instances in order to compute and q.

3.2 Learning Detection Quality

As RCN classifies each RoI independently, multiple detections with different categories often have high detection scores. For example, a detector would report a horse nearby a dog as they are visually similar. Then, conventional post-processing methods, such as NMS, are typically suppress bounding boxes in each class. While heuristic post-processing algorithms are effective for removing duplicated bounding boxes in each category, these algorithms cannot remove duplicated boxes with different categories. In this case, even if there is a true bounding box for the dog, the horse bounding box cannot be removed. To alleviate this issue, we propose the sparse-score loss (SS loss) to detect an object with the correct class label by removing the other candidate boxes with incorrect categories.

We first select categories with top detection scores among categories from each RoI. We assume that the selected categories are composed of visually similar categories from the correct category. By suppressing the scores of visually similar categories except for the bounding boxes of a top-1 category, we can obtain a single bounding box with a correct category for an object. Let be all bounding boxes of top- categories from all RoIs and be a set of positive boxes, i.e., bounding boxes with a top-1 category in each RoI. Then, we define the SS loss as the negative log-likelihood of (1) as follows:

(2)

where . This loss function increases detection scores of bounding boxes in the positive set, . In other words, this loss suppresses scores of all subsets which have at least one non-positive bounding box. We would like to note that the normalization term for a DPP is included for numerical stability during training.

We also use two softmax losses for classification ( for binary classification and for multi-class classification), and two smooth L1 losses () for the RoI regression and candidate box regression [31]. Note that the losses are the same as [31] since we adopt Faster R-CNN as a baseline. We call the summation of all above losses as a multi-task loss.

Suppose RPN predicts the objectness probability and location shifts , where is the index of RoIs in a mini-batch. RCN predicts of categories and location shifts , where is the index of candidate boxes. The target location shift for the th RoI and th candidate box are and , respectively. Additionally, and are the ground truth category label for a RoI and an candidate box, respectively. Then, the multi-task loss is expressed as follows:

(3)

where is an indicator function, which outputs 1 when the th candidate box has a non-background label.

With all losses defined as above, the weights for a backbone, RPN, and RCN, which are denoted by in Figure 2, can be learned by optimizing:

(4)

where is used to balance the SS loss with the multi-task loss. The similarity matrix is fixed while calculating the gradient of the SS loss, since is freezed while optimizaing .

3.3 Learning Instance Differences

An instance-agnostic detector solely based on object category information often fails to detect objects in proximity. For accurate detections from real-world images with frequent overlapping objects, it is crucial to distinguish different object instances. To address this problem, we propose the instance-aware detection loss (ID loss). The objective of this loss function is to obtain similar features from the same instance and different features from different instances. This is done by maximizing the probability of a subset of the most distinctive bounding boxes.

Let be a set of all candidate bounding boxes which intersect with the ground truth bounding boxes. Let be a set of the most representative boxes, i.e., candidate boxes which are closest to the ground truth boxes obtained by the Hungarian algorithm [18]. Then, the ID loss for all objects is defined as follows:

(5)

Due to the determinant, it increases the cosine distance between and if and are from different instances. As we select boxes nearby the ground truth bounding boxes to construct , the network can learn what bounding boxes are similar or different.

In addition to (5), we set an objective which focuses on differentiating instances from the same category. For category , is candidate boxes in the th category and is a set of candidate boxes which are closest to the ground truth boxes. is also obtained by the Hungarian algorithm [18]. The category-specific ID loss is defined as follows:

(6)

It provides an additional guidance signal to train the network since it is more difficult to distinguish similar instances from the same category than instances from different categories. We find an improvement when we use both and , compared to cases when only one of them is used. Finally, the ID loss is defined as:

(7)

The goal of the ID loss is to find all instances while discriminating different instances as shown in Figure 1. While the ID loss aims to distinguish instances, the multi-task loss tries to classify categories. The difference between their goals makes a network perform worse when both losses are used simultaneously. To alleviate the problem, we trained weights of RIN ( in Figure 2) separate from . Given a set of candidate bounding boxes and subsets of them, weights of RIN can be learned by optimizing:444The gradients of the SS, ID losses are derived in the appendix.

(8)

Note that while calculating the gradient of the ID loss, the detection quality (q) is fixed, as is freezed while optimizaing .

3.4 Inference

Given a set of candidate bounding boxes, the similarity matrix and the detection quality q, Algorithm 1 (IDPP) finds the most representative subset of bounding boxes. The problem of finding a subset that maximizes the probability is NP-hard [19]. Fortunately, due to the log-submodular property of DPPs [19], we can approximately solve the problem using a greedy algorithm, such as Algorithm 1, which iteratively adds an index of a detection candidate until it cannot make the cost of a new subset higher than that of the current subset [3], where the cost of a set is .

1:  
2:  while  do
3:     
4:     
5:     if  then
6:         
7:         delete from
8:     else
9:         return  
10:     end if
11:  end while
12:  return  
Algorithm 1 Instance-Aware DPP Inference (IDPP).

4 Experiments

Datasets and baseline methods.

We comprehensively evaluated IDNet on PASCAL VOC [8] and MS COCO 2014 [23], which include 20 and 80 categories, respectively.

To demonstrate that our IDNet is effective for detecting overlapped objects, we have constructed the VOC crowd set from the VOC 2007 test set and the COCO crowd set from the COCO val set, respectively. The crowd sets contain at least one overlapped object in an image. Unless otherwise specified, we define overlapped objects as those who overlap with another object over 0.3 IoU in all experiments. We name the crowd set on VOC 2007 as , where the number of images is 283. The COCO crowd set consists of 5,471 images, which is called . The indices of crowd sets will be made publicly available.

Since the goal of our algorithm is to discriminate instances with given candidate bounding boxes, we adopt Faster R-CNN as a proposal network to get candidate detections, but other proposal networks can be used in our framework. We implement baseline methods, Faster R-CNN [31] and LDDP [3], to compare with our algorithm. Since there are few methods tested on the crowd sets, we choose the two baselines for fair comparison. Note that our baseline implementation achieves a reasonable performance of 71.4% mAP when trained with VOC 2007 using VGG-16 as a backbone, considering that the performance in the original paper [31] is 69.9% mAP.

We use different inference algorithms for each method. Unless otherwise stated, Faster R-CNN uses NMS, LDDP uses LDPP, and IDNet uses IDPP as an inference algorithm. LDPP is an inference algorithm proposed in LDDP [3], which uses a fixed class-wise similarity matrix while our IDPP uses the instance-aware features extracted from RIN.

Implementation details.

All baseline methods and our IDNet are implemented based on the Faster R-CNN in Tensorflow [1], where the most parameters, such as a learning rate, optimizer, data augmentation strategy, and batch size, are the same as the original paper [31]. In our method, we use backbone networks, e.g., VGG-16 and ResNet-101, pre-trained on the ImageNet [7] and the RIN module is initialized with Xavier initialization [12]. The RIN shares the parameters in a backbone, such as the layers until the conv2 of VGG-16 [32] and the conv1 of ResNet-101 [13], to conserve memory. We set to five for the VOC and ten for the COCO, since VOC has around five categories in the super-category and COCO has ten categories in the super-category on average. We set the ratio between the spatial similarity and visual similarity () to 0.6, which is a similar value compared with [37, 20]. Since the performance of a detector is poor during the early stage of training, top- bounding boxes do not contain similar categories. Thus, we set to zero during the early stage of training. is increased to 0.01 after the early stage. The early stages are chosen around 60% of total training iterations. We use 40k iterations for VOC 2007, 70k for VOC 0712, and 360k for COCO. Additionally, we set the size of to 256 as it performs the best. More implementation details can be found in the appendix.

Evaluation metrics.

For evaluation, we use the mean average precision (mAP). We report mAP which considers detection candidates over IoU 0.5 as correct objects for VOC. For COCO, we evaluate performance with three types of mAPs in the standard MS COCO [23] protocols: AP, , and . AP reports the average values of mAP at ten different IoU thresholds from .5 to .95, reports mAP at IoU 0.5, and reports mAP at IoU 0.75. A high score in requires better localization of detection boxes.

Figure 3: Recall curves of Faster R-CNN, LDDP, and IDNet on VOC 2007. The results are evaluated at different overlap IoU thresholds, from .0 to .4. Our proposed IDNet has a higher crowd recall and effectively detects object with high overlaps.

4.1 Pascal Voc

For VOC 2007, we train a network with VOC 2007 trainval, which contains 5k images. For VOC 0712, we train a network with VOC 0712 trainval set, which includes 16k images. All methods are tested on VOC 2007 test set, which has 5k images. After training IDNet with the SS loss and the multi-task loss, we train RIN to learn differences of instances with the ID loss for 30k iterations for VOC 2007, and 20k iterations for VOC 0712. While training RIN, the parameters in other modules except RIN are frozen. A VGG-16 backbone is used for all tested methods for PASCAL VOC.

Since IDNet is effective for overlapped objects, we report recall which is calculated as a ratio of detected objects among the overlapped objects (Figure 3). For calculating recall, we check that there are detected objects among the objects overlapped with another object above a fixed IoU threshold. After calculating the probability of detecting overlapped objects in each category, the results are averaged over categories. The recall is a better performance measure than mAP for showing the robustness to overlap. This is because the recall is calculated only for overlapped objects, while the mAP is calculated for all objects in an image containing at least a single overlapped object.

In Figure 3, recall for the objects with overlap over 0.4 is increased from 0.58 (Faster R-CNN) to 0.71 (IDNet), which is an impressive improvement. For all overlap regions, recall is higher than baseline methods and as the overlap ratio gets higher, the performance gap between Faster R-CNN and IDNet gets bigger. The results show that IDNet is effective for detecting objects in proximity.

 

Method Inference Train mAP

 

Fast R-CNN [11] NMS 07 66.9 -
SSD300 [24] NMS 07 68.0 -
Faster R-CNN [31] NMS 07 71.4 56.0
LDDP [3] LDPP 07 70.9 57.7
IDPP 07 71.9 61.8
Fast R-CNN [11] NMS 07+12 70.0 -
SSD300 [24] NMS 07+12 74.3 -
Faster R-CNN [31] NMS 07+12 75.8 62.0
LDDP [3] LDPP 07+12 76.4 63.1
IDPP 07+12 76.6 64.5

 

Table 1: Detection results on VOC 2007 test set and VOC crowd set. Legend: 07: VOC 2007 trainval set, 07+12: VOC 0712 trainval set. All methods are trained using a VGG-16 backbone network.

 

Method Inference Backbone AP

 

Faster R-CNN [31] NMS VGG-16 26.2 19.2 46.6 36.9 26.9 18.4
LDDP [3] LDPP VGG-16 26.4 19.6 46.7 37.9 26.8 18.6
IDNet IDPP VGG-16 27.3 20.5 47.6 38.2 28.2 20.0
Faster R-CNN [31] NMS ResNet-101 31.5 23.5 52.0 42.5 33.5 23.0
LDDP [3] LDPP ResNet-101 31.4 23.8 51.7 43.0 33.4 23.4
IDNet IDPP ResNet-101 32.7 24.4 53.1 43.4 34.8 24.4

 

Table 2: Detection results on COCO val set and COCO crowd set. All methods are trained with COCO train set.

To demonstrate that our IDNet is effective for detecting overlapped objects on the standard mAP, we tested Faster R-CNN, LDDP and our 555 is a version of IDNet only using ID loss. on the VOC crowd set ( in Table 1). The IDNet shows impressive improvements compared to Faster R-CNN with an improvement of 5.8% mAP for VOC 2007 and 2.5% for VOC 0712. We also observe improvements over LDDP: 4.1% improvement in mAP for VOC 2007 and 1.4% improvement for VOC 0712. Next, when we evaluated mAP for , the mAP compared with baseline methods is increased for both VOC 2007 and VOC 0712 (Table 1).

4.2 Ms Coco

MS COCO is composed of 80k images in the train set and 40k images in the val set. After training a network with the SS loss and the multi-task loss, we train the RIN module with the ID loss for 20k additional iterations.

In Table 2, we report the results using multiple APs for COCO. With respect to the crowd test set (), Table 2 shows that the performance is improved from 19.2% to 20.5% AP for VGG-16. Since the larger number of categories in COCO makes distinguishing instances harder, the improvement is smaller than the results on . To demonstrate the general effectiveness of our IDNet, we also provide the results when the backbone network is replaced by ResNet-101. The performance of IDNet is improved from 23.5% AP to 24.4% AP on the ResNet-101 backbone, compared with Faster R-CNN, which shows the effectiveness of our IDNet on a stronger backbone. We also observe that the improvement on the is bigger than the improvement on the , which means the IDNet with the IDPP inference algorithm is effective for the localization accuracy.

For all COCO val images, the performance is improved by 1.1% AP for the VGG-16 backbone and 1.2% AP for the ResNet-101 backbone (Table 2). We attribute the reason for the improvments to the fact that there are many similar categories in COCO, which has eight categories for each of 11 super categories on average. Since a number of duplicated candidate boxes can be generated, our SS loss can remove duplicated bounding boxes to increase the final detection performance.

To verify that SS loss affected the improvements, we extract candidate boxes having detection scores over a fixed threshold (0.01) in Figure 4. When a predicted box overlaps with the ground truth box by 0.5 of IoU or more, we consider it as a correct box. We divide the number of correct boxes by the number of bounding boxes to check how many boxes are correctly classified. Figure 4 shows that IDNet achieves superior performance on this measure for all categories compared to other methods. On average, IDNet achieves 43.7% while Faster R-CNN has 32.4% and LDDP has 32.9% for COCO. The results indicate that the SS loss can successfully remove incorrectly classified bounding boxes.

Figure 4: Probability of finding correct bounding boxes after training IDNet with SS loss. For the evaluation, the IDNet is trained with COCO train set, and tested with COCO val set. The categories are sampled for the best view.

 

Inference Loss AP
SS ID

 

NMS 26.2 19.2
NMS 27.0 19.6
IDPP 26.5 19.7
IDPP 27.3 20.5

 

Table 3: Ablation study on COCO. All results are from IDNet using VGG-16 as a backbone.
(a) A person is not detected.
(b) A person is detected.
(c) A sheep is mistakenly detected.
(d) A sheep is removed.
Figure 5: Qualitative detection results of Faster R-CNN vs. IDNet. (a), (c) are results of Faster R-CNN and (b), (d) are results of IDNet. In (b), IDNet detect a person, which is not detected on Faster R-CNN in (a). In (d), IDNet successfully suppresses an incorrect label, sheep, while Faster R-CNN reports a sheep in (c).
Figure 6: Scores of candidate boxes after training with each method. The leftmost column shows the ground truth boxes, and the other columns show the results of Faster R-CNN, LDDP, and IDNet from left to right. For each method, candidate boxes with scores over 0.1 and the maximum score of each category are visualized on each image. All methods are trained on COCO train set using VGG-16 as a backbone. Here, the IDNet only utilizes the SS loss.

Inference time.

We measure the average inference time per image using VGG-16 as a backbone network on minival set of COCO, which is a subset of 5k samples from the val set. All running times are measured on a machine with Intel Core 3.7GHz CPU and Titan X GPU.

Our algorithm takes 2.14 seconds to find candidate boxes and extract features of them, and 0.33 seconds to select bounding boxes using IDPP. Since Faster R-CNN takes 1.61 seconds and LDDP [3] takes 1.55 seconds, an extra time of 0.86 seconds is needed for detecting objects in an image compared with Faster R-CNN and 0.92 seconds compared with LDDP. Although our algorithm takes more time to inference, it can be used in problems which require exact detections in a crowd.

4.3 Ablation Study

We analyze the influence of the ID loss and SS loss in Table 3, where the IDNet is trained with COCO train set using VGG-16 as a backbone. In ablation studies, we check our IDNet with two post-processing methods: NMS and IDPP. In the first two rows in Table 3, we use NMS for the experiments that do not use the ID loss, since IDPP uses the trained features with the ID loss. In the last two rows of Table 3, we use IDPP with a trained RIN module.

Instance-aware identity loss.

The ID loss is made to be effective for detecting objects in a crowded scene. In the third row of Table 3, the performance is improved from 19.2% to 19.7% AP on . Comparing the second row and the last row, the performance is improved by 0.9% AP. In Figure 5(a), a person is not detected in Faster R-CNN, while our IDNet detects the person in Figure 5(b) since IDNet learns to discriminate different objects. This result indicates that the ID loss is effective for detecting objects in proximity.

Sparse-score loss.

Since the SS loss is designed to remove incorrectly classified bounding boxes, the SS loss is effective for all testing images. Thus, we focus on the results on column in Table 3. The results show that as the SS loss is used, the performance is improved by 0.8% AP.

In Figure 5(c), a sheep is erroneously detected for a cow, while our IDNet removes this erroneous detection of a sheep in Figure 5(d) as IDNet learns to remove incorrectly classified bounding boxes. It shows that the SS loss can alleviate duplicated bounding box problem in a detector.

Since Figure 5 only shows the final detections, we visualize images with candidate boxes in Figure 6 to show the changes in detection scores. The score threshold is fixed to 0.1 and the highest score in each category is written in each image.

We first compare the result with Faster R-CNN. Since Faster R-CNN does not have any loss to decrease the scores of incorrect categories, the highest score of a horse in Faster R-CNN is 0.546 while the score in IDNet is 0.158 (see the first row of Figure 6). For images in the second row of Figure 6, the maximum score of an incorrect category, remote, is 0.476 in Faster R-CNN, while the maximum score of a remote is under the threshold (0.1) in IDNet.

We also compare the result with LDDP [3]. The LDDP loss [3] is defined to increase the score of a single subset using a category-level relationship, while our SS loss is defined to decrease scores of all possible subsets containing incorrect candidate boxes using an instance-level relationship between candidate boxes. Thus, after softmax is applied to scores, the SS loss can better suppress the detection scores of bounding boxes with incorrect categories. For example, as shown in the third and last columns of Figure 6, given a cow image, the detection score for a horse is decreased from 0.673 (LDDP) to 0.158 (IDNet). It shows that the SS loss can successfully suppress scores of duplicated bounding boxes around a correct bounding box as expected.

5 Conclusion

We propose IDNet which tackles two challenges in object detection by introducing two novel losses. First, we propose the ID loss for detecting overlapped objects. Second, the SS loss is introduced to suppress erroneous detections of wrong categories. By introducing these two losses using DPPs, we demonstrate that learning instance-level relationship is useful for accurate detection. IDNet performs favorably for overall test sets and achieves significant improvements on the crowd sets. Additionally, the ablation studies show that IDNet learns to suppress erroneous detections of wrong categories. While the inference time is moderately slower than other detection methods, our algorithm is useful for real-world situations which require separating objects in proximity.

Appendix

 

Notation Definition Description

 

RoIs - Region of interest boxes which are proposed from RPN.
b - Candidate bounding boxes which are proposed from RCN.
Intersection over union (IoU) of two bounding boxes.
q - Detection score corresponding to the candidate bounding boxes.
Normalized feature of a bounding box .
Similarity between box and . .
Kernel matrix of DPPs.

 

Table 4: Notations in this paper.

Appendix A Notations

In Table 4, notations used in this paper are described.

Appendix B Gradient of losses

In this section, we derive the gradients of the proposed instance-aware detection loss (ID loss) and sparse-score loss (SS loss). For notational convenience, we assume that the matrix has the same dimension as and its entries corresponding to is copied from while remaining entries are filled with zero, for any matrix and indices .

b.1 Gradient of Instance-Aware Detection Loss

Here, we show the gradient with respect to the normalized feature (). As the derivative of the log-determinant is , the derivative of intra-class ID loss is as follows:

(9)

where is the Frobenius inner product, is the element-wise multiplication, and is the th category. Note that the is the number of categories. We only calculate the gradient of the ID loss on the similarity feature (), where . Since is a constant, the derivative of is as follows:

(10)

where . Note that is fixed while deriving gradient of the ID loss. Using the property that , where are arbitrary matrices, we can derive this:

(11)

By seeing the matrix in element-wise,

(12)

Since the gradient of is similar with a gradient of , we omit the derivation of that. Then, we can construct the gradient of ID loss by summing up (12) for all batches and categories:

(13)

b.2 Gradient of Sparse-Score Loss

The derivation for calculating the gradient of the SS loss is similar with the derivation of the instance-aware detection loss, while the gradient for the SS loss is derived over the quality (q). Note that is fixed while deriving gradient of the SS loss. The derivative of the SS loss is as follows:

(14)

Similar to the derivation of ID loss, by using the following properties,

(15)

we can derive this:

(16)

Thus, the final derivative of SS loss is as follows:

(17)

Appendix C Network Architecture

As shown in Table 5. the RIN consists of seven convolutional layers, three fully connected layers, three max-pooling layers, and one crop and resize layer. Since RIN utilizes parameters of a backbone network, the size of input channel () is chosen according to the backbone network, e.g, 64 for VGG-16 and ResNet-101. The parameters are used for training with VOC. For COCO, are used. At the end of each convolutional and fully-connected layer except the last layer has a batch normalization [15] and a rectified linear unit (ReLU) in order. We set all convolutional layers to have filters with a size of 3 3 pixels and a stride of one.

 

Layer Type Parameter Remark

 

0 Convolution stride 1
1 Convolution stride 1
2 Convolution stride 1
3 Convolution stride 1
4 Max pooling - size 22, stride 2
5 Convolution stride 1
6 Convolution stride 1
7 Convolution stride 1
8 Crop and resize - size 1515
9 Fully connected ( )1000 -
10 Fully connected 1000x1000 -
11 Fully connected 1000x256 -

 

Table 5: RIN architecture.

Appendix D More Experimental Results

In this section, we provide full results on PASCAL VOC and MS COCO datasets. For the results on all test images are in the Table 6 and Table 8. Table 7 and Table 9 show the results on the crowd sets.

Additionally, Figure 7 is the graph showing the impact of SS loss on VOC dataset and Figure 8 shows the recall graph for COCO dataset.

Appendix E Example Visualization

We visualize qualitative results of IDNet on VOC 2007 and MS COCO. For comparison, we also visualize the ground truth bounding boxes in each image, and the results of Faster R-CNN and LDDP. For Faster R-CNN and LDDP, only bounding boxes with a score threshold of 0.6 are visualized. The threshold is designated in their paper [3]. For IDNet, we use 0.2 as a score threshold.

 

Method Inference Train mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

 

Faster R-CNN [31] NMS 07 71.4 70.4 78.2 69.7 58.9 56.9 79.5 83.0 84.3 53.3 78.6 64.5 81.7 83.7 76.1 77.9 45.4 70.5 66.7 74.3 73.3
LDDP [3] LDPP 07 70.9 67.7 79.2 68.2 57.9 53.9 75.2 7979 84.8 53.7 79.2 67.5 80.9 84.0 75.7 78.0 44.7 73.3 66.7 73.8 73.1
IDPP 07 71.9 71.3 79.1 70.8 57.9 53.1 77.5 84.2 85.8 53.0 80.4 69.1 80.7 84.3 75.8 79.6 44.0 75.1 66.8 76.5 73.2
Faster R-CNN [31] NMS 07+12 75.8 77.2 84.1 74.8 67.3 65.5 82.0 87.4 87.9 58.7 81.5 69.8 85.0 85.1 77.7 79.2 47.2 75.4 71.8 82.3 75.8
LDDP [3] LDPP 07+12 76.4 76.9 83.0 75.0 66.5 64.3 83.4 87.5 87.7 61.2 81.5 70.0 86.0 84.9 81.9 83.3 48.6 75.7 72.3 82.6 76.5
IDPP 07+12 76.6 78.8 82.8 75.9 66.3 66.6 82.9 88.1 87.2 59.6 82.4 70.6 85.1 85.7 80.7 82.6 50.0 78.3 70.9 82.8 75.5

 

Table 6: Detection results on VOC 2007 test set. Legend: 07: VOC 2007 trainval set, 07+12: VOC 0712 trainval set. All methods are trained with the multi-task loss, using a VGG-16 backbone network.

 

Method Inference Train mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

 

Faster R-CNN [31] NMS 07 56.0 45.5 56.0 44.2 42.0 57.4 54.5 70.3 37.4 47.2 67.8 65.4 56.4 63.0 61.4 67.8 30.0 66.4 53.4 63.6 70.6
LDDP [3] LDPP 07 57.7 38.2 61.4 47.9 37.7 54.3 54.5 74.6 48.1 49.5 76.1 70.3 60.3 63.3 60.3 73.7 31.4 70.5 52.3 63.6 66.3
IDPP 07 61.8 65.5 59.6 56.2 49.8 60.2 61.2 76.0 38.7 50.1 67.4 65.5 68.0 67.8 64.2 74.0 35.8 75.6 50.6 81.8 68.6
Faster R-CNN [31] NMS 07+12 62.0 100.0 59.4 60.1 28.5 61.3 53.2 72.0 51.4 51.9 67.0 67.0 55.1 76.9 71.4 69.4 32.6 67.5 61.1 63.6 70.2
LDDP [3] LDPP 07+12 63.1 78.5 64.6 55.6 34.8 60.3 52.1 76.9 55.4 56.7 72.8 69.0 69.0 73.2 69.3 76.3 41.4 73.4 48.2 63.6 70.5
IDPP 07+12 64.5 88.3 68.8 59.8 31.9 64.1 61.7 79.0 48.7 54.4 72.3 66.5 64.2 77.7 71.7 75.6 37.7 77.0 57.5 63.6 70.0

 

Table 7: Detection results on VOC 2007 crowd set. Legend: 07: VOC 2007 trainval set, 07+12: VOC 0712 trainval set. All methods are trained with the multi-task loss, using a VGG-16 backbone network.

 

Method Inference Backbone AP

 

Faster R-CNN [31] NMS VGG-16 26.2 46.6 26.9 10.3 29.3 36.4 25.5 38.1 39.0 17.9 44.0 55.7
LDDP [3] LDPP VGG-16 26.4 46.7 26.8 10.5 29.4 36.8 25.0 37.4 38.4 16.0 43.1 55.3
IDNet IDPP VGG-16 27.3 47.6 28.2 10.9 30.1 38.0 25.9 39.4 40.6 18.6 45.1 58.9
Faster R-CNN [31] NMS ResNet-101 31.5 52.0 33.5 12.5 35.2 45.9 29.2 43.2 44.2 20.6 49.9 63.8
LDDP [3] LDPP ResNet-101 31.4 51.7 33.4 12.3 35.3 46.0 28.5 41.9 42.9 18.2 48.2 63.4
IDNet IDPP ResNet-101 32.7 53.1 34.8 13.1 36.4 47.6 29.5 44.3 45.6 21.2 51.2 65.8

 

Table 8: Detection results on MS COCO val set. All methods are trained on MS COCO train set with the multi-task loss.

 

Method Inference Backbone AP

 

Faster R-CNN [31] NMS VGG-16 19.2 36.9 18.4 8.5 24.3 31.0 17.0 28.6 29.6 13.4 36.4 47.8
LDDP [3] LDPP VGG-16 19.6 37.9 18.6 8.9 24.6 31.6 16.6 28.4 29.6 12.9 36.4 47.7
IDNet IDPP VGG-16 20.5 38.2 20.0 9.1 25.7 33.0 17.0 30.9 33.2 14.4 39.2 56.0
Faster R-CNN [31] NMS ResNet-101 23.5 42.5 23.0 10.4 29.6 38.5 19.3 32.8 34.0 16.1 41.7 54.6
LDDP [3] LDPP ResNet-101 23.8 43.0 23.4 10.5 30.0 39.4 19.2 32.4 33.7 15.0 41.4 55.2
IDNet IDPP ResNet-101 24.4 43.4 24.4 10.9 30.6 40.0 19.6 33.7 34.8 16.5 42.4 56.4

 

Table 9: Detection results on MS COCO crowd set. All methods are trained on MS COCO train set with the multi-task loss.

Failure cases analysis.

The top image of Figure 9 shows that the detector detected the bounding box of the wrong category for avocados. This means that the detector has found a class similar to avocado, such as banana and apple because there are no categories in a dataset. This case suggests that there is a need to suppress further scores for pictures in the absence of a detection class, i.e., background category. In the bottom of Figure 9, a giraffe is hidden behind two trees. If there is an occlusion for an object, detectors tend not to notice that it is a single object. Then detectors choose several bounding boxes for the object. Since IDPP tries to find the most representative bounding boxes, it would select all of the created bounding boxes, which increases the number of false detections.

Figure 7: Probability of finding correct bounding boxes after training IDNet with SS loss. For the evaluation, the IDNet is trained with VOC. The categories are sampled for the best view.
Figure 8: Recall curves of Faster R-CNN, LDDP, and IDNet on COCO. The results are evaluated at different overlap IoU thresholds, from .0 to .4. Our proposed IDNet has a higher crowd recall and effectively detects object with high overlaps.
Figure 9: Failure cases of IDNet. Top: A detector find an incorrect category; Bottom: A detector cannot distinguish a completely occluded object. The class labels are arranged for the best view.
Figure 10: Visualization results on PASCAL VOC 2007 test set. The leftmost column shows the ground truth boxes, and the other columns show the results of Faster R-CNN, LDDP, and IDNet from left to right. For each method, final boxes with scores over 0.6 are visualized on each image. All methods are trained on VOC 2007 trainval set using VGG-16 as a backbone.
Figure 11: Visualization results on COCO val set. The leftmost column shows the ground truth boxes, and the other columns show the results of Faster R-CNN, LDDP, and IDNet from left to right. For each method, final boxes with scores over 0.6 are visualized on each image. All methods are trained on COCO train set using VGG-16 as a backbone.

Successful cases.

The successful images of IDNet are visualized in Figure 10 for VOC 2007, and Figure 11 for MS COCO. In Figure 10, the first and last row images show that incorrect class bounding boxes are suppressed while selecting a correct class, which means that the IDNet suppressed bounding boxes with incorrect categories. The results on the other rows show the objects in proximity are detected while other methods fail. The results show that overlapped objects are successfully detected in IDNet. In Figure 11, all results show that IDNet can detect overlapped objects.

The results show the proposed IDNet can detect overlapped objects compared to the other algorithms while suppressing bounding boxes with incorrect categories.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning.. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), Cited by: §4.
  • [2] M. Andriluka, S. Roth, and B. Schiele (2008) People-Tracking-by-Detection and People-Detection-by-Tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [3] S. Azadi, J. Feng, and T. Darrell (2017) Learning Detection with Diverse Proposals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 6, Table 7, Table 8, Table 9, Appendix E, §1, §2, §2, §3.4, §4, §4, §4.2, §4.3, Table 1, Table 2.
  • [4] W. Chao, B. Gong, K. Grauman, and F. Sha (2015) Large-Margin Determinantal Point Processes. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.
  • [5] J. Dai, K. He, and J. Sun (2016) Instance-aware Semantic Segmentation via Multi-task Network Cascades. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [6] J. Dai, Y. Li, K. He, and J. Sun (2016) R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Neural Information Processing Systems (NIPS), Cited by: §2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision (IJCV) 88 (2), pp. 303–338. Cited by: §1, §4.
  • [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32 (9), pp. 1627–1645. Cited by: §2.
  • [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [11] R. Girshick (2015) Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, Table 1.
  • [12] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §4.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.
  • [14] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei (2018) Relation Networks for Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [15] S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML), Cited by: Appendix C.
  • [16] T. Kong, A. Yao, Y. Chen, and F. Sun (2016) HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [17] A. Krause, A. Singh, and C. Guestrin (2008) Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. Journal of Machine Learning Research (JMLR) 9, pp. 235–284. Cited by: §1.
  • [18] H. W. Kuhn (1955) THE HUNGARIAN METHOD FOR THE ASSIGNMENT PROBLEM. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.3, §3.3.
  • [19] A. Kulesza and B. Taskar (2012) Determinantal Point Processes for Machine Learning. arXiv preprint arXiv:1207.6083. Cited by: §1, §1, §3.1, §3.1, §3.1, §3.4.
  • [20] D. Lee, G. Cha, M. Yang, and S. Oh (2016) Individualness and Determinantal Point Processes for Pedestrian Detection. In European Conference on Computer Vision (ECCV), Cited by: §2, §4.
  • [21] Y. Li, B. Sun, T. Wu, and Y. Wang (2016) Face Detection with End-to-End Integration of a ConvNet and a 3D Model. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [22] H. Lin and J. A. Bilmes (2012) Learning Mixtures of Submodular Shells with Application to Document Summarization. Conference on Uncertainty in Artificial Intelligence (UAI). Cited by: §1.
  • [23] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), Cited by: §1, §4, §4.
  • [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot Multibox Detector. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, Table 1.
  • [25] Y. Liu, R. Wang, S. Shan, and X. Chen (2018) Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [26] G. A. Miller (1995) WordNet: A Lexical Database for English. Communications of the ACM 38 (11), pp. 39–41. Cited by: §2.
  • [27] R. Ranjan, V. M. Patel, and R. Chellappa (2017) HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §1.
  • [28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [29] J. Redmon and A. Farhadi (2017) YOLO9000: Better, Faster, Stronger. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [30] M. Ren and R. S. Zemel (2017) End-to-End Instance Segmentation with Recurrent Attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [31] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Neural Information Processing Systems (NIPS), Cited by: Table 6, Table 7, Table 8, Table 9, §1, §1, §1, §1, §2, Figure 2, §3.2, §4, §4, Table 1, Table 2.
  • [32] K. Simonyan and A. Zisserman (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.
  • [33] Y. Tian, M. Lu, and A. Hampapur (2005) Robust and Efficient Foreground Analysis for Real-time Video Surveillance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. In Neural Information Processing Systems (NIPS), Cited by: §2.
  • [35] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen (2018) Repulsion Loss: Detecting Pedestrians in a Crowd. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [36] P. Xie, R. Salakhutdinov, L. Mou, and E. P. Xing (2017) Deep Determinantal Point Process for Large-Scale Multi-Label Classification. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [37] K. Zhang, W. Chao, F. Sha, and K. Grauman (2016) Video Summarization with Long Short-term Memory. In European Conference on Computer Vision (ECCV), Cited by: §1, §4.
  • [38] T. Zhou, Z. Kuscsik, J. Liu, M. Medo, J. R. Wakeling, and Y. Zhang (2010) Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences (PNAS) 107 (10), pp. 4511–4515. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388291
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description