Towards Accurate One-Stage Object Detection with AP-Loss

Towards Accurate One-Stage Object Detection with AP-Loss

Kean Chen, Jianguo Li, Weiyao Lin  , John See, Ji Wang, Lingyu Duan, Zhibo Chen, Changwei He, Junni Zou
Shanghai Jiao Tong University, China, Intel Labs, China,
Multimedia University, Malaysia, Tencent YouTu Lab, China, Peking University, China
Corresponding Author, Email: wylin@sjtu.edu.cn
Abstract

One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks. We verify good convergence property of the proposed algorithm theoretically and empirically. Experimental results demonstrate notable performance improvement in state-of-the-art one-stage detectors based on AP-loss over different kinds of classification-losses on various benchmarks, without changing the network architectures.

1 Introduction

(a) Acc
(b) AP
Figure 1: Dashed red boxes are the ground truth object boxes. The orange filled boxes and other blank boxes are anchors with positive and negative ground truth labels, repectively. (a) shows that the detection performance is poor but the classification accuracy is still high due to large number of true negatives. (b) shows the ranking metric AP can better reflect the actual condition as it does not suffer from the large number of true negatives.
Figure 2: Overall framework of the proposed approach. We replace the classification-task in one-stage detectors with a ranking task, where the ranking procedure produces the primary terms of AP-loss and the corresponding label vector. The optimization algorithm is based on an error-driven learning scheme combined with backpropagation. The localization-task branch is not shown here due to no modification.

Object detection needs to localize and recognize the objects simultaneously from the large backgrounds, which remains challenging due to the imbalance between foreground and background. Deep learning based detection solutions usually adopt a multi-task architecture, which handles classification task and localization task with different loss functions. The classification task aims to recognize the object in a given box, while the localization task aims to predict the precise bounding box of the object. Two-stage detectors [25, 7, 2, 14] first generates a limited number of object box proposals, so that the detection problem can be solved by adopting classification task on those proposals. However, the circumstance is different for one-stage detectors, which need to predict the object class directly from the densely pre-designed candidate boxes. The large number of boxes yield the imbalance between foreground and background which makes the optimization of classification task easily biased and thus impacts the detection performance. It is observed that the classification metric could be very high for a trivial solution which predicts negative label for almost all candidate boxes, while the detection performance is poor. (a) illustrates one such example.

To tackle this issue in one-stage object detectors, some works introduce new classification losses such as balanced loss [23], Focal Loss [15], as well as tailored training method such as Online Hard Example Mining (OHEM) [18, 31]. These losses model each sample (anchor box) independently, and attempt to re-weight the foreground and background samples in classification losses to cater for the imbalance condition; this is done without considering the relationship among different samples. The designed balance weights are hand-crafted hyper-parameters, which do not generalize well across datasets. We argue that the gap between classification task and detection task hinder the performance of one-stage detectors. In this paper, instead of modifying the classification loss, we propose to replace classification task with ranking task in one-stage detectors, where the associated ranking loss explicitly models sample relationship, and is invariant to the ratio of positive and negative samples. As shown in (b), we adopt Average Precision (AP) as our target loss which is inherently more consistent with the evaluation metric for object detection.

However, it is non-trivial to directly optimize the AP-loss due to the non-differentiability and non-decomposability, so that standard gradient descent methods are not amenable for this case. There are three aspects of studies for this issue. First, AP based loss is studied within structured SVM models [37, 19], which restricts in linear SVM model so that the performance is limited. Second, a structured hinge loss [20] is proposed to optimize the upper bound of AP-loss instead of the loss itself. Third, approximate gradient methods [34, 9] are proposed for optimizing the AP-loss, which are less efficient and easy to fall into local optimum even for the case of linear models due to the non-convexity and non-quasiconvexity of the AP-loss. Therefore, it is still an open problem for the optimization of the AP-loss.

In this paper, we address this challenge by replacing the classification task in one-stage detectors with a ranking task, so that we handle the class imbalance problem with a ranking based loss named AP-loss. Furthermore, we propose a novel error-driven learning algorithm to effectively optimize the non-differentiable AP based objective function. More specifically, some extra transformations are added to the score output of one-stage detector to obtain the AP-loss, which includes a linear transformation that transforms the scores to pairwise differences, and a non-linear and non-differentiable “activation function” that transform the pairwise differences to the primary terms of the AP-loss. Then the AP-loss can be obtained by the dot product between the primary terms and the label vector. It is worth noting that the difficulty for using gradient method on the AP-loss lies in passing gradients through the non-differentiable activation function. Inspired by the perceptron learning algorithm [26], we adopt an error-driven learning scheme to directly pass the update signal through the non-differentiable activation function. Different from gradient method, our learning scheme gives each variable an update signal proportional to the error it makes. Then, we adopt the backpropagation algorithm to transfer the update signal to the weights of neural network. We theoretically and experimentally prove that the proposed optimization algorithm does not suffer from the non-differentiability and non-convexity of the objective function. The main contributions of this paper are summarized as below:

  • We propose a novel framework in one-stage object detectors which adopts the ranking loss to handle the class imbalance issue.

  • We propose an error-driven learning algorithm that can efficiently optimize the non-differentiable and non-convex AP-based objective function with both theoretical and experimental verifications.

  • We show notable performance improvement with the proposed method on state-of-the-art one-stage detectors over different kinds of classification-losses without changing the model architecture.

2 Related Work

One-stage detectors: In object detection, the one-stage approaches have relatively simpler architecture and higher efficiency than two-stage approaches. OverFeat [28] is one of the first CNN-based one-stage detectors. Thereafter, different designs of one-stage detectors are proposed, including SSD [18], YOLO [23], DSSD [6] and DSOD [30, 13]. These methods demonstrate good processing efficiency as one-stage detectors, but generally yield lower accuracy than two-stage detectors. Recently, RetinaNet [15] and RefineDet [38] narrow down the performance gap (especially on the challenging COCO benchmark [16]) between one-stage approaches and two-stage approaches with some innovative designs. As commonly known, the performance of one-stage detectors benefits much from densely designed anchors, which introduce extreme imbalance between foreground and background samples. To address this challenge, methods like OHEM [18, 31] and Focal Loss [15] have been proposed to reduce the loss weight for easy samples. However, there are two hurdles that are still open to discussion. Firstly, hand-crafted hyper-parameters for weight balance do not generalize well across datasets. Secondly, the relationship among sample anchors is far from well modeled.

AP as a loss for Object Detection: Average Precision (AP) is widely used as the evaluation metric in many tasks such as object detection [5] and information retrieval [27]. However, AP is far from a good and common choice as an optimization goal in object detection due to its non-differentiability and non-convexity. Some methods have been proposed to optimize the AP-loss in object detection, such as AP-loss in the linear structured SVM model [37, 19], structured hinge loss as upper bound of the AP-loss [20], approximate gradient methods [34, 9], reinforcement learning to fine-tune a pre-trained object detector with AP based metric [22]. Although these methods give valuable results in optimizing the AP-loss, their performances are still limited due to the intrinsic limitations. In details, the proposed approach differs from them in 4 aspects. (1) Our approach can be used for any differentiable linear or non-linear models such as neural networks, while [37, 19] only work for linear SVM model. (2) Our approach directly optimizes the AP-loss, while [20] introduces notable loss gap after relaxation. (3) Our approach dose not approximate the gradient and dose not suffer from the non-convexity of objective function as in [34, 9]. (4) Our approach can train the detectors in an end-to-end way, while [22] cannot.

Perceptron Learning Algorithm: The core of our optimization algorithm is the “error-driven update” which is generalized from the perceptron learning algorithm [26], and helps overcome the difficulty of the non-differentiable objective functions. The perceptron is a simple artificial neuron using the Heaviside step function as the activation function. The learning algorithm was first invented by Frank Rosenblatt [26]. As the Heaviside step function in perceptron is non-differentiable, it is not amenable for gradient method. Instead of using a surrogate loss like cross-entropy, the perceptron learning algorithm employs an error-driven update scheme directly on the weights of neurons. This algorithm is guaranteed to converge in finite steps if the training data is linearly separable. Further works like [11, 1, 35] have studied and improved the stability and robustness of the perceptron learning algorithm.

3 Method

We aim to replace the classification task with AP-loss based ranking task in one-stage detectors such as RetinaNet [15]. Figure 2 shows the two key components of our approach, i.e., the ranking procedure and the error-driven optimization algorithm. Below, we will first present how AP-loss is derived from traditional score output. Then, we will introduce the error-driven optimization algorithm. Finally, we also present the theoretical analyses of the proposed optimization algorithm and outline the training details. Note that all changes are made on the loss part of the classification branch without changing the backbone model and localization branch.

3.1 Ranking Task and AP-Loss

3.1.1 Ranking Task

(a)
(b)
Figure 3: Comparison of label assignments. The dashed red box is the ground truth box with class . (a) In traditional classification task of one-stage detectors, the anchor is assigned a foreground label . (b) In our ranking task framework, the anchor replicates times, and we assign the -th anchor to label , others 0.

In traditional one-stage detectors, given input image , suppose the pre-defined boxes (also called anchors) set is , each box will be assigned a label based on ground truth and the IoU strategy [7, 25], where label means the object class ID, label “0” means background and label “” means ignored boxes. During training and testing phase, the detector outputs a score-vector for each box .

In our framework, instead of one box with dimensional score predictions, we replicate each box for times to obtain where , and the -th box is responsible for the -th class. Each box will be assigned a label through the same IoU strategy (label for not counted into the ranking loss). Therefore, in the training and testing phase, the detector will predict only one scalar score for each box . Figure 3 illustrates our label formulation and the difference to traditional case.

The ranking task dictates that every positive boxes should be ranked higher than all the negative boxes w.r.t their scores. Note that AP of our ranking result is computed over the scores from all classes. This is slightly different from the evaluation metric meanAP for object detection systems, which computes AP for each class and obtains the average value. We compute AP this way because the score distribution should be unified for all classes while ranking each class separately cannot achieve this goal.

3.1.2 AP-Loss

For simplicity, we still use to denote the anchor box set after replication, and to denote the -th anchor box without the replication subscript. Each box thus corresponds to one scalar score and one binary label . Some transformations are required to formulate a ranking loss as illustrated in Figure 2. First, the difference transformation transfers the score to the difference form

(1)

where is a CNN based score function with weights for box . The ranking label transformation transfers labels to the corresponding pairwise ordering form

(2)

where is a indicator function which equals to 1 only if the subscript condition holds (i.e., ), otherwise . Then, we define an vector-valued activation function to produce the primary terms of the AP-loss as

(3)

where is the Heaviside step function:

(4)

A ranking is denoted as proper ranking when there are no two samples scored equally (i.e., ). Without loss of generality, we will treat all rankings as a proper ranking by breaking ties arbitrarily. Now, we can formulate the AP-loss as

(5)

where and denote the ranking position of score among all valid samples and positive samples respectively, , , is the size of set , and are vector form for all and respectively, means dot-product of two input vectors. Note that , where .

Finally, the optimization problem can be written as:

(6)

where denotes the weights of detector model. As the activation function is non-differentiable, a novel optimization/learning scheme is required instead of the standard gradient descent method.

Besides the AP metric, other ranking based metric can also be used to design the ranking loss for our framework. One example is the AUC-loss [12] which measures the area under ROC curve for ranking purpose, and has a slightly different “activation function” as

(7)

As AP is consistent with the evaluation metric of the object detection task, we argue that AP-loss is intuitively more suitable than AUC-loss for this task, and will provide empirical study in our experiments.

3.2 Optimization Algorithm

3.2.1 Error-Driven Update

Recalling the perceptron learning algorithm, the update for input variable is “error-driven”, which means the update is directly derived from the difference between desired output and current output. We adopt this idea and further generalize it to accommodate the case of activation function with vector-valued input and output. Suppose is the input and is the current output, the update for is thus

(8)

where is the desired output. Note that the AP-loss achieves its minimum possible value when each term . There are two cases. If , we should set the desired output . If , we do not care the update and set it to , since it does not contribute to the AP-loss. Consequently, the update can be simplified as

(9)

3.2.2 Backpropagation

We now have the desired vector-form update , and then will find an update for model weights which will produce most appropriate movement for . We use dot-product to measure the similarity of successive movements, and regularize the change of weights (i.e. ) with -norm based penalty term. The optimization problem can be written as:

(10)

where denotes the model weights at the -th step. With that, the first-order expansion of is given by:

(11)

where is the Jacobian matrix of vector-valued function at . Ignoring the high-order infinitesimal, we obtain the step-wise minimization process:

(12)

The optimal solution can be obtained by finding the stationary point. Then, the form of optimal is consistent with the chain rule of derivative, which means, it can be directly implemented by setting the gradient of to (c.f. Equation 9) and proceeding with backpropagation. Hence the gradient for score can be obtained by backward propagating the gradient through the difference transformation:

(13)

3.3 Analyses

Convergence: To better understand the characteristics of the AP-loss, we first provide a theoretical analysis on the convergence of the optimization algorithm, which is generalized from the convergence property of the original perceptron learning algorithm.

Proposition 1

The AP-loss optimizing algorithm is guaranteed to converge in finite steps if below conditions hold: (1) the learning model is linear;

(2) the training data is linearly separable.

The proof of this proposition is provided in Appendix-1 of supplementary. Although convergence is somewhat weak due to the need of strong conditions, it is non-trivial since the AP-loss function is not convex or quasiconvex even for the case of linear model and linearly separable data, so that gradient descent based algorithm may still fail to converge on a smoothed AP-loss function even under such strong conditions. One such example is presented in Appendix-2 of supplementary. It means that, under such conditions, our algorithm still optimizes better than the approximate gradient descent algorithm for AP-loss. Furthermore, with some mild modifications, even though the training data is not separable, the accumulated AP-loss can also be bounded proportionally by the best performance of the learning model. More details are presented in Appendix-3 of supplementary.

Consistency: Besides convergence, We observed that the proposed optimization algorithm is inherently consistent with widely used classification-loss functions.

Observation 1

When the activation function takes the form of softmax function and loss-augmented step function, our optimization algorithm can be expressed as the gradient descent algorithm on cross-entropy loss and hinge loss respectively.

The detailed analysis of this observation is presented in Appendix-4 of supplementary. We argue that the observed consistency is on the basis of the “error-driven” property. As is known, the gradients of those widely used loss functions are proportional to their prediction errors, where the prediction here refers to the output of activation function. In other words, their activation functions have a nice property: the vector field of prediction errors is conservative, allowing it being the gradient of some surrogate loss function. However, our activation function does not have this property, which makes our optimization not able to express as gradient descent with any surrogate loss function.

3.4 Details of Training Algorithm

1:All scores and corresponding labels in a minibatch
2:Gradient of input
3:
4:
5:
6: Indexes of scores sorted in ascending order
7:for  do
8:      Compute for all and for all According to Equation 3 and Equation 14
9:      
10:      if  then
11:            
12:      else Interpolation
13:            
14:      end if
15:       According to Equation 13
16:       According to Equation 13
17:end for
18: Normalization
Algorithm 1 Minibatch training for Interpolated AP
Batch Size AP AP AP
1 52.4 80.2 56.7
2 53.0 81.7 57.8
4 52.8 82.2 58.0
8 53.1 82.3 58.1
(a) Varying batch size
AP AP AP
0.25 50.2 80.7 53.6
0.5 51.3 81.6 55.4
1 53.1 82.3 58.1
2 52.8 82.6 57.2
(b) Varying for piecewise step function
Interpolated AP AP AP
No 52.6 82.2 57.1
Yes 53.1 82.3 58.1
(c) Interpolated vs. not interpolated
Table 1: Ablation experiments. Models are tested on VOC2007 test set.

Minibatch Training The minibatch training strategy is widely used in deep learning frameworks [8, 18, 15] as it accounts for more stability than the case with batch size equal to 1. The mini-batch training helps our optimization algorithm quite a lot for escaping the so-called “score-shift” scenario. The AP-loss can be computed both from a batch of images and from a single image with multiple anchor boxes. Consider an extreme case: our detector can predict perfect ranking in both image and image , but the lowest score in image is even greater than the highest score in image . There are “score-shift” between two images so that the detection performance is poor when computing AP-loss per-image. Aggregating scores over images in a mini-batch can avoid such problem, so that the minibatch training is crucial for good convergence and good performance.

Piecewise Step function During early stage of training, scores are very close to each other (i.e. almost all inputs to Heaviside step function are near zero), so that a small change of input will cause a big output difference, which destabilizes the updating process. To tackle this issue, we replace with a piecewise step function:

(14)

The piecewise step functions with different are shown in Figure 4. When approaches , the piecewise step function approaches the original step function. Note that is only different from near zero point. We argue that the precise form of the piecewise step function is not crucial. Other monotonic and symmetric smooth functions that only differs from near zero point could be equally effective. The choice of relates closely to the weight decay hyper-parameter in CNN optimization. Intuitively, parameter controls the width of decision boundary between positive and negative samples. Smaller enforces a narrower decision boundary, which causes the weights to shrink correspondingly (similar effect to that caused by the weight decay). Further details are presented in the experiments.

Figure 4: Heaviside step function and piecewise step function. (Best viewed in color)

Interpolated AP The interpolated AP [27] is widely adopted by many object detection benchmarks like PASCAL VOC [5] and MS COCO [16]. The common justification for interpolating the precision-recall curve [5] is “to reduce the impact of ’wiggles’ in the precision-recall curve, caused by small variations in the ranking of examples”. Under the same consideration, we adopt the interpolated AP instead of the original version. Specifically, the interpolation is applied on to make the precision at the -th smallest positive sample monotonically increasing with where the precision is in which is the index of the -th smallest positive sample. It is worth noting that the interpolated AP is a smooth approximation of the actual AP so that it is a practical choice to help to stabilize the gradient and to reduce the impact of ’wiggles’ in the update signals. The details of the interpolated AP based algorithm is summarized in Algorithm 1.

4 Experiments

4.1 Experimental Settings

We evaluate the proposed method on the state-of-the-art one-stage detector RetinaNet [15]. The implementation details are the same as in [15] unless explicitly stated. Our experiments are performed on two benchmark datasets: PASCAL VOC [5] and MS COCO [16]. The PASCAL VOC dataset has 20 classes, with VOC2007 containing 9,963 images for train/val/test and VOC2012 containing 11,530 for train/val. The MS COCO dataset has 80 classes, containing 123,287 images for train/val. We implement our codes with the MXNET framework, and conduct experiments on a workstation with two NVidia TitanX GPUs.

PASCAL VOC: When evaluated on the VOC2007 test set, models are trained on the VOC2007 and VOC2012 trainval sets. When evaluated on the VOC2012 test set, models are trained on the VOC2007 and VOC2012 trainval sets plus the VOC2007 test set. Similar to the evaluation metrics used in the MS COCO benchmark, we also report the AP averaged over multiple IoU thresholds of . We set in Equation 14. We use ResNet [8] as the backbone model which is pre-trained on the ImageNet-1k classification dataset [4]. At each level of FPN [14], the anchors have 2 sub-octave scales (, for ) and 3 aspect ratios [0.5, 1, 2]. We fix the batch normalization layers to be frozen in training phase. We adopt the minibatch training on 2 GPUs with 8 images per GPU. All evaluated models are trained for 160 epochs with an initial learning rate of 0.001 which is then divided by 10 at 110 epochs and again at 140 epochs. Weight decay of 0.0001 and momentum of 0.9 are used. We adopt the same data augmentation strategies as [18], while do not use any data augmentation during testing phase. In training phase, the input image is fixed to 512512, while in testing phase, we maintain the original aspect ratio and resize the image to ensure the shorter side with 600 pixels. We apply the non-maximum suppression with IoU of 0.5 for each class.

MS COCO: All models are trained on the widely used trainval35k set (80k train images and 35k subset of val images), and tested on minival set (5k subset of val images) or test-dev set. We train the networks for 100 epochs with an initial learning rate of 0.001 which is then divided by 10 at 60 epochs and again at 80 epochs. Other details are similar to that for PASCAL VOC.

4.2 Ablation Study

We first investigate the impact of our design settings of the proposed framework. We fix the ResNet-50 as backbone and conduct several controlled experiments on PASCAL VOC2007 test set (and COCO minival if stated) for this ablation study.

4.2.1 Comparison on Different Parameter Settings

Training Loss PASCAL VOC COCO
AP AP AP AP AP AP
CE-Loss + OHEM 49.1 81.5 51.5 30.8 50.9 32.6
Focal Loss 51.3 80.9 55.3 33.9 55.0 35.7
AUC-Loss 49.3 79.7 51.8 25.5 44.9 26.0
AP-Loss 53.1 82.3 58.1 35.0 57.2 36.6
Table 2: Comparison through different training losses. Models are tested on VOC2007 test and COCO minival sets. The metric AP is averaged over multiple IoU thresholds of .

Here we study the impact of the practical modifications introduced in Section 3.4. All results are shown in Table 1.

Minibatch Training: First, we study the mini-batch training, and report detector results at different batch-size in (a). It shows that larger batch-size (i.e. 8) outperforms all the other smaller batch-size. This verifies our previous hypothesis that large minibatch training helps to eliminate the “score-shift” from different images, and thus stabilizes the AP-loss through robust gradient calculation. Hence, is used in our further studies.

Piecewise Step Function: Second, we study the piecewise step function, and report detector performance on the piecewise step function with different in (b). As mentioned before, we argue that the choice of is trivial and is dependent on other network hyper-parameters such as weight decay. Smaller makes the function sharper, which yields unstable training at initial phase. Larger makes the function deviate from the properties of the original AP-loss, which also worsens the performance. is a good choice we used in our further studies.

Interpolated AP: Third, we study the impact of interpolated AP in our optimization algorithm, and list the results in (c). Marginal benefits are observed for interpolated AP over standard AP, so we use interpolated AP in all the following studies.

4.2.2 Comparison on Different Losses

We evaluate with different losses on RetinaNet [15]. Results are shown in Table 2. We compare traditional classification based losses like focal loss [15] and cross entropy loss (CE-loss) with OHEM [18] to the ranking based losses like AUC-loss and AP-loss. Although focal loss is significantly better than CE-loss with OHEM on COCO dataset, it is interesting that focal-loss does not perform better than CE-loss at AP on PASCAL VOC. This is likely because the hyper-parameters of focal loss are designed to suit the imbalance condition on COCO dataset which is not suitable for PASCAL VOC, so that focal loss cannot generalize well to PASCAL VOC without tuning its hyper-parameters. The proposed AP-loss performs much better than all the other losses on both two datasets, which demonstrates its effectiveness and stronger generalization ability on handling the imbalance issue. It is worth noting that AUC-loss performs much worse than AP-loss, which may be due to the fact that AUC has equal penalty for each misordered pair while AP imposes greater penalty for the misordering at higher positions in the predicted ranking. It is obvious that object detection evaluation concerns more on objects with higher confidence, which is why AP provides a better loss measure. Furthermore, an assessment of the detection performance at different training iterations, as shown in (a), outlines the superiority of the AP-loss for snapshot time points.

(a)
(b)
Figure 5: (a) Detection accuracy (mAP) on VOC2007 test set. (b) Convergence curves of different AP-loss optimizations on VOC2007 trainval set. (Best viewed in color)
Figure 6: Some detection examples. Top: Baseline results by RetinaNet with focal loss. Bottom: Our results with AP-loss.
Method Backbone Multi-Scale VOC07 VOC12 COCO
AP AP AP AP AP AP AP AP
YOLOv2 [24] DarkNet-19 78.6 73.4 21.6 44.0 19.2 5.0 22.4 35.5
DSOD300 [30] DS/64-192-48-1 77.7 76.3 29.3 47.3 30.6 9.4 31.5 47.0
SSD512 [18] VGG-16 79.8 78.5 28.8 48.5 30.3 - - -
SSD513 [6] ResNet-101 80.6 79.4 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [6] ResNet-101 81.5 80.0 33.2 53.3 35.2 13.0 35.4 51.1
DES512 [39] VGG-16 81.7 80.3 32.8 53.2 34.6 13.9 36.0 47.6
RFBNet512 [17] VGG-16 82.2 - 33.8 54.2 35.9 16.2 37.1 47.4
PFPNet-R512 [10] VGG-16 82.3 80.3 35.2 57.6 37.9 18.7 38.6 45.9
RefineDet512 [38] VGG-16 81.8 80.1 33.0 54.5 35.5 16.3 36.3 44.3
RefineDet512 [38] ResNet-101 - - 36.4 57.5 39.5 16.6 39.9 51.4
RetinaNet500 [15] ResNet-101 - - 34.4 53.1 36.8 14.7 38.5 49.1
RetinaNet500+AP-Loss (ours) ResNet-101 83.9 83.1 37.4 58.6 40.5 17.3 40.8 51.9
PFPNet-R512 [10] VGG-16 84.1 83.7 39.4 61.5 42.6 25.3 42.3 48.8
RefineDet512 [38] VGG-16 83.8 83.5 37.6 58.7 40.8 22.7 40.3 48.3
RefineDet512 [38] ResNet-101 - - 41.8 62.9 45.7 25.6 45.1 54.1
RetinaNet500+AP-Loss (ours) ResNet-101 84.9 84.5 42.1 63.5 46.4 25.6 45.0 53.9
Table 3: Detection results on VOC2007 test, VOC 2012 test and COCO test-dev sets.

4.2.3 Comparison on Different Optimization Methods

We also compare our optimization method with the approximate gradient method [34, 9] and structured hinge loss method [20]. Both [34, 9] approximate the AP-loss with a smooth expectation and envelope function, respectively. Following their guidance, we replace the step function in AP-loss with a sigmoid function to constrain the gradient to neither zero nor undefined, while still keep the shape similar to the original function. Same as [9], we adopt the log space objective function, i.e. , to allow the model to quickly escape from the initial state. We train the detector on VOC2007 trainval set and turn off the bounding box regression task. The convergence curves shown in (b) reveal some essential observations. It can be seen that AP-loss optimized by approximate gradient method does not even converge, likely because its non-convexity and non-quasiconvexity fail on a direct gradient descent method. Meanwhile, AP-loss optimized by the structured hinge loss method [20] converges slowly and stabilizes near 0.8, which is significantly worse than the asymptotic limit of AP-loss optimized by our error-driven update scheme. We believe that this method does not optimize the AP-loss directly but rather an upper bound of it, which is controlled by a discriminant function [20]. In ranking task, this discriminant function is hand-picked and has an AUC-like form, which may cause variability in optimization.

4.3 Benchmark Results

With the settings selected in ablation study, we conduct experiments to compare the proposed detector to state-of-the-art one-stage detectors on three widely used benchmark, i.e. VOC2007 test, VOC2012 test and COCO test-dev sets. We use ResNet-101 as backbone networks instead of ResNet-50 in ablation study. We use an image scale of 500 pixels for testing. Table 3 lists the benchmark results comparing to recent state-of-the-art one-stage detectors such as SSD [18], YOLOv2 [24], DSSD [6], DSOD [30], DES [39], RetinaNet [15], RefineDet [38], PFPNet [10], RFBNet [17]. Compared to the baseline model RetinaNet500 [15], our detector achieves a 3.0% improvement (37.4% vs. 34.4%) on COCO dataset. Figure 6 illustrates some detection results by the RetinaNet with focal loss and our AP-loss. Besides, our detector outperforms all the other methods for both single-scale and multi-scale tests in all the three benchmarks. We should emphasize that this verifies the great effectiveness of our AP-loss since our detector achieves such a great performance gain just by replacing the focal-loss with our AP-loss in RetinaNet without whistle and bells, without using advanced techniques like deformable convolution [3], SNIP [33], group normalization [36], etc. The performance could be further improved with these kinds of techniques and other possible tricks. Our detector has the same detection speed (i.e., on one NVidia TitanX GPU) as RetinaNet500 [15] since it does not change the network architecture for inference.

5 Conclusion

In this paper, we address the class imbalance issue in one-stage object detectors by replacing the classification sub-task with a ranking sub-task, and proposing to solve the ranking task with AP-Loss. Due to non-differentiability and non-convexity of the AP-loss, we propose a novel algorithm to optimize it based on error-driven update scheme from perceptron learning. We provide a grounded theoretical analysis of the proposed optimization algorithm. Experimental results show that our approach can significantly improve the state-of-the-art one-stage detectors.

Acknowledgements. This paper is supported in part by: National Natural Science Foundation of China (61471235), Shanghai ’The Belt and Road’ Young Scholar Exchange Grant (17510740100), CREST Malaysia (No. T03C1-17), and the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation. We gratefully acknowledge the support from Tencent YouTu Lab.

References

  • [1] JK Anlauf and M Biehl. The adatron: an adaptive perceptron algorithm. EPL (Europhysics Letters), 10(7):687, 1989.
  • [2] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, pages 379–387, 2016.
  • [3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, et al. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
  • [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [5] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
  • [6] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
  • [7] Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [9] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean average precision. In ACCV, pages 198–213, 2016.
  • [10] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyramid network for object detection. In ECCV, pages 234–250, 2018.
  • [11] Werner Krauth and Marc Mézard. Learning algorithms with optimal stability in neural networks. Journal of Physics A: Mathematical and General, 20(11):L745, 1987.
  • [12] Jianguo Li and Y Zhang. Learning surf cascade for fast and accurate object detection. In CVPR, 2013.
  • [13] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo Li. Tiny-DSOD: Lightweight object detection for resource-restricted usages. In BMVC, 2018.
  • [14] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 3, 2017.
  • [15] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Trans on PAMI, 2018.
  • [16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  • [17] Songtao Liu, Di Huang, and andYunhong Wang. Receptive field block net for accurate and fast object detection. In The European Conference on Computer Vision (ECCV), September 2018.
  • [18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
  • [19] Pritish Mohapatra, CV Jawahar, and M Pawan Kumar. Efficient optimization for average precision svm. In NIPS, pages 2312–2320, 2014.
  • [20] Pritish Mohapatra, Michal Rolinek, C.V. Jawahar, Vladimir Kolmogorov, and M. Pawan Kumar. Efficient optimization for rank-based loss functions. In CVPR, 2018.
  • [21] A Novikoff. On convergence proofs for perceptrons. Proc.sympos.math.theory of Automata, pages 615–622, 1963.
  • [22] Yongming Rao, Dahua Lin, Jiwen Lu, and Jie Zhou. Learning globally optimized object detector via policy gradient. In CVPR, pages 6190–6198, 2018.
  • [23] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [24] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
  • [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [26] Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
  • [27] Gerard Salton and Michael J McGill. Introduction to modern information retrieval. 1986.
  • [28] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • [29] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  • [30] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In ICCV, 2017.
  • [31] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
  • [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [33] Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection–snip. In CVPR, 2018.
  • [34] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via direct loss minimization. In ICML, pages 2169–2177, 2016.
  • [35] A Wendemuth. Learning the unlearnable. Journal of Physics A: Mathematical and General, 28(18):5423, 1995.
  • [36] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
  • [37] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In SIGIR, pages 271–278. ACM, 2007.
  • [38] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In CVPR, 2018.
  • [39] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, Wei Shen, Bo Wang, and Alan L. Yuille. Single-shot object detection with enriched semantics. In CVPR, 2018.

A1. Convergence

We provide proof for the proposition mentioned in Section 3.3.1 of the paper. The proof is generalized from the original convergence proof [21] for perceptron learning algorithm.

Proposition 2

The AP-loss optimizing algorithm is guaranteed to converge in finite steps if below conditions hold: (1) the learning model is linear;

(2) the training data is linearly separable.

Proof. Let denote the weights of the linear model. Let denote the feature vector of -th box in -th training sample. Hence the score of -th box is . Define . Note that the training data is separable, which means there are and that satisfy:

(15)

In the -th step, a training sample which makes an error (if there is no such training sample, the model is already optimal and algorithm will stop) is randomly chosen. Then the update of is:

(16)

where

(17)

Here, since the discussion centers on the current training sample, we omit the superscript hereon.

From (16), we have

(18)

Hence we have

(19)

Then

(20)

Here, is a positive constant.

From (16), we also have

(21)

Here, is a positive constant. Hence we arrive at:

(22)

Then, combining (20) and (22), we have

(23)

which means

(24)

It shows that the algorithm will stop at most after steps, which means that the training model will achieve the optimal solution at most after steps.

A2. An Example of Gradient Descent Failing on Smoothed AP-loss

We approximate the step function in AP-loss by sigmoid function to make it amenable to gradient descent. Specifically, the smoothed AP-loss function is given by:

(25)

where

(26)

Consider a linear model and three training samples (the first one is negative sample, others are positive samples). Then we have

(27)

Note that the training data is separable since we have and when .

Under this setting, the smoothed AP-loss become

(28)

If is sufficiently large and , then the partial derivatives satisfy the following condition:

(29)

which means and will keep increasing with the inequality according to the gradient descent algorithm. Hence the objective function will approach here. However, the objective function approaches the global minimum value if and only if and . This shows that the gradient descent fails to converge to global minimum in this case.

A3. Inseparable Case

In this section, we will provide analysis for our algorithm with inseparable training data. We demonstrate that the bound of accumulated AP-loss depends on the best performance of learning model. The analysis is based on online learning bounds [29].

A3.1. Preliminary

To handle the inseparable case, a mild modification on the proposed algorithm is needed, i.e. in the error-driven update scheme, is modified to

(30)

where is defined in Section 3.4.2 (Piecewise Step Function) of the paper. The purpose is to introduce a non-zero decision margin for the pairwise score which makes the algorithm more robust in the inseparable case. In contrast to the case in Section 3.4.2, here we only change to in the numerator for the convenience of theoretical anaysis. However, such algorithm still suffers from the discontinuity of in the denominator. Hence the strategy in Section 3.4.2 is also practical consideration, necessary for good performance. Then, consider the AP-loss:

(31)

and define a surrogate loss function:

(32)

where . Note that the AP-loss is upper bounded by the surrogate loss:

(33)

The learning model can be written as , where denotes the training data for one iteration and is the whole training set. Then, the modified error-driven algorithm is equivalent to gradient descent on surrogate loss at each step . We further suppose below conditions are satisfied:

(1) For all and , is convex w.r.t .

(2) For all , is upper bounded by a constant . Here is the matrix norm induced by the 2-norm for vectors.

Remark 1. Note that these two conditions are satisfied if the learning model is linear.

A3.2. Bound of Accumulated Loss

By the convexity, we have:

(34)

where we use to denote and can be any vector of model weights. Then, let and compute the sum over , we have:

(35)

where is the step size of gradient descent. Note that

(36)

and

(37)

Note that both and depend on . However, we omit the subscript here since the discussion only centers on the current training sample .

Hence we have:

(38)

Let , rearrange and get the expression:

(39)

This entails the bound of surrogate loss :

(40)

which implies the bound of AP-loss :

(41)

As a special case, if there exists a such that for all , then the accumulated AP-loss is bounded by a constant, which implies that convergence can be achieved with finite steps (similar to that of the separable case). Otherwise, with sufficiently large , the average AP-loss mainly depends on . This implies that the bound is meaningful if there still exists a sufficiently good solution in such inseparable case.

A3.3. Offline Setting

With the offline setting ( for all ), a bound with simpler form can be revealed. For simplicity, we will omit the subscript of and define , . Then,

(42)

The second last inequality is based on the fact that are picked from without replacement (assume no ties; if ties exist, this inequality still holds). Combining the results from Equation 42 and Equation 41, we have:

(43)

Next,

(44)

Combining the results from Equation 44 and Equation 41, we have:

(45)

If is small, the bound in Equation 43 is active, otherwise the bound in Equation 45 is active. Consequently, we have:

(46)

where denotes the average AP-loss, as increases.

A4. Consistency

Observation 2

When the activation function takes the form of softmax function and loss-augmented step function, our optimization algorithm can be expressed as the gradient descent algorithm on cross-entropy loss and hinge loss respectively.

Cross Entropy Loss: Consider the multi-class classification task. The outputs of neural network are where is the number of classes, and the ground truth label is . Using softmax as the activation function, we have:

(47)

The cross entropy loss is:

(48)

Hence the gradient of is

(49)

Note that is “error-driven” with the desired output and current output . This form is consistent with our error-driven update scheme (c.f. Section 3.2.1 of the paper).

Hinge Loss: Consider the binary classification task. The output of neural network is , and the ground truth label is . Define . Using loss-augmented step function as the activation function, we have:

(50)

where is the Heaviside step function. The hinge loss is:

(51)

Hence the gradient of is

(52)

There are two cases. If , the gradient is “error-driven” with the desired output and current output . If