Learning to Count Objects with Few Exemplar Annotations

Learning to Count Objects
with Few Exemplar Annotations

Jianfeng Wang Microsoft
11email: {jianfw, rxiao, yag, leizhang}@microsoft.com
   Rong Xiao Microsoft
11email: {jianfw, rxiao, yag, leizhang}@microsoft.com
   Yandong Guo Microsoft
11email: {jianfw, rxiao, yag, leizhang}@microsoft.com
   Lei Zhang Microsoft
11email: {jianfw, rxiao, yag, leizhang}@microsoft.com

In this paper, we study the problem of object counting with incomplete annotations. Based on the observation that in many object counting problems the target objects are normally repeated and highly similar to each other, we are particularly interested in the setting when only a few exemplar annotations are provided. Directly applying object detection with incomplete annotations will result in severe accuracy degradation due to its improper handling of unlabeled object instances. To address the problem, we propose a positiveness-focused object detector (PFOD) to progressively propagate the incomplete labels before applying the general object detection algorithm. The PFOD focuses on the positive samples and ignore the negative instances at most of the learning time. This strategy, though simple, dramatically boosts the object counting accuracy. On the CARPK dataset for parking lot car counting, we improved mAP@0.5 from 4.58% to 72.44% using only 5 training images each with 5 bounding boxes. On the Drink35 dataset for shelf product counting, the mAP@0.5 is improved from 14.16% to 53.73% using 10 training images each with 5 bounding boxes.

Incompletely-supervised learning; object counting; object detection

1 Introduction

Object counting is to count the number of object instances in a single image or video sequence. It has many real-world applications such as traffic flow monitoring, crowdedness estimation, and product counting.

Existing approaches towards the counting problem can be roughly categorized as regression-based approach [1, 2, 3, 4, 5], density-based approach [6, 7, 8] and detection-based approach [9, 10, 11]. The regression-based approach directly learns a mapping from the image to the number of instances. In contrast, the density-based approach first estimates a density map and then aggregates the density information to get the number of instances. Both categories of approaches provide little information of exact instance location, whereas the detection-based approach is capable of detecting the location of each individual instance and thus making the counting result more explainable. Due to this advantage, we mainly focus on the detection-based approach in this work.

(a) Fully-Supervised (b) Weakly-Supervised (c) Incompletely-Supervised
Figure 1: Different training image settings in detection-based object counting. The label names are shown in bottom left

In recent years, the accuracy of object detection has been dramatically improved [12, 13, 14, 15] thanks to the advance of deep convolutional neural network. However, the high accuracy is achieved normally with the large number of fully annotated bounding boxes. In the context of object counting, the number of instances even in a single image could be huge (e.g. from tens to hundreds), which not only presents a great challenge to object detection, but also requires tremendous annotation effort to build a high-accuracy object detector.

To save the annotation cost, weakly supervised approaches are generally adopted to train the model solely based on image-level annotations. However, such approaches still suffer from suboptimal accuracy due to the lack of instance-level (e.g. bounding box) annotations. To address this problem, we resort to a problem setting where each image is only partially annotated with a few bounding boxes and leaves other instances unlabeled. This setting is practically useful especially when the number of instances is large and it is tedious and costly to label all the instances. Another motivation is that in certain object counting problem, the large number of instances exhibits less variances, e.g. scale, color. Potentially, a few annotated bounding boxes might generalize well to achieve a high accuracy. We call the learning algorithm trained on such training data incompletely-supervised learning.

Fig. 1 shows examples to illustrate the training image difference among the fully-supervised, weakly-supervised, and incompletely-supervised learning. For the fully-supervised setting, all cars and drinks are annotated. The total number of the car instances is as many as 58. In the weakly-supervised setting, we only have the information that the car or drinks exist in the image. Based on the observation that in many object counting problems the target objects are normally repeated and highly similar to each other, we are particularly interested in the setting when only a few exemplar annotations are provided, which is illustrated in Fig. 1(c)

In incompletely-supervised learning, if any unlabeled region is simply treated as background in object detection training, as the number of labeled instances is usually small, the trained model could over-fit the training data severely, and mistaken the unlabeled object as background. To address the issue, we propose a simple yet effective algorithm, named positiveness-focused object detector (PFOD), to progressively propagate the labels, which treats unlabeled regions as background first and then neglects them to mainly focus on the positive samples during the training. The intuition is to first learn a compact classifier which might over-fit the data, and then relax the learning by ignoring the unlabeled region to pull the unlabeled instances towards being positive. In this way, bounding box supervision could be automatically expanded during training before we apply a standard object detector.

Overall, our contributions can be summarized as follows.

  1. We explicitly formulate the problem of incompletely-supervised learning, which focuses on the incomplete annotations for object counting.

  2. We propose a progressive label propagation algorithm through positiveness-focused object detector to properly handle the incomplete labels.

  3. We conduct extensive experiments demonstrates the effectiveness of the proposed PFOD to handle such training problems. On the CARPK dataset for parking lot car counting, we improved mAP@0.5 from 4.58% to 72.44% using only 5 training images each with 5 bounding boxes. On the Drink35 dataset for shelf product counting, the mAP@0.5 is improved from 14.16% to 53.73% using 10 training images each with 5 bounding boxes.

2 Related Work

2.1 Object Counting

The approaches towards object counting can be roughly categorized into three categories: regression-based approach, density-based approach, and detection-based approach.

2.1.1 Regression-based Approach

predicts instance count directly based on global regressors with image features [2, 3, 4, 5]. For example, in [2], the video sequence is first segmented into different components of homogeneous motion, and a Gaussian process regression is learned for each segmented region to count the number of instances. Cumulative attribute representation [3] was proposed and used to learn the regressor to handle the imbalanced training data. Inter-dependent features are used to mine the spatially importance among different region to learn the number of count [4]. Feature normalization is taken into account to deal with perspective projections in [5]. With the advantage of the deep learning,  [16] estimate the number of cars using CNNs. These regression-based approaches provide no clue of individual location of each object, which limits its potential applications.

2.1.2 Density-based Approach

first maps the image to a density map, such that the integral over any sub region gives the count of objects within that region [17, 7, 8, 6]. In [17], the pixel-level density is learned by minimizing a regularized risk quadratic cost function. Based on the density map, an interactive counting system is introduced in [7] to incorporate the relevance feedback. Instead of hand-crafted feature, [6] focuses on the CNN-based density map and instance count estimation in cross-scene scenario. While the density map provides certain clue of the crowdedness, it still lacks the exact position of each instance.

2.1.3 Detection-based Approach

gives the total count by localizing each object instance. Such approaches can be considered as the application of object detection and thus benefit a lot from the improvement of object detection [12, 13, 14, 15]. However, directly applying object detection for object counting requires special focus on small-scale objects and extraordinary effort on massive instance labeling. In [18], a hierarchical part-template matching approach is proposed to detection humans, which requires careful feature and template design, while in [9], the neural network learning is applied to detect and count car instances by incorporating layout information. With its large potential in applications demanding location information, we mainly work on the detection-based approach with special focus on incompletely-supervised learning.

2.2 Object Detection

Nowadays, mainstream object detection algorithms have changed to CNN-based implementation due to its powerful representation and high accuracy. Such algorithms can be roughly categorized into two-stage object detector [19, 12, 13], and one-stage object detector [14, 15]. Two-stage object detectors such as Faster RCNN [13] first extract region proposals and then perform classification and bounding box regression, while single-stage objectors such as YOLO [15] directly output the bounding box locations and the classification results without generating the region proposal. All the object detection algorithms by default following the fully supervised setting - requiring large amount of training images with annotated bounding boxes to achieve a high accuracy.

To reduce the annotation effort, weakly supervised learning trains object detector only based on image-level labels. Most approaches [20, 21, 22, 23] first generates multiple region proposals for each image and then leverage multi-instance learning algorithms to solve the problem. However, such approaches still suffer from suboptimal accuracy due to the lack of instance-level labels.

In both the fully and weakly supervised settings, each image is either labeled with full bounding box annotations or only image-level category information. In contrast, we are more interested in the incomplete supervised setting, where only a few exemplar bounding boxes are annotated and all others are unlabeled. This setting is practically more useful when the number of instances is large, for example to count the number of products on a retail store shelf, or the number of cars in a large parking lot.

3 Approach

3.1 Problem Definition

Although detection-based object counting resembles the general object detection problem, it still presents unique challenges when the total number of object instances in an image is large, e.g. from tens to hundreds, and the size of each instance is relatively small comparing with the image size.

To save the tedious and costly labeling effort, we assume each image in the training set is only labeled with a few instances (e.g. less than ten) though there might be tens or hundreds of instances in one image. For image , denote by its -th annotated bounding box (). Since not all instances are annotated, the rest region could also contain the object instances. Without loss of generality, we assume the number of object categories is 1. For notation simplicity, we drop the from and when there is no ambiguity. Thus, the problem is how to build an effective object detector to count the object instances based on the incomplete annotations . We name the problem as incompletely-supervised learning for object counting.

3.1.1 Comparison with other supervised learning problems.

The differences with other supervised learning settings in the context of object detection are also shown in Fig. 1. Fully-supervised learning requires that all the bounding boxes are labeled. This represents the upper bound of detection-based object counting performance, but is costly to label every instance. Weakly-supervised learning assumes that only image-level category information is available. That means, we know there are some object instances in one image, but we have no information of their locations. Another problem setting is low-shot learning, where the number of training images is small, but each training image has full annotations. This is more like the fully-supervised learning but with only a few training images.

(a) Ideal (b) Baseline (c) Stage 1 (d) Stage 2 (d) Stage 3 (e) Stage 4
Figure 2: Intuition of the proposed label propagation through a positiveness-focused object detector to manage the incompletely-supervised labels

3.2 Label Propagation by Positiveness-Focused Object Detection

As we have a few exemplar labels of the target objects, the most intuitive way is to treat the labels as seeds and carefully propagate them to unknown regions. We propose a simple yet effective positiveness-focused object detector (PFOD) to solve the propagation problem.

3.2.1 Intuition.

Fig. 2 illustrates the intuition of the label propagation by PFOD. The plus symbol denotes a positive sample while the minus symbol denotes negative. The sample in red is known (labeled) while the sample in blue is unknown. The line with the dotted line is the decision boundary. If all the positive and negative samples are known, we can easily figure out the decision boundary to separate true positive and true negative samples, as shown in Fig. 2(a). When only one positive sample is known and all the rest are unknown, as shown in Fig. 2(b), if we simply treat all the unknowns as negative sample, the decision boundary could mis-classify the unlabeled positive samples as being negative.

To compensate the lack of negative samples, we can introduce some images as extra negative data, as shown in green from Fig. 2(c) to Fig. 2(e). For object detection, it is indeed easy to find images without any target objects in the same domain. For example, we can use the PASCAL VOC [24] 2007 data set as extra background images for the problem of car counting on the CARPK [9] dataset.

In Fig. 2(c) we show how to propagate the labels. We first treat all the unlabeled samples (shown as both blue plus and minus symbols in Fig. 2(c)) as negative samples to train the detector, and the learned decision boundary will be pushed close to the only positive sample (red plus symbol) as depicted with the red dotted line. Next, we ignore the unknown samples (note that we still have the extra negative samples shown as green minus symbols) and gradually update the learned classifier to classify the labeled positive sample and the extra negative samples. As the unknown positive samples (shown as blue plus symbols) are not taken as negative samples, they no longer push the decision boundary. As a result, the decision boundary (as shown as black dotted line) will be moved a bit further from the known positive sample and classify a few more unknown positive samples as positive samples.

If the propagation is carefully controlled, we can treat the newly classified positive samples as known labeled data for the next stage, as shown in Fig. 2(d). We repeat the process above to iteratively learn the boundary and propagate the label set. Each iteration here is called a stage, and we use to denote the label set in the -th stage, where is the number of expanded bounding boxes in one image. Finally, we combine all the expanded positive samples and take all the others as background to learn the final decision boundary.

Figure 3: Framework of the proposed label propagation by a positiveness-focused object detector (PFOD)

3.2.2 Solution.

Fig. 3 shows the framework of the proposed strategy to learn the object detector. At the initial state of training, the bounding boxes we have are the labeled set, i.e. for each training image. Based on , we train a positiveness-focused object detector (PFOD), which can be based on any object detector [19, 12, 13, 14, 15]. In this work, we choose YoloV2 [15] for its simplicity. The network first processes each training image in a batch manner by a fully convolutional neural network, and then outputs three components at each spatial position: bounding box coordinates, objectiveness to tell how confident the bounding box contains an object, and classification scores to tell which category the bounding box contains. Here we assume the number of categories is 1 and remove the classification module. The Euclidean loss is used for bounding box coordinate regression and objectiveness confidence regression.

Specifically, for the objectiveness at spatial position , the loss is defined as


where is the network parameter, learned iteratively through the mini-batch stochastic gradient descent (SGD) algorithm, is the objectiveness score at position , and is 1 if it is identified as being positive for position , and 0 otherwise based on the current label set. For the extra background images, the label is consistently set as 0 and the loss will be always enabled. For the training images in the target domain, we modify it as follows to implement PFOD,


where is the number of iterations in SGD, is a pre-defined parameter (200 in experiments) to determine how many iterations are needed to treat the unknown regions as background. After iterations, the detector training will only focus on the positive samples and the extra background data.

With the model trained by PFOD, we run the prediction over all the training images from target domain. The predicted bounding boxes with high probability scores (0.9 in experiments) will be merged into the original label set to form . We also discard any predicted bounding box if it has a high overlap (Intersection-over-Union 0.2 in experiments) with any of the original bounding boxes in .

After stages, we feed the training images and the expanded bounding box set into a normal object detector training pipeline, in which the bounding boxes in are positive while all the rest are negative. We still apply the YoloV2 algorithm here for training and testing. Ideally, if all the unlabeled object instances could be propagated, the trained model should be able to achieve an accuracy on par with that trained from the full annotations.

4 Experiments

4.1 Settings

4.1.1 Datasets.

We mainly evaluate the approaches on the CARPK [9] dataset, which contains 989 training images and 459 testing images. The task is to detect and count the car instances in the image. Each training image has annotated cars, while each testing image has cars. The images are collected by a drone on top of car parks. An example image is shown in the first row of Fig. 1

Another interesting application is to count the number of drinks or products on retail store shelves. To demonstrate the effectiveness of the incompletely-supervised learning algorithm on other domains, we collect a small dataset Drink35, which contains 10 images as the training set and 25 images as the testing set. The task is to detect and count all the product instances. One example image is shown in the second row of Fig. 1.

To simulate the incompletely-supervised settings, we randomly select training images, and for each image we randomly select at most annotated cars. All the other unselected images are discarded during training. This training set is denoted as or for CARPK and Drink35 datasets, respectively. For example, CARPK_5_5 means the training set with 5 images and each with at most 5 annotated bounding boxes. Similarly, the suffix of denotes the training set of images with all the annotated boxes, and denotes all training images with at most annotated bounding boxes in each image. The test set is not altered for consistent evaluation.

We use the PASCAL VOC [24] 2007 trainval set (5011 images) as the extra background images with all the original bounding box labels removed. Note that the images in PASCAL VOC 2007 contains the object of cars and drinks. We keep these images as negative samples because the cars and drinks in VOC 2007 are generally of difference appearances or views compared with the object instances in CARPK and Drink35.

4.1.2 Criteria.

Since we focus on the detection-based approaches, we adopt the widely-used [12, 13, 14, 15] mAP@0.5 as one of the metrics, which measures the mean average precision (mAP) using 0.5 as the interaction-over-union (IoU) threshold.

Following [17, 9], we also use the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) to evaluate the accuracy of the counting results. MAE is defined as , while RMSE as , where is the number of objects predicted by the model for the -th testing image, is the ground-truth number of objects, and is the total number of testing images.

4.1.3 Implementation details.

The data augmentation is of great importance for the network learning due to the small training set with incomplete labels. Motivated by the implementation of Yolo [15] and SSD [14], we incorporate the random scaling, random aspect ratio distortion, and color jittering. Random rotation is also implemented for the car counting problem by multiple data samplers (motivated by SSD [14]), so that non-rotated images are preferred than rotated images. That is, images with 0, 90, 180, and 270 degrees’ rotations are preferred than images with arbitrary rotations, and images with less than 10 degrees’ rotation are preferred than images with other arbitrary rotations. Specifically, for each training image, we have 25% chance to select non-rotated image; 25% chance to rotate images by x degrees; 25% chance to rotate images by less than degrees plus a random x rotation; and the last 25% chance to rotate images for any angle.

Another important parameter is that we use a large input resolution for both training and testing due to large number of objects and small object sizes in each image. During training, we resize the image so that the longer side randomly ranges from to , and then crop a subregion of as the network input. During testing, we resize the input image so that the whole area is close to while its aspect ratio is kept. Non-Maximum Compression (NMS) is used to filter the bounding box and the IoU threshold is 0.2.

The network backbone is Darknet19, which is the same as in YoloV2 [15]. We use 9 stages of PFOD to propagate the labels. In each stage of PFOD, we train the network with 10K iterations. The learning rate is 0.0001 for the first 100 iterations, 0.001 for the next 4900 iterations, 0.0001 for 4000 iterations and 0.00001 for the last 1000 iterations111The total number iterations could be greatly reduced as the number of training images is small. However, as the training time is not a concern in this work, we leave the training optimization as a future work.. The batch size is set to 64, and the weight decay is 0.0005. The last detector training shares the same parameters. The training takes 1.25 hours on 4 NVidia P100 GPUs to finish the 10K iterations. The implementation is based on Caffe [25] under the environment of CUDA 8.0 and CUDNN 5.1.

We also report the accuracy without any label propagation, and directly train the object detector on the provided label set. This straightforward approach is denoted as OD as a naive baseline, and all the data augmentation and learning strategy parameters are the same with a single stage of PFOD.

In the incomplete annotation settings, we enable the extra background samples by replacing 16 images in each batch of 64 with the extra background images.

Method mAP(%) MAE RSME
LPN[9] 989 ALL NotAvail 23.80 36.79
OD 989 ALL 96.67 2.94 (0.15) 3.94 (0.10)
ALL 91.97 4.38 (0.20) 5.58 (0.20)
OD 87.90 6.02 (0.05) 8.23 (0.05)
61.26 49.28 (0.05) 54.45 (0.05)
4.95 99.39 (0.05) 106.51 (0.05)
4.58 101.93 (0.10) 108.99 (0.10)
PFOD 83.73 11.54 (0.05) 16.76 (0.05)
80.27 16.52 (0.05) 21.45 (0.05)
73.46 22.47 (0.05) 27.63 (0.05)
72.44 23.63 (0.05) 26.36 (0.05)
Table 1: Accuracy on CARPK with different training images. : number of training images from the whole training set; : number of labeled bounding box annotations for each image; mAP: the higher, the better; MAE/RSME: the lower, the better. The value in parenthesis of MAE/RSME is the threshold to decide if the predicted bounding box is valid

4.2 Results on CARPK

The results are shown in Table 1 for different numbers of training images () and different numbers of labeled bounding boxes () in each image. To count the number of objects, we need a threshold to determine if the predicted bounding box should be kept from the detector. Since different settings might favor different thresholds, we select the one with lowest MAE or RSME among . We select the best threshold for MAE and RSME to examine the best performance under these two criteria. The threshold is in parentheses of Table 1. Note the criterion of mAP does not depend on the threshold. From the table, we have the following observations and discussions.

4.2.1 What is the upper bound performance using all the training data?

In the fully supervised setting (, ), our detector (OD) could achieve 96.67% mAP@0.5, 2.94 MAE and 3.94 RMSE. This significantly outperforms the state-of-the-art method of LPN [9], whose MAE is 23.80 and RMSE is 36.79 (mAP is not reported). Other baseline approaches are not shown here because the accuracy is lower than LPN. This demonstrates a strong object detector towards the counting problem.

4.2.2 Do we need hundreds of training images for car counting?

We apply the OD on images with all the labeled annotations (), and the detector can still achieves 91.97 mAP@0.5, 3.00 MAE, and 5.58 RSME without extra background images. Compared with , which uses 989 training images, this result is very encouraging. It clearly indicates that to develop a car counting algorithm, 5 or a few more images might be enough rather than several hundreds. The 5 image IDs are 20160524_GF2_00133, 20161030_GF1_00153, 20161030_GF1_00036, 20160331_NTU_00007, and
20161030_GF2_00071 for reproducing the result.

(a) (b)
Figure 4: Straightforward object detection result on two training images. The model is trained on CARPK_5_5. The ground-truth bounding boxes are almost identical with the predicted boxes and thus are not shown for clarity. Confidence scores shown around each box are all close to 1 (0.8+). The threshold is 0.05, which means the confidence scores for unlabeled/undetected cars are less than 0.05. This is a clear indication of overfitting, which severely degrades the performance

4.2.3 What if the unlabeled regions are treated as negative samples?

When the number of bounding boxes is decreased to 5, the mAP significantly drops to 4.40. To identify the reason, we evaluate the trained model against the training images and show two examples in Fig 4. The predicted bounding boxes are drawn on the image with confidence scores around each box. Only the boxes with confidence scores higher than 0.05 are displayed. The selected 5 labeled boxes are located at the same position with the predicted boxes and the probability is close to 1. Since the threshold here is 0.05, all the other regions including the unlabeled cars are classified as background with high confidence. This shows that the model overfits the training data severely, which degrades the accuracy.

(a) expand=6, correct=6 (b) expand=13, correct=13 (c) expand=8, correct=8
(d) expand=8, correct=8 (e) expand=35, correct=35 (f) expand=29, correct=29
(g) expand=10, correct=10 (h) expand=46, correct=46 (i) expand=32, correct=32
(j) expand=22, correct=22 (k) expand=50, correct=49 (l) expand=37, correct=33
Figure 5: Label propagation process on training images of CARPK_5_5. The two numbers below the images are the number of expanded boxes and the number of correctly expanded. From the left to the right, the total number of true bounding boxes is 61, 58, 34, respectively. From the top to bottom, the images correspond to the training data for the third, fifth-th, seven-th and the last stage, respectively. The original sampled 5 labeled bounding boxes are shown in blue; the propagated boxes ares shown in yellow; the incorrectly (IoU 0.3) propagated is shown in red

4.2.4 How effective is the proposed incompletely-supervised learning approach?

We apply the proposed PFOD on CARPK_5_5, and surprisingly the accuracy is boosted to 72.44 mAP, 23.63 MAE and 26.36 RSME. In terms of MAE and RSME, the accuracy has surpassed the LRN [9] trained on the full training set of 989 images.

In Fig. 5, we illustrate the label propagation process by PFOD on CARPK_5_5. Each column corresponds to one training image. The two numbers below each images are the number of the expanded boxes (initially provided + propagated), and the number of correct boxes among those boxes. A box is correct if its IoU is larger than 0.3 with at least one bounding box in CARPK_5_ALL. From the figure, the correct bounding boxes used for training could be gradually populated. Taking the leftmost image as an example, the number of correct boxes is increased from the initial number 5 to 22. The rightmost one can have 33 correct boxes.

Meanwhile, we observe that the propagation is still not perfect - it introduces several false bounding boxes while missing a few cars. This is the reason why there is still a gap between this setting and CARPK_5_ALL, and will motivate us to continue investigating the problem.

4.2.5 Will introducing more labels help?

By increasing the number of labeled boxes per training image from 5 to 50, the accuracy can be smoothly increased for both OD and PFOD. With less than 25 boxes in each image, the accuracy of PFOD is consistently higher than OD, while with 50 boxes, the accuracy is lower. The reason is that under the setting of 50 boxes, most of the true boxes are included, while the label propagation introduced some false boxes. That is, under the almost full annotations, it is enough to apply the OD instead of propagating the boxes.

(a) pred=9, gt=127 (b) pred=130, gt=127 (c) pred=122, 127
(d) pred=13, gt=138 (e) pred=152, gt=138 (f) pred=93, gt=138
(g) pred=0, gt=114 (h) pred=115, gt=114 (i) pred=107, gt=114
Figure 6: Visualization of detection/counting results on 3 CARPK test images. Each row corresponds to one test image. Left: OD on CARPK_5_5; Middle: OD on CARPK_5_Full; Right: our PFOD on CARPK_5_5
OD on Drink35_ALL_5 OD on full Drink35 PFOD on Drink35_ALL_5
(a) pred=0, gt=58 (b) pred=49, gt=58 (c) pred=36, gt=58
(d) pred=14, gt=51 (e) pred=46, gt=51 (f) pred=45, gt=51
Figure 7: Visualization of the detection/counting results on Drink35 test images. Each row corresponds to one test image. Left: OD on Drink35_ALL_5. Middle: OD on the full Drink35; Right: our PFOD on Drink35_ALL_5

Fig. 6 visualizes the detection and counting results on three testing images based on the model with OD trained on CARPK_5_5, OD trained on CARPK_5_ALL, and the model with PFOD on CARPK_5_5. The yellow boxes are the correct bounding boxes while the red one is the incorrect one. A predicted box is correct if it has an IoU larger than 0.3 with one of the ground truth bounding boxes. The two numbers under the image are the number of instances predicted by the model and the ground-truth number of instances, respectively.

4.3 Results on Drink35

With the full annotations, OD can achieve 79.53% mAP@0.5, 7.92 MAE and 11.50 RSME. This can be treated as the upper bound performance as we have used all the labels. If each training image is provided with only 5 labeled instance, the OD’s accuracy degrades to 14.16% mAP, 29.52 MAE and 38.62 RSME. In contrast, using PFOD, the accuracy could be jumped to 53.73%, 11.16 MAE and 13.10 RSME. Fig. 7 shows two example images detected/counted by the three approaches.

5 Conclusion

We have studied the problem of object counting when there are only a few exemplar annotations available. The problem is more practical especially when the number of object instances is large. We formulate the problem as incompletely-supervised learning in the context of object detection. Since not all the bounding boxes are provided, we cannot simply treat other regions as background which will lead to severe overfitting and performance degradation. To address the problem, we have proposed a positiveness-focused object detector to progressively propagate the incomplete labels to more object instances. Our experimental results over two applications have demonstrated that this simple yet effective approach significantly boosts the accuracy with only a few manually annotations.


  • [1] An, S., Liu, W., Venkatesh, S.: Face recognition using kernel ridge regression. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. (2007)
  • [2] Chan, A.B., Liang, Z.J., Vasconcelos, N.: Privacy preserving crowd monitoring: Counting people without people models or tracking. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA. (2008)
  • [3] Chen, K., Gong, S., Xiang, T., Loy, C.C.: Cumulative attribute space for age and crowd density estimation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. (2013) 2467–2474
  • [4] Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: British Machine Vision Conference, BMVC 2012, Surrey, UK, September 3-7, 2012. (2012) 1–11
  • [5] Kong, D., Gray, D., Tao, H.: A viewpoint invariant approach for crowd counting. In: 18th International Conference on Pattern Recognition (ICPR 2006), 20-24 August 2006, Hong Kong, China. (2006) 1187–1190
  • [6] Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. (2015) 833–841
  • [7] Arteta, C., Lempitsky, V.S., Noble, J.A., Zisserman, A.: Interactive object counting. In: Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III. (2014) 504–518
  • [8] Rodriguez, M., Laptev, I., Sivic, J., Audibert, J.: Density-aware person detection and tracking in crowds. In: IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011. (2011) 2423–2430
  • [9] Hsieh, M., Lin, Y., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. (2017) 4165–4173
  • [10] Kamenetsky, D., Sherrah, J.: Aerial car detection and urban understanding. In: 2015 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2015, Adelaide, Australia, November 23-25, 2015. (2015) 1–8
  • [11] Moranduzzo, T., Melgani, F.: Automatic car counting method for unmanned aerial vehicle images. IEEE Trans. Geoscience and Remote Sensing 52(3) (2014) 1635–1647
  • [12] Girshick, R.B.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. (2015) 1440–1448
  • [13] Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. (2015) 91–99
  • [14] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I. (2016) 21–37
  • [15] Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. (2017) 6517–6525
  • [16] Mundhenk, T.N., Konjevod, G., Sakla, W.A., Boakye, K.: A large contextual dataset for classification, detection and counting of cars with deep learning. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III. (2016) 785–800
  • [17] Lempitsky, V.S., Zisserman, A.: Learning to count objects in images. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada. (2010) 1324–1332
  • [18] Lin, Z., Davis, L.S.: Shape-based human detection and segmentation via hierarchical part-template matching. IEEE Trans. Pattern Anal. Mach. Intell. 32(4) (2010) 604–618
  • [19] Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. (2014) 580–587
  • [20] Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. (2016) 2846–2854
  • [21] Cinbis, R.G., Verbeek, J.J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 39(1) (2017) 189–203
  • [22] Kantorov, V., Oquab, M., Cho, M., Laptev, I.: Contextlocnet: Context-aware deep network models for weakly supervised localization. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V. (2016) 350–365
  • [23] Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. (2017) 3059–3067
  • [24] Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88(2) (2010) 303–338
  • [25] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description