Learning to Count Objects
with Few Exemplar Annotations
In this paper, we study the problem of object counting with incomplete annotations. Based on the observation that in many object counting problems the target objects are normally repeated and highly similar to each other, we are particularly interested in the setting when only a few exemplar annotations are provided. Directly applying object detection with incomplete annotations will result in severe accuracy degradation due to its improper handling of unlabeled object instances. To address the problem, we propose a positiveness-focused object detector (PFOD) to progressively propagate the incomplete labels before applying the general object detection algorithm. The PFOD focuses on the positive samples and ignore the negative instances at most of the learning time. This strategy, though simple, dramatically boosts the object counting accuracy. On the CARPK dataset for parking lot car counting, we improved mAP@0.5 from 4.58% to 72.44% using only 5 training images each with 5 bounding boxes. On the Drink35 dataset for shelf product counting, the mAP@0.5 is improved from 14.16% to 53.73% using 10 training images each with 5 bounding boxes.
Keywords:Incompletely-supervised learning; object counting; object detection
Object counting is to count the number of object instances in a single image or video sequence. It has many real-world applications such as traffic flow monitoring, crowdedness estimation, and product counting.
Existing approaches towards the counting problem can be roughly categorized as regression-based approach [1, 2, 3, 4, 5], density-based approach [6, 7, 8] and detection-based approach [9, 10, 11]. The regression-based approach directly learns a mapping from the image to the number of instances. In contrast, the density-based approach first estimates a density map and then aggregates the density information to get the number of instances. Both categories of approaches provide little information of exact instance location, whereas the detection-based approach is capable of detecting the location of each individual instance and thus making the counting result more explainable. Due to this advantage, we mainly focus on the detection-based approach in this work.
|(a) Fully-Supervised||(b) Weakly-Supervised||(c) Incompletely-Supervised|
In recent years, the accuracy of object detection has been dramatically improved [12, 13, 14, 15] thanks to the advance of deep convolutional neural network. However, the high accuracy is achieved normally with the large number of fully annotated bounding boxes. In the context of object counting, the number of instances even in a single image could be huge (e.g. from tens to hundreds), which not only presents a great challenge to object detection, but also requires tremendous annotation effort to build a high-accuracy object detector.
To save the annotation cost, weakly supervised approaches are generally adopted to train the model solely based on image-level annotations. However, such approaches still suffer from suboptimal accuracy due to the lack of instance-level (e.g. bounding box) annotations. To address this problem, we resort to a problem setting where each image is only partially annotated with a few bounding boxes and leaves other instances unlabeled. This setting is practically useful especially when the number of instances is large and it is tedious and costly to label all the instances. Another motivation is that in certain object counting problem, the large number of instances exhibits less variances, e.g. scale, color. Potentially, a few annotated bounding boxes might generalize well to achieve a high accuracy. We call the learning algorithm trained on such training data incompletely-supervised learning.
Fig. 1 shows examples to illustrate the training image difference among the fully-supervised, weakly-supervised, and incompletely-supervised learning. For the fully-supervised setting, all cars and drinks are annotated. The total number of the car instances is as many as 58. In the weakly-supervised setting, we only have the information that the car or drinks exist in the image. Based on the observation that in many object counting problems the target objects are normally repeated and highly similar to each other, we are particularly interested in the setting when only a few exemplar annotations are provided, which is illustrated in Fig. 1(c)
In incompletely-supervised learning, if any unlabeled region is simply treated as background in object detection training, as the number of labeled instances is usually small, the trained model could over-fit the training data severely, and mistaken the unlabeled object as background. To address the issue, we propose a simple yet effective algorithm, named positiveness-focused object detector (PFOD), to progressively propagate the labels, which treats unlabeled regions as background first and then neglects them to mainly focus on the positive samples during the training. The intuition is to first learn a compact classifier which might over-fit the data, and then relax the learning by ignoring the unlabeled region to pull the unlabeled instances towards being positive. In this way, bounding box supervision could be automatically expanded during training before we apply a standard object detector.
Overall, our contributions can be summarized as follows.
We explicitly formulate the problem of incompletely-supervised learning, which focuses on the incomplete annotations for object counting.
We propose a progressive label propagation algorithm through positiveness-focused object detector to properly handle the incomplete labels.
We conduct extensive experiments demonstrates the effectiveness of the proposed PFOD to handle such training problems. On the CARPK dataset for parking lot car counting, we improved mAP@0.5 from 4.58% to 72.44% using only 5 training images each with 5 bounding boxes. On the Drink35 dataset for shelf product counting, the mAP@0.5 is improved from 14.16% to 53.73% using 10 training images each with 5 bounding boxes.
2 Related Work
2.1 Object Counting
The approaches towards object counting can be roughly categorized into three categories: regression-based approach, density-based approach, and detection-based approach.
2.1.1 Regression-based Approach
predicts instance count directly based on global regressors with image features [2, 3, 4, 5]. For example, in , the video sequence is first segmented into different components of homogeneous motion, and a Gaussian process regression is learned for each segmented region to count the number of instances. Cumulative attribute representation  was proposed and used to learn the regressor to handle the imbalanced training data. Inter-dependent features are used to mine the spatially importance among different region to learn the number of count . Feature normalization is taken into account to deal with perspective projections in . With the advantage of the deep learning,  estimate the number of cars using CNNs. These regression-based approaches provide no clue of individual location of each object, which limits its potential applications.
2.1.2 Density-based Approach
first maps the image to a density map, such that the integral over any sub region gives the count of objects within that region [17, 7, 8, 6]. In , the pixel-level density is learned by minimizing a regularized risk quadratic cost function. Based on the density map, an interactive counting system is introduced in  to incorporate the relevance feedback. Instead of hand-crafted feature,  focuses on the CNN-based density map and instance count estimation in cross-scene scenario. While the density map provides certain clue of the crowdedness, it still lacks the exact position of each instance.
2.1.3 Detection-based Approach
gives the total count by localizing each object instance. Such approaches can be considered as the application of object detection and thus benefit a lot from the improvement of object detection [12, 13, 14, 15]. However, directly applying object detection for object counting requires special focus on small-scale objects and extraordinary effort on massive instance labeling. In , a hierarchical part-template matching approach is proposed to detection humans, which requires careful feature and template design, while in , the neural network learning is applied to detect and count car instances by incorporating layout information. With its large potential in applications demanding location information, we mainly work on the detection-based approach with special focus on incompletely-supervised learning.
2.2 Object Detection
Nowadays, mainstream object detection algorithms have changed to CNN-based implementation due to its powerful representation and high accuracy. Such algorithms can be roughly categorized into two-stage object detector [19, 12, 13], and one-stage object detector [14, 15]. Two-stage object detectors such as Faster RCNN  first extract region proposals and then perform classification and bounding box regression, while single-stage objectors such as YOLO  directly output the bounding box locations and the classification results without generating the region proposal. All the object detection algorithms by default following the fully supervised setting - requiring large amount of training images with annotated bounding boxes to achieve a high accuracy.
To reduce the annotation effort, weakly supervised learning trains object detector only based on image-level labels. Most approaches [20, 21, 22, 23] first generates multiple region proposals for each image and then leverage multi-instance learning algorithms to solve the problem. However, such approaches still suffer from suboptimal accuracy due to the lack of instance-level labels.
In both the fully and weakly supervised settings, each image is either labeled with full bounding box annotations or only image-level category information. In contrast, we are more interested in the incomplete supervised setting, where only a few exemplar bounding boxes are annotated and all others are unlabeled. This setting is practically more useful when the number of instances is large, for example to count the number of products on a retail store shelf, or the number of cars in a large parking lot.
3.1 Problem Definition
Although detection-based object counting resembles the general object detection problem, it still presents unique challenges when the total number of object instances in an image is large, e.g. from tens to hundreds, and the size of each instance is relatively small comparing with the image size.
To save the tedious and costly labeling effort, we assume each image in the training set is only labeled with a few instances (e.g. less than ten) though there might be tens or hundreds of instances in one image. For image , denote by its -th annotated bounding box (). Since not all instances are annotated, the rest region could also contain the object instances. Without loss of generality, we assume the number of object categories is 1. For notation simplicity, we drop the from and when there is no ambiguity. Thus, the problem is how to build an effective object detector to count the object instances based on the incomplete annotations . We name the problem as incompletely-supervised learning for object counting.
3.1.1 Comparison with other supervised learning problems.
The differences with other supervised learning settings in the context of object detection are also shown in Fig. 1. Fully-supervised learning requires that all the bounding boxes are labeled. This represents the upper bound of detection-based object counting performance, but is costly to label every instance. Weakly-supervised learning assumes that only image-level category information is available. That means, we know there are some object instances in one image, but we have no information of their locations. Another problem setting is low-shot learning, where the number of training images is small, but each training image has full annotations. This is more like the fully-supervised learning but with only a few training images.
|(a) Ideal||(b) Baseline||(c) Stage 1||(d) Stage 2||(d) Stage 3||(e) Stage 4|
3.2 Label Propagation by Positiveness-Focused Object Detection
As we have a few exemplar labels of the target objects, the most intuitive way is to treat the labels as seeds and carefully propagate them to unknown regions. We propose a simple yet effective positiveness-focused object detector (PFOD) to solve the propagation problem.
Fig. 2 illustrates the intuition of the label propagation by PFOD. The plus symbol denotes a positive sample while the minus symbol denotes negative. The sample in red is known (labeled) while the sample in blue is unknown. The line with the dotted line is the decision boundary. If all the positive and negative samples are known, we can easily figure out the decision boundary to separate true positive and true negative samples, as shown in Fig. 2(a). When only one positive sample is known and all the rest are unknown, as shown in Fig. 2(b), if we simply treat all the unknowns as negative sample, the decision boundary could mis-classify the unlabeled positive samples as being negative.
To compensate the lack of negative samples, we can introduce some images as extra negative data, as shown in green from Fig. 2(c) to Fig. 2(e). For object detection, it is indeed easy to find images without any target objects in the same domain. For example, we can use the PASCAL VOC  2007 data set as extra background images for the problem of car counting on the CARPK  dataset.
In Fig. 2(c) we show how to propagate the labels. We first treat all the unlabeled samples (shown as both blue plus and minus symbols in Fig. 2(c)) as negative samples to train the detector, and the learned decision boundary will be pushed close to the only positive sample (red plus symbol) as depicted with the red dotted line. Next, we ignore the unknown samples (note that we still have the extra negative samples shown as green minus symbols) and gradually update the learned classifier to classify the labeled positive sample and the extra negative samples. As the unknown positive samples (shown as blue plus symbols) are not taken as negative samples, they no longer push the decision boundary. As a result, the decision boundary (as shown as black dotted line) will be moved a bit further from the known positive sample and classify a few more unknown positive samples as positive samples.
If the propagation is carefully controlled, we can treat the newly classified positive samples as known labeled data for the next stage, as shown in Fig. 2(d). We repeat the process above to iteratively learn the boundary and propagate the label set. Each iteration here is called a stage, and we use to denote the label set in the -th stage, where is the number of expanded bounding boxes in one image. Finally, we combine all the expanded positive samples and take all the others as background to learn the final decision boundary.
Fig. 3 shows the framework of the proposed strategy to learn the object detector. At the initial state of training, the bounding boxes we have are the labeled set, i.e. for each training image. Based on , we train a positiveness-focused object detector (PFOD), which can be based on any object detector [19, 12, 13, 14, 15]. In this work, we choose YoloV2  for its simplicity. The network first processes each training image in a batch manner by a fully convolutional neural network, and then outputs three components at each spatial position: bounding box coordinates, objectiveness to tell how confident the bounding box contains an object, and classification scores to tell which category the bounding box contains. Here we assume the number of categories is 1 and remove the classification module. The Euclidean loss is used for bounding box coordinate regression and objectiveness confidence regression.
Specifically, for the objectiveness at spatial position , the loss is defined as
where is the network parameter, learned iteratively through the mini-batch stochastic gradient descent (SGD) algorithm, is the objectiveness score at position , and is 1 if it is identified as being positive for position , and 0 otherwise based on the current label set. For the extra background images, the label is consistently set as 0 and the loss will be always enabled. For the training images in the target domain, we modify it as follows to implement PFOD,
where is the number of iterations in SGD, is a pre-defined parameter (200 in experiments) to determine how many iterations are needed to treat the unknown regions as background. After iterations, the detector training will only focus on the positive samples and the extra background data.
With the model trained by PFOD, we run the prediction over all the training images from target domain. The predicted bounding boxes with high probability scores (0.9 in experiments) will be merged into the original label set to form . We also discard any predicted bounding box if it has a high overlap (Intersection-over-Union 0.2 in experiments) with any of the original bounding boxes in .
After stages, we feed the training images and the expanded bounding box set into a normal object detector training pipeline, in which the bounding boxes in are positive while all the rest are negative. We still apply the YoloV2 algorithm here for training and testing. Ideally, if all the unlabeled object instances could be propagated, the trained model should be able to achieve an accuracy on par with that trained from the full annotations.
We mainly evaluate the approaches on the CARPK  dataset, which contains 989 training images and 459 testing images. The task is to detect and count the car instances in the image. Each training image has annotated cars, while each testing image has cars. The images are collected by a drone on top of car parks. An example image is shown in the first row of Fig. 1
Another interesting application is to count the number of drinks or products on retail store shelves. To demonstrate the effectiveness of the incompletely-supervised learning algorithm on other domains, we collect a small dataset Drink35, which contains 10 images as the training set and 25 images as the testing set. The task is to detect and count all the product instances. One example image is shown in the second row of Fig. 1.
To simulate the incompletely-supervised settings, we randomly select training images, and for each image we randomly select at most annotated cars. All the other unselected images are discarded during training. This training set is denoted as or for CARPK and Drink35 datasets, respectively. For example, CARPK_5_5 means the training set with 5 images and each with at most 5 annotated bounding boxes. Similarly, the suffix of denotes the training set of images with all the annotated boxes, and denotes all training images with at most annotated bounding boxes in each image. The test set is not altered for consistent evaluation.
We use the PASCAL VOC  2007 trainval set (5011 images) as the extra background images with all the original bounding box labels removed. Note that the images in PASCAL VOC 2007 contains the object of cars and drinks. We keep these images as negative samples because the cars and drinks in VOC 2007 are generally of difference appearances or views compared with the object instances in CARPK and Drink35.
Since we focus on the detection-based approaches, we adopt the widely-used [12, 13, 14, 15] mAP@0.5 as one of the metrics, which measures the mean average precision (mAP) using 0.5 as the interaction-over-union (IoU) threshold.
Following [17, 9], we also use the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) to evaluate the accuracy of the counting results. MAE is defined as , while RMSE as , where is the number of objects predicted by the model for the -th testing image, is the ground-truth number of objects, and is the total number of testing images.
4.1.3 Implementation details.
The data augmentation is of great importance for the network learning due to the small training set with incomplete labels. Motivated by the implementation of Yolo  and SSD , we incorporate the random scaling, random aspect ratio distortion, and color jittering. Random rotation is also implemented for the car counting problem by multiple data samplers (motivated by SSD ), so that non-rotated images are preferred than rotated images. That is, images with 0, 90, 180, and 270 degrees’ rotations are preferred than images with arbitrary rotations, and images with less than 10 degrees’ rotation are preferred than images with other arbitrary rotations. Specifically, for each training image, we have 25% chance to select non-rotated image; 25% chance to rotate images by x degrees; 25% chance to rotate images by less than degrees plus a random x rotation; and the last 25% chance to rotate images for any angle.
Another important parameter is that we use a large input resolution for both training and testing due to large number of objects and small object sizes in each image. During training, we resize the image so that the longer side randomly ranges from to , and then crop a subregion of as the network input. During testing, we resize the input image so that the whole area is close to while its aspect ratio is kept. Non-Maximum Compression (NMS) is used to filter the bounding box and the IoU threshold is 0.2.
The network backbone is Darknet19, which is the same as in YoloV2 . We use 9 stages of PFOD to propagate the labels. In each stage of PFOD, we train the network with 10K iterations. The learning rate is 0.0001 for the first 100 iterations, 0.001 for the next 4900 iterations, 0.0001 for 4000 iterations and 0.00001 for the last 1000 iterations111The total number iterations could be greatly reduced as the number of training images is small. However, as the training time is not a concern in this work, we leave the training optimization as a future work.. The batch size is set to 64, and the weight decay is 0.0005. The last detector training shares the same parameters. The training takes 1.25 hours on 4 NVidia P100 GPUs to finish the 10K iterations. The implementation is based on Caffe  under the environment of CUDA 8.0 and CUDNN 5.1.
We also report the accuracy without any label propagation, and directly train the object detector on the provided label set. This straightforward approach is denoted as OD as a naive baseline, and all the data augmentation and learning strategy parameters are the same with a single stage of PFOD.
In the incomplete annotation settings, we enable the extra background samples by replacing 16 images in each batch of 64 with the extra background images.
|OD||989||ALL||96.67||2.94 (0.15)||3.94 (0.10)|
|ALL||91.97||4.38 (0.20)||5.58 (0.20)|
|OD||87.90||6.02 (0.05)||8.23 (0.05)|
|61.26||49.28 (0.05)||54.45 (0.05)|
|4.95||99.39 (0.05)||106.51 (0.05)|
|4.58||101.93 (0.10)||108.99 (0.10)|
|PFOD||83.73||11.54 (0.05)||16.76 (0.05)|
|80.27||16.52 (0.05)||21.45 (0.05)|
|73.46||22.47 (0.05)||27.63 (0.05)|
|72.44||23.63 (0.05)||26.36 (0.05)|
4.2 Results on CARPK
The results are shown in Table 1 for different numbers of training images () and different numbers of labeled bounding boxes () in each image. To count the number of objects, we need a threshold to determine if the predicted bounding box should be kept from the detector. Since different settings might favor different thresholds, we select the one with lowest MAE or RSME among . We select the best threshold for MAE and RSME to examine the best performance under these two criteria. The threshold is in parentheses of Table 1. Note the criterion of mAP does not depend on the threshold. From the table, we have the following observations and discussions.
4.2.1 What is the upper bound performance using all the training data?
In the fully supervised setting (, ), our detector (OD) could achieve 96.67% mAP@0.5, 2.94 MAE and 3.94 RMSE. This significantly outperforms the state-of-the-art method of LPN , whose MAE is 23.80 and RMSE is 36.79 (mAP is not reported). Other baseline approaches are not shown here because the accuracy is lower than LPN. This demonstrates a strong object detector towards the counting problem.
4.2.2 Do we need hundreds of training images for car counting?
We apply the OD on
images with all the labeled annotations (), and the detector can still achieves 91.97 mAP@0.5, 3.00 MAE, and 5.58 RSME without extra background images. Compared with , which uses 989 training images, this result is very encouraging. It clearly indicates that to develop a car counting algorithm, 5 or a few more images might be enough rather than several hundreds.
The 5 image IDs are
20161030_GF2_00071 for reproducing the result.
4.2.3 What if the unlabeled regions are treated as negative samples?
When the number of bounding boxes is decreased to 5, the mAP significantly drops to 4.40. To identify the reason, we evaluate the trained model against the training images and show two examples in Fig 4. The predicted bounding boxes are drawn on the image with confidence scores around each box. Only the boxes with confidence scores higher than 0.05 are displayed. The selected 5 labeled boxes are located at the same position with the predicted boxes and the probability is close to 1. Since the threshold here is 0.05, all the other regions including the unlabeled cars are classified as background with high confidence. This shows that the model overfits the training data severely, which degrades the accuracy.
|(a) expand=6, correct=6||(b) expand=13, correct=13||(c) expand=8, correct=8|
|(d) expand=8, correct=8||(e) expand=35, correct=35||(f) expand=29, correct=29|
|(g) expand=10, correct=10||(h) expand=46, correct=46||(i) expand=32, correct=32|
|(j) expand=22, correct=22||(k) expand=50, correct=49||(l) expand=37, correct=33|
4.2.4 How effective is the proposed incompletely-supervised learning approach?
We apply the proposed PFOD on CARPK_5_5, and surprisingly the accuracy is boosted to 72.44 mAP, 23.63 MAE and 26.36 RSME. In terms of MAE and RSME, the accuracy has surpassed the LRN  trained on the full training set of 989 images.
In Fig. 5, we illustrate the label propagation process by PFOD on CARPK_5_5. Each column corresponds to one training image. The two numbers below each images are the number of the expanded boxes (initially provided + propagated), and the number of correct boxes among those boxes. A box is correct if its IoU is larger than 0.3 with at least one bounding box in CARPK_5_ALL. From the figure, the correct bounding boxes used for training could be gradually populated. Taking the leftmost image as an example, the number of correct boxes is increased from the initial number 5 to 22. The rightmost one can have 33 correct boxes.
Meanwhile, we observe that the propagation is still not perfect - it introduces several false bounding boxes while missing a few cars. This is the reason why there is still a gap between this setting and CARPK_5_ALL, and will motivate us to continue investigating the problem.
4.2.5 Will introducing more labels help?
By increasing the number of labeled boxes per training image from 5 to 50, the accuracy can be smoothly increased for both OD and PFOD. With less than 25 boxes in each image, the accuracy of PFOD is consistently higher than OD, while with 50 boxes, the accuracy is lower. The reason is that under the setting of 50 boxes, most of the true boxes are included, while the label propagation introduced some false boxes. That is, under the almost full annotations, it is enough to apply the OD instead of propagating the boxes.
|OD on CARPK_5_5||OD on CARPK_5_ALL||PFOD on CARPK_5_5|
|(a) pred=9, gt=127||(b) pred=130, gt=127||(c) pred=122, 127|
|(d) pred=13, gt=138||(e) pred=152, gt=138||(f) pred=93, gt=138|
|(g) pred=0, gt=114||(h) pred=115, gt=114||(i) pred=107, gt=114|
|OD on Drink35_ALL_5||OD on full Drink35||PFOD on Drink35_ALL_5|
|(a) pred=0, gt=58||(b) pred=49, gt=58||(c) pred=36, gt=58|
|(d) pred=14, gt=51||(e) pred=46, gt=51||(f) pred=45, gt=51|
Fig. 6 visualizes the detection and counting results on three testing images based on the model with OD trained on CARPK_5_5, OD trained on CARPK_5_ALL, and the model with PFOD on CARPK_5_5. The yellow boxes are the correct bounding boxes while the red one is the incorrect one. A predicted box is correct if it has an IoU larger than 0.3 with one of the ground truth bounding boxes. The two numbers under the image are the number of instances predicted by the model and the ground-truth number of instances, respectively.
4.3 Results on Drink35
With the full annotations, OD can achieve 79.53% mAP@0.5, 7.92 MAE and 11.50 RSME. This can be treated as the upper bound performance as we have used all the labels. If each training image is provided with only 5 labeled instance, the OD’s accuracy degrades to 14.16% mAP, 29.52 MAE and 38.62 RSME. In contrast, using PFOD, the accuracy could be jumped to 53.73%, 11.16 MAE and 13.10 RSME. Fig. 7 shows two example images detected/counted by the three approaches.
We have studied the problem of object counting when there are only a few exemplar annotations available. The problem is more practical especially when the number of object instances is large. We formulate the problem as incompletely-supervised learning in the context of object detection. Since not all the bounding boxes are provided, we cannot simply treat other regions as background which will lead to severe overfitting and performance degradation. To address the problem, we have proposed a positiveness-focused object detector to progressively propagate the incomplete labels to more object instances. Our experimental results over two applications have demonstrated that this simple yet effective approach significantly boosts the accuracy with only a few manually annotations.
-  An, S., Liu, W., Venkatesh, S.: Face recognition using kernel ridge regression. In: 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. (2007)
-  Chan, A.B., Liang, Z.J., Vasconcelos, N.: Privacy preserving crowd monitoring: Counting people without people models or tracking. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA. (2008)
-  Chen, K., Gong, S., Xiang, T., Loy, C.C.: Cumulative attribute space for age and crowd density estimation. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. (2013) 2467–2474
-  Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: British Machine Vision Conference, BMVC 2012, Surrey, UK, September 3-7, 2012. (2012) 1–11
-  Kong, D., Gray, D., Tao, H.: A viewpoint invariant approach for crowd counting. In: 18th International Conference on Pattern Recognition (ICPR 2006), 20-24 August 2006, Hong Kong, China. (2006) 1187–1190
-  Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. (2015) 833–841
-  Arteta, C., Lempitsky, V.S., Noble, J.A., Zisserman, A.: Interactive object counting. In: Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III. (2014) 504–518
-  Rodriguez, M., Laptev, I., Sivic, J., Audibert, J.: Density-aware person detection and tracking in crowds. In: IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011. (2011) 2423–2430
-  Hsieh, M., Lin, Y., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. (2017) 4165–4173
-  Kamenetsky, D., Sherrah, J.: Aerial car detection and urban understanding. In: 2015 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2015, Adelaide, Australia, November 23-25, 2015. (2015) 1–8
-  Moranduzzo, T., Melgani, F.: Automatic car counting method for unmanned aerial vehicle images. IEEE Trans. Geoscience and Remote Sensing 52(3) (2014) 1635–1647
-  Girshick, R.B.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. (2015) 1440–1448
-  Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. (2015) 91–99
-  Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I. (2016) 21–37
-  Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. (2017) 6517–6525
-  Mundhenk, T.N., Konjevod, G., Sakla, W.A., Boakye, K.: A large contextual dataset for classification, detection and counting of cars with deep learning. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III. (2016) 785–800
-  Lempitsky, V.S., Zisserman, A.: Learning to count objects in images. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada. (2010) 1324–1332
-  Lin, Z., Davis, L.S.: Shape-based human detection and segmentation via hierarchical part-template matching. IEEE Trans. Pattern Anal. Mach. Intell. 32(4) (2010) 604–618
-  Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. (2014) 580–587
-  Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. (2016) 2846–2854
-  Cinbis, R.G., Verbeek, J.J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 39(1) (2017) 189–203
-  Kantorov, V., Oquab, M., Cho, M., Laptev, I.: Contextlocnet: Context-aware deep network models for weakly supervised localization. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V. (2016) 350–365
-  Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. (2017) 3059–3067
-  Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88(2) (2010) 303–338
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)