Benchmark for Generic Product Detection: A strong baseline for Dense Object Detection

Benchmark for Generic Product Detection: A strong baseline for Dense Object Detection


Object detection in densely packed scenes is a new area where standard object detectors fail to train well [6]. We show that the performance of the standard object detectors on densely packed scenes is superior when it is trained on normal scenes rather than dense scenes. We train a standard object detector on a small, normally packed dataset with data augmentation techniques. This achieves significantly better results than state-of-the-art methods that are trained on densely packed scenes. We obtain 68.5% mAP on SKU110K dataset [6], 19.3% higher and 1.4x better than the previous state-of-the-art. We also create a varied benchmark for generic SKU product detection by providing full annotations for multiple public datasets. It can be accessed at this URL. We hope that this benchmark helps in building robust detectors that perform reliably across different settings.


Dense Object Detection, Grocery Products, Retail Products, Benchmark, Generic SKU Detection

1 Introduction

Dataset #Images #Objects #Obj/Img Object Size Avg Img Size
(Mean) (Std)
SKU110K-Test 2941 432,312 146 0.27% 0.21% 7.96
WebMarket 3153 118,388 37 1.20% 1.09% 4.40
TobaccoShelves 354 13,184 37 1.1% 0.65% 6.08
Holoselecta 295 10,036 34 0.99% 0.80% 15.62
GP 680 9184 13 3.66% 2.59% 7.99
CAPG-GP 234 4756 20 3.09% 3.04% 12.19
Table 1: Details of the datasets in the benchmark. # represents the count. Object sizes (Mean and Standard Deviation) are relative to the image size. Average Image size is shown in Megapixels

The real-world applications of computer vision span multiple industries like banking, agriculture, governance, healthcare, automotive, retail, and manufacturing. A few prominent ones include self-driving cars, automated retail stores like Amazon Go, and automated surveillance. The use of object detectors is absolutely a critical part of such real-world products. The area of research in object detectors has been quite vibrant, with a considerable number of datasets spanning various domains. However, the sub-topic of object detection in dense scenes is rarely explored. A recent study [6] showed that standard object detectors do not train well on dense scenes. This topic is quite relevant to multiple applications, for example, in surveillance and retail industries. Some of these applications are crowd counting, monitoring and auditing of retail shelves, insights into brand presence for sales, marketing teams, and so on.

Exemplar based object detection refers to the detection and classification of objects from scene images with the supervision of an exemplar image of the object. Most object detection datasets are quite large, with enough number of instances of every object category. Most of the object detection methods [9] depend on balanced and large object detection datasets to perform well in every category. These guarantees cannot be made for real-world applications where the object categories vary widely both in variety and in availability. For example, in the retail domain, the gathering of data to train an end to end object detection model is highly time-consuming as well as costly. This is because gathering enough data which covers all the variants of objects and has equal representation of each object is going to be much harder. For example, making sure that our dataset contains a specific rare Mercedes logo design requires us to search across multiple showrooms or market places. We cannot be even sure that availability across all classes in retail is even enough to create a balanced object detection dataset. A similar case can be made when we need to monitor retail shelves, which has thousands of SKUs having different availability and frequency.

Moreover, in a dynamic world, where new products, new marketing materials, new logos keep getting introduced, the importance of incorporating incremental learning in real-world applications becomes greatly necessary. Unfortunately, the methods of incremental learning for object detection lead to a vast and unacceptable drop in performance [10]. A lot of these applications also involve distinguishing between extremely fine-grained classes. E.g., retail shelf monitoring, logo monitoring, face recognition. Building an end-to-end detector that would do both the dense object detection and fine-grained recognition is a very challenging task whose real-world performance is quite bad. Hence, to tackle this problem, we introduce to decouple detection and classification. We propose to use a general object detector that predicts bounding boxes of objects that is of interest. The detected objects can then be classified by a suitable fine-grained classifier.

This brings us to the current work of generic object detection in densely packed scenes. Previous works have shown that standard detectors trained on dense scenes do not perform well on dense scenes. In this work, we show that standard detectors trained on small datasets of normal scenes are quite sufficient and work much better than the ones trained on large datasets of dense scenes. We observe a 19.3% higher mAP, 1.4x better than the previous state-of-the-art in the SKU110K dataset.

We also create a varied benchmark for generic SKU product detection by annotating every SKU in multiple datasets. The motivation behind this is to create detectors that are robust across different settings. It is quite common for deep learning based detectors to perform well on the data similar to their training set. In the industry, there is a high need for robust detectors that perform reliably in the wild. This benchmark consists of 6 datasets (Table 1) used solely as a test set. Models trained on any dataset can be tested on this benchmark to measure the progress of robust generic SKU detection. The benchmark datasets, evaluation code, and the leaderboard are available at this URL1

Dataset Method AP AP AP AR AR
SKU110K-Test RetinaNet [6] 0.455 - 0.389 0.530 -
Full Approach (RetinaNet) [6] 0.492 - 0.556 0.554 -
Faster-RCNN [6] 0.045 - 0.010 0.066 -
Faster-RCNN 0.685 0.827 0.755 0.756 0.855
WebMarket Faster-RCNN 0.597 0.754 0.661 0.677 0.785
TobaccoShelves Faster-RCNN 0.436 0.569 0.523 0.474 0.576
Holoselecta Faster-RCNN 0.661 0.853 0.765 0.769 0.905
GP Faster-RCNN 0.537 0.792 0.604 0.656 0.879
CAPG-GP Faster-RCNN 0.648 0.900 0.749 0.760 0.955
Table 2: Performance of Faster-RCNN across different general product datasets

2 Benchmark Datasets

[6] recently released a huge benchmark dataset for product detection in densely packed scenes. To increase diversity to the task of generic product detection, we release a benchmark of datasets. Details of the datasets are shown in Table 1. Please note that all of these datasets are used as a test set on which we benchmark our models. We welcome the community to participate in this benchmark by submitting their results.

2.1 WebMarket

[15] released a database of 3153 supermarket images. They also provided information regarding what product is present in each image. We annotate every object in the entire dataset to provide ground truth for the evaluation of general object detection. The average number of objects per image in this dataset is 37, while the average object area is roughly 0.052 megapixels.

2.2 Grocery products (GP)

A multi-label classification approach was proposed by [5] accompanied by 680 annotated shop images from their GP dataset. The annotation provided by them covered a group of same products in a bounding box rather than bounding boxes for individual boxes. A subset of 70 images was chosen by [12] and annotated with the desired object-level annotations. We provide individual bounding box annotation for every product for all 680 images in this dataset. The average number of objects per image in this dataset is 13, while the average object size is roughly 0.293 megapixels.

2.3 Capg-Gp

A fine-grained grocery product dataset was released by [4]. It consists of 234 test images taken from 2 stores. The authors annotated only the products belonging to certain categories. To create ground truth for generic object detection, we decided to annotate every product in the entire dataset. The average number of objects per image in this dataset is 20, while the average object area is roughly 0.377 megapixels.

Type of Images Avg Img Size #Images
Type 1 (HoloLens) 1.11 30
Type 2 8.13 8
Type 3 (OnePlus-6T) 16.03 208
Type 4 24 49
Total 15.62 295
Table 3: Details of Holoselecta dataset. # represents the count. Average size of the entire image is shown in megapixels

2.4 Existing General Product Datasets


Most recently, [3] released a dataset of 300 real-world images of Selecta vending machines containing 10,000 objects belonging to 90 categories of packaged products. The images in this dataset were quite varied in their sizes, as shown in Table 3.


A retail product dataset was released by [13] containing 345 images of tobacco shelves collected 40 stores with four cameras. The annotations of every product were also released by the authors2. The average number of objects per image in this dataset is 37, while the average object area is roughly 1.1% of the entire image. The images in this dataset were also quite varied in size, ranging from 1.4 megapixels to 10.5 megapixels.


Recently, [6] released a huge dataset for precise object detection in densely packed scenes. This dataset contains 11,762 images that were split into train, validation, and test sets. The test set consists of 2941 images with the average number of objects per image being 146. The average object area is roughly 0.27%, making it the lowest among all the datasets in this benchmark.

Dataset #Images #Obj/Img
Our trainset 312 14.6
SKU110K-train 8233 147.4
Table 4: Statistics of trainset. # represents the count.

3 Approach

We collected close to 300 images encompassing various shapes in which retail products occur from the public domain (e.g., GoogleImages, OpenImages). The average number of objects per image in this training set is 14.6. This is in contrast with the data on which [6] was trained, where the average number of objects per image is 147.4, as shown in Table 4. We apply standard object detection augmentations from [1]. We train a standard object detector, Faster-RCNN [9], on our training set described above as a baseline for this benchmark.

Dataset #Images #Images with SKU count greater than
20 40 60 80 100
SKU110K-Test 2941 2939 2934 2926 2902 2822
WebMarket 3153 2526 1119 349 121 57
TobaccoShelves 354 308 130 18 3 2
Holoselecta 295 249 113 0 0 0
GP 680 101 9 1 0 0
CAPG-GP 234 82 33 4 0 0
Table 5: Analysis on the denseness of the generic product datasets. # represents the count.

4 Implementation & Results

We use standard post-processing steps like Non-Max Suppression after our detections. We use multi-scale testing for scenes where object sizes are quite large (e.g., GP, CAPG-GP), but for others like SKU110K, only a single scale is used.

We use the same evaluation metric as the recent work [6]. This is the standard evaluation metric used by COCO [7]. Average precision (AP) and Average Recall (AR) are reported at IoU=.50:.05:.95 (averaged by varying IoU between 0.5 and 0.95 with 0.05 intervals). 300 here represents the maximum number of objects in an image. AP and AR at specific IoUs (0.50 and 0.75) are also reported.

Our results on the SKU110K dataset (Table 2) clearly shows that standard detectors trained on normal scenes work much better than the sophisticated methods trained on dense scenes. It also shows that there is no need for large scale datasets for generic SKU detection. This method serves as a simple baseline while methods exploiting the shape of the objects and structure of densely packed scenes look promising.

5 Discussion

A qualitative output of our method on different datasets are shown in Figure 1, 2, 3. The performance on TobaccoShelves was a bit low 2, which is seen qualitatively in Figure 5. This shows that evaluating on a varied benchmark is necessary to have a detector perform reliably in the wild. One can see that our method performs greatly when homogeneous objects are present throughout the image. It also detects objects of different aspect ratios comfortably. Current limitations of this baseline model include handling multi-scale objects ranging from 0.1% to 20% of the scene. Better scale-invariant object detectors [11] can be tried as future work.

6 Dense Object Detection

The denseness of the datasets depends on two factors. The average number of objects and the relative sizes of the objects. SKU110K [6] is by far one of the most dense datasets for object detection. An analysis of the denseness of other datasets in the current benchmark is shown in Table 5.

7 Related Work

Identifying real-world products (in situ) by training on exemplar template product images (in vitro) was initially proposed by [8]. They released a database of 120 SKUs for product classification. Six in vitro images were collected from the web for each product and used for training. The in situ images were provided from frames of videos captured in a grocery store. There have been a few related retail product checkout datasets by [14] and [2]. Both of them are densely packed product datasets arranged in the fashion of a checkout counter, with many overlapping regions between objects as well.

Figure 1: Ouputs of our method spanning different type of objects on SKU110K [6] dataset.
Figure 2: Sample predictions of our method on GP [5] dataset. Blue boxes denote groundtruth objects, while Red denotes predictions.
Figure 3: Ouputs of our method on Holoselecta [15] dataset
Figure 4: Ouputs of our method on Holoselecta [3] dataset
Figure 5: Ouputs of our method on TobaccoShelves [13] dataset




  1. A. V. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov and A. A. Kalinin (2018) Albumentations: fast and flexible image augmentations. CoRR abs/1809.06839. External Links: Link, 1809.06839 Cited by: §3.
  2. P. Follmann, T. Böttger, P. Härtinger, R. König and M. Ulrich (2018) MVTec D2S: densely segmented supermarket dataset. CoRR abs/1804.08292. External Links: Link, 1804.08292 Cited by: §7.
  3. K. Fuchs, T. Grundmann and E. Fleisch (2019) Towards identification of packaged products via computer vision: convolutional neural networks for object detection and image classification in retail environments. In Proceedings of the 9th International Conference on the Internet of Things, IoT 2019, New York, NY, USA, pp. 26:1–26:8. External Links: ISBN 978-1-4503-7207-7, Link, Document Cited by: §2.4.1, Figure 4.
  4. W. Geng, F. Han, J. Lin, L. Zhu, J. Bai, S. Wang, L. He, Q. Xiao and Z. Lai (2018) Fine-grained grocery product recognition by one-shot learning. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, New York, NY, USA, pp. 1706–1714. External Links: ISBN 978-1-4503-5665-7, Link, Document Cited by: §2.3.
  5. M. George and C. Floerkemeier (2014-09) Recognizing products: a per-exemplar multi-label image classification approach. pp. 440–455. External Links: Document Cited by: §2.2, Figure 2.
  6. E. Goldman, R. Herzig, A. Eisenschtat, O. Ratzon, I. Levi, J. Goldberger and T. Hassner (2019) Precise detection in densely packed scenes. CoRR abs/1904.00853. External Links: Link, 1904.00853 Cited by: Benchmark for Generic Product Detection: A strong baseline for Dense Object Detection, Table 2, §1, §2.4.3, §2, §3, §4, §6, Figure 1.
  7. T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. External Links: Link, 1405.0312 Cited by: §4.
  8. M. Merler, C. Galleguillos and S. Belongie (2007-06) Recognizing groceries in situ using in vitro training data. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. External Links: Document, ISSN 1063-6919 Cited by: §7.
  9. S. Ren, K. He, R. B. Girshick and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §1, §3.
  10. K. Shmelkov, C. Schmid and K. Alahari (2017) Incremental learning of object detectors without catastrophic forgetting. CoRR abs/1708.06977. External Links: Link, 1708.06977 Cited by: §1.
  11. B. Singh and L. S. Davis (2017) An analysis of scale invariance in object detection - SNIP. CoRR abs/1711.08189. External Links: Link, 1711.08189 Cited by: §5.
  12. A. Tonioni and L. di Stefano (2017) Product recognition in store shelves as a sub-graph isomorphism problem. CoRR abs/1707.08378. External Links: Link, 1707.08378 Cited by: §2.2.
  13. G. Varol and R. Salih (2015-03) Toward retail product recognition on grocery shelves. pp. 944309. External Links: Document Cited by: §2.4.2, Figure 5.
  14. X. Wei, Q. Cui, L. Yang, P. Wang and L. Liu (2019) RPC: A large-scale retail product checkout dataset. CoRR abs/1901.07249. External Links: Link, 1901.07249 Cited by: §7.
  15. Y. Zhang, L. Wang, R. I. Hartley and H. Li (2007-11-16) Where’s the weet-bix?. In ACCV (1), Y. Yagi, S. B. Kang, I. Kweon and H. Zha (Eds.), Lecture Notes in Computer Science, Vol. 4843, pp. 800–810. Cited by: §2.1, Figure 3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description