Data Priming Network for Automatic Check-Out

Data Priming Network for Automatic Check-Out

Congcong Li, Dawei Du, Libo Zhang, Tiejian Luo, Yanjun Wu, Qi Tian, Longyin Wen, Siwei Lyu University of Chinese Academy of Sciences, Beijing, China
University at Albany, State University of New York, Albany, NY, USA
Institute of Software Chinese Academy of Sciences, Beijing, China
State Key Laboratory of Computer Science, ISCAS, Beijing, China
Huawei Noah’s Ark Lab, China
JD Digits, Mountain View, CA, USA.

Automatic Check-Out (ACO) receives increased interests in recent years. An important component of the ACO system is the visual item counting, which recognizes the categories and counts of the items chosen by the customers. However, the training of such a system is challenged by the domain adaptation problem, in which the training data are images from isolated items while the testing images are for collections of items. Existing methods solve this problem with data augmentation using synthesized images, but the image synthesis leads to unreal images that affect the training process. In this paper, we propose a new data priming method to solve the domain adaptation problem. Specifically, we first use pre-augmentation data priming, in which we remove distracting background from the training images using the coarse-to-fine strategy and select images with realistic view angles by the pose pruning method. In the post-augmentation step, we train a data priming network using detection and counting collaborative learning, and select more reliable images from testing data to fine-tune the final visual item tallying network. Experiments on the large scale Retail Product Checkout (RPC) dataset demonstrate the superiority of the proposed method, i.e., we achieve checkout accuracy compared with of the baseline methods. The source codes can be found in

automatic check-out, domain adaptation, data priming network, detection and counting collaborative learning
Both authors contributed equally to this research.
Corresponding author (
journalyear: 2019conference: Proceedings of the 27th ACM International Conference on Multimedia; October 21–25, 2019; Nice, Francebooktitle: Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), October 21–25, 2019, Nice, Franceprice: 15.00doi: 10.1145/3343031.3350989isbn: 978-1-4503-6889-6/19/10

1. Introduction

The recent success of Amazon Go system has invigorated the interests in Automatic Check-Out (ACO) in supermarket and grocery stores. With ACO, customers do not need to put items on the conveyer belt and wait in line for a store assistant to scan them. Instead, they can simply collect the chosen items and an AI-based system will be able to produce the categories and count of these items and automatic process the purchase. Successful ACO system will revolutionize the way we do our shopping and will have significant impact to our daily life in the coming years.

Figure 1. The illustration of Automatic Check-Out (ACO) system. It can recognize the categories and counts of the products that the customer puts on the checkout counter, and calculate the corresponding total price. This can be also expanded to other application scenarios such as shopping cart and shelves.

The bedrock of an ACO system is visual item counting that takes images of shopping items as input and generates output as a tally of different categories. With the recent successes of deep learning, deep neural network is a tool of choice for this task. The training of deep neural networks predicates on the availability of large annotated dataset. However, unlike other tasks in computer vision such as object detection and recognition, the training of deep neural network for visual item counting faces a special challenge of domain shift. Specifically, the training data are usually images of individual items under different viewing angles, which is collected using an isolated item sitting on a turntable. As such, the training images may have a distribution different from the images of shopping items piled together over a surface, see Figure 1. The visual item counting algorithm needs to be able to adapt to the difference between the source domain (images of isolated objects) and the target domain (images of collections of objects).

Existing work (Wei et al., 2019) attempts to solve this problem with data argumentation. Firstly, images of collections of objects are generated by overlaying individual objects randomly. To improve the realism of the target images, the CycleGAN method (Zhu et al., 2017) is used to render realistic shadows and boundaries. However, such a scheme has serious drawbacks. The synthesized testing images have low level of realism due to some unrealistic poses. Besides, there still exists considerable domain shift between training data and testing data.

In this work, we propose a new strategy termed as data priming, to solve the challenging domain adaptation in the visual item counting problem. Instead of simply increasing the data volume by data augmentation as in the previous method (Wei et al., 2019), we improve the relevancy of the augmented data in two steps. In the pre-augmentation data priming step, we extract the foreground region from the training images of isolated objects using the coarse-to-fine saliency detection method. Then, we develop a pose pruning method to choose images only with consistent configurations of the target domain as candidates to generate synthesized images of checked out items with realistic poses. In the post-augmentation data priming step, we construct a data priming network with two heads, one for counting the total number of items and the other for detecting individual objects. Trained on the synthesized images, the data priming network is used to determine the reliability of testing data by detection and counting collaborative learning. Thus reliable testing data is selected to train the visual item tallying network. Experiments on the large-scale Retail Product Checkout (RPC) dataset (Wei et al., 2019) demonstrate significant performance improvement of the proposed method compared with the baselines, — we achieve checkout accuracy compared with of the baseline method.

The main contributions of this work are three-fold.

  • First, we develop a simple and effective pose pruning method to select synthesized checkout samples with realistic poses for training data.

  • Second, we propose the data priming network to select reliable testing data by detection and counting collaborative learning to guide the training of visual item tallying network.

  • Third, experiments on the RPC dataset shows that our proposed method achieves favorable performance compared to the baselines.

2. Related Work

In this section, we review previous works that are relevant to the proposed method.

2.1. Salient Object Detection

Salient object detection (Li and Yu, 2015; Liu and Han, 2016; Hu et al., 2016; Tang and Wu, 2017; Hou et al., 2019) is to segment the main object in the image for pre-processing. Li et al. (Li and Yu, 2015) obtain the saliency map based on the multi-scale features extracted from CNN models. Hu et al. (Hu et al., 2016) propose a saliency detection method based on the compactness hypothesis that assumes salient regions are more compact than background from the perspectives of both color layout and texture layout. Liu et al. (Liu and Han, 2016) develop a two-stage deep network, where a coarse prediction map is produced and followed by a recurrent CNN to refine the details of the prediction map hierarchically and progressively. Tang and Wu (Tang and Wu, 2017) develop multiple single-scale fully convolutional networks integrated chained connections to generate saliency prediction results from coarse to fine. Recently, Hou et al. (Hou et al., 2019) take full advantage of multi-level and multi-scale features extracted from fully CNNs, and introduce short connections to the skip-layer structures within the holistically-nested edge detector.

2.2. Data Augmentation

Data augmentation is a common method used in deep network training to deal with training data shortage. Recently, generative models including variational auto-encoder (VAE) (van den Oord et al., 2016; Yan et al., 2016) and generative adversarial networks (GANs) (Goodfellow et al., 2014; Zhu et al., 2017) are used to synthesize images similar to those in realistic scenes for data augmentation. Oord et al. (van den Oord et al., 2016) propose a new conditional image generation method based on the Pixel-CNN structure. It can be conditioned feature vectors obtained from descriptive labels or tags, or latent embeddings created by other networks. In (Yan et al., 2016), a layered VAE model with disentangled latent variables is proposed to generate images from visual attributes. Besides, different from VAE, Goodfellow et al. (Goodfellow et al., 2014) estimate generative models via an adversarial process of two models, where the generative model captures the data distribution, and the discriminative model estimates the probability that a sample came from the training data rather than the generative model. Recently, the CycleGAN model (Zhu et al., 2017) is to learn the mapping between an input image and an output image in different styles.

2.3. Domain Adaptation

In training deep learning models, due to many factors, there exists a shift between the domains of the training and testing data that can degrade the performance. Domain adaptation uses labeled data in one source domains to apply to testing data in a target domain. Recently there have been several domain adaptation methods for visual data. In (Ganin and Lempitsky, 2015), the authors learn deep features such that they are not only discriminative for the main learning task on the source domain but invariant with respect to the shift between the domains. Saito et al. (Saito et al., 2017) propose an asymmetric tri-training method for unsupervised domain adaptation, where unlabeled samples are assigned to pseudo-labels and train neural networks as if they are true labels. In (Wang et al., 2018), a novel Manifold Embedded Distribution Alignment method is proposed to learn the domain-invariant classifier with the principle of structural risk minimization while performing dynamic distribution alignment. The work of (Chen et al., 2018) adapts the Faster R-CNN (Ren et al., 2015) with both image and instance level domain adaptation components to reduce the domain discrepancy. Qi et al. (Qi et al., 2018) propose a covariant multimodal attention based multimodal domain adaptation method by adaptively fusing attended features of different modalities.

2.4. Grocery Product Dataset

To date, there only exists a handful of related datasets for grocery product classification (Rocha et al., 2010), recognition (Koubaroulis et al., 2002; Merler et al., 2007; George and Floerkemeier, 2014; Jund et al., 2016), segmentation (Follmann et al., 2018) and tallying (Wei et al., 2019).

Supermarket Produce Dataset (Rocha et al., 2010) includes product categories of fruit and vegetable and images in diverse scenes. However, this dataset is not very challenging and does not reflect the challenging aspects of real life checkout images. SOIL-47 (Koubaroulis et al., 2002) contains product categories, where each category has images taken from different horizontal views. Then, Grozi-120 (Merler et al., 2007) contains grocery product categories in natural scenes, including from the web and from the store. Similar to Grozi-120, Grocery Products Dataset (George and Floerkemeier, 2014) is proposed for grocery product recognition. It consists of grocery products comprising training images and testing images. The training images are downloaded from the web, and the testing images are collected in natural shelf scenario. Freiburg Groceries Dataset (Jund et al., 2016) collects images of grocery classes using four different smartphone cameras at various stores, apartments and offices in Freiburg, Germany, rather than collecting them from the web. Specifically, the training set consists of images that contains one or more instances of one class, while the testing set contains images of clutter scenes, each containing objects of multiple classes. Besides, in (Follmann et al., 2018), the MVTec D2S dataset is used for instance-aware semantic segmentation in an industrial domain. It consists of images of object categories with pixel-wise labels.

Different from the aforementioned datasets, the RPC dataset (Wei et al., 2019) is the largest scale of grocery product dataset to date, including product categories and images. Each image is obtained for a particular instance of a type of product with different appearances and shapes, which is divided into sub-categories, such as puffed food, instant drink, dessert, gum, milk, personal hygiene and stationery. Specifically, single-product images are taken in isolated environment as training exemplar images. To capture multi-view of single-product images, four cameras are used to cover the top, horizontal, and views of the exemplar image on a turntable. Then, each camera takes photos every degrees when the turntable rotating. The resolution of the captured image is . Then, several random products are placed on a white board, and then a camera mounted on top takes the photos with a resolution of pixels to generate checkout images. Based on the number of products, the testing images are categorized in three difficulty levels, i.e., easy ( categories and instances), medium ( categories and instances), and hard ( categories and instances), each containing images. The dataset provides three different types of annotations for the testing checkout images:

  • shopping lists that provide the category and count of each item in the checkout image,

  • point-level annotations that provide the center position and the category of each item in the checkout image,

  • bounding boxes that provide the location and category of each item.

3. Methodology

In this section, we present in detail our data priming scheme for data augmentation in the training of visual item tallying network for automatic check-out system. As mentioned in the Introduction, our method has two steps. The pre-augmentation step we process training images of isolated items to remove those with irrelevant poses to improve the synthesized images. In the post augmentation step, we introduce a data priming network that helps to sift synthesized images to train the visual item tallying network.

Figure 2. An example of background removal by the mathematical morphology technique to generate coarse masks of items.

3.1. Pre-augmentation Data Priming

3.1.1. Background Removal

Since the training images are obtained with examplar items captured on the turntable, it contains background that affects training of the visual item tallying network to focus on the object. To remove background noise, we develop a coarse-to-fine saliency based refinement method. Specifically, we first extract the contour of the object using the method of (Dollár and Zitnick, 2013), remove the edges with the confidence score less than , and fill the connected regions. Then other holes inside the contour are filled and small isolated regions are removed using the mathematical morphology operations such as dilation and erosion. As a last step, we use median filter to smooth the edges of the masks. A qualitative example of coarse mask generation is shown in Figure 2. Given the coarse masks, we employ the saliency detection model (Hou et al., 2019) to extract fine masks with detailed contours of the object. The saliency model is formed by a deep neural network trained on the MSRA-B salient object database (Wang et al., 2017). Then, the deep neural network is fine-tuned based on the generated coarse masks of exemplars. We use these masks to extract the foreground object to use in the synthesis of testing checkout images.

Figure 3. An example of unrealistic and realistic poses from the bag-like product. We classify the item poses according to the ratio of mask area in Eq.(1).
Figure 4. The framework of the proposed network. The black dashed bounding box indicates the base network. The red and blue dashed bounding boxes correspond to the counter and detector heads, respectively. The orange dashed bounding box is to measure the reliability of testing data.

3.1.2. Pose Pruning

Since the testing image contains multiple objects while the training image only contains a single object, we use the segmented isolated items to create synthesized checkout images. However, not all the poses of the isolated items are viable in checkout images. For example, it is difficult to put bag-like products on the checkout table with the view from bottom to top, as shown in Figure 3. To remove these inappropriate poses of exemplars, we propose a simple metric based on the ratio of areas, i.e.,


where is the area of the item mask captured by the -th view in the -th category. If the ratio is less than a pre-set threshold ( in the experiment), it indicates that the area of this pose is too small to be put on the checkout table stably, i.e., unrealistic pose. Otherwise, we regard this pose as a realistic pose.

3.1.3. Checkout Images Synthesis

After obtaining the selected segmented items, we synthesize the checkout images using the method in (Wei et al., 2019). Specifically, segmented items are randomly selected and freely placed (i.e., random angles from to and scales from to ) on a prepared background image such that the occlusion rate of each instance less than . Thus the synthesized images are similar to the checkout images in terms of item placement.

The synthesized checkout images by random copy and paste still lack characteristics of the true testing images, so following the work in (Wei et al., 2019), we use the Cycle-GAN (Zhu et al., 2017) to render synthesized checkout images with more realistic lighting condition and shadows, as shown in Figure 5.

Figure 5. Comparison of synthesized images (first row) and rendered images (second row) by the Cycle-GAN method (Zhu et al., 2017).

3.2. Data Priming Network

We can train a deep neural network for visual item tallying using the rendered synthesized checkout images. However, the rendered images still have different characteristics with regards to the actual checkout images. To solve the problem, we propose the Data Priming Network (DPNet) to select reliable testing samples using the detection and counting collaborative learning strategy to guide the training of visual item tallying network.

3.2.1. Network Architecture

The goal of the visual item tallying in ACO is to predict the count and the category of items in the checkout image. To this end, we introduce a data priming network to select reliable checkout images to facilitate the training. Specifically, the data priming network consists of three components, i.e., base network with counter head and detector head , as shown in Figure 4. denotes the base network that outputs shared features among two heads, which is implemented using the ResNet-101 backbone (He et al., 2016) with Feature Pyramid Network (FPN) architecture (Lin et al., 2017). Based on the shared features, the counter head predicts the number of total instances using the predicted density map, while the detector head recognizes the location and category of instances. From the last feature maps of the base network, the counter head consists of several dilated convolutional layers to extract deeper features without losing resolutions and a convolutional layer as output layer, similar to (Li et al., 2018). Notably, the feature maps are first down-sampled with a factor of to reduce computational complexity using a stride- dilated convolutional layer. The detector head includes fully connected layers to calculate regression and classification losses from multi-scale feature maps (i.e., size of the input image).

3.2.2. Loss Function

The loss function of the proposed network consists of terms of the counter and detector heads. For the counter head, we use the Euclidean distance to measure the difference between the ground-truth map and the estimated density map we generated. For the detector head, we use the standard cross-entropy loss for classification and smooth L1 loss for regression (Lin et al., 2017). The loss function is given as follow:


where represents the input image and is the batch size. and are the estimated and ground-truth density of location in the input image , respectively. Both maps are size of the input image. and are the predicted and ground-truth class label of detection in the image , including the class index of background . We have if its argument is true (objects), and otherwise (background), That is, we only consider the regression loss of objects, where and are the regression vectors representing the parameterized coordinates of the predicted and ground-truth bounding box of detection in the image , respectively. is the factor to balance the two terms.

3.2.3. Ground-truth generation

To train the DPNet, we need to generate ground-truth density maps. Using the center locations of extracted item masks, we generate ground-truth density maps for rendered images using the strategy in (Zhang et al., 2016). First, we blur the center of each instance using a normalized Gaussian kernel. Then, we generate the ground-truth considering the spatial distribution of all instance in the rendered image. For the detector, both the locations and labels of instances simply come from the exemplars in the synthesized images.

3.2.4. Detection and Counting Collaborative Learning

We train the network using detection and counting collaborative learning, the whole procedure of which is presented in Algorithm 1. First, we train the entire network with the source training set. Here both the counter and the detector are optimized by Eq. (2). Then, we can select reliable testing data such that the estimated number of items by the counter head is equal to the number of detections with high confidence (we set as in the experiment) by the detection head after NMS operation, i.e.,


where is the estimated density of location in the sample and indicates the rounding operation. is the probability of detection in the sample . if its argument is true, and otherwise. Finally, after removing the counter head, the network is fine-tuned based on selected reliable testing data from target domain as the visual item tallying network.

0:  rendered training data with annotations, unlabelled testing data
0:  counts and categories of items in testing data
1:  for  to  do
2:     Train the DPNet using rendered training data .
3:  end for
4:  Select reliable testing data based on Eq. (3).
5:  Remove the counter head of the DPNet to obtain the visual item tallying network.
6:  for  to  do
7:     Fine-tune the visual item tallying network using reliable testing data .
8:  end for
9:  Evaluate the visual item tallying network based on testing data .
Algorithm 1 Detection and Counting Collaborative Learning

4. Experiment

We evaluate our method111Both the source codes and experimental results can be found in on the RPC dataset (Wei et al., 2019) with several baseline methods.

4.1. Implementation Details

The propose method is implemented by PyTorch (Paszke et al., 2017). The setting for the cycleGAN model is similar to that of (Zhu et al., 2017). Each mini-batch consists of images on each GPU and we set the number of detections to be for each image. We use the SGD optimization algorithm to train the DPNet, and set the weight decay to be and momentum is set to be . The factor in Eq. (2) is set as . For the counter head, the initial learning rate is for the first 120k iterations, which decays by a factor of for the next 40k iterations. For the detection head, the initial learning rate is for the first 120k iterations, which decays by a factor of for the next 40k iterations. All the experiments are conducted on a workstation with Nvidia TITAN Xp GPUs.

Clutter mode Methods cAcc () ACD () mCCD () mCIoU () mAP50 () mmAP ()
Easy Single (Baseline) 0.02% 7.83 1.09 4.36% 3.65% 2.04%
Syn (Baseline) 18.49% 2.58 0.37 69.33% 81.51% 56.39%
Render (Baseline) 63.19% 0.72 0.11 90.64% 96.21% 77.65%
Syn+Render (Baseline) 73.17% 0.49 0.07 93.66% 97.34% 79.01%
Render (DPNet(w/o PP)) 79.82% 0.31 0.05 95.84% 98.33% 82.05%
Render (DPNet(w/o DP)) 85.38% 0.23 0.03 96.82% 98.72% 83.10%
Render (DPNet(w/o DPC)) 84.46% 0.23 0.03 96.92% 97.93% 83.22%
Render (DPNet) 89.74% 0.16 0.02 97.83% 98.52% 82.75%
Syn+Render (DPNet(w/o DP)) 86.58% 0.21 0.03 97.12% 98.62% 83.47%
Syn+Render (DPNet) 90.32% 0.15 0.02 97.87% 98.60% 83.07%
Medium Single (Baseline) 0.00% 19.77 1.67 3.96% 2.06% 1.11%
Syn (Baseline) 6.54% 4.33 0.37 68.61% 79.72% 51.75%
Render (Baseline) 43.02% 1.24 0.11 90.64% 95.83% 72.53%
Syn+Render (Baseline) 54.69% 0.90 0.08 92.95% 96.56% 73.24%
Render (DPNet(w/o PP)) 58.76% 0.74 0.06 94.10% 97.55% 76.05%
Render (DPNet(w/o DP)) 70.90% 0.49 0.04 95.90% 98.16% 77.22%
Render (DPNet(w/o DPC)) 69.85% 0.50 0.04 95.95% 97.24% 77.09%
Render (DPNet) 77.75% 0.35 0.03 97.04% 97.92% 76.78%
Syn+Render (DPNet(w/o DP)) 73.20% 0.46 0.04 96.24% 98.19% 77.69%
Syn+Render (DPNet) 80.68% 0.32 0.03 97.38% 98.07% 77.25%
Hard Single (Baseline) 0.00% 22.61 1.33 2.06% 0.97% 0.55%
Syn (Baseline) 2.91% 5.94 0.34 70.25% 80.98% 53.11%
Render (Baseline) 31.01% 1.77 0.10 90.41% 95.18% 71.56%
Syn+Render (Baseline) 42.48% 1.28 0.07 93.06% 96.45% 72.72%
Render (DPNet(w/o PP)) 44.58% 1.20 0.07 93.25% 96.86% 73.62%
Render (DPNet(w/o DP)) 56.25% 0.84 0.05 95.28% 97.67% 74.88%
Render (DPNet(w/o DPC)) 52.80% 0.86 0.05 95.17% 96.51% 74.77%
Render (DPNet) 66.35% 0.60 0.03 96.60% 97.49% 74.67%
Syn+Render (DPNet(w/o DP)) 59.05% 0.77 0.04 95.71% 97.77% 75.45%
Syn+Render (DPNet) 70.76% 0.53 0.03 97.04% 97.76% 74.95%
Averaged Single (Baseline) 0.01% 12.84 1.06 2.14% 1.83% 1.01%
Syn (Baseline) 9.27% 4.27 0.35 69.65% 80.66% 53.08%
Render (Baseline) 45.60% 1.25 0.10 90.58% 95.50% 72.76%
Syn+Render (Baseline) 56.68% 0.89 0.07 93.19% 96.57% 73.83%
Render (DPNet(w/o PP)) 60.98% 0.75 0.06 94.05% 97.29% 75.89%
Render (DPNet(w/o DP)) 70.80% 0.52 0.04 95.86% 97.93% 77.07%
Render (DPNet(w/o DPC)) 69.03% 0.53 0.04 95.82% 96.96% 77.09%
Render (DPNet) 77.91% 0.37 0.03 97.01% 97.74% 76.80%
Syn+Render (DPNet(w/o DP)) 72.83% 0.48 0.04 96.17% 97.94% 77.56%
Syn+Render (DPNet) 80.51% 0.34 0.03 97.33% 97.91% 77.04%
Table 1. Experimental results on the RPC dataset.
Figure 6. Results of our method for easy, medium, and hard modes (from top to down). Different color bounding boxes correspond to the predictions with the item categories and the confidence scores. Best view in color.

4.2. Evaluation protocol

To evaluate the performance of the proposed method, we use several metrics following (Wei et al., 2019). First, the counting error for a specific category in an image is defined as


where and indicates the predicted count and ground-truth item number of the -th category in the -th image, respectively. To measure the error over all categories for the -th image is calculated as


4.2.1. Checkout Accuracy

Checkout Accuracy (cAcc) is the primary metric for ranking in the ACO task (Wei et al., 2019), which is the accuracy when the complete product list is predicted correctly. It is calculated as


where if its argument is true, and otherwise. The range of the cAcc score is from to . For example, if , all items are accurately predicted, i.e., .

4.2.2. Mean Category Intersection of Union

Mean Category Intersection of Union (mCIoU) measures the compatibility between the predicted product list and ground-truth. It is defined as


The range of the mCIoU score is from to .

4.2.3. Average Counting Distance

Different from cAcc focusing on the counting error, Average Counting Distance (ACD) indicates the average number of counting errors for each image, i.e.,


4.2.4. Mean Category Counting Distance

Moreover, the Mean Category Counting Distance (mCCD) is used to calculate the average ratio of counting errors for each category, i.e.,


4.2.5. Mean Average Precision

On the other hand, according to the evaluation protocols in MS COCO (Lin et al., 2014) and the ILSVRC 2015 challenge (Russakovsky et al., 2015), we use the mean Average Precision (mAP) metrics (i.e., mAP50 and mmAP) to evaluate the performance of the detector. Specifically, mAP50 is computed at the single Intersection over Union (IoU) threshold over all item categories, while mmAP is computed by averaging over all IoU thresholds (i.e., in the interval in steps of ) of all item categories.

4.3. Baseline Solutions

The authors of (Wei et al., 2019) provide four baselines for comparison. Specifically, a detector is trained to recognize the items based on the following four kinds of training data.

  • Single. We train the FPN detector (Lin et al., 2017) using training images of isolated items based on the bounding box annotations.

  • Syn. We copy and paste the segmented isolated items to create synthesized checkout images for detector training. To segment these items, we employ a salience based object segmentation approach (Hu et al., 2016) with Conditional Random Fields (CRF) (Krähenbühl and Koltun, 2011) for mask refinement of item to remove the background noise.

  • Render. To reduce domain gap, we employ Cycle-GAN (Zhu et al., 2017) to translate the synthesized images into the checkout image domain for detector training, resulting in more realistic render images.

  • Syn+Render. We train the detector based on both synthesized and rendered images.

4.4. Experimental Results and Analysis

The performance compared with baseline methods are presented in Table 1. More visual examples for different difficulty levels are shown in Figure 6. The Single method fails in ACO task because of the huge gap between the exemplars and the checkout images. By combining segmented items into synthesized checkout images, the checkout accuracy is improved from to in averaged level. Moreover, significant boost is achieved by training the detector on rendered images. This is because the GAN method can mimic the realistic checkout images in lighting conditions or shadow patterns effectively. Compared to the aforementioned Render baseline method (i.e., cAcc score), our DPNet method achieves cAcc score in averaged level only training on rendered images. Given the Syn+Render data, the checkout accuracy is further improved by , , for easy, medium and hard level respectively compared with the Syn+Render baseline method. This indicates the effectiveness of our approach.

4.5. Ablation Study

We further perform experiments to study the effect of different modules of the proposed method by construct three variants, i.e., DPNet(w/o DPC), DPNet(w/o DP) and DPNet(w/o PP). DPNet(w/o DPC) indicates that the DPNet removes the counter head to select reliable testing data. In this way, the reliability checking condition in Eq. (3) is rewritten as , because the least number of items in the checkout image is (easy mode). DPNet(w/o DP) indicates that we do not use the DPNet for domain adaptation, i.e., the detector is trained based on the rendered data. DPNet(w/o PP) denotes the method that further removes the pose pruning module from DPNet(w/o DP). For fair comparison, we use the same parameter settings and input size in evaluation. We choose all testing checkout images to conduct the experiments.

4.5.1. Effectiveness of Background Removal

The Render baseline method uses the Saliency (Hu et al., 2016)+CRF (Krähenbühl and Koltun, 2011) model to obtain the masks of exemplars. As presented in Table 1, our DPNet(w/o PP) method achieves better performance, i.e., vs. checkout accuracy based on the rendered data. This may be attributed to better segmentation results by our DPNet(w/o PP) method using coarse-to-fine strategy.

4.5.2. Effectiveness of Pose Pruning

If we remove the pose pruning module, the DPNet(w/o PP) method decreases in terms of checkout accuracy ( vs. ). This noticeable performance drop validates the importance of the pose pruning module to remove the synthesized images including the items with unrealistic poses (see Figure 3).

4.5.3. Effectiveness of Detection and Counting Collaborative Learning

From Table 1, our proposed DPNet achieves better results than its variant DPNet(w/o DP). The increase in checkout accuracy indicates that the data priming method adapts the data from source domain to that from target domain effectively. Besides, DPNet(w/o DPC) performs even slightly inferior than DPNet(w/o DP), i.e., ( vs. ). It is not confident to determine reliable testing data only based on the detection head, resulting in much unreliable testing data ( of selected testing data). On the contrary, we can select correct reliable testing data based on the proposed DPNet with both counter and detection heads.

To visualize the distribution of source and target domains, we first train our network with the ResNet-101 backbone and then calculate the features of images randomly selected from the two domains using the last block of backbone. For Figure 7(a), the network is trained on synthesized images. For Figure 7(b), the network is trained on rendered images. For Figure 7(c), the network is trained on rendered image and then finetuned on reliable testing images. Finally, we embed high-dimensional features of each domain for visualization in a low-dimensional space of two dimensions using the t-Distributed Stochastic Neighbor Embedding (t-SNE) technique (van der Maaten, 2014).

Figure 7. Comparison of the distribution of source and target domain based on the detector trained on (a) synthesized data; (b) rendered data; (c) reliable testing data.

4.5.4. Effectiveness of Syn+Render

Similar to the trend in the baseline methods ( cAcc of Render (baseline) vs. cAcc of Syn+Render (baseline)), the performance is constantly improved when training on both synthesized and rendered data. Specifically, Syn+Render (DPNet) achieves cAcc score compared to cAcc score of the Render (DPNet) configuration.

5. Conclusion

In this paper, we propose a new data priming network to deal with automatic checkout. Different from the previous domain adaptation methods, we construct both counter and detector heads to measure the reliability of testing images for the target domain. Then, the detector of the target branch can learn target-discriminative representation based on the reliable testing samples using detection and counting collaborative learning, resulting in robust performance. The experiment on the RPC dataset shows that our method surpasses the previous baseline methods significantly by more than checkout accuracy in the averaged level. For future works, we would like to further study other potential options for the data priming network, including heads of other types of attributes.

6. Acknowledgments

This work is partially supported the National Natural Science Foundation of China under Grant 61807033, and partially supported by US Natural Science Foundation under Grant IIS1816227.


  • Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. V. Gool (2018) Domain adaptive faster R-CNN for object detection in the wild. In CVPR, pp. 3339–3348. Cited by: §2.3.
  • P. Dollár and C. L. Zitnick (2013) Structured forests for fast edge detection. In ICCV, pp. 1841–1848. Cited by: §3.1.1.
  • P. Follmann, T. Böttger, P. Härtinger, R. König, and M. Ulrich (2018) MVTec D2S: densely segmented supermarket dataset. In ECCV, pp. 581–597. Cited by: §2.4, §2.4.
  • Y. Ganin and V. S. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In ICML, pp. 1180–1189. Cited by: §2.3.
  • M. George and C. Floerkemeier (2014) Recognizing products: A per-exemplar multi-label image classification approach. In ECCV, pp. 440–455. Cited by: §2.4, §2.4.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.1.
  • Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. S. Torr (2019) Deeply supervised salient object detection with short connections. TPAMI 41 (4), pp. 815–828. Cited by: §2.1, §3.1.1.
  • P. Hu, W. Wang, C. Zhang, and K. Lu (2016) Detecting salient objects via color and texture compactness hypotheses. TIP 25 (10), pp. 4653–4664. Cited by: §2.1, 2nd item, §4.5.1.
  • P. Jund, N. Abdo, A. Eitel, and W. Burgard (2016) The freiburg groceries dataset. CoRR abs/1611.05799. Cited by: §2.4, §2.4.
  • D. Koubaroulis, J. Matas, J. Kittler, and C. CMP (2002) Evaluating colour-based object recognition algorithms using the soil-47 database. In ACCV, Vol. 2. Cited by: §2.4, §2.4.
  • P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, pp. 109–117. Cited by: 2nd item, §4.5.1.
  • G. Li and Y. Yu (2015) Visual saliency based on multiscale deep features. In CVPR, pp. 5455–5463. Cited by: §2.1.
  • Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pp. 1091–1100. Cited by: §3.2.1.
  • T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §3.2.1, §3.2.2, 1st item.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §4.2.5.
  • N. Liu and J. Han (2016) DHSNet: deep hierarchical saliency network for salient object detection. In CVPR, pp. 678–686. Cited by: §2.1.
  • M. Merler, C. Galleguillos, and S. J. Belongie (2007) Recognizing groceries in situ using in vitro training data. In CVPR, Cited by: §2.4, §2.4.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NeurIPS Workshop, Cited by: §4.1.
  • F. Qi, X. Yang, and C. Xu (2018) A unified framework for multimodal domain adaptation. In ACM Multimedia, pp. 429–437. Cited by: §2.3.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §2.3.
  • A. Rocha, D. C. Hauagge, J. Wainer, and S. Goldenstein (2010) Automatic fruit and vegetable classification from images. Computers and Electronics in Agriculture 70 (1), pp. 96–104. Cited by: §2.4, §2.4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §4.2.5.
  • K. Saito, Y. Ushiku, and T. Harada (2017) Asymmetric tri-training for unsupervised domain adaptation. In ICML, pp. 2988–2997. Cited by: §2.3.
  • Y. Tang and X. Wu (2017) Salient object detection with chained multi-scale fully convolutional network. In ACM Multimedia, pp. 618–626. Cited by: §2.1.
  • A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional image generation with pixelcnn decoders. In NeurIPS, pp. 4790–4798. Cited by: §2.2.
  • L. van der Maaten (2014) Accelerating t-sne using tree-based algorithms. JMLR 15 (1), pp. 3221–3245. Cited by: §4.5.3.
  • J. Wang, W. Feng, Y. Chen, H. Yu, M. Huang, and P. S. Yu (2018) Visual domain adaptation with manifold embedded distribution alignment. In ACM Multimedia, pp. 402–410. Cited by: §2.3.
  • J. Wang, H. Jiang, Z. Yuan, M. Cheng, X. Hu, and N. Zheng (2017) Salient object detection: A discriminative regional feature integration approach. IJCV 123 (2), pp. 251–268. Cited by: §3.1.1.
  • X. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu (2019) RPC: A large-scale retail product checkout dataset. CoRR abs/1901.07249. Cited by: §1, §1, §2.4, §2.4, §3.1.3, §3.1.3, §4.2.1, §4.2, §4.3, §4.
  • X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2Image: conditional image generation from visual attributes. In ECCV, pp. 776–791. Cited by: §2.2.
  • Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, pp. 589–597. Cited by: §3.2.3.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pp. 2242–2251. Cited by: §1, §2.2, Figure 5, §3.1.3, 3rd item, §4.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description