Dataset Culling: Towards Efficient Training of Distillation-based Domain Specific Models

Dataset Culling: Towards Efficient Training of Distillation-based Domain Specific Models

Abstract

Real-time CNN based object detection models for applications like surveillance can achieve high accuracy but require extensive computations. Recent work has shown to reduction in computation cost with domain-specific network settings. However, this prior work focused on inference only: if the domain network requires frequent retraining, training and retraining costs can be a significant bottleneck. To address training costs, we propose Dataset Culling: a pipeline to significantly reduce the required training data set size for domain specific models. Dataset Culling reduces the dataset size by filtering out non-essential data for training, and reducing the size of each image until detection degrades. Both of these operations use a confusion loss metric which enables us to execute the culling with minimal computation overhead. On a custom long-duration dataset, we show that Dataset Culling can reduce the training costs 47 with no accuracy loss or even with slight improvements. 111Preprint is now being uploaded to arXiv. Codes are available: https://github.com/kentaroy47/DatasetCulling

Dataset Culling: Towards Efficient Training of Distillation-based Domain Specific Models

Kentaro Yoshioka
Stanford University / Toshiba
Computer Science
kyoshioka47@gmail.com
                  
Edward Lee, Simon Wong, Mark Horowitz
Stanford University
Electrical Engineering and Computer Science
edhlee,sswong@stanford.edu, horowitz@ee.stanford.edu


Index Terms—  Object Detection, Training Efficiency, Distillation, Dataset Culling, Deep Learning

1 Introduction

Convolutional neural network (CNN) object detectors have recently achieved significant improvements in accuracy [1][2] but have also become more computationally expensive. Since CNNs generally obtain better classification performance with larger networks, there exists a tradeoff between accuracy and computation cost (or efficiency). One way around this tradeoff is to leverage application and domain knowledge. For example, models for stationary surveillance and traffic cameras require pedestrians and cars to be detected but not different specifies of dogs; as the space of images decrease, smaller models can be used.

Recent approaches leverage domain-specialization to train compact domain specific models (DSMs) with distillation [3]. It has been shown that compact student models can achieve high accuracies when trained with sufficient domain data and such student models can be 10-100 smaller than the teacher. [4] utilized this idea in a model cascade, [5] pushes this idea to the extreme by training frequently with extremely-small student models, and [6] use unlabeled data to augment the student dataset.

Fig. 1: Dataset Culling aims to reduce the size of the unlabelled training data (number of images and image resolution). Therefore, the computational costs of both student training and teacher labeling can be significantly reduced.

The computation cost in conventional teacher-student frameworks is as follows: 1. Inference cost for the student, 2. Inference cost for the teacher (for labeling) and 3. Training cost for the student. Importantly, small student models may require frequent retraining to cancel out drift in data statistics associated with environment. For example, in a traffic surveillance setting, the number of pedestrians needed to detect changes seasonally throughout the course of the year. Periodic retraining may perhaps be required every minute or every day. As one might expect, the smaller the student model the shorter the retraining interval. Therefore with a small model, one can achieve computationally-efficient inference but with high (re)training overheads. For our survelliance application, a day’s worth of surveillance data (86,400 images at 1 fps) required 100 GPU hours (nVidia K80 on AWS P2) to train.

Fig. 2: Our Dataset Culling pipeline. First, by culling the data with the confusion loss (), the dataset size is reduced 50 (in surveillance). The dataset is further reduced 6 by culling further with precision using teacher predictions. Finally, the dataset image resolution is dynamically scaled to further reduce computation by another 1.2-6.

Prior work has discussed ways to improve computation costs for the student model during inference. However, there have not been much focus on costs associated with (re)training or teacher costs. Our contributions are:

  • We propose Dataset Culling, which significantly reduces the computation cost of training. We show speedups in training by over 47 with no accuracy penalty. To the best of our knowledge, this is the first successful demonstration of improving training efficiency of DSMs.

  • Dataset Culling uses the student predictions to filter only essential data for training. We show that the dataset size can be culled by a factor of 300 to minimize the training costs.

  • In addition, we extend Dataset Culling to not only reduce the number of data samples for training but also optimize the CNN input image resolution to further improve inference and training costs.

Fig. 3: Object detection results of Dataset Culling is shown, with results of 3 scenes from surveillance dataset (top to bottom: Coral, Jackson, Kentucky.). Accuracy and the computation cost per image (GFLOPs) are shown. The student model is trained with a compressed dataset of =128, and the image resolution is set automatically and shown as optResolution. While resolution scaling introduces a penalty in accuracy (average 1% mAP), it dramatically improves the computation cost. For example, the computation cost for inference is improved by up to 18 for Coral.

2 Efficient training of DSMs

The role of DSMs is to achieve high object detection performance with small model size. However, training of DSMs can itself be computationally problematic. In order to reduce training time, we propose Dataset Culling. This procedure removes (filters out) 1) data samples from the training dataset that are believed to be easy to predict, and 2) reduces the image resolution of the dataset as a function of training accuracy (shown in Fig. 1). By reducing both the dataset size and image resolution, we reduce 1) the expensive teacher inference for labelling, 2) the training steps of the student, and 3) the total computation required for each training and inference step for the DSM. Previous work [4] required the student and teacher to be run for all data samples for training, holding significant computing costs.

2.1 Dataset Culling.

The Dataset Culling pipeline is illustrated in Fig.2. We first assess the difficulty of a new stream of data by performing inference through the DSM. During training, model parameters are only updated when differences between the label and prediction exist. In other words, ”easy” data which the student already provides good predictions for do not contribute to training. The designed confusion loss (shown below) assesses the difficulty of prediction on a sample of data from the student’s output probabilities. For example, if the model’s output probability for an object class is high, we assume that the sample of data is reasonably easy to infer and similarly if the answer is very low, the region is likely to be background. However, intermediate values mean that the data is hard to infer.

We design a confusion loss metric to measure the number of ”confusing” objects in each image and only data samples with high confusion loss are kept. The loss is designed to also mind the total number of objects in image, since focusing only on ”confusing” objects concludes on picking up object-less but noisy images (e.g. images at midnight), which do not contribute to training. The designed confusion loss is expressed by:

Input is the prediction confidence, is a constant to set the intercept to zero, and Q sets the weightings of low-confidence predictions. In experiments, we use and to roughly weight low-confidence detections more than confident detection results. The absolute form of the loss function is not essential, we observed similar results by designing functions which emphasize unconfident predictions than the other. When the model provides multiple detection results, is computed for each prediction and are summed to obtain the overall loss of the data. This first stage of culling yields a 10 to 50 reduction in the size of training data.

Next, we feed the remaining samples that are kept into the computationally-expensive teacher model in the second stage of culling. We compare the answers made between the teacher and student of data sample and use this to directly determine the difficulty. Here, we compute the average precision by treating the teacher predictions as ground truths. Using this second stage of culling, we further reduce the number of data samples by . Furthermore, in some cases, we can even improve the student’s mAP as we eliminate data that add little to no feedback for enhancing the student.

2.2 Scaling of image resolution

Dynamic scaling of image resolution is our second technique in Dataset Culling. The idea is that a smaller image size reduces the number of multiply-and-add operations and total number of activations in the model. For example, with a reduction in image resolution, we obtain 4 improvement in computational efficiency. To perform resolution scaling, we take advantage of the fact that object detection difficulties depend on the scene and application itself. For example, objects of interest in video for indoor and sport scenes are usually large in size and relatively easy to classify with low-resolution but traffic monitoring cameras demand high resolutions in order to monitor both relatively-small pedestrians and large vehicles. Traditionally, human supervision or expensive teacher inference was required to tune resolution [7].

Dataset Culling integrates dynamic resolution scaling with low computational overheads. We first feed an image of (pixel) size into the student model and compute the confusion loss. We downsample to size , infer, and compute its confusion loss. These downsampling operations are recursively performed times, until the change in confusion loss exceeds a predefined threshold, as the large change indicates that objects are becoming harder to infer. Hence, the image resolution of the data that is finally kept is therefore . In our implementation, we compute the mean-squared-error (MSE) of the confusion loss for full-resolution inference results () as shown in Fig.4.

Here, is the confusion loss of the downsampled inference. One limitation of dynamic scaling of images is that we strongly assume that the overall object size is constant throughout training and runtime. For example, pedestrians in a CCTV camera do not suddenly become larger during test time unless the surveillance camera were to move to a different position. However, in such cases, the model would require retraining anyway.

Dataset
Training
images
Target dataset size
64 128 256 Full No train
Surveillance 86,400 Accuracy [mAP] 85.56 (- 3.0%) 88.3 (- 0.3%) 89.3 (+ 0.8%) 88.5 58.6
Total Train Time 1.9 (54) 2.0 (50) 2.2 (47) 104 -
Student Training
Student Prediction
Teacher Prediction
0.07
1.54
0.33
0.14
1.54
0.33
0.28
1.54
0.33
96
0
8
-
Sports 3,600 Accuracy [mAP] 93.7 (- 0.1%) 93.8 (0%) 93.8 (0%) 93.8 80.7
Total Train Time 0.16 (16) 0.23 (11) 0.40 (6) 2.5 -
Student Training
Student Prediction
Teacher Prediction
0.07
0.06
0.03
0.14
0.06
0.03
0.28
0.06
0.06
2
0
0.5
-
Table 1: We evaluate how culling the dataset impacts accuracy. Here, we perform dataset size reduction with the first two stages in Fig. 2. (cull by confusion and precision) with no resolution scaling of images. Time is reported in GPU hours.

3 Experiments

Filtering strategy
Intermittent Sampling
Confusion only
Precision only
Confusion + Precision
Full dataset
mAP 0.731 0.911 0.954 0.948 0.958
GPU hours 0.15 1.7 8.0 2.0 104
Table 2: Ablation study and comparison of Dataset Culling strategies conducted on Jackson dataset. Our approach of conducting both filtering by confusion loss and data difficulty has a good balance of accuracy and computation. Confusion-only culling misses more than of samples that was otherwise kept using Precision-only, and ConfusionPrecision misses that Precision-only kept. All strategies have a target dataset size of 128.

Models. For experiments, we develop Faster-RCNN object detection models pretrained on MS-COCO [1][8]. We utilize resnet101 (res101) for the teacher and resnet18 (res18) for the student model for the region proposal network (RPN) backbone [9]. We expect similar outcomes when MobileNet [10] is utilized for RPN, since both models achieve similar Imagenet accuracy. We chose Faster-RCNN for its accuracy, but Dataset Culling can be applied to other object detection frameworks such as SSD and YOLO, which have similar training frameworks [11][12].

Custom Long-Duration Dataset. We develop 8 long-duration videos from YouTubeLive to evaluate Dataset Culling. As manually labelling all frames in video is cumbersome, we label the dataset by treating the teacher predictions as ground truth labels as in [4]. In this paper, we report the accuracy as mean average precision (mAP) at 50% IoU. We will cite 5 of the videos ”surveillance”, each consisting of 30-hour fixed-view streams with 1 to 4 detection classes. The first 24 hours (86,400 images) are used for training and the consequent 6 hours (21,600 images) are used for validation. We cite 3 of the videos ”sports”, which consist of 2-hours (7,200 images) of handheld video. Here, the class is ”person” only. We split the images evenly: the first 3600 images for training and the later 3600 images for testing.

Results. Object detection results of randomly selected images from 3 scenes are shown in Fig.3 and mAP and computation costs are reported in Table 1. For surveillance results, using domain specific training for the small network improved accuracy by 31% compared to the COCO-pretrained student results. In comparison to the full dataset training, Dataset Culling improves the training time 47 by reducing the dataset size to =256. A slight increase in accuracy is observed because Dataset Culling present a similar effect to hard example mining [13], where training on a limited but difficult data benefits model accuracy. Since we include the time to run inference on the entire training set in the training time, further culling the dataset does not improve the training time much. However in smaller dataset (sports), increasing the culling ratio contribute to training efficiency because the model training time possess a significant portion.

Fig. 4: The image resolution was scaled manually to observe the change in mAP (blue solid line) and computed MSE of (black dasheds) in two domains. The red star indicates the optimum point proposed by Dataset Culling, meeting both high accuracy and computation cost.

Ablation study. We perform ablation as shown in Table 2. We construct a =128 difficult dataset with four filtering techniques and compare the mAPs of our trained student models. We show that while filtering using only the precision metric (no confusion-loss induced or image rescaling) can achieve the highest accuracy, the training time is 4 higher than our final approach (confusion + precision). Finally, we illustrate that with both culling procedures, we can realize a good balance of accuracy and computation.

Scaling of dataset. Fig. 4 shows the dynamic image scaling results. Dataset Culling provides a well-tuned image resolution, satisfying both accuracy and computation costs. For sports (Badminton), the objects of interest are large and easy to detect given the uniform green floor. Dataset Culling thus selects a resolution of 0.5. For surveillance of traffic in Jackson, our procedure selects a scale of 0.8.

4 Conclusions

While domain-specific models can dramatically decrease the cost of inference, if these models need frequent retraining, training costs can be a problem. We show how simple outputs from the small DSM can be used to filter these images to create compact training sets for this retraining, a processes we call Dataset Culling. Since the filtering only requires running the small DSM model, it has low overhead to run, and can significantly reduce training time with little to no penalty in accuracy. We demonstrated this idea using a custom long-duration surveillance dataset for evaluation.

References

  • [1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [2] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár, “Focal loss for dense object detection,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [3] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [4] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia, “Noscope: optimizing neural network queries over video at scale,” Proceedings of the VLDB Endowment, vol. 10, no. 11, pp. 1586–1597, 2017.
  • [5] Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian, “Online model distillation for efficient video inference,” arXiv preprint arXiv:1812.02699, 2018.
  • [6] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He, “Data distillation: Towards omni-supervised learning,” arXiv preprint arXiv:1712.04440, 2017.
  • [7] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica, “Chameleon: scalable adaptation of video analytics,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. ACM, 2018, pp. 253–266.
  • [8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
  • [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [11] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [12] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg, “Ssd: Single shot multibox detector,” pp. 21–37, 2016.
  • [13] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick, “Training region-based object detectors with online hard example mining,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
335586
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description