Object Detection Using Deep CNNs Trained on Synthetic Images
The need for large annotated image datasets for training Convolutional Neural Networks (CNNs) has been a significant impediment for their adoption in computer vision applications. We show that with transfer learning an effective object detector can be trained almost entirely on synthetically rendered datasets. We apply this strategy for detecting packaged food products clustered in refrigerator scenes. Our CNN trained only with 4000 synthetic images achieves mean average precision (mAP) of 24 on a test set with 55 distinct products as objects of interest and 17 distractor objects. A further increase of 12% in the mAP is obtained by adding only 400 real images to these 4000 synthetic images in the training set. A high degree of photorealism in the synthetic images was not essential in achieving this performance. We analyze factors like training data set size and 3D model dictionary size for their influence on detection performance. Additionally, training strategies like fine-tuning with selected layers and early stopping which affect transfer learning from synthetic scenes to real scenes are explored. Training CNNs with synthetic datasets is a novel application of high-performance computing and a promising approach for object detection applications in domains where there is a dearth of large annotated image data.
The field of Computer Vision has reached new heights over the last few years. In the past, methods like DPMs , SIFT  and HOG  were used for feature extraction, and linear classifiers were used for making predictions. Other methods  used correspondences between template images and the scene image. Later works focused on class-independent object proposals  using segmentation and classification using hand crafted features. Today methods based on Deep Neural Networks (DNNs) have achieved state-of-the-art performance on image classification, object detection, and segmentation [6, 7]. DNNs been successfully deployed in numerous domains [6, 7]. Convolutional Neural Networks (CNNs), specifically, have fulfilled the demand for a robust feature extractor that can generalize to new types of scenes. CNNs were initially deployed for image classification  and later extended to object detection . The R-CNN approach  used object proposals and features from a pre-trained object classifier. Recently published works like Faster R-CNN  and SSD  learn object proposals and object classification in an end-to-end fashion.
The availability of large sets of training images has been a prerequisite for successfully training CNNs . Manual annotation of images for object detection, however, is a time-consuming and mechanical task; what is more, in some applications the cost of capturing images with sufficient variety is prohibitive. In fact the largest image datasets are built upon only a few categories for which images can be feasibly curated (20 categories in PASCAL VOC , 80 in COCO , and 200 in ImageNet ). In applications where a large set of intra-category objects need to be detected the option of supervised learning with CNNs is even tougher as it is practically impossible to collect sufficient training material.
There have been solutions proposed to reduce annotation efforts by employing transfer learning or simulating scenes to generate large image sets. The research community has proposed multiple approaches for the problem of adapting vision-based models trained in one domain to a different domain [14, 15, 16, 17, 18]. Examples include: re-training a model in the target domain ; adapting the weights of a pre-trained model ; using pre-trained weights for feature extraction ; and, learning common features between domains .
Attempts to use synthetic data for training CNNs to adapt in real scenarios have been made in the past. Peng et. al. used available 3D CAD models, both with and without texture, and rendered images after varying the projections and orientations of the objects, evaluating on 20 categories in the PASCAL VOC 2007 data set . The CNN employed for their approach used a general object proposal module  which operated independently from the fine-tuned classifier network. In contrast, Su and coworkers  used the rendered 2D images from 3D on varying backgrounds for pose estimation. Their work also uses an object proposal stage and limits the objects of interest to a few specific categories from the PASCAL VOC data set. Georgakis and coworkers  propose to learn object detection with synthetic data generated by object instances being superimposed into real scenes at different positions, scales, and illumination. They propose the use of existing object recognition data sets such as BigBird  rather than using 3D CAD models. They limit their synthesized scenes to low-occlusion scenarios with 11 products in GMU-Kitchens data set. Gupta et. al. generate a synthetic training set by taking advantage of scene segmentation to create synthetic training examples, however the goal is text localization instead of object detection . Tobin et. al. perform domain randomization with low-fidelity rendered images from 3D meshes, however their objective is to locate simpler polygon-shaped objects restricted to a table top in world coordinates . In [28, 29], the Unity game engine is used to generate RGB-D rendered images and semantic labels for outdoor and indoor scenes. They show that by using photo-realistic rendered images the effort for annotation can be significantly reduced. They combine synthetic and real data to train models for semantic segmentation, however the network requires depth map information for semantic segmentation.
None of the existing approaches to training with synthetic data consider the use of synthetic image datasets for training a general object detector in a scenario where high intra-class variance is present along with high clutter or occlusion. Additionally, while previous works have compared the performance using benchmark datasets, the study of cues or hyper-parameters involved in transfer learning has not received sufficient attention. We propose to detect object candidates in the scene with large intra-class variance compared to an approach of detecting objects for few specific categories. We are especially interested in synthetic datasets which do not require extensive effort towards achieving photorealism. In this work, we simulate scenes using 3D models and use the rendered RGB images to train a CNN-based object detector. We automate the process of rendering and annotating the 2D images with sufficient diversity to train the CNN end-to-end and use it for object detection in real scenes. Our experiments also explore the effects of different parameters like data set size and 3D model repository size. We also explore the effects of training strategies like fine-tuning selective layers and early stopping  on transfer learning from simulation to reality. The rest of this paper is organized as follows: our methodology is described in section II, followed by the results we obtain reported in section III, finally concluding the paper in section IV.
Given a RGB image captured inside a refrigerator, our goal is to predict a bound-box and the object class category for each object of interest. In addition, there are few objects in the scene that need to be neglected. Our approach is to train a deep CNN with synthetic rendered images from available 3D models. Overview of the approach is shown in Figure 1. Our work can be divided into two major parts namely synthetic image rendering from 3D models and transfer learning by fine-tuning the deep neural network with synthetic images.
Ii-a Synthetic Generation of Images from 3D Models
We use an open source 3D graphics software named Blender. Blender-Python APIs facilitate to load 3D models and automate the scene rendering. We use Cycles Render Engine available with Blender since it supports ray-tracing to render synthetic images. Since all the required annotation data is available, we use the KITTI  format with bound-box co-ordinates, truncation state and occlusion state for each object in the image.
Real world images have lot of information embedded about the environment, illumination, surface materials, shapes etc. Since the trained model, at test time must be able to generalize to the real world images, we take into consideration the following aspects during generation of each scenario:
Number of objects
Shape, Texture, and Materials of the objects
Texture and Materials of the refrigerator
Packing pattern of the objects
Position, Orientation of camera
Illumination via light sources
In order to simulate the scenario, we need 3D models, their texture information and metadata. Thousands of 3D CAD models are available online. We choose ShapeNet  database since it provides a large variety of objects of interest for our application. Among various categories from ShapeNet like bottles, tins, cans and food items, we selectively add 616 various object models to object repository () for generating scenes. Figure 2a shows few of the models in . The variety helps randomize the aspect of shape, texture and materials of the objects. For the refrigerator, we choose a model from Archive3D  suitable for the application. The design of refrigerator remains same for all the scenarios though the textures and material properties are dynamically chosen.
For generating training set with rendered images, the 3D scenes need to be distinct. The refrigerator model with 5-25 randomly selected objects from are imported in each scene. To simulate the cluster of objects packed in refrigerator like real world scenarios, we use three patterns namely grid, random and bin packing for 3D models. The grid places the objects in a particular scene on a refrigerator tray top at predefined distances. Random placements drop the objects at random locations on refrigerator tray top. Bin packing tries to optimize the usage of tray top area placing objects very close and clustered in the scene to replicate common scenarios in refrigerator. The light sources are placed such that illumination is varied in every scene and the images are not biased to a well lit environment since refrigerators generally tend to have dim lighting. Multiple cameras are placed at random location and orientation to render images from each scene. The refrigerator texture and material properties are dynamically chosen for every rendered image. Figure 2b shows few rendered images used as training set while Figure 2c shows the subset of real world images used in training.
Ii-B Deep Neural Network Architecture, Training and Evaluation
Figure 3 provides the detailed illustration of network architecture and work-flow for the training and validation stages. For neural network training we use NVIDIA-DIGITS-DetectNet  with Caffe  library in back-end. During training, the RGB images with resolution (in pixels) 512 x 512 are labelled with standard KITTI  format for object detection. We neglect objects truncated or highly occluded in the images using appropriate flags in the ground truth label generated while rendering. The dataset is later fed into a fully convolutional network (FCN) predicting coverage map for each detected class. The FCN network represented concisely in Figure 4 has the same structure as GoogLeNet  without the data input layers and output layers. For our experiments, we use pre-trained weights on ImageNet to initialize the FCN network which has earlier been helpful for transfer learning .
The bound-box regressor predicts bound-box corner per grid square. We train the detector through stochastic gradient descent with Adam optimizer using standard learning rate of . The total loss is the weighted summation of the following losses:
L2 loss between the coverage map estimated by the network and ground truth
where is the coverage map extracted from annotated ground truth and is the predicted coverage map while denoting the batch size.
L1 loss between the true and predicted corners of the bounding box for the object covered by each grid square.
where are the ground-truth bound box co-ordinates while are the predicted bound box co-ordinates. denotes the batch size.
For the validation stage, we threshold the coverage map obtained after forward pass through the FCN network, and use the bound-box regressor to predict the corners. Since multiple bound-boxes are generated, we finally cluster them to refine the predictions. For evaluation, we compute Intersection over Union (IoU) score. With a threshold hyper-parameter, predicted bound boxes are classified as True Positives (TP), False Positives (FP) and False Negatives (FN). Precision (PR) and Recall (RE) are calculated using these metrics and a simplified mAP score is defined by the product of PR and RE .
Iii Results and Discussion
We evaluate our object detector trained exclusively with synthetically rendered images using manually annotated crowd-sourced refrigerator images. Figure 8 illustrates the variety in object textures, shapes, scene illumination and environment cues present in the test set. The real scenarios also include other objects like vegetables, fruits, etc. which need to be neglected by the detector. We address them as distractor objects.
All the experiments were carried on workstation with Intel Core i7-5960X processor accelerated by NVIDIA GEFORCE GTX 1070. NVIDIA-DIGITS (v5.0) tool was used to prepare and manage the databases and trained models. Hyper-parameters search on learning rate, learning rate policy, training epochs, batch-size were performed for training all neural network models.
The purpose of our experiments was to evaluate the efficacy of transfer learning from rendered 3D models on real refrigerator scenarios. Hence we divide this section into two parts:
Factors affecting Transfer Learning: Here, we analyze the factors which we experimented with to achieve the best detection performance via transfer learning. We study following factors affecting overall detection performance:
Training Dataset Size: The variety in training images used determines the performance of neural networks.
Selected Layer Fine-tuning: Features learned at each layer in CNNs have been distinct and found to be general across domains and modalities. Fine-tuning of the final fully-connected linear classification layers has been used in practice for transfer learning across applications. Hence, we extend this idea to train several convolutional as well as linear layers of the network and evaluate the resulting performance.
Object Dictionary Size: The appearance of an object in image in static environment is a function of its shape, texture and surface material property. Variance in objects used for rendering has been observed to increase detection performance significantly .
Detection Accuracy: Here, we represent the analysis of the performance on real dataset achieved with the best detector model111Trained network weights and synthetic dataset are available at https://github.com/paramrajpura/Syn2Real.
Iii-a Factors affecting transfer learning
Considering other parameters like object dictionary size and fine-tuned network layers, we vary the training data size from 500-6000. We observe an increase in mAP up to 4000 images followed by a light decline in performance as shown in Figure 5. Note that the smaller dataset is a subset of the larger dataset size i.e. we have incrementally added new images to train dataset. After an extent, we observe decline in accuracy as we increase the dataset size suggesting over-fitting to synthetic data with increase in dataset size.
We use GoogleNet FCN architecture with 11 different hierarchical levels with few inception modules as single level (Figure 4). mAP vs. number of epochs chart is presented in Figure 6 for models with different layers selected for fine-tuning. Starting from training just the final coverage and bounding-box regressor layers we sequentially open deeper layers for fine-tuning. We observe that fine-tuning all the inception modules helps transfer learning from synthetic images to real images in our application. The results show that selection of the layers to fine-tune proves to be important for detection performance.
To study the relationship of variance in 3D models with performance, we incrementally add distinct 3D models to the dictionary starting from 10 to 400. We observe an increase in mAP up to 200 models and slight decline later on as represented in Figure 7.
Iii-B Detection Accuracy
We evaluate our best object detector model on a set of 50 crowd-sourced refrigerator scenes with all cue variances covering 55 distinct objects of interest considered as positives and 17 distractor objects as negatives. Figure 8 shows the variety in test set and the predicted bound-boxes for all refrigerator images. The detector achieves mAP of 24 on this dataset which is a promising result considering that no distractor objects were used while training using synthetic images.
We observe that detector handles scale, shape and texture variance. Though packing patterns like vertical stacking or highly oblique camera angles lead to false predictions. Few vegetables among the distractor objects are falsely predicted as objects of interest suggesting the influence of pre-training on ImageNet dataset also noting that the training dataset was devoid of such distractor objects marked as background clutter. We report in Figure 5, Figure 6 and Figure 7 mAP vs. epochs trained plots over mAP vs. variance in factor to also represent the relevance of early stopping . The networks trained by varying factors, show their peak performances for 25-50 epochs of training while the performance declines contrary to saturating which suggests over-fitting to synthetic images.
The question arises how well does a network trained with synthetic images fare against one trained with real world images. Hence we compare the performance of networks trained with three different training image-sets as illustrated in Figure 9. The synthetic training set consisted of 4000 images with 200 3D object models of interest while the real training set consisted of 400 images parsed from the internet with 240 distinct products and 19 distractor objects. The hybrid set with synthetic and real images consisted of 3600 synthetic and 400 real images. All models were evaluated on a set of 50 refrigerator scenes with less than 5% object overlap between the test set and train set images. CNN fully trained with 4000 synthetic images (achieves 24 mAP) underperforms against one with 400 real images (achieves 28 mAP) but the addition of 4000 synthetic images to real dataset boosts the detection performance by 12% (achieves 36 mAP) which signifies the importance of transferable cues from synthetic to real.
To improve the observed performance, several tactics can be tried. The presence of distractor objects in the test set was observed to negatively impact performance. We are working on the addition of distractor objects to the 3D model repository for rendering scenes with distractor objects to train the network to become aware of them. Optimizing the model architecture or replacing DetectNet with object proposal networks might be another alternative. Training CNNs for semantic segmentation using synthetic images and the addition of depth information to the training sets is also expected to help in the case of images with high degree of occlusion.
We acknowledge funding support from Innit Inc. consultancy grant CNS/INNIT/EE/P0210/1617/0007 and High Performance Computing Lab support from Mr. Sudeep Banerjee. We thank Aalok Gangopadhyay for the insightful discussions.
-  D. Forsyth, “Object detection with discriminatively trained part-based models,” Computer, vol. 47, no. 2, pp. 6–7, feb 2014.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, nov 2004.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings - 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. I. IEEE, 2005, pp. 886–893.
-  S. Ekvall, F. Hoffmann, and D. Kragic, “Object recognition and pose estimation for robotic manipulation using color cooccurrence histograms,” in Proceedings 2003 IEEERSJ International Conference on Intelligent Robots and Systems IROS 2003 Cat No03CH37453, vol. 2, no. October. IEEE, 2003, pp. 1284–1289.
-  J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, sep 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in International Conference on Neural Information Processing Systems. Curran Associates Inc., 2012, pp. 1097–1105.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June. IEEE, jun 2015, pp. 1–9.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-Based Convolutional Networks for Accurate Object Detection and Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142–158, jan 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, jun 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9905 LNCS, pp. 21–37.
-  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2014.
-  T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8693 LNCS, no. PART 5, pp. 740–755, 2014.
-  Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, jun 2009, pp. 248–255.
-  W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, jun 2014, pp. 1134–1148.
-  J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko, “Efficient Learning of Domain-invariant Image Representations,” in ICLR, jan 2013, pp. 1–9.
-  J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko, “LSDA: Large Scale Detection Through Adaptation,” in Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, 2014, pp. 3536–3544.
-  B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, jun 2011, pp. 1785–1792.
-  M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning Transferable Features with Deep Adaptation Networks,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, pp. 97–105.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, 2014, pp. 3320–3328.
-  Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting Batch Normalization For Practical Domain Adaptation,” Arxiv Preprint, vol. 1603.04779, no. 10.1016/B0-7216-0423-4/50051-2, mar 2016.
-  A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic Data for Text Localisation in Natural Images,” Arxiv Preprint, vol. 1604.06646, no. 10.1109/CVPR.2016.254, apr 2016.
-  E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep Domain Confusion: Maximizing for Domain Invariance,” Arxiv Preprint, vol. 1412.3474, dec 2014.
-  X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object detectors from 3D models,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 1278–1286, dec 2015.
-  H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 2686–2694, may 2015.
-  G. Georgakis, A. Mousavian, A. C. Berg, and J. Kosecka, “Synthesizing Training Data for Object Detection in Indoor Scenes,” Arxiv Preprint, vol. 1702.07836, feb 2017.
-  A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, “BigBIRD: A large-scale 3D database of object instances,” in Proceedings - IEEE International Conference on Robotics and Automation. IEEE, may 2014, pp. 509–516.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World,” Arxiv Preprint, vol. 1703.06907, mar 2017.
-  G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016, pp. 3234–3243.
-  A. Handa, V. Patraucean, V. Badrinarayanan, S. Stent, and R. Cipolla, “SceneNet: Understanding Real World Indoor Scenes With Synthetic Data,” Arxiv Preprint, vol. 1511.07041, no. 10.1109/CVPR.2016.442, nov 2015.
-  Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315, aug 2007.
-  Sharp and Toby, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on : date, 16-21 June 2012. IEEE, 2012.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model Repository,” Arxiv Preprint, vol. 1512.03012, no. 10.1145/3005274.3005291, dec 2015.
-  A. 3D, “Archive 3D,” 2015. [Online]. Available: http://archive3d.net/
-  J. Barker, S. Sarathy, and A. T. July, “DetectNet : Deep Neural Network for Object Detection in DIGITS,” pp. 1–8, 2016. [Online]. Available: https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/
-  M. P. Vlastelica, S. Hayrapetyan, M. Tapaswi, and R. Stiefelhagen, “Kit at MediaEval 2015 - Evaluating visual cues for affective impact of movies task,” in CEUR Workshop Proceedings, vol. 1436. New York, New York, USA: ACM Press, 2015, pp. 675–678.
-  D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Berlin, Heidelberg, 2012, vol. 7574 LNCS, no. PART 3, pp. 340–353.