Improving Annotation for 3D Pose Dataset of Fine-Grained Object Categories
Existing 3D pose datasets of object categories are limited to generic object types and lack of fine-grained information. In this work, we introduce a new large-scale dataset that consists of 409 fine-grained categories and 31,881 images with accurate 3D pose annotation. Specifically, we augment three existing fine-grained object recognition datasets (StanfordCars, CompCars and FGVC-Aircraft) by finding a specific 3D model for each sub-category from ShapeNet and manually annotating each 2D image by adjusting a full set of 7 continuous perspective parameters. Since the fine-grained shapes allow 3D models to better fit the images, we further improve the annotation quality by initializing from the human annotation and conducting local search of the pose parameters with the objective of maximizing the IoUs between the projected mask and the segmentation reference estimated from state-of-the-art deep Convolutional Neural Networks (CNNs). We provide a full statistics of the annotations with qualitative and quantitative comparisons suggesting that our dataset can be a complementary source for studying 3D pose estimation. The dataset can be downloaded at http://users.umiacs.umd.edu/~wym/3dpose.html.
In the past few years, the fast-pacing progress of generic image recognition on ImageNet  has drawn increasing attention in classifying fine-grained object categories [11, 25], \egbird species , car makes and models . However, simply recognizing object labels is still far from solving many industrial problems where we need a deeper understanding of other attributes of the objects . On the other hand, estimating 3D object pose from a single 2D image is an indispensable step in various practical applications, such as vehicle damage detection , novel view synthesis [32, 21], grasp planning  and autonomous driving . In this work, we introduce the problem of estimating 3D pose for fine-grained objects from monocular images. We believe this will become an important component in broader tasks, contributing to both fine-grained object recognition and 3D object pose estimation.
To address this task, collecting suitable data is of vital importance. However, due to the expensive annotation cost, most existing 3D pose datasets only provide accurate ground truth annotations for a few object classes and the number of instances associated to each category is quite small . Although there are two large scale pose datasets, Pascal3D+  and ObjectNet3D , both of them are collected for generic object types and there is still no large-scale 3D pose dataset for fine-grained object categories. Moreover, these datasets are lack of accurate pose information, since different objects in one hyper class (\ie, cars) are only matched with a few generic 3D shapes, leading to a high projection error that affects human annotators to find the accurate pose, as demonstrated in Figure 1.
In this work, we introduce a new benchmark pose estimation dataset for fine-grained object categories. Specifically, we augment three existing fine-grained recognition datasets, StanfordCars , CompCars  and FGVC-Aircraft , with two types of useful 3D information: (1) for each object in the image, we manually annotate the full perspective projection represented by 7 continuous pose parameters; (2) we provide an accurate match of the computer aided design (CAD) model for each fine-grained object category. The resulting augmented dataset consists of more than 30,000 images for over 400 fine-grained object categories. Table 1 shows the general statistics of our dataset.
To the best of our knowledge, our dataset is the very first one which employs fine-grained category aware 3D models in pose annotation. To fully utilize the valuable fine-grained information, we further develop an automatic pose refinement mechanism to improve over the human annotations. Thanks to the fine-grained shapes, an accurate pose parameter also leads to the optimal segmentation overlap between the projected 2D mask from the 3D model and the target object ground truth segmentation. We hence conduct a local greedy search over the 7 full perspective pose parameters, initialized from the human annotation, to maximize the segmentation overlap objective. To avoid effort on segmentation annotation, we utilize state-of-the-art image segmentation models including both Mask R-CNN  and DeepLab v3+  to obtain the as-accurate-as-possible segmentation reference. This process significantly improves our annotation quality. Figure 2 illustrates this process.
In summary, our contribution is three-fold. (1) We collect a new large-scale 3D pose dataset for fine-grained objects with more accurate annotations, which can be viewed as a complementary source to the existing pose dataset. (2) Our pose annotation contains a full perspective model parameters including the camera focal length, which is a more challenging benchmark for developing algorithms beyond only estimating viewpoint angles (azimuth)  or recovering the rotation matrices . (3) We propose a simple but effective way to automatically refine the pose annotation based on the segmentation cues. With the corresponding 3D fine-grained model, this method can automatically refine object pose while significantly alleviating the human label effort.
|Dataset||# class||# image||annotation||fine-grained|
|3D Object ||10||6,675||discretized view||✗|
|EPFL Car ||1||2,299||continuous view||✗|
|IKEA ||11||759||2d-3d alignment||✗|
|Pascal3D+ ||12||30,899||2d-3d alignment||✗|
|ObjectNet3D ||100||90,127||2d-3d alignment||✗|
|Total (Ours)||409||31,881||2d-3d alignment||✓|
|Fine-grained 2D Image||Fine-grained 3D Model||Initial Pose by Human|
|Initial 2D Segmentation||Segmentation Reference||Final Adjusted Pose|
|(1) human pose annotation|
|(2) segmentation based pose refinement|
2 Related Work
3D Pose Estimation Dataset. Due to the 3D ambiguity from 2D images and heavy annotation cost, earlier object pose datasets are limited not only in their dataset scales but also in the types of annotation they covered. Table 1 provides a quantitative comparison between our dataset and previous ones. For example, 3D Object dataset  only provides viewpoint annotation for 10 object classes with 10 instances for each class. EPFL Car dataset  consists of 2,299 images of 20 car instances captured at multiple azimuth angles. Moreover, the other parameters including elevation and distance are kept almost the same for all the instances in order to simplify the problem . Pascal3D+  is perhaps the first large-scale 3D pose dataset for generic object categories, with 30,899 images from 12 different classes of the Pascal VOC dataset . Recently, ObjectNet3D  further extends the dataset scale to 90,127 images of 100 categories. Both Pascal3D+ and ObjectNe3D assume a camera model with 6 parameters to annotate. However, different images in one hyper class (\ie, cars) are usually matched with a few coarse 3D CAD models, thereby the projection error might be large due to the lack of accurate CAD models in some cases. Being aware of these problems, we therefore project fine-grained CAD models to match objects in the 2D images. In addition, our datasets surpass most of previous ones in both scales of images and classes.
Fine-Grained Recognition Dataset. Fine-grained recognition refers to the task of distinguishing sub-ordinate categories [27, 12, 25]. In earlier works, 3D information is a common source to gain recognition performance improvement [33, 28, 19, 24]. As deep learning prevails and fine-grained datasets become larger, the effect of 3D information on recognition diminishes [16, 11]. Recently,  incorporate 3D bounding box into deep framework when images of cars are taken from a fixed camera. On the other hand, almost all existing fine-grained datasets are lack of 3D pose labels or 3D shape information , and pose estimation for fine-grained object categories are not well-studied. Our work fills this gap by annotating poses and matching CAD models on three existing popular fine-grained recognition datasets.
3D Model Dataset. Similar to , we adopt the 2d-3d alignment method to annotate object poses, Annotating in such a way requires a source for accessing accurate 3D models of objects. Luckily, there has been substantial growth in the number of of 3D models available online over the last decade [3, 5, 10, 14] with well-known repositories like the Princeton Shape Benchmark  which contains around 1,800 3D models grouped into 90 categories. In this work, we use ShapeNet , the so far largest 3D CAD model database which has indexed more than 3,000,000 models, with 220,000 models out of which are classified into 3,135 categories including various object types such as cars, airplanes, bicyles, etc. The large amount of 3D models allow us to find an exact model to many of the objects in the natural images. For example, the car category, ShapeNet provides 183,533 models for the car category and 114,045 models for the airplane category. Note that although we only annotate three fine-grained datasets, our annotation framework can be continued to apply to building more 3D pose dataset, thanks to larger-scale datasets like ShapeNet  and iNaturalist .
3 Dataset Construction
Building our 3D pose dataset involves two main processes: (1) human pose annotation, and (2) segmentation based pose refinement. Figure 3 illustrates the whole process. Our human pose annotation process is similar to ObjectNet3D  but requires more effort on selecting finer 3D models. We first select the most appropriate 3D car model from ShapeNet  for each object in the fine-grained image dataset. We then obtain the pose parameters by asking the annotators to align the projection of the 3D model to the corresponding image using our designed interface.
Although human can initiate the pose annotation with reasonably high efficiency and accuracy, we find it hard for them to adjust the fine detailed poses. Our second-stage segmentation based pose refinement further adjusts the pose parameters by performing a local greedy search initialized from the human annotation. We discuss the details of each process in the next subsections.
3.1 3D Models
We build three fine-grained 3D pose datasets. Each dataset consists of two parts: 2D images and 3D models. The 2D images are collected from StanfordCars , CompCars  and FGVC-Aircraft  respectively. Unlike Pascal3D+  and ObjectNet3D , the target objects in most images are non-occluded and easy to identify. In order to distinguish between fine-grained categories, we adopt a distinct model for each category. Thanks to ShapeNet , a large number of 3D models for fine-grained objects are available with make/model names in their meta data, which are used to find the corresponding 3D model given an image category name. If there is no exact match between a category name and the meta data, we manually select a visually similar 3D model for that category. For StanfordCars, we annotate images for all 196 categories, where 148 categories have exact matched 3D models. For CompCars, we only include 113 categories with matched 3D models in ShapeNet. For FGVC-Aircraft, we annotate images for all 100 categories with more than 70 matched models. To the best of our knowledge, our dataset is the very first one which employs fine-grained category aware 3D models in 3D pose estimation.
3.2 Camera Model
The world coordinate system is defined in accordance with the 3D model coordinate system. In this case, a point on a 3D model is projected onto a point on a 2D image:
via a perspective projection matrix:
where denotes the intrinsic parameter matrix:
and encodes a rotation matrix between the world and camera coordinate systems, parameterized by three angles, \ie, elevation , azimuth and in-plane rotation . We assume that the camera is always facing towards the origin of the 3D model. Hence the translation is only defined up to the model depth , the distance between the origins of the two coordinate systems, and the principal point is the projection of the origin of world coordinate system on the image. As a result, our model has 7 continuous parameters in total: camera focal length , principal point location , , azimuth , elevation , in-plane rotation and model depth . Note that, since the images are collected online, the annotated intrinsic parameters (, and ) are approximations. Compared to previous datasets [30, 29] with 6 parameters ( fixed), our camera model considers both the camera focal length and object depth in a full perspective projection for finer 2D-3D alignment, which allows for a more flexible pose adjustment and a better shape matching.
3.3 2D-3D Alignment
We annotate 3D pose information for all 2D images through crowd-sourcing. To facilitate the annotation process, we develop an annotation tool illustrated in Figure 4. For each image during annotation, we choose the 3D model according to the fine-grained label given beforehand. We then ask the annotators to adjust the 7 parameters so that the projected 3D model is aligned with the target object in the 2D image. This process can be roughly summarized as follows: (1) shift the 3D model such that the center of the model (the origin of the world coordinate system) is roughly aligned with the center of the target object in the 2D image; (2) rotate the model to the same orientation as the target object in the 2D image; (3) adjust the model depth and camera focal length to match the size of the target object in the 2D image. Some finer adjustment might be applied after the three main steps. In this way we annotate all 7 parameters across the whole dataset. On average, each image takes approximately 1 minute to annotate by an experienced annotator. To ensure the quality, after one round of annotation across the whole dataset, we perform quality check and let the annotators do a second round revision for the unqualified examples.
3.4 Segmentation Based Pose Refinement
|Initial Pose by Human||Iteration 1|
|Iteration 2||Final Pose|
Although human annotators already provide reasonably accurate annotation in the first stage, we notice that there are still potential rooms to further improve the annotation quality. This is because humans are good at providing a strong initial pose estimate but finetuning the detailed pose parameters is a very annoying thing to them. Realizing that ultimately the problem is to estimate the object pose such that the projection of the 3D model aligns with the image as well as possible, we design a simple but effective iterative local greedy search algorithm to automatically adjust pose parameters by maximimzing
where is the 2D object segmentation reference and maps a 3D model to a 2D mask according to the pose parameter .
The algorithm aims to finetune the 7 pose parameters to maximize the segmentation overlap between the projected 2D mask from the 3D model and the segmentation reference. We use the traditional intersection over union as the segmentation overlapping criterion. The algorithm greedily updates pose parameters, it is hence a local search algorithm with guarantee to converge to a local optimum. During the local search process, we observe it converges in 3-10 iterations with 1 minute per image on average. Algorithm 1 shows the overall process. Figure 5 illustrates the local search algorithm.
|azimuth||elevation||in-plane rotation||focal length||model depth||principal||principal|
3.5 Segmentation Reference
To conduct the local greedy search, ideally we need the ground truth target object segmentation. Although we may setup another segmentation annotation interface for all 2D images in three datasets through crowd-sourcing, we find using existing state-of-the-art image segmentation models such as Mask R-CNN  and DeepLab v3+  can already provide us with satisfying segmentation results. For example, on the Pascal VOC2012 segmentation benchmark, DeepLab v3+ can reach average IoUs of 93.2 on the “car” class and 97.0 on the “aeroplane” class respectively. Mask R-CNN, although does not provide as-accurate-as-enough semantic segmentation, is able to obtain instance-level segmentation, which are particularly useful for images with more than 1 instance from the same class. In the end, we use a combination of both models to find the most appropriate segmentation reference. Figure 6 illustrates the process.
3.6 Dataset Statistics
We plot the distributions of the 7 parameters in Figure 7 for StanfordCars3D, CompCars3D and FGVC-Aircraft3D respectively. Due to the nature of the original fine-grained dataset, all the parameters are not uniformly distributed. Unsurprisingly, the most challenging parameter across the three datasets is azimuth (), which varies across the , while elevation () and in-plane rotation () are somewhat concentrated in a small range around since the images of cars and airplanes are often taken from the ground view. The distribution of focal length () and model depth () are also not widespread because objects in these fine-grained images are generally normalized and cropped to a standard size. Although the parameter distribution issue may raise concerns about learning trivial solutions, we believe that our first attempt still provide reasonable diversity on pose annotation. For example, the distribution of azimuth () are quite different across the three datasets and complementary to each other. This could encourage building a more generalized pose estimation model.
3.7 Dataset Split
We split the three datasets in this way. For StanfordCars3D, since we have annotated all the images, we follow the standard train/test split provided by the original dataset  with 8144 training examples and 8041 testing examples. For CompCars3D, we randomly sample of our annotated data as training set and the rest as testing set, resulting in 3798 training and 1898 test examples. We provide the train/test split information in the dataset release. For FGVC-Aircraft3D, we follow the standard train/test split provided by the original dataset  with 6667 training examples and 3333 testing examples.
4 Dataset Comparison
4.1 Compare to Existing Dataset
We compare our annotation quality with two existing large-scale 3D pose dataset, PASCAL3D+  and ObjectNet3D . It is worth to note that we are not aiming to show the superiority of our dataset, since both previous datasets consider more general scenarios with multiple objects and challenging occlusion in an image. However, we hope that by comparing to them, we demonstrate our fine-grained pose dataset can become a complementary resource for studying 3D pose estimation in monocular images.
Figure 8 and Figure 9 show the qualitative comparison on the “car” class and the “aeroplane” class respectively. Overall, we find our annotation more satisfying by visually comparing the overlay images which maps the 3D model on the 2D image. To further conduct quantitative comparison, we use segmentation overlap between the projected 2D mask and the ground truth object mask as the evaluation measure. We randomly select 50 “car” images and 50 “aeroplane” images from PASCAL3D+ and ObjectNet3D respectively. We then randomly pick 50 images from StanfordCars3D and FGVC-Aircraft3D. In total, we randomly select 300 images and annotate them with ground truth segmentation. Since both PASCAL3D+ and StanfordCars3D consider more complicated scenarios such as multiple objects with cluttered background, we filter out those images containing more than one object with reasonably large size for a fair comparison. Hence the average IoUs can be an optimistic estimate for both baseline datasets. Even with that, our annotation shows a clear segmentation improvement on average IoUs on both “car” and “aeroplane”, as demonstrated in Table 2. Particularly, both the mean and the standard deviation of the segmentation IoUs get significantly improved, indictating that our annotations are not only more accurate but more stable as well.
|car||PASCAL3D+ ||ObjectNet3D ||StanfordCars3D|
|78.5% 8.6%||84.1% 6.0%||90.4% 3.3%|
|airplane||PASCAL3D+ ||ObjectNet3D ||FGVC-Aircraft3D|
|62.7% 13.1%||65.1% 11.0%||78.9% 9.4%|
4.2 Compare to Human Annotation
We also analyze how much gain we get by conducting segmentation based pose refinement. To understand this, we utilize the manually annotated ground truth 2D segmentation on the randomly select 100 images from the StanfordCars and FGVC-Aircraft. We then compare the average IoUs between human annotated pose and the refined pose. Table 3 shows the improvement of segmentation overlap on the three datasets. On StanfordCars3D, for example, our second-stage refinement improves average IoUs from 84.1% to 90.4%, which is significant. On FGVC-Aircarft3D, the improvement is even more, from 65.3% to 78.9%. Figure 10 and Figure 11 illustrate the pose improvement qualitatively.
Considering segmentation overlap may not be the only appropriate quantitative measure, we further conduct a human study to compare the pose annotation quality. To do this, we hire 5 professional annotators, show them the 2D-3D alignment of the same image with annotation in the two stages simultaneously and let them rate the relative quality for the 50 selected images in each dataset. The relative comparison consists of “Worse”, “Equal” and “Better”, indicating the second-stage pose is either significantly worse, roughly equal or significantly better than the first-stage human annotation from the subjective point of view. Table 4 shows the human study result. Most of the time, the second-stage refined pose is either roughly equal or significantly better than the initial human annotation, suggesting the benefit of utilizing segmentation cues to facilitate the pose search.
|Average IoUs||Human Annotation||Refined Annotation|
|StanfordCars3D||84.1% 6.2%||90.4% 3.3%|
|FGVC-Aircraft3D||65.3% 19.9%||78.9% 9.4%|
In summary, we introduce the new problem of 3D pose estimation for fine-grained object categories from a monocular image We annotate three popular fine-grained recognition datasets with 3D shapes and poses, ending in total 31,881 images with 409 classes. By utilizing image segmentation as an intermediate cue, we further improve the pose annotation quality. It is worth to note that human may ultimately produce better annotation given unlimited time, but the segmentation based pose refinement provides a facilitation with a better trade-off between cost and accuracy.
There are still a need of future works to continue the improvement. First, the super-categories shall be continued to enlarge with more fine-grained datasets. Second, the current fine-grained datasets are less challenging in terms of background clutter and object size. Third, while all existing large-scale pose datasets limit to rigid objects, it is still necessary to develop methods for non-rigid objects. Finally, it is also possible to develop a neural network architecture to replace the segmentation based pose refinement and combine it with the human annotation interface. We leave these as future work.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- X. Chen, A. Golovinskiy, and T. Funkhouser. A benchmark for 3d mesh segmentation. In Acm transactions on graphics (tog), volume 28, page 73. ACM, 2009.
- X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3D object detection for autonomous driving. In CVPR, 2016.
- X. Chen, A. Saparov, B. Pang, and T. Funkhouser. Schelling points on 3d surface meshes. ACM Transactions on Graphics (TOG), 31(4):29, 2012.
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
- A. Ghodrati, M. Pedersoli, and T. Tuytelaars. Is 2D information enough for viewpoint estimation? In BMVC, 2014.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
- S. Jayawardena et al. Image based automatic vehicle damage detection. PhD thesis, The Australian National University, 2013.
- V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning part-based templates from large collections of 3d shapes. ACM Transactions on Graphics (TOG), 32(4):70, 2013.
- J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, pages 301–320. Springer, 2016.
- J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object representations for fine-grained categorization. In ICCV Workshops on 3D Representation and Recognition, 2013.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
- B. Li, Y. Lu, C. Li, A. Godil, T. Schreck, M. Aono, M. Burtscher, H. Fu, T. Furuya, H. Johan, et al. Shrecâ14 track: Extended large scale sketch-based 3d shape retrieval. In Eurographics workshop on 3D object retrieval, volume 2014. ., 2014.
- J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing IKEA objects: Fine pose estimation. In ICCV, 2013.
- T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, 2015.
- S. Mahendran, H. Ali, and R. Vidal. 3d pose regression using convolutional neural networks. In IEEE International Conference on Computer Vision, volume 1, page 4, 2017.
- S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- R. Mottaghi, Y. Xiang, and S. Savarese. A coarse-to-fine model for 3D pose estimation and sub-category recognition. In CVPR, 2015.
- M. Ozuysal, V. Lepetit, and P. Fua. Pose estimation for category specific multiview object localization. In CVPR, 2009.
- E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. Transformation-grounded image generation network for novel 3D view synthesis. In CVPR, 2017.
- S. Savarese and L. Fei-Fei. 3D generic object categorization, localization and pose estimation. In ICCV, 2007.
- P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape benchmark. In Shape modeling applications, 2004. Proceedings, pages 167–178. IEEE, 2004.
- J. Sochor, A. Herout, and J. Havel. BoxCars: 3D boxes as cnn input for improved fine-grained vehicle recognition. In CVPR, 2016.
- G. Van Horn, O. Mac Aodha, Y. Song, A. Shepard, H. Adam, P. Perona, and S. Belongie. The inaturalist challenge 2017 dataset. arXiv preprint arXiv:1707.06642, 2017.
- J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen. Shape completion enabled robotic grasping. In IROS, 2017.
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1903–1911, 2015.
- Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. ObjectNet3D: A large scale database for 3D object recognition. In ECCV, 2016.
- Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV, 2014.
- L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In CVPR, 2015.
- T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In ECCV, 2016.
- M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. Detailed 3D representations for object recognition and modeling. PAMI, 35(11):2608–2623, 2013.