# Lifting Object Detection Datasets into 3D

## 1Introduction

an object’s category and the region it occupies in an image, purely from pixel-level information, is now closer to becoming a reality. Next on the list is to infer the object’s 3D surfaces. The availability of large datasets such as PASCAL VOC [1] and Imagenet [2] has paved the way to important advances in object segmentation and recognition over the last few years. Progress has been comparatively slower in the area of object reconstruction from a single image, hindered by the challenge in acquiring the necessary training data — ideally hundreds of thousands of images in uninstrumented settings aligned with their ground truth 3D shapes. One possible way forward, as computer graphics evolves, could be to render the data and learn to reconstruct in an environment that resembles a 3D computer game setting. Alternatively, depth sensors such as Kinect could be employed, but these are not yet fully practical in general settings — e.g. objects far from the camera, or outdoors. A third option would be to build large datasets with objects in videos and use structure-from-motion techniques [3] to recover their shape.

Here we propose instead to build upon existing, and extremely popular, recognition datasets and to directly reconstruct them aided by available annotations. While our experimental focus will be on PASCAL VOC, our proposed techniques are general and could be applied to any other object detection dataset (e.g. [2]), as long as ground truth class labels, figure-ground segmentations and a small number of per-class keypoints are available, as is the case for VOC [6] and is illustrated in fig. ?. These types of annotations can nowadays be easily crowdsourced over Mechanical Turk, as they require only a few clicks per image — e.g. a recently unveiled object detection dataset comes with around 2 million object segmentations [7].

Fig. ? illustrates what may be the major difficulty in our stated intentions: typically there is drastic intra-class shape variation, which makes previous class-specific reconstruction approaches based on linear shape models impractical. Instead we propose a multiview reconstruction strategy. Unlike settings where multiple calibrated images of the same object are available [8], detection datasets are composed of uncalibrated images of different instances of the same class of objects (most often assembled from images available on the web). We bypass the problem of establishing point correspondences between different objects, which is still unmanageable with current technology, by relying on a very small set of consistent per-class ground truth keypoint matches, from which scaled orthographic camera viewpoints are bootstrapped. We also bypass segmentation, another yet incompletely solved vision problem despite much recent progress [9], and rely on ground truth silhouettes as input to our dense reconstruction engine which is based on a novel visual hull based algorithm.

Visual hull computation has been shown to be a simple but powerful reconstruction technique when many diverse views of the same object are available. We adapt it to operate on category detection imagery proposing a novel formulation that we denote *imprinted visual hull reconstruction*. The basis of our algorithm is to embed visual hull reconstruction within a sampling-based approach. We propose a set of candidate reconstructions for each input image by running the visual hull algorithm multiple times using the current image and two additional images sampled from the dataset. We prioritize viewpoints that are known to best expose the 3D shape of most objects. Finally, we select the most consistent reconstruction amongst the proposed candidates by maximizing intra-category similarity.

Our contributions span different areas of computer vision:

The recognition problem:

a first attempt to semi-automatically augment object detection datasets, here instantiated on PASCAL VOC, with dense per-object 3D geometry without requiring annotations beyond those readily available online.

The reconstruction problem:

we propose a new data-driven method for class-based 3D reconstruction that relies only on 2D information, such as figure-ground segmentations and a few keypoint annotations.

This paper extends the original conference work [11] with additional visualizations and many new experiments: we present a direct evaluation of viewpoint estimation, a new analysis of the reconstructed shapes on PASCAL VOC and studies on the influence of the main parameters of our algorithm on the results. The full source code of our algorithm, which we call *carvi* as in “carving”, and our synthetic dataset are freely available online^{1}

## 2Related Work

Formerly a dominant paradigm, model-based recognition, which reasoned jointly about object identity and 3D geometry [12], was permanently upstaged in the 1990’s by a flurry of view-based approaches [15]. The main appeal of view-based approaches was their flexibility: collecting a few example images of the target objects and annotating their bounding boxes or 2D keypoint locations became all the manual labor required to build a recognition system, averting the need for cumbersome manual 3D design or special instrumentation (3D scanners). This 2D data-driven approach made it possible to attack harder problems such as category-level object recognition. Model-based recognition held important advantages, nevertheless [17]: modeling the 3D geometry of an object enabled arbitrary viewpoints and occlusion patterns to be rendered and recognized, and it also facilitated higher-level reasoning about interactions between objects and a scene.

While the popularity of model-based recognition was falling, interest in multiview 3D reconstruction was rising, powered by breakthroughs in affine structure-from-motion [18] and the adoption of projective geometry [21]. Multiview reconstruction has since been largely solved in the rigid case [8], as calibration parameters and correspondences can be reliably estimated and the problem reduces to a well-understood geometric optimization problem. In this paper we are interested in the harder problem of class-based reconstruction, where the goal is to reconstruct different objects from the same category, each pictured in a single image. This problem will be the focus of the rest of our literature review in this section.

### 2.1Class-based reconstruction with 3D data

Most class-based reconstruction methods make use of 3D data in their pipeline which provides prior 3D information about the shape of the objects in the class.

#### Reconstruction as prototype selection and alignment

In certain cases a precise 3D model of the target object is known [22] and the problem can be seen as a special case of class-based reconstruction that focuses on reconstructing a specific instance of the class. The goal reduces then to locate and estimate the viewpoint of the object instance in the image. This problem has traditionally attracted much attention in computer vision and is currently going through a revival due to the increased availability of accurate 3D models for many objects and better feature extraction technology [24].

Other methods rely on a dataset of 3D shapes and automatically choose and align the one that best fits the object in the image [25]. To account for differences between the 3D exemplars and the object depicted, Su and Guibas [26] choose a few exemplars that appear to be similar to the depicted shape and combine them to produce a single depth map.

#### Reconstruction as patch classification

Hassner and Basri [27] performed class-based single-view reconstruction using also 3D training data but without building a parametric model for the class. Instead, their model searches the training set for patches similar to those in the test image, then transfers the associated depth information. Related methods, but employing parametric classifiers, have also been successfully used for scene reconstruction from a single image [28].

#### Reconstruction using morphable models

When multiple 3D shapes corresponding to different instances of the same class are available, they can be used to build a morphable model for the 3D class, that can then generalize and be fit to unseen instances of the class. Morphable models are low-dimensional parametric models that have been used to represent the shape of many object classes. They can be built from 3D scans of different instances of the class, e.g. the face model in [31] and the human body model in [32], or using 3D meshes obtained from shape repositories, such as Google Sketchup as in [33].

The trained morphable model can then be used in a variety of tasks: (1) to reconstruct from a single image, usually with some user interaction to initialize the viewpoint [31], (2) as a prior for reconstruction from multiple images [34] or from a depth map [35], or (3) for performing object detection and pose estimation [33] in a single image. One factor that limits the applicability of these models is the need for 3D training data. In order to partially overcome this issue, [36] proposed a hybrid method that uses a single 3D shape together with 2D information in order to build a morphable model. The system was demonstrated on classes with limited intra-class shape variability such as dolphins or pigeons.

### 2.2Data-driven class-based reconstruction

In this paper we focus on a data-driven method for class-based reconstruction that operates directly on an unordered dataset of images and some associated 2D annotations, without using any 3D data. To the best of our knowledge, there have only been two previous attempts [37] at tackling the problem in a purely data-driven fashion. These two approaches build upon traditional non-rigid structure-from-motion methods [39], originally developed for reconstruction from video, and either produce sparse reconstructions [37] or have only been demonstrated on simple classes such as flower petals and clown-fish, while requiring complex manual annotations [38].

Our method differs from the above in two important aspects: (1) we require only a small set of keypoint correspondences across images and these are not the only points we reconstruct; instead we reconstruct dense 3D models of the objects, and (2) we do not build a morphable model for the class. Instead, our aim is to reconstruct every object instance, using “borrowed” shape information from a small number of similar instances seen from different viewpoints. This makes our method applicable to classes with large intra-class variation as those in the VOC dataset.

### 2.3Bottom-up reconstruction

More general object reconstruction approaches have been devised, that do not require any class information, tracing back to classic works by Binford, Marr and others [40]. Methods such as shape from shading (SfS) [43] hold great promise but have so far been applied only in very restricted settings as they make strong assumptions about global illumination conditions and the reflective properties of the object. Recently SIRFS [44] went one step beyond traditional SfS and aimed to recover not only the shape and shading of an object but also reflectance and incident illumination, all from a single image.

Another family of approaches attempts to compute shape from a single silhouette [45]. For example [45] employed the representation of geometric images to successfully reconstruct simple shapes symmetric with respect to the image plane, but required large amounts of user interaction for more complex objects. A similar principle of symmetry with respect to the image plane is the basis of [47], that focused on including shading information and improving the user experience. The same symmetry principle was also used, albeit more lightly, in [48] where the focus was on coping with deformations.

### 2.4Dataset augmentation into 3D

The goal of populating detection datasets with 3D annotations has been previously considered for the class *person* [49], using an interactive method to reconstruct a set of body joints. In contrast, we obtain full dense reconstructions for a variety of classes. In a related approach, [50] targeted the problem of automatically bootstrapping 3D scene geometry from 2D annotations on the LabelMe dataset — instead, we focus on objects.

Recently and perhaps closest to our approach, Karsch *et al.* [51] experimented with reconstructing VOC objects, using manual curvature annotations on boundaries but computed 2.5D reconstructions while we focus on the full 3D problem. Even more recently — and concurrently with our original paper — new 3D annotations were added to 12 rigid classes of the PASCAL dataset [52] in a largely manual effort. All instances of these 12 classes were manually associated with one out of a small set of 3D CAD models posed in the correct viewpoint. The goal of the dataset is to provide ground truth viewpoint information, not shape, and the CAD models provide only a very coarse approximation to the rich set of shapes in PASCAL (e.g. about 50% overlap, or roughly just as much as top automatic semantic segmentation systems [10]).

## 3Problem formulation

We assume we are given a set of images depicting different instances of the same object class, which may be very diverse in terms of object scale, location, pose, shape and articulation. We make the small simplification in this paper of not addressing the problem of reconstructing occluded objects, that are marked as such in PASCAL. Each object instance has a corresponding binary mask — a figure-ground segmentation locating the object boundaries in the image — and specific keypoints for each class , which are on easily identifiable parts of the object, such as *“left mirror”* for cars or *“nose tip”* for aeroplanes. Each object instance is annotated with its visible keypoints, i.e. the set of 2D image coordinates^{2}

Our goal in this paper is to obtain a dense 3D reconstruction of each of the object instances. It is easy to see that this is a severely underconstrained problem since each image corresponds to a different object instance. Without additional prior knowledge, and if each instance is to be reconstructed independently, an infinite number of reconstructions would be available that could exactly generate the silhouette .

### 3.1Our data-driven approach

Instead of relying on bottom-up reconstruction methods and performing reconstruction completely independently for each instance, we leverage the information contained in images of other objects from the same category, by building upon the assumption that at least some instances of the same class will have a similar 3D shape. We propose a feedforward strategy with two phases: first, camera viewpoints are estimated for all objects using both keypoint and silhouette information; secondly, a sampling-based approach that employs a novel variant of *visual hull reconstruction* is used to produce dense per-object 3D reconstructions. The details of these two steps will be further explained in the following two sections.

## 4Camera viewpoint estimation and refinement

Estimated azimuth for cars.

The first step of our algorithm is to estimate the camera viewpoint for each of the instances using the factorization based rigid structure-from-motion algorithm of Marques and Costeira [53]. Although rigid modeling may appear to be a suboptimal choice at first sight, several non-rigid structure-from-motion algorithms make use of a similar strategy in viewpoint estimation due to the lack of robustness to noise of specialized non-rigid SfM viewpoint estimates. Simply put, the hope is that the – admittedly flawed – assumption of rigidity acts as a regularizer. The algorithm we adopted models projection using scaled orthographic cameras and requires global point correspondences across the different object instances. In comparison with full perspective cameras, scaled orthographic cameras are considerably easier to model, do not require calibration parameters and are a reasonable approximation for the problem considered.

Using the annotated keypoints we form an observation matrix for each instance:

where is the object instance and is the number of annotated keypoints. Some of the entries in this matrix may be unknown if the keypoint is not visible for this instance.

The SFM algorithm finds the 3D shape , a matrix that can be seen as a rough “mean shape” for the object instances in the class, the motion matrices and the translation vectors , by minimizing the image reprojection error:

under the constraint that . This constraint guarantees that matrices correspond to the first two rows of a scaled rotation matrix which can be easily converted into a full rotation matrix and scale parameter . The SfM algorithm used does not require that all keypoints are visible in all the instances, i.e. it can deal with missing data. We follow [53] and use an iterative method with power factorization to minimize the reprojection error.

For classes with large intra-class variation or articulation, we manually select a subset of the keypoints to perform rigid SfM. There are two types of classes that follow this behavior: the class *boat* and animal classes ^{3}*boat* includes both sailing boats and motor boats and since the sails are not present in the motor boats, we estimate the camera by only considering the keypoints on the hull. Excluding the keypoints corresponding to the sails is crucial for the refinement step detailed in the next section. For animals, which undergo articulation, different instances may have very different poses. For these classes, we assume that the camera viewpoint is defined with respect to the head and torso and exclude the keypoints corresponding to the limbs or wings when performing rigid SFM. For all classes, for robustness, we double the number of instances by adding left-right flipped versions of each image.

### 4.1Silhouette-based camera refinement

To obtain the camera pose estimate for a particular instance, the SFM algorithm only uses the keypoints visible in that instance. If some keypoints are self-occluded, and since the shape is an average shape of all the objects in the class, this may lead to an inaccurate estimate of the camera viewpoint (see fig. ? (a)). However, the silhouette provides extra constraints that can be used to refine this initial estimate of the camera viewpoint. In particular, if the estimated shape was the correct one, all the keypoints, even the ones which are not visible, should reproject inside the silhouette. This constraint is not satisfied by the initial result of fig. ? (a), but by including a soft-constraint that encourages all points to reproject inside the silhouette we obtain a better viewpoint estimate as can be seen in fig. ? (b). We include this constraint as a soft-constraint to account for imprecisions in the shape estimation and keypoint and silhouette annotations.

More formally, we refine the camera estimate and by fixing the shape and minimizing an energy function of the form:

under the constraint . The first term of this energy is the reprojection error as in and the second term is defined as:

where is the distance transform map from the figure-ground segmentation . A point on the mean shape incurs a penalty if its reprojection, given by , is outside the silhouette and this penalty is proportional to the distance to the silhouette. To minimize this function, we use gradient descent with a projection step into the space of scaled rotation matrices. A similar projection step is used in [53]. Qualitative results of our camera viewpoint estimation can be seen in fig. ?.

This camera refinement step can also be used to estimate the camera viewpoint of a new test image, by initializing to the identity matrix and to the center of the mask. This allows our method to reconstruct a previously unseen image, the only requirement being that the keypoints are marked or have been detected and that the object is segmented.

(a) | (b) |

## 5Object reconstruction

After jointly estimating the camera viewpoints for all the instances in each class, we reconstruct the 3D shape of all objects using shape information borrowed from other exemplars in the same class ^{4}

### 5.1Sampling shape surrogates

In datasets as diverse as VOC, it is reasonable to assume that for every instance there are at least a few shape surrogates, i.e. other instances of the same class that, despite not corresponding to the same physical object, have a similar 3D shape. Finding shape surrogates is not straightforward, however. When the surrogates have very different viewpoint it is difficult to establish that their 3D shape is similar to the shape of the reference object (e.g. that they are true surrogates) because their appearance changes vastly. In visual hull approaches, such as the one we propose, a tension also exists between reconstructing from fewer silhouettes, which may result in a solution with many uncarved voxels, or from a large number of silhouettes which may instead lead to an over-carved or even empty solution, because calibration is not exact and “surrogateness” is only approximate. Here we strike a compromise: we sample groups of three views, where the two surrogates of the reference instance are selected among those pictured from far apart viewpoints, so as to maximize the number of background voxels carved away (see fig. ?).

Furthermore, when selecting *far apart* viewpoints we took inspiration from technical illustration practices, where the goal is to communicate 3D shape as concisely as possible, and it is common to represent the shape by drawing 3D orthographic projections on three orthogonal planes. In a similar vein, we restrict surrogate sampling to be over objects pictured from three orthogonal viewpoints, which we will call principal directions.

Our sampling process has three steps:

**(1) Principal direction identification** We found empirically that a good set of principal directions can be obtained by computing the three PCA components of the set of 3D coordinate vectors of the mean shape (estimated in the rigid SfM step). The results typically correspond to the top/bottom, left/right and front/back directions.

**(2) Clustering instances around the principal directions** Instances where the viewpoint difference with respect to a principal direction is smaller than some threshold ( in our implementation) are clustered together ^{5}

**(3) Sampling** We start by selecting two of the three principal directions, with a probability proportional to the number of associated instances. Then, from each of the selected principal directions, we sample one surrogate instance, which together with the reference instance forms a triplet of views.

Three of the classes in the VOC dataset (*bottle*, *dining table* and *potted plant*) have view-dependent keypoints since it is difficult to define a reference frame for the object [6]. This makes 3D registration ambiguous for all the instances of the class. Instead of sampling surrogate instances, we observed that some of the instances of these classes are approximately rotational symmetric and synthesize the surrogates from the reference instance by rotating it around the axis of symmetry, every 45 degrees. This is obviously a rough approximation for instances that considerably depart from rotational symmetry.

### 5.2Imprinted visual hull reconstruction

Recovering the approximate shape of an object from silhouettes seen from different camera viewpoints can be done by finding the visual hull of the shape [55], the reconstruction with maximum volume among all of those that reproject inside all the different silhouettes. Visual hull reconstruction is a frequent first step in multi-view stereo [56], providing an initial shape that is then refined using photo-consistency. Existing visual hull methods assume that the different silhouettes project from the same physical 3D object [57]. This is in contrast with our scenario where images of different objects are considered. Visual hull reconstruction is known to be sensitive to errors in the segmentation and in the viewpoint estimate and it is clear that such sources of noise are very present in our framework, and can lead to overcarving if handled naively.

A clear inefficiency of using the standard visual hull algorithm in our setting is that there is no guarantee that the visual hull is silhouette-consistent with the reference instance , i.e. that for all the foreground pixels in the mask there will be an active voxel reprojecting on them. This happens because the algorithm trusts equally all silhouettes. Here we propose a variation of the original formulation that does not have this problem, which we denote *imprinted visual hull reconstruction*. We will use a volumetric representation of shape and formulate imprinted visual hull reconstruction as a binary labelling problem. Let be the set of instances corresponding to a sampled triplet and be a set of voxels. The goal is to find a binary labelling such that if voxel is inside the shape, and otherwise. Let be a signed distance function such that if voxel is inside the camera cone of instance , and let be the largest signed distance value over all the cameras, for each voxel . Visual hull reconstruction can be formulated as the minimization of the energy:

To enforce silhouette consistency with the reference mask (imprinting), we need to guarantee that all the rays cast from the foreground pixels of intersect with an interior voxel. Let be the set of voxels that intersect with the ray corresponding to pixel . Imprinting is then enforced by minimizing energy under the following constraints:

Similar constraints have been previously used for multi-view stereo [58], where they were enforced equally for all the images. Energy can be minimized exactly under constraint , by simply setting if and only if or if . Basically, this energy has a prior for thin structures, in depth. It promotes the construction of a thin layer in depth that fills in the reference mask and is positioned so as to minimize the distance to all considered masks. This is a sensible prior in many cases, because thin surfaces are likely to be carved away using visual hulls, for example sails of boats, bird wings, chair legs, etc. In a few cases, however, it is not ideal, most notoriously when masks are mismatched due to perspective effects in buses and trains (e.g. see fig. ?).

We chose to formulate our reconstruction algorithm as a labelling problem, to motivate future extensions such as adding pairwise constraints between voxels or connectivity priors [59]. An example case where imprinting is particularly useful is the bird in fig. ?.

### 5.3Reconstruction selection

Once all reconstruction proposals have been computed based on different sampled triplets, the final step is to choose the best reconstruction for the reference instance. Here we propose a selection criterion that follows a simple observation: reconstructions should be similar to the average shape of their object class. Our selection procedure first computes an average mask for each of the principal directions. This is done by aligning the masks of all the instances in each principal direction cluster and averaging them. Afterwards, each reconstruction proposal is projected onto a plane perpendicular to each principal direction and the difference between this projection and the average mask associated with that direction is measured. The final score is the sum of the three differences, one for each direction. The average masks for each principal direction for two classes are shown in fig. ?.

## 6Experiments

Our main goals in terms of experiments were to evaluate the accuracy of 1) viewpoint estimation and 2) shape reconstruction. We focused on the PASCAL VOC dataset because it is still the most popular object detection dataset and the annotations that our algorithm uses as input are already publicly available. In work published concurrently with our original publication [11], human provided viewpoint annotations have been gathered for PASCAL VOC [52]. This allows us to evaluate the camera viewpoint estimation. Regarding the shape reconstruction, in the absence of accurate ground truth data and considering the simplicity of our inputs (keypoints and figure-ground segmentation) ^{6}

### 6.1Reconstructing PASCAL VOC

We consider the subset of 9,087 fully visible objects in 5,363 images from the 20,775 objects and 10,803 images available in the PASCAL VOC 2012 training data and use the publicly available keypoints and figure-ground segmentations [61]. VOC has classes, including highly articulated ones (dogs, cats, people), vehicles (cars, trains, bicycles) and indoor objects (dining tables, potted plants) in realistic images drawn from FLICKR. Amongst these, fewer than 1% have focal lengths in their EXIF metadata, which we ignored.

We reconstructed all the objects and show two example outputs from each class in fig. ?. We observe that surprisingly accurate reconstructions are obtained for most classes, with some apparent difficulties for “dining table”, “sofa” and “train”. The problems with “dining table” can be explained by there being only 13 exemplars marked as unoccluded, which makes camera viewpoint estimation frail. “Sofa” has a strong concavity which makes visual-hull reconstruction hard and would benefit from stereo-based post-processing, which we leave for future work. “Train” is a very difficult class to reconstruct in general: different trains may have a different number of carriages, there are strong perspective effects and it is articulated. Finally, sometimes our reconstructions of animals have either fewer or more limbs than in the image, and certain reconstructions have disconnected components.

In all experiments, we sampled 20 reconstructions of each reference object instance and found our algorithm to be very efficient: it took just 7 hours to reconstruct VOC on a 12-core computer, with the camera refinement algorithm taking around 5 hours.

**Simple shape analysis.** Reconstruction of image collections, as pursued in this paper, holds the potential to greatly extend the domain of powerful shape analysis techniques developed in the graphics community [62], that have however been mostly applied to collections of CAD models. Here, we made one small step in this direction and experimented with clustering our reconstructions on PASCAL VOC. For each class we first computed a distance matrix between all instances (using the symmetric mesh distance from [63]), then clustered each class into 5 clusters using the K-medoids algorithm. We show the resulting prototypes (medoids) for each cluster in fig. ?, ordered by the number of elements assigned to that cluster, which reflects how frequently each type of shape appears in the dataset.

**Viewpoint evaluation using Pascal3D+ ground truth cameras.** Recently, Xiang et. al. [52] augmented 12 of the object classes in the PASCAL dataset with human-provided 3D information – the Pascal3D+ dataset. To construct this dataset, for each object, a human first selected the 3D CAD model most similar to it from a small set of options (a total of 70 CAD models for the 12 classes were used), then manually oriented and aligned it with the image. The viewpoint is then refined by optimizing the projection of ground truth keypoints in the 3D model to ground truth keypoints in the image.

We use the Pascal3D+ dataset to evaluate our camera viewpoint estimation algorithm detailed in section ?. In Pascal3D+ the annotations for each object contain the camera’s azimuth, elevation and camera roll angles which we compare to our estimates. We report results for 10 of the annotated classes (we exclude classes “bottle” and “dining table” because the keypoints we experimented with [6] for both classes are view-dependent).

The results are shown in fig. ? and are in most cases lower than , which is the precision of the ground truth annotations of Pascal3D+ [64]. We measure the angle error in degrees and report the median for each class. The results show that our method is effective in estimating the viewpoint for most objects in all classes. It also shows that our refinement step detailed in section ? consistently outperforms the initial estimate using rigid SFM. In fig. ? we show typical failure cases of viewpoint estimation: large perspective effects, large intra-class variation and articulations.

### 6.2Reconstructing a synthetic PASCAL VOC

Full | -CRef | -SImp | [46] | SFMc | |
---|---|---|---|---|---|

aeroplane | 3.58 |
4.94 | 3.95 | 9.64 | 5.79 |

bicycle | 4.30 | 3.26 |
4.75 | 10.51 | 6.56 |

bird | 9.98 | 10.92 | 10.34 | 8.76 |
12.01 |

boat | 5.91 |
6.78 | 6.05 | 8.81 | 6.52 |

bottle | 8.09 | 10.77 | 8.53 | 6.25 |
12.13 |

bus | 6.45 | 6.10 |
6.49 | 11.02 | 7.34 |

car | 3.04 |
6.33 | 3.10 | 11.07 | 3.22 |

cat | 6.98 |
7.57 | 7.49 | 11.39 | 9.61 |

chair | 5.36 |
5.73 | 6.06 | 8.13 | 7.37 |

cow | 5.44 | 5.24 |
5.83 | 9.17 | 7.50 |

diningtable | 8.97 | 12.57 | 14.30 | 8.67 |
9.52 |

dog | 7.08 |
8.38 | 7.19 | 11.61 | 9.91 |

horse | 6.05 |
7.05 | 6.38 | 6.90 | 7.41 |

motorbike | 4.12 |
4.24 | 4.16 | 9.24 | 5.32 |

person | 7.35 |
7.95 | 7.55 | 9.14 | 19.46 |

pottedplant | 7.72 | 8.15 | 7.99 | 7.58 |
17.86 |

sheep | 7.18 | 7.15 |
7.66 | 8.77 | 7.16 |

sofa | 6.11 | 6.24 | 6.31 | 8.06 | 5.75 |

train | 15.73 |
20.55 | 16.19 | 17.01 | 17.47 |

tv/monitor | 9.73 | 10.45 | 10.28 | 9.67 |
10.08 |

Mean |
6.96 |
8.01 | 7.53 | 9.57 | 9.40 |

We also performed a quantitative evaluation on synthetic test images with similar segmentations and keypoints as those in VOC. To make results as representative of performance on real data as possible, we reconstruct using only surrogate shapes from VOC. We downloaded 10 meshes for each category from the web, then manually annotated keypoints consistent with those of [6] in 3D and rendered them using 5 different cameras, sampled from the ones estimated on VOC for that class. This resulted in 50 synthetic images per class, each with associated segmentation and visible keypoints, for a total of 1000 test examples.

We measure the distortion between a reconstruction and a ground truth 3D mesh using, as in the clustering experiment, the symmetric root mean squared error between the two meshes [63]. Let the root mean squared error be:

where and are the two meshes we want to compare and is the distance of a point to a mesh, defined by the Hausdorff distance, i.e. the minimum euclidean distance between point and any point on the mesh . Since this distance is not symmetric we use instead:

We normalize scale using the diagonal length of the bounding box of the ground truth 3D model, such that the error is a percentage of this length, and report the average error over all the objects in each category. Table ? demonstrates the benefits of the different components of our proposed methodology. Since no other existing class reconstruction technique scales to such a large and diverse dataset using simple 2D annotations we compare to two simple baselines: an inflation technique originally proposed for silhouette based single-view reconstruction called *Puffball* [46] and a multiview baseline relying on our rigid SfM. Our method is significantly better for most classes, and a visual comparison of resulting reconstructions obtained is available in fig. ?, together with some of the CAD models in the dataset and their renderings.

Fig. ? suggests large gains of our simple ranking approach over random selection but also that there is much to improve with the addition of more advanced features. Fig. ? also shows the effect of varying the principal direction clustering threshold, which we have set by default to 15º: reconstruction quality degrades slowly with looser thresholds. We have observed for example that cars tend to be more diamond-shaped if a tight frontal view is not available and instead views 30º or 40º away from the frontal view are used.

Image view | Top view | Image view | Top view | Image view | Top view | Image view | Top view | |

## 7Discussion

While our results are encouraging for such a hard problem, there are several challenges that our approach does not address, such as modelling parts/articulation, occlusions and perspective effects. An additional limitation of our method is the use of a single “average shape” for the objects of a class. Although our experiments show that the camera viewpoint estimation step generally provides accurate results, this simplification may occasionally lead to incorrect camera pose estimates when the shape of the object instance differs significantly from the “average shape”. Modelling subcategories would be a straightforward avenue for boosting the performance of all components - pose estimation, surrogate sampling and ranking. Two possible ways to obtain such subcategory information are: 1) to use image classifiers trained on a dataset with finer-grained category information and 2) to divide shapes into subcategories during the reconstruction process and iterate.

An additional potentially powerful direction for future work is feature learning, in particular to improve the ranking of reconstructions, perhaps using one of the large collections of CAD models available online [65].

Finally, while our use of imprinting when computing visual hulls helps to mitigate the issues of using different object instances as surrogate shapes, this could be combined with other advanced visual hull techniques that explicitly deform the surrogate silhouettes to reduce inconsistencies between the silhouettes [66] or that enforce connectivity of the reconstruction [67].

## 8Conclusion

We have proposed a novel data-driven methodology for bootstrapping 3D reconstructions of objects in detection datasets, based on a small set of commonly available annotations, namely figure-ground segmentations and a small set of keypoints. Our approach is the first to target class-based 3D reconstruction on a challenging detection dataset, PASCAL VOC, and is demonstrated to achieve very promising performance. It produces convincing 3D shapes for most categories, handling widely different objects such as animals, vehicles and indoor furniture using the same integrated framework. We believe this paper contributes to the recently renewed interest in 3D modeling in recognition (eg. [68]) and that it will promote progress in this direction since it provides the first semi-automatic solution to 3D model acquisition from detection data, which has been a difficult obstacle to research in joint object recognition and reconstruction.

## Acknowledgments

This work was supported by FCT grants PTDC/EEA-CRO/122812/2010 and SFRH/BPD/84194/2012, by the European Research Council under the ERC Starting Grant agreement 204871-HUMANIS. It was also also partly supported by the SecondHands project, funded from the European Unions Horizon 2020 Research and Innovation programme under grant agreement No 643950.

Jo{\~a}o Carreira

received his doctorate from the University of Bonn, Germany. His thesis focused on sampling class-independent object segmentation proposals using the CPMC algorithm, and on applying them in object recognition and localization. Systems authored by him and colleagues were winners of all four PASCAL VOC Segmentation challenges, 2009-2012. He did post-doctoral work at the Institute of Systems and Robotics in Coimbra, Portugal and is currently with the EECS department, at the University of California in Berkeley, USA. His research interests lie at the intersection of recognition, segmentation, pose estimation and shape reconstruction of objects from a single image.

Sara Vicente

received her PhD from University College London, United Kingdom. She was a postdoctoral researcher at Queen Mary, University of London and later at University College London. She currently works as a research scientist at Anthropics Technology. Her research focuses on image segmentation and 3D reconstruction of deformable objects from images.

Lourdes Agapito

received the BSc degree in physics in 1991 and the PhD degree in 1996 from the Universidad Computense in Madrid, Spain. She was then a Marie Curie fellow at Oxford’s Robotics Research Group. She is currently a reader in Vision and Imaging Science at University College London. In 2008, she was awarded an ERC Starting Grant. Her research focuses on the area of 3D reconstruction of non-rigid structure from image sequences. She is a member of the IEEE.

Jorge Batista

Prof. Jorge Batista received the M.Sc. and Ph.D. degree in Electrical Engineering from the University of Coimbra in 1992 and 1999, respectively. He joined the Department of Electrical Engineering and Computers, University of Coimbra, Coimbra, Portugal, in 1987 as a research assistant where he is currently an Associate Professor with tenure. He has been the Head of Department from 2011 to 2013. He is a founding member of the Institute of Systems and Robotics (ISR) in Coimbra, where he is a senior researcher and principal investigator of several research projects. His research interest focus on a wide range of computer vision and pattern analysis related issues, including real-time vision, video surveillance, video analysis, non-rigid modeling and facial analysis.

### Footnotes

- http://www.isr.uc.pt/~joaoluis/carvi
- These annotations are publicly available for all the 20 classes in the VOC dataset [6].
- Note however that most object classes are non-rigid in practice, for example cars can have their doors open, wheels rotate, etc.
- An idea similar in spirit was proposed for segmentation [54]
- The amount of camera roll is typically low in detection datasets and we did not compensate for it but it may be a good idea in future work.
- Synthetic datasets were also successfully used in Kinect [60], where the inputs can also be rendered realistically (e.g. depth maps).

### References

- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”
*International Journal of Computer Vision*, 2010. - J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2009. - M. Paladini, A. Del Bue, M. Stosic, M. Dodig, J. Xavier, and L. Agapito, “Factorization for non-rigid and articulated structure using metric projections,” in
*Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on*.1em plus 0.5em minus 0.4emIEEE, 2009, pp. 2898–2905. - C. Russell, R. Yu, and L. Agapito, “Video pop-up: Monocular 3d reconstruction of dynamic scenes,” in
*Computer Vision – ECCV 2014*, ser. Lecture Notes in Computer Science, 2014, vol. 8695, pp. 583–598. - A. Fragkiadaki, M. Salas, P. Arbelaez, and J. Malik, “Grouping-based low-rank video completion and 3d reconstruction,” in
*Advances in Neural Information Processing Systems*, 2014. - T. Brox, L. Bourdev, S. Maji, and J. Malik, “Object segmentation by alignment of poselet activations to image contours,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2011. - T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,”
*arXiv preprint arXiv:1405.0312*, 2014. - R. I. Hartley and A. Zisserman,
*Multiple View Geometry in Computer Vision*.1em plus 0.5em minus 0.4emCambridge University Press, 2004. - J. Carreira, F. Li, and C. Sminchisescu, “Object Recognition by Sequential Figure-Ground Ranking,”
*International Journal of Computer Vision*, 2012. - J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic segmentation with second-order pooling,” in
*European Conference on Computer Vision*, 2012. - S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing pascal voc,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2014. - W. E. L. Grimson, D. P. Huttenlocher
*et al.*,*Object recognition by computer: the role of geometric constraints*.1em plus 0.5em minus 0.4emMit Press, 1990. - D. P. Huttenlocher and S. Ullman, “Object recognition using alignment,” in
*Proceedings of the 1st International Conference on Computer Vision*, 1987, pp. 102–111. - D. G. Lowe, “Three-dimensional object recognition from single two-dimensional images,”
*Artificial intelligence*, vol. 31, no. 3, pp. 355–395, 1987. - M. Turk and A. Pentland, “Eigenfaces for recognition,”
*Journal of cognitive neuroscience*, vol. 3, no. 1, pp. 71–86, 1991. - H. Murase and S. K. Nayar, “Visual learning and recognition of 3-d objects from appearance,”
*International journal of computer vision*, vol. 14, no. 1, pp. 5–24, 1995. - J. L. Mundy, “Object recognition in the geometric era: A retrospective,” in
*Toward Category-Level Object Recognition*, 2006. - S. Ullman, “The interpretation of structure from motion,”
*Proceedings of the Royal Society of London. Series B. Biological Sciences*, vol. 203, no. 1153, pp. 405–426, 1979. - C. Tomasi and T. Kanade, “Shape and motion from image streams under orthography: a factorization method,”
*International Journal of Computer Vision*, vol. 9, no. 2, pp. 137–154, 1992. - J. J. Koenderink, A. J. Van Doorn
*et al.*, “Affine structure from motion,”*JOSA A*, vol. 8, no. 2, pp. 377–385, 1991. - O. Faugeras,
*Three-dimensional computer vision: a geometric viewpoint*.1em plus 0.5em minus 0.4emMIT press, 1993. - L. G. Roberts, “Machine perception of three-dimensional solids,” Ph.D. dissertation, Massachusetts Institute of Technology, 1963.
- D. G. Lowe, “Three-dimensional object recognition from single two-dimensional images,”
*Artif. Intell.*, 1987. - J. J. Lim, H. Pirsiavash, and A. Torralba, “Parsing ikea objects: Fine pose estimation,” in
*IEEE International Conference on Computer Vision*.1em plus 0.5em minus 0.4emIEEE, 2013, pp. 2992–2999. - M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2014. - H. Su, Q. Huang, N. Mitra, Y. Li, and L. Guibas, “Estimating image depth using shape collection,”
*Transaction of Graphics*, no. Special Issue of SIGGRAPH 2014, 2014. - T. Hassner and R. Basri, “Example based 3d reconstruction from single 2d images,” in
*IEEE CVPR Workshop*, 2006. - A. Saxena, S. H. Chung, and A. Y. Ng, “3-d depth reconstruction from a single still image,”
*International Journal of Computer Vision*, 2008. - D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context from a single image,”
*IEEE International Conference on Computer Vision*, 2005. - L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,” in
*Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on*.1em plus 0.5em minus 0.4emIEEE, 2014, pp. 89–96. - V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in
*Proceedings of the 26th annual conference on Computer graphics and interactive techniques*, 1999. - D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, “Scape: shape completion and animation of people,” in
*ACM Trans. Graph.*, 2005. - M. Zia, M. Stark, B. Schiele, and K. Schindler, “Detailed 3d representations for object recognition and modeling,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2013. - S. Y. Bao, M. Chandraker, Y. Lin, and S. Savarese, “Dense object reconstruction with semantic priors,” in
*Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on*.1em plus 0.5em minus 0.4emIEEE, 2013, pp. 1264–1271. - A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid, “Dense reconstruction using 3d object shape priors,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2013. - T. J. Cashman and A. W. Fitzgibbon, “What shape are dolphins? building 3d morphable models from 2d images,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2013. - S. Zhu, L. Zhang, and B. Smith, “Model evolution: An incremental approach to non-rigid structure from motion.” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2010. - M. Prasad, A. Fitzgibbon, A. Zisserman, and L. Van Gool, “Finding nemo: Deformable object class modelling using curve matching,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2010. - C. Bregler, A. Hertzmann, and H. Biermann, “Recovering non-rigid 3D shape from image streams.” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2000. - G. J. Agin and T. O. Binford, “Computer description of curved objects,”
*Computers, IEEE Transactions on*, vol. 100, no. 4, pp. 439–449, 1976. - D. Marr and H. K. Nishihara, “Representation and recognition of the spatial organization of three-dimensional shapes,”
*Proceedings of the Royal Society of London. Series B. Biological Sciences*, vol. 200, no. 1140, pp. 269–294, 1978. - R. Mohan and R. Nevatia, “Using perceptual organization to extract 3d structures,”
*Pattern Analysis and Machine Intelligence, IEEE Transactions on*, vol. 11, no. 11, pp. 1121–1139, 1989. - B. Horn, “Shape from shading: A method for obtaining the shape of a smooth opaque object from one view,” PhD thesis, Massachusetts Inst. of Technology, 1970.
- J. T. Barron and J. Malik, “Shape, albedo, and illumination from a single image of an unknown object,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*.1em plus 0.5em minus 0.4emIEEE, 2012, pp. 334–341. - M. Prasad, A. Zisserman, and A. W. Fitzgibbon, “Single view reconstruction of curved surfaces,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2006. - N. R. Twarog, M. F. Tappen, and E. H. Adelson, “Playing with puffball: simple scale-invariant inflation for use in vision and graphics,” in
*ACM Symp. on Applied Perception*, 2012. - E. Toppe, C. Nieuwenhuis, and D. Cremers, “Relative volume constraints for single view 3d reconstruction,” in
*Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on*.1em plus 0.5em minus 0.4emIEEE, 2013, pp. 177–184. - S. Vicente and L. Agapito, “Balloon shapes: Reconstructing and deforming objects with volume from images,” in
*3DTV-Conference, 2013 International Conference on*.1em plus 0.5em minus 0.4emIEEE, 2013, pp. 223–230. - L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in
*IEEE International Conference on Computer Vision*, 2009. - B. C. Russell and A. Torralba, “Building a database of 3d scenes from user annotations,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2009. - K. Karsch, Z. Liao, J. Rock, J. T. Barron, and D. Hoiem, “Boundary cues for 3d object shape recovery,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2013. - Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in
*IEEE Winter Conference on Applications of Computer Vision (WACV)*, 2014. - M. Marques and J. P. Costeira, “Estimating 3D shape from degenerate sequences with missing data,”
*Computer Vision and Image Understanding*, 2008. - J. Kim and K. Grauman, “Shape sharing for object segmentation,” in
*European Conference on Computer Vision*, 2012. - A. Laurentini, “The visual hull concept for silhouette-based image understanding,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 1994. - S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluation of multi-view stereo reconstruction algorithms,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*.1em plus 0.5em minus 0.4emIEEE, 2006. - K. Grauman, G. Shakhnarovich, and T. Darrell, “Inferring 3d structure with a statistical image-based shape model,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2003. - D. Cremers and K. Kolev, “Multiview stereo and silhouette consistency via convex functionals over convex domains,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2011. - S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut based image segmentation with connectivity priors,”
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2008. - J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman
*et al.*, “Efficient human pose estimation from single depth images,”*Pattern Analysis and Machine Intelligence, IEEE Transactions on*, vol. 35, no. 12, pp. 2821–2840, 2013. - B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in
*IEEE International Conference on Computer Vision*, 2011. - V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser, “Learning Part-based Templates from Large Collections of 3D Shapes,”
*Transactions on Graphics (Proc. of SIGGRAPH)*, vol. 32, no. 4, 2013. - N. Aspert, D. Santa-Cruz, and T. Ebrahimi, “Mesh: Measuring errors between surfaces using the hausdorff distance,” in
*ICME*, 2002. - Private communication with Roozbeh Mottaghi, 2014.
- Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao, “3d shapenets for 2.5d object recognition and next-best-view prediction,”
*CoRR*, vol. abs/1406.5670, 2014. - N. J. Mitra and M. Pauly, “Shadow art,” in
*ACM Transactions on Graphics*, 2009. - M. R. Oswald, J. Stühmer, and D. Cremers, “Generalized connectivity constraints for spatio-temporal 3d reconstruction,” in
*European Conference on Computer Vision*, 2014. - D. Hoiem and S. Savarese,
*Representations and techniques for 3D object recognition and scene interpretation*.1em plus 0.5em minus 0.4em Morgan & Claypool Publishers, 2011, vol. 15. - M. Sun, H. Su, S. Savarese, and L. Fei-Fei, “A multi-view probabilistic model for 3d object classes,” in
*IEEE International Conference on Computer Vision and Pattern Recognition*, June 2009.