Clouds of Oriented Gradients for 3D Detection of Objects, Surfaces, and Indoor Scene Layouts
We develop new representations and algorithms for three-dimensional (3D) object detection and spatial layout prediction in cluttered indoor scenes. We first propose a clouds of oriented gradient (COG) descriptor that links the 2D appearance and 3D pose of object categories, and thus accurately models how perspective projection affects perceived image gradients. To better represent the 3D visual styles of large objects and provide contextual cues to improve the detection of small objects, we introduce latent support surfaces. We then propose a “Manhattan voxel” representation which better captures the 3D room layout geometry of common indoor environments. Effective classification rules are learned via a latent structured prediction framework. Contextual relationships among categories and layout are captured via a cascade of classifiers, leading to holistic scene hypotheses that exceed the state-of-the-art on the SUN RGB-D database.
Semantic understanding of three-dimensional (3D) scenes plays an increasingly important role in modern robotic systems and autonomous vehicles. The last decade has seen major advances in semantic understanding of 2D images [1, 2]. However, images of indoor (home or office) environments remain challenging for existing methods due to the prevalence of clutter and occlusions. Advances in depth sensor technology can reduce ambiguities in standard RGB images, enabling breakthroughs in scene layout prediction [3, 4, 5], support surface prediction [6, 7, 8], semantic parsing , and object detection [10, 11, 12]. A growing number of annotated RGB-D datasets have been constructed to train and evaluate indoor scene understanding methods [13, 14, 6, 15].
Holistic indoor scene understanding  requires integrated detection of objects and the room layouts (walls, floors, and ceilings) that surround them. While object detection is often formalized as the prediction of a 2D bounding box , 2D representations are insufficient for many real-world applications because they do not explicitly represent object orientations or contextual relationships. We instead propose to detect the 3D size, position, and orientation of object instances via bounding cuboids (convex polyhedra). 3D cuboid detection is a standard task in indoor and outdoor scene understanding benchmarks [15, 16].
Descriptors constructed from point cloud representations of RGB-D images are frequently used for 3D object detection. For example, Song et al.  use the truncated signed distance function (TSDF) to define descriptors for candidate 3D bounding cuboids. But given the diverse variation in the appearance of indoor object categories, accurately modeling how appearance varies with object style and 3D viewpoint is very challenging . We thus design a novel, orientation-adaptive gradient descriptor that uses perspective geometry to better detect objects observed from diverse 3D viewpoints.
Basic discriminative scene parsing algorithms detect each category independently, but often have many false positives. Previous work has used manually engineered heuristics to prune false detections  or combined CAD models with layout cues to model scenes . In this paper, we significantly boost detection accuracy via a cascaded classification framework  that learns contextual relationships among object categories, as well as relationships between objects and the overall room layout. This efficient approach allows initial detections of visually distinctive objects to lead to holistic scene interpretations of higher quality.
To estimate the spatial layout used in our cascaded classifiers, we assume an orthogonal “Manhattan” room structure . Many previous methods predict 2D projections of the underlying 3D room structure [22, 23], but small 2D alignment errors may lead to poor 3D layout estimates. We avoid this by using a structured prediction framework to directly estimate 3D layouts from RGB-D images, and propose a Manhattan voxel representation that (like our object descriptors) is adapted to the geometry of indoor scenes. Our learning-based approach is more robust to the noisy depth estimates produced by practical RGB-D cameras, and thus avoids errors made by simpler layout prediction heuristics .
Holistic indoor scene understanding is particularly challenging because smaller objects, like lamps and monitors, only occupy a tiny fraction of the room volume. Bottom-up detectors thus have high computational demands (many candidate bounding cuboids must be considered) and typically produce many false positives. To address this challenge, we note that many small objects are supported by the surfaces of large objects [6, 8], and augment our cuboid representations with latent support surfaces. While surface heights are estimated without explicit training annotations, modeling them nevertheless boosts the accuracy of our furniture detectors. When integrated into our cascaded classification framework, support surfaces constrain the search space for small objects, and thereby improve detection speed as well as accuracy.
In summary, we propose a general framework for learning detectors for multiple object categories using only RGB-D annotations. We first introduce a cloud of oriented gradient (COG) descriptor that robustly links 3D object pose to 2D image boundaries, and discuss extensions that further boost performance (Sec. 3). Because a major cause of feature inconsistency across object instances is variation in the location of the supporting surface, we model this height as a latent variable, and use it to distinguish different visual styles and detect smaller objects (Sec. 4). We also introduce a Manhattan voxel representation to predict room layout directly from RGB-D data (Sec. 5). We use a structured prediction framework to learn an algorithm that aligns 3D cuboid hypotheses to RGB-D data, and a cascaded classifier to incorporate contextual cues from other object instances and categories, as well as the overall 3D layout (Sec. 6). We evaluate our algorithm on the challenging SUN RGB-D dataset  and achieve state-of-the-art accuracy in the 3D detection of 19 object categories (Sec. 7).
2 Related Work
Two-dimensional object detection is a widely studied problem. Dalal and Triggs  introduced the histogram of oriented gradient (HOG) descriptor to model 2D object appearance using image gradients. Building on HOG, Felzenszwalb et al.  used a discriminately-trained part-based model to represent objects. This method is effective because it explicitly models object parts as latent variables and thus captures some object style and pose variations. More recently, many papers have used convolutional neural networks (CNNs) to extract rich features from images [26, 27, 28, 29, 30]. For domains where large sets of labeled images are available, CNNs lead to state-of-the-art performance with efficient detection speed [31, 32].
Increasingly, real-world computer vision systems often incorporate depth data as an additional input to increase accuracy and robustness. With depth maps we can reconstruct point cloud representations of scenes, leading to significant recent advances in 3D object classification [33, 34], point cloud segmentation [35, 36], cuboid-based geometric modeling [37, 38, 39], room layout prediction [40, 41], 3D contextual modeling [42, 43], and 3D shape reconstruction [44, 45]. Here, we focus on the related problem of 3D object detection.
In outdoor scenes, localizing objects with 3D cuboids has become a standard in the popular KITTI autonomous driving benchmark . 3D detection systems model car shape and occlusion patterns using LiDAR or stereo inputs [46, 47, 48, 49, 50], and may also incorporate additional overhead imagery . 3D cuboid representations are more powerful than 2D bounding boxes because they contain more information about 3D object locations, physical occupancy, and orientation. However, many outdoor 3D detection systems are specialized to vehicles and pedestrians, and may not generalize to cluttered indoor environments.
Other work has localized indoor objects with 3D cuboids [52, 53], but achieving high accuracy is challenging due to the significant shape variations found in cluttered, real-world environments. Several recent methods have incorporated CAD models to learn object shape [33, 54, 10] or hallucinate alternative viewpoints for appearance-based matching [55, 56, 57]. While CAD models are a potentially powerful information source, there does not exist an abundant supply of models for all categories, and many methods are limited to a small number of object categories . Moreover, example-based methods  may be computationally inefficient due to the need to match each exemplar to each image.
For robotics applications, a 3D convolutional neural network was designed to detect simple objects in real time . In 2015, Song et al. introduced a SUN RGB-D dataset  containing 10,335 RGB-D images with accurate 3D cuboid annotations for indoor objects, room layouts, and scene categories. The size of the dataset matches that of the PASCAL-VOC dataset  and motivates several recent research projects. Some methods utilize pre-trained 2D detectors and region proposals as priors , and localize 3D bounding boxes via a separate CNN [17, 59, 60, 49]. These methods are efficient and can achieve decent accuracy, but are sensitive to failures of the 2D object detector, which may not generalize to objects seen from novel 3D viewpoints.
Detecting support surfaces is an essential first step in understanding the geometry of 3D scenes for such tasks as surface normal estimation [61, 7] and shape retrieval . Silberman et al.  use semantic segmentation to model object support relationships; this work was later extended by Guo et al.  for support surface prediction. We instead use 3D support surface representations to improve the accuracy of our models of object style, and the speed of our detectors for small object categories.
To enable more holistic understanding of 3D scenes, we also predict the locations of walls, ceilings, and floors; this structure is sometimes called the room layout . Some related work has predicted 2D projections of the 3D layout [41, 5, 22, 63, 40, 64], or used CNNs to directly predict the 3D layout . In this paper, we use the geometric structure of typical indoor environments to design a Manhattan voxel representation that leads to accurate 3D layout predictions.
More broadly, holistic scene understanding systems integrate forms of semantic object reasoning, spatial context modeling, and scene type identification [66, 15]. Often, models for each sub-task are learned independently, and then integrated via conditional random fields (CRFs) like that proposed by Lin et al. . However, rich scene models lead to complex graph structures and challenging inference problems. Hoiem et al.  jointly estimate the camera viewpoint and detect objects, Zhang et al.  use pre-defined room configurations to adjust object localizations, while Ren et al. utilize scene type to refine detector outputs. We instead adapt the cascaded prediction framework  to learn multi-stage models capturing detector accuracies and contextual relationships among objects and the room layout.
3 Modeling 3D Geometry & Appearance
Feature extraction is one of the most important steps for object detection algorithms. 2D object detectors typically use either hand-crafted features based on image gradients [24, 25] or learned features from deep neural networks [26, 27, 28, 29, 30]. For 3D object detection systems with additional depth inputs, Gupta et al.  use horizontal disparity, height above the ground, and the angle of the local surface normal to encode images as a three channel (HHA) map for learning with CNNs. While convolutional processing of 2D images may be used to extract features from 2D bounding boxes, it does not directly model 3D cuboids. Song et al. propose a deep sliding shape  method that combines TSDF features  with standard 2D CNN features to describe 3D cuboids, but do not explicitly model 3D cuboid orientation.
Our object detectors are learned from 3D oriented cuboid annotations in the SUN-RGBD dataset . We discretize each cuboid into a grid of (large) voxels, and extract features for these cells. Voxel dimensions are scaled to match the size of each instance. We use standard descriptors for the 3D geometry of the observed depth image, and propose a novel cloud of oriented gradient (COG) descriptor of RGB appearance. We also introduce simple extensions that improve its performance.
3.1 Object Geometry: 3D Density and Orientation
3.1.1 Point Cloud Density
Conditioned on a 3D cuboid annotation or detection hypothesis , suppose voxel contains points. We use perspective projection to find the silhouette of each voxel in the image, and compute the area of that convex region. The point cloud density feature for voxel then equals . Normalization gives robustness to depth variation of the object in the scene. We normalize by the local voxel area, rather than by the total number of points in the cuboid as in some related work , to give greater robustness to partial object occlusions.
3.1.2 3D Normal Orientations
Various representations, such as spin images , have been proposed for the vectors normal to a 3D surface. As in , we build a 25-bin histogram of normal orientations within each voxel, and estimate the normal orientation for each 3D point via a plane fit to its 15 nearest neighbors. This feature captures the surface shape of cuboid via patterns of local 3D orientations.
3.2 Clouds of Oriented Gradients (COG)
The histogram of oriented gradient (HOG) descriptor  forms the basis for many effective object detection methods . Edges are a very natural foundation for indoor scene understanding, due to the strong occluding contours generated by common objects. However, as gradient orientations are determined by 3D object orientation and perspective projection, HOG descriptors that are naively extracted in 2D image coordinates generalize poorly.
To address this issue, some previous work has restrictively assumed that parts of objects are near-planar so that image warping may be used for alignment , or that all objects have a 3D pose aligned with the global “Manhattan world coordinates” of the room . The bag of boundaries (BOB)  descriptor builds separate gradient-based models for each of several distinct 3D viewpoints, rather than using geometry to generalize across 3D viewpoints. Some previous 3D extensions of the HOG descriptor [71, 72] assume that a full 3D model is given. In recent work , 3D cuboid hypotheses were used to aggregate standard 2D features from a deep convolutional neural network, but the deep features are not conditioned on object orientations. Our cloud of oriented gradient (COG) feature accurately describes the 3D appearance of objects with complex 3D geometry, as captured by RGB-D cameras from any viewpoint.
3.2.1 2D Gradient Computation
We compute gradients by applying filters , to the RGB channels of the unsmoothed 2D image. The maximum responses across color channels are the gradients in the and directions, with corresponding magnitude . We follow similar implementation details to the gradient computations used in HOG descriptors . The 2D unsigned gradients are then aggregated in each voxel to define our 3D COG descriptor.
3.2.2 3D Orientation Bins
The standard HOG descriptor  for cell of object uses nine evenly spaced gradient histogram bins, . For all object instances, is aligned with the horizontal image direction. As shown in Fig. 2, HOG descriptors may thus be inconsistent for (even nearly identical) objects in distinct poses.
Because objects from the same category typically have similar local 3D structure, for each oriented 3D cuboid proposal, we instead model local gradient statistics in a canonical 3D coordinate frame. As illustrated in Fig. 1, we define nine evenly spaced 3D orientation bins on the front surface (-plane) of each voxel within the cuboid. For all instances, is aligned with the horizontal 3D -axis (dark blue lines in Fig. 1). Given the camera’s intrinsic matrix , and the extrinsic matrix encoding the relative 3D pose of cuboid , we use perspective projection to map 3D orientation bins to 2D image coordinates:
This transform aligns the 2D orientation bins for distinct 3D cuboids. For each pixel that back-projects to 3D voxel , we accumulate its unsigned 2D gradient in the corresponding projected orientation bin to define a nine-dimensional COG feature .
Some previous work has warped images to align with fixed 2D orientation bins , but such affine transformations may be unstable for objects with non-planar geometry. Our COG descriptor can be seen as accumulating standard gradients with warped histogram bins, rather than warping images to match fixed orientation bins. This innovation enables our later learning algorithms to better generalize to novel 3D views of complex objects.
3.2.3 Normalization and Aliasing
We bilinearly interpolate gradient magnitudes between neighboring orientation bins . To normalize the histogram for voxel in cuboid , we then set for a small . Accounting for all orientations and voxels, the dimension of the COG feature is .
3.3 Extensions of the COG Descriptor
3.3.1 View-to-Camera Features
For single view RGB-D inputs, objects like nightstands and other furniture may only expose one planer surface to the camera. At test time, the features of a 3D cuboid proposal oriented away from the camera may resemble those of a correct detection (see Fig. 3) because voxel features are computed by first rotating the cuboid to a canonical coordinate frame. However, due to the self-occlusions that occur in real objects, the features modeled by the COG descriptor would in fact not be visible when objects are facing away from the camera. Therefore, we add features to represent objects’ orientation with respect to the camera, and learn to distinguish implausible object hypotheses.
Specifically, we compute the cosine of the angle between the cuboid orientation and its viewing angle from camera in horizontal direction. Then we define a set of radial basis functions of the form
and space the basis function centers evenly between with step size 0.2. The bandwidth was chosen using validation data. Radial basis expansions are a standard non-linear regression method, and can be seen as a layer of a neural network. We expand the camera angle using this basis representation plus a bias feature, producing an 11-dimensional view-to-camera feature .
3.3.2 Expanded Cuboid Features
Many object detection systems have a pre-processing stage that generates bounding box proposals that contain objects with well-defined boundaries, instead of amorphous background areas . Using a region proposal network to maximize the “objectness” score of predicted bounding boxes  is thus an essential first step for many state-of-the-art object detection systems [26, 17].
Objectness scores are usually determined from the difference between local and surrounding appearances of each object. Instead of designing a separate pre-processing step, we build such contextual cues into our cuboid features. For each cuboid proposal, we expand its size to capture an additional layer of voxels in each direction, so that each cuboid is now described by voxels.
Before discussing the training algorithm, we preview the learned weights of COG descriptors for the chair and toilet categories in Fig. 4. Toilets are typically placed against the wall in cluttered bathrooms, while there is typically free space around chairs, and thus our expanded cuboid features capture differences between these categories that improve detection accuracy.
The structure of our expanded cuboid feature has some similarities to the “zoom-out” features originally proposed for 2D image segmentation , and used by Song et al.  for 3D detection. We provide ablation studies in Table I, and demonstrate that this extension is very effective in modeling the geometric structure surrounding each cuboid, improving object detection accuracy.
3.4 Structured Prediction of Object Cuboids
For each voxel in some cuboid annotated in training image , we have one point cloud density feature , 25 surface normal histogram features , and 9 COG appearance features . For each cuboid , we have 12 camera view features . Using expanded features with voxels, our overall representation of cuboid is then . Cuboids are aligned via annotated orientations as illustrated in Fig. 1, using the gravity direction provided in the SUN-RGBD dataset .
For each object category independently, using those images which contain visible instances of that category, our goal is to learn a prediction function that maps an RGB-D image to a 3D bounding box . Here is the center of the cuboid in 3D, is the cuboid orientation, is the physical size of the cuboid along the three axes determined by its orientation, and is a binary variable indicating whether the object is present in that area of the 3D scene. We assume objects have a base upon which they are typically supported, and thus is a scalar rotation with respect to the ground plane.
Given training examples of category , we use an -slack formulation of the structural support vector machine (SVM) objective  with margin rescaling constraints:
Here are the features for oriented cuboid hypothesis given RGB-D image , is the ground-truth cuboid annotation, and is the set of possible alternative cuboids. For training images with multiple instances, as in previous work on 2D detection  we add multiple copies to the training set, each time removing the subset of 3D points contained in other instances.
Given some ground truth cuboid and estimated cuboid , we define the loss function as follows. If a scene contains ground truth cuboid B and indicator variable , we compute
Here, is the volume of the 3D intersection of the cuboids, divided by the volume of their 3D union. The loss is bounded between 0 and 1, and is smallest when the is near 1 and the orientation error . The loss approaches 1 if either position or orientation is completely wrong. If a scene does not contain any ground truth instances of the object and the indicator variable for the cuboid proposal, the loss equals 0. We penalize all other cases with a loss of 1. We solve the loss-sensitive objective of Eq. (3) using a cutting-plane method .
3.5 Cuboid Hypotheses
We create cuboid proposals in a sliding-window fashion using discretized 3D world coordinates, with 16 candidate orientations. We discretize cuboid sizes using empirical statistics of the cuboid annotations in the training database: width quantiles, depth quantiles, and height quantiles. Every combination of cuboid size, 3D position on the ground plane (whose height is estimated as described in Sec. 5), and 3D orientation is then evaluated.
3.6 Relative Importance of 3D Cuboid Features
We explore the relative importance of different features for the detection of 5 large objects in Table I. We first trained our detector with geometric features only (Geom), with COG only (COG), with both geometric and COG features (Geom+COG), adding the camera-view feature (Geom+COG+view), and finally utilizing the expanded cuboid feature (Geom+COG+view+expanded). The COG feature and geometric features have complementary advantages in 3D object detection, and combining them leads to improved performance. The average accuracies of object detectors improve when additional features are added, demonstrating that each step of our feature design is effective.
4 Modeling Latent Support Surfaces
Geometric descriptors and COG descriptors are able to capture local shapes and appearances, but objects have widely varying visual styles. Moreover, 3D cuboids are labeled by different annotators from Mechanical Turk to construct the SUN RGB-D dataset , and thus objects in the same category may have inconsistent 3D annotations. As a result, voxel features are sometimes noisy and inconsistent across different object instances (see Fig. 5).
To explicitly model different visual styles within each object category, a classical approach is to use part-based models [25, 18] where objects are explained by spatially arranged parts. For many object categories, the height of the support surface is the primary cause of style variations (Fig. 5). Therefore, we explicitly model the support surface as a latent part for each object.
By modeling support surfaces we can also constrain the search space for small object detectors. Such detectors are otherwise computationally challenging to learn, and perform poorly due to the large set of 3D pose hypotheses.
4.1 Latent Structural SVM Learning
Some previous work was specifically designed to predict support surface regions  from labeled training data, but the predicted support surfaces are not semantically meaningful. We instead treat the height of the support surface of each object as a latent variable and use latent structural SVMs [79, 25] to learn the detector.
We follow the notation in Sec. 3.4 with an updated objective. For each category , our goal is to learn a prediction function that maps an RGB-D image to a 3D bounding box along with its relative surface height . The latent variable is defined as the relative surface height to the bottom of the cuboid. We discretize cuboid height to 7 slices, and thus localizes the support surface to one of those slices (see Fig. 7).
Given training examples of category , we want to solve the following optimization problem:
Here is the target cuboid, is the set of possible cuboids, and is the set of possible surface heights. are the features associated to cuboid whose relative surface height is indicated by . We first discretize into voxels and compute geometric, COG, view-to-camera, and expanded cuboid features, as denoted by . Then we discretize with finer resolutions at the vertical dimension into voxels and take the -th slice from the bottom to represent cuboid feature, as denoted by . Finally we add an indicator vector for support surface height, so that
We use the same loss function defined in Sec. 3.4.
To train the model with latent support surfaces, we first pre-train cuboid descriptors (geometric features, COG, view-to-camera, and scene layout features) without modeling support surfaces. We then extract the center slice of pre-trained cuboid descriptors and concatenate it to the pre-trained models. Finally, we initialize the support surface height indicator vector randomly in . With this informative initialization, we find that the CCCP algorithm  is effective at solving the (non-convex) latent structural SVM learning problem .
4.2 Small Object Detection via Supporting Surfaces
While indoor scenes typically contain some large furniture like beds and chairs, many other objects with comparatively small physical size are very challenging to detect [17, 12]. Some algorithms are specifically designed to detect small objects in 2D images using multi-scale methods [81, 82], but they cannot be directly applied to 3D object detection.
A severe issue for detecting small objects is that the search space can be enormous, and thus training and testing with sliding-window cuboid proposals can be computationally intractable. But note that small objects, such as pillows and monitors and lamps, are usually placed on top of other objects with support surfaces. If we only search for small objects on predicted support surfaces, the search space will be greatly reduced. As a result, the inference speed will be improved and object proposals will have fewer false positives. This is another benefit of modeling support surfaces.
In our implementation, we first detect large objects and furniture that rest on the ground. Then using the cascaded detection framework described in Sec. 6, we only search for smaller objects on top of the support surfaces of those large objects with positive confidence scores. We reduce the voxel grid to for lamps and pillows due to their small size, and to for monitors and TVs due to their flat shape.
5 Room Layout Geometry: Manhattan Voxels
Given an RGB-D image, indoor scene parsing requires not only object detection, but also room layout (floor, ceiling, wall) prediction [63, 3, 5, 41]. Such “free space” understanding is crucial for applications like robot navigation. Simple RGB-D layout prediction methods  work by fitting planes to the observed point cloud data, but are sensitive to outliers. We propose a more accurate learning-based approach to predicting Manhattan geometries that utilizes our COG descriptor.
The orthogonal walls of a standard room can be represented via a cuboid , and we could define geometric features via a standard voxel discretization (Fig. 8, bottom left). However, because corner voxels usually contain the intersection of two walls, they then mix 3D normal vectors with very different orientations. This discretization also ignores points outside of the hypothesized cuboid, and may match subsets of rooms with wall-like structure.
We propose a novel Manhattan voxel (Fig. 8, bottom right) discretization for 3D layout prediction. We first discretize the vertical space between floor and ceiling into 6 equal bins. We then use a threshold of to separate points near the walls from those in the interior or exterior of the hypothesized layout. Further using diagonal lines to split bins at the room corners, the overall space is discretized in bins. For each vertical layer, regions model the scene interior whose point cloud distribution varies widely across images. Regions model points near the assumed Manhattan wall structure: and should contain orthogonal planes, while and should contain parallel planes. Regions capture points outside of the predicted layout, as might be produced by depth sensor errors on transparent surfaces.
We again use the S-SVM formulation of Eq. (3) to predict Manhattan layout cuboids . The loss function is as in Eq. (4), except we use the “free-space” IOU defined by , and account for the fact that orientation is only identifiable modulo rotations. Because layout annotations do not necessarily have Manhattan structure, the ground truth layout is defined as the cuboid hypothesis with the largest free-space IOU.
We predict floors and ceilings as the 0.001 and 0.999 quantiles of the 3D points along the gravity direction, and discretize orientation into 18 evenly spaced angles between and . We then propose layout candidates that capture at least of all 3D points, and are bounded by the farthest and closest 3D points. For typical scenes, there are 5,000 to 20,000 layout hypotheses.
6 Cascaded Learning of Spatial Context
If the learned object detectors are independently applied for each category, there may be many false positives where a “piece” of a large object is detected as a smaller object (see Fig. 9). Song et al.  reduce such errors via a heuristic reduction in confidence scores for small detections on large image segments. To avoid such manual engineering, which must be tuned to each category for peak performance, we propose to directly learn the relationships among detections of different categories. As room geometry is also an important cue for object detection, we integrate Manhattan layout hypotheses for holistic scene understanding [15, 52].
Classically, structured prediction of spatial relationships is often accomplished via undirected Markov random fields (MRFs) . As shown in Fig. 9, this generally leads to a fully connected graph  because there are relationships among every pair of object categories. An extremely challenging MAP estimation (or energy minimization) problem must then be solved at every training iteration, as well as for each test image, so learning and prediction are costly.
We propose to instead adapt cascaded classification  to the modeling of contextual relationships in 3D scenes. In this approach, “first-stage” detections as in Sec. 3.4 become input features to “second-stage” classifiers that estimate confidence in the correctness of cuboid hypotheses. This can be interpreted as a directed graphical model with hidden variables. Marginalizing the first-stage variables recovers a standard, fully-connected undirected graph. Crucially however, the cascaded representation is far more efficient: training decomposes into independent learning problems for each node (object category), and optimal test classification is possible via a rapid sequence of local decisions.
6.1 Contextual Features
For an overlapping pair of detected bounding boxes and , we denote their volumes as and , the volume of their overlap as , and the volume of their union as . We characterize their geometric relationship via three features: , , and the intersection-over-union . To model contextual relations between objects and the scene layout , we compute the distance and angle of cuboid to the closest wall.
First-stage detectors provide a most-probable layout hypothesis, as well as a set of detections (following non-maximum suppression) for each category. For a bounding box with confidence score , there may be several overlapping bounding boxes of categories . Letting be the instance of category with maximum confidence , features for bounding box are created via a quadratic function of , , , and a radial basis expansion of . Relationships between second-stage layout candidates and object cuboids are modeled similarly.
For small objects that are placed on the support surfaces of large objects, 3D overlap features are noisy. We replace 3D overlap with 2D overlap scores from the top-down view of the scene (Fig. 10). See the Appendix for further details.
6.2 Contextual Learning
Due to the directed graphical structure of the cascade, each second-stage detector may be learned independently. The objective is a simple binary classification: is the candidate detection a true positive, or a false positive? During training, each detected bounding box for each class is marked as “true” if its intersection-over-union score to a ground truth instance is greater than 0.25, and is the largest among such detections. We train a standard binary SVM with a radial basis function (RBF) kernel
The bandwidth parameter is chosen using validation data. While we use a RBF kernel for all reported experiments, the performance of a linear SVM is only slightly worse, and cascaded classification still provides useful performance gains for that more scalable training objective.
6.3 Contextual Prediction
During testing, given the set of cuboids found in the first-stage sliding-window search, we apply the second-stage cascaded classifier to each cuboid to get a new contextual confidence score . The overall confidence score used for precision-recall evaluation is then , to account for both the original belief from the geometric and COG features and the correcting power of contextual cues. The second-stage layout prediction is directly provided by the second-stage S-SVM classifier.
We train our 3D object detection algorithm solely on the SUN RGB-D dataset  with 5285 training images, and report performance on 5050 test images for all 19 object categories (Table II). The NYU Depth dataset  has 3D cuboid labels for 1449 images, but annotations are noisy and inconsistent. Some previous work has only evaluated detection performance on this small dataset , or defined their own annotations for 3D cuboids . We do not evaluate on the NYU Depth dataset because it is a subset of SUN RGB-D.
We evaluate detection performance via the intersection-over-union (IOU) with ground-truth cuboid annotations, and consider the predicted cuboid to be correct when the IOU is above 0.25. To evaluate the layout prediction performance, we calculate the free space IOU with human annotations. We provide results demonstrating the effectiveness of our 3D scene understanding system, and the importance of both appearance and context features.
7.1 Modeling Latent Support Surfaces
For objects such as beds, tables, and desks, modeling support surface as a latent variable helps capture the intra-class style variations within each cuboid. We visualize examples of inferred support surfaces in Figure 15. For objects that do not have explicit “support surfaces”, such as bathtub, bookshelf, and sink, our model can be viewed as a single part-based model and is also effective for 3D object detection. Note that the goal of this work is to model latent support surfaces to boost 3D detection accuracy, not to predict accurate supporting regions in scenes. We do not use any annotations of support surfaces when training, and also do not evaluate our performance on surface prediction benchmarks .
7.2 Small Object Detection
Detecting small objects is a challenging task, and achieving high accuracy remains an open research problem. Without modeling support surfaces, our baseline detectors completely fail to detect small objects because the search space is large and 3D object proposals contain many false positives. Using simple heuristics to check support relationships in the SUN-RGBD annotations, we find that more than 95% of lamps/pillows/monitors/TVs are placed on the surface of night-stands/tables/beds/desks/dressers. As shown in Table II, searching on predicted surfaces thus enables our algorithm to discover small objects with higher precision.
7.3 The Importance of Context
To show that the cascaded classifier helps to prune false positives, we evaluate detections using the confidence scores from the first-stage classifier (surface), as well as the updated confidence scores from the second-stage classifier using all object-to-object features (+context). As shown in Table II and Fig. 12, adding a contextual cascade clearly boosts performance. Furthermore, when object-to-scene-layout features are included (+layout), performance increases further. This result demonstrates that even if a small number of object categories are of primary interest, building models of the broader scene can be very beneficial.
We show some representative detection results in Fig. 14. In the first image our chair detector is confused and fires on part of the sofa, but with the help of contextual cues of other detected bounding boxes, these false positives are pruned away. For a fixed threshold across all object categories, we have as many true detections while producing fewer false positives.
7.4 Cubical Voxels versus Manhattan Voxels
We use the free-space IOU  to evaluate layout prediction performance. Using standard cubical voxels, our performance (72.33) is similar to the heuristic SUN RGB-D baseline (73.4, ). Combining Manhattan voxels with structured learning, performance increases to 78.96, demonstrating the effectiveness of this improved discretization. Furthermore, if we also incorporate contextual cues from detected objects, the score improves to 80.03. We provide layout prediction examples in Fig. 11.
|Ground Truth||First-stage||Second Stage||Second Stage|
7.5 Computational Speed
We implemented our algorithm using MATLAB in a 2.5GHz single core CPU. The computational speed of our detector is 10-30min per image. The most time-consuming part is the feature computation step, which could be improved by using parallel computing with multi-core CPUs or GPUs. With pre-computed cuboid features for each RGB-D image, the inference time is 2sec for each object category. With pre-computed contextual features among all objects, the cascaded prediction framework takes less than 0.5sec on average. The training time ranges from 2 to 12 hours per category, depending on the number of training instances.
Other deep learning-based 3D detection systems [17, 49] typically have a region proposal step that highly constrains the search space for all object categories. Our cuboid proposals are dense and extensive, and thus the computational speed is usually slower. This limitation of our system could be potentially alleviated by pre-processing the data using a region proposal network .
7.6 Comparison to Other Methods
This paper has several differences from our preliminary work [12, 86]. Our use of expanded cuboid features is new, and contributes to our overall 3D detection performance. Some implementation details also differ, for example  uses scene category features while this paper does not. Also  uses a discretization of cuboids into voxels, and uses only images containing at least one object instance for structural SVM training of detectors.
Compared to other methods that use CNN features [17, 60] pretrained on external datasets, our COG-based 3D object detector has comparable or better performance even without the contextual cues provided by our cascaded classifier. Conventional CNNs for 3D detection [17, 60] are trained to produce weighted confidence scores for each of multiple object categories, while our first-stage detector is instead tuned to discriminatively localize individual categories in 3D. Our subsequent cascaded prediction  of contextual relationships between object detections has structural similarities to a multi-stage neural network, but it is trained using (convex) structural SVM loss functions and designed to have a more interpretable, graphical structure. Interestingly, our overall cascaded approach is more accurate than standard 3D CNNs [17, 60, 49] in the detection of both 10 and 19 object categories.
|Groundtruth Annotations for RGB-D Images||Our Final Stage 3D Detection Output|
We propose a geometric framework for 3D cuboid detection and Manhattan layout prediction from RGB-D images. Using our novel COG descriptor of 3D cuboid appearance, we train accurate 3D object detectors for nineteen categories, as well as a cascaded classifier that learns contextual cues to boost performance. Modeling the height of support surfaces as latent variables further increases detection accuracy for large objects, and constrains the search space to make the detection of small objects feasible.
Our scene representations are learned directly from RGB-D data without external CAD models, and thus may be easily generalized to many other object categories. Gradient-based detectors incorporating cloud of oriented gradient (COG) features achieve state-of-the-art performance on the challenging SUN RGB-D dataset. We hypothesize that our improvement over baseline methods incorporating deep learning is due to the superior ability of COG descriptors to generalize to novel 3D viewpoints. Incorporating similar geometric invariances into convolutional networks is a promising area for future research.
[Computation of Contextual Features] We give a more detailed specification of the contextual features we use to model object-object and object-layout relationships.
The first-stage detectors provide a most-probable layout hypothesis, as well as a set of detections (following non-maximum suppression) for each category. For each bounding box with confidence score , there may be several bounding boxes of various categories that overlap with it. We let be the instance of category with the maximum confidence score . The features for bounding box are then as follows:
Constant bias feature, and confidence score from the first-stage detector.
For , , we calculate , , and concatenate those numbers.
For , we calculate the difference in confidence score from each first-stage detector, , and concatenate those numbers.
For , we consider radial basis functions of the form in Eq. 2. For a typical indoor scene, the largest object-to-wall distance is usually less than m, therefore we space the basis function centers evenly between 0 and 5 with step size 0.5, and choose . We expand using this radial basis expansion.
The absolute value of cosine : .
To model the second-stage layout candidates, we select the bounding box with the highest confidence score from the first-stage classifier in each category , and use the following features for layout with confidence score :
All the features used in the first-stage to model using Manhattan Voxels.
For , we calculate the radial basis expansion for , and its product with and .
For , we calculate the absolute value of the cosine of : , and .
For , we calculate the difference in confidence score from each first-stage detector, , and concatenate those numbers.
This research is supported in part by the Office of Naval Research (ONR) under Award Numbers N00014-13-1-0644 and N00014-17-1-2094, and by a pilot grant from the Brown University Center for Vision Research.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision (IJCV), vol. 88, no. 2, pp. 303–338, 2010.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), 2015.
-  D. C. Lee, M. Hebert, and T. Kanade, “Geometric reasoning for single image structure recovery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 2136–2143.
-  V. Hedau, D. Hoiem, and D. Forsyth, “Thinking inside the box: Using appearance models and context based on room geometry,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2010, pp. 224–237.
-  J. Zhang, C. Kan, A. G. Schwing, and R. Urtasun, “Estimating the 3D layout of indoor scenes and its clutter from depth sensors,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2013, pp. 1273–1280.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2012, pp. 746–760.
-  D. F. Fouhey, A. Gupta, and M. Hebert, “Unfolding an indoor origami world,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 687–702.
-  R. Guo and D. Hoiem, “Support surface prediction in indoor scenes,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2013, pp. 2144–2151.
-  S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich features from RGB-D images for object detection and segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 345–360.
-  S. Song and J. Xiao, “Sliding shapes for 3D object detection in depth images,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 634–651.
-  S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  Z. Ren and E. B. Sudderth, “Three-dimensional object detection and layout prediction using clouds of oriented gradients,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1525–1533.
-  B. C. Russell and A. Torralba, “Building a database of 3D scenes from user annotations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 2711–2718.
-  K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2011, pp. 1817–1824.
-  S. Song, L. Samuel, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 3354–3361.
-  S. Song and J. Xiao, “Deep sliding shapes for amodal 3D object detection in RGB-D images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Fidler, S. Dickinson, and R. Urtasun, “3D object detection and viewpoint estimation with a deformable 3D cuboid model,” in Advances in Neural Information Processing Systems (NeurIPS), 2012, pp. 611–619.
-  A. Geiger and C. Wang, “Joint 3D object and layout inference from a single RGB-D image,” in German Conference on Pattern Recognition (GCPR), 2015.
-  G. Heitz, S. Gould, A. Saxena, and D. Koller, “Cascaded classification models: Combining models for holistic scene understanding,” in Advances in Neural Information Processing Systems (NeurIPS), 2009, pp. 641–648.
-  J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a single image by Bayesian inference,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2. IEEE, 1999, pp. 941–947.
-  A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, “Efficient structured prediction for 3D indoor scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 2815–2822.
-  J. Bai, Q. Song, O. Veksler, and X. Wu, “Fast dynamic programming for labeling problems with ordering constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 1728–1735.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1. IEEE, 2005, pp. 886–893.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 32, no. 9, pp. 1627–1645, 2010.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 580–587.
-  R. Girshick, “Fast R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.
-  J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao, “3D shapenets for 2.5D object recognition and next-best-view prediction,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-view convolutional neural networks for 3D shape recognition,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3D classification and segmentation,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems (NeurIPS), 2017.
-  H. Jiang and J. Xiao, “A linear approach to matching cuboids in RGBD images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
-  Z. Jia, A. Gallagher, A. Saxena, and T. Chen, “3D-based reasoning with blocks, support, and stability,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1–8.
-  J. Xiao, B. C. Russell, and A. Torralba, “Localizing 3d cuboids in single-view images.” in Advances in Neural Information Processing Systems (NeurIPS), 2012.
-  C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and A. Rabinovich, “Roomnet: End-to-end room layout estimation,” 2017.
-  A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun, “Box in the box: Joint 3D layout and object reasoning from single images,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2013, pp. 353–360.
-  T. Shao, A. Monszpart, Y. Zheng, B. Koo, W. Xu, K. Zhou, and N. J. Mitra, “Imagining the unseen: Stability-based cuboid arrangements for scene understanding,” ACM Transactions on Graphics (SIGGRAPH ASIA), vol. 33, no. 6, 2014.
-  Y. Zhang, M. Bai, P. Kohli, S. Izadi, and J. Xiao, “Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  S. Tulsiani, A. Kar, J. Carreira, and J. Malik, “Learning category-specific deformable 3D models for object reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.
-  A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3D reconstructions of indoor scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3D object proposals using stereo imagery for accurate object class detection,” in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
-  A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká, “3D bounding box estimation using deep learning and geometry,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 5632–5640.
-  Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Data-driven 3D voxel patterns for object category recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1903–1911.
-  C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” 2018.
-  Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  D. Lin, S. Fidler, and R. Urtasun, “Holistic scene understanding for 3D object detection with RGBD cameras,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2013, pp. 1417–1424.
-  A. Gupta, A. A. Efros, and M. Hebert, “Blocks world revisited: Image understanding using qualitative geometry and mechanics,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2010, pp. 482–496.
-  S. Gupta, P. A. Arbeláez, R. B. Girshick, and J. Malik, “Aligning 3D models to RGB-D images of cluttered scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  J. J. Lim, H. Pirsiavash, and A. Torralba, “Parsing IKEA objects: Fine pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013.
-  J. J. Lim, A. Khosla, and A. Torralba, “FPM: Fine pose parts-based model with 3D CAD models,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 478–493.
-  D. Maturana and S. Scherer, “Voxnet: A 3D convolutional neural network for real-time object recognition,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2015, pp. 922–928.
-  Z. Deng and L. J. Latecki, “Amodal detection of 3D objects: Inferring 3D bounding boxes from 2d ones in rgb-depth images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  J. Lahoud and B. Ghanem, “2D-driven 3D object detection in RGB-D images,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks for surface normal estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2D-3D alignment via surface normal prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5965–5974.
-  V. Hedau, D. Hoiem, and D. Forsyth, “Recovering the spatial layout of cluttered rooms,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 1849–1856.
-  A. Mallya and S. Lazebnik, “Learning informative edge maps for indoor scene layout prediction,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 936–944.
-  C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” 2018.
-  J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 702–709.
-  D. Hoiem, A. A. Efros, and M. Hebert, “Putting objects in perspective,” International Journal of Computer Vision (IJCV), vol. 80, no. 1, pp. 3–15, 2008.
-  Y. Ren, C. Chen, S. Li, and C.-C. J. Kuo, “Context-assisted 3d (c3d) object detection from rgb-d images,” Journal of Visual Communication and Image Representation, vol. 34, no. 11, pp. 2189–2202, 2012.
-  A. E. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3D scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 21, no. 5, pp. 433–449, 1999.
-  N. Payet and S. Todorovic, “From contours to 3d object detection and pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 983–990.
-  N. Buch, J. Orwell, and S. A. Velastin, “3D extended histogram of oriented gradients (3dhog) for classification of road users in urban scenes,” in Proceedings of the British Machine Vision Conference (BMVC), 2009.
-  M. Scherer, M. Walter, and T. Schreck, “Histograms of oriented gradients for 3D object retrieval,” in Proceedings of the European Conference on Computer Vision (ECCV), 2010.
-  S. Song and J. Xiao, “Deep sliding shapes for amodal 3D object detection in RGB-D images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34, no. 11, pp. 2189–2202, 2012.
-  W. Kuo, B. Hariharan, and J. Malik, “Deepbox: Learning objectness with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2479–2487.
-  M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedforward semantic segmentation with zoom-out features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3376–3385.
-  T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural svms,” Machine Learning, 2009.
-  A. Vedaldi and A. Zisserman, “Structured output regression for detection with partial occulsion,” in Advances in Neural Information Processing Systems (NeurIPS), 2009.
-  C.-N. J. Yu and T. Joachims, “Learning structural svms with latent variables,” in International Conference on Machine Learning (ICML). ACM, 2009, pp. 1169–1176.
-  A. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural Computation, vol. 15, no. 4, pp. 915–936, 2003.
-  C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-CNN for small object detection,” in ACCV. Springer, 2016, pp. 214–230.
-  P. Hu and D. Ramanan, “Finding tiny faces,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  L. D. Pero, J. Guan, E. Brau, J. Schlecht, and K. Barnard, “Sampling bedrooms,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2011, pp. 2009–2016.
-  S. Nowozin and C. H. Lampert, “Structured learning and prediction in computer vision,” Foundations and Trends in Computer Graphics and Vision, vol. 6, no. 3–4, pp. 185–365, 2011.
-  A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, “Objects in context,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2007.
-  Z. Ren and E. B. Sudderth, “3D object detection with latent support surfaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.