QuadricSLAM: Constrained Dual Quadrics from Object Detectionsas Landmarks in Semantic SLAM

QuadricSLAM: Constrained Dual Quadrics from Object Detections
as Landmarks in Semantic SLAM

Lachlan Nicholson, Michael Milford, and Niko Sünderhauf This research was conducted by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016). Michael Milford is supported by an Australian Research Council Future Fellowship (FT140101229).The authors are with the ARC Centre of Excellence for Robotic Vision, Queensland University of Technology (QUT), Brisbane, Australia.The authors gratefully thank John Skinner for his contributions to the evaluation environment.Contact: lachlan.nicholson@hdr.qut.edu.au

Research in Simultaneous Localization And Mapping (SLAM) is increasingly moving towards richer world representations involving objects and high level features that enable a semantic model of the world for robots, potentially leading to a more meaningful set of robot-world interactions. Many of these advances are grounded in state-of-the-art computer vision techniques primarily developed in the context of image-based benchmark datasets, leaving several challenges to be addressed in adapting them for use in robotics. In this paper, we derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D bounding boxes (such as those typically obtained from visual object detection systems) can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for deep-learned object detectors that addresses the challenge of partial object detections often encountered in robotics applications, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera.

I Introduction

In recent years, impressive vision-based object detection performance improvements have resulted from the “rebirth” of Convolutional Neural Networks (ConvNets). Building on the groundbreaking work by Krizhevsky et al. [1] and earlier work [2, 3], several other groups (e.g. [4, 5, 6, 7, 8, 9]) have increased the quality of ConvNet-based methods for object detection. Recent approaches have even reached human performance on the standardized ImageNet ILSVRC benchmark [10] and continue to push the performance boundaries on other benchmarks such as COCO [11].

Despite these impressive developments, the Simultaneous Localization And Mapping community (SLAM) has not yet fully adopted the newly arisen opportunities to create semantically meaningful maps. SLAM maps typically represent geometric information, but do not carry immediate object-level semantic information. Semantically-enriched SLAM systems are appealing because they increase the richness with which a robot can understand the world around it, and consequently the range and sophistication of interactions that that robot may have with the world, a critical requirement for their eventual widespread deployment at work and in homes.

Semantically meaningful maps should be object-oriented, with objects as the central entities of the map. Quadrics, i.e. 3D surfaces such as ellipsoids, are ideal landmark representations for object-oriented semantic maps. In contrast to more complex object representations such as truncated signed distance fields [12], quadrics have a very compact representation and can be manipulated efficiently within the framework of projective geometry. Quadrics also capture information about the size, position, and orientation of an object, and can serve as anchors for more detailed 3D reconstructions if necessary. They are also appealing from an integration perspective: as we are going to show, in their dual form, quadrics can be constructed directly from object detection bounding boxes and conveniently incorporated into a factor graph based SLAM formulation.

In this paper we make the following contributions. We first show how to parametrize object landmarks in SLAM as constrained dual quadrics. We then demonstrate that visual object detection systems such as Faster R-CNN [7], SSD [8], or Mask R-CNN [9] can be used as sensors in SLAM, and that their observations – the bounding boxes around objects – can directly constrain dual quadric parameters via our novel geometric error formulation. To incorporate quadrics into SLAM, we derive a factor graph-based SLAM formulation that jointly estimates the dual quadric and robot pose parameters. Our large-scale evaluation using 250 indoor trajectories through a high-fidelity simulation environment shows how object detections and the dual quadric parametrization aid the SLAM solution.

Fig. 1: QuadricSLAM uses objects as landmarks and represents them as constrained dual quadrics in 3D space. QuadricSLAM jointly estimates camera poses and quadric parameters from noisy odometry and object detection bounding boxes, and performs loop closures based on the object observations. This figure illustrates how well the estimated quadrics fit the true objects when projected into the camera images from different viewpoints (red ellipses).

Previous work [13] utilized dual quadrics as a parametrization for landmark mapping only, was limited to an orthographic camera [14], or used an algebraic error that proved to be invalid when landmarks are only partially visible [15]. In this new work we perform full SLAM, i.e. Simultaneous Localization And Mapping, with a general perspective camera and a more robust geometric error. Furthermore, previous work [13, 14] required ellipse fitting as a pre-processing step: here we show that dual quadrics can be estimated in SLAM directly from bounding boxes.

Ii Related Work

In the following section we discuss the use of semantically meaningful landmark representations in state-of-the-art mapping systems and detail existing literature that utilizes quadric surfaces as object representations.

Ii-a Maps and Landmark Representations in SLAM

Most current SLAM systems represent the environment as a collection of distinct geometric points that are used as landmarks. ORB-SLAM [16, 17] is one of the most prominent recent examples for such a point-based visual SLAM system. Even direct visual SLAM approaches [18, 19] produce point cloud maps, albeit much denser than previous approaches. Other authors explored the utility of higher order geometric features such as line segments [20] or planes [21].

A commonality of all those geometry-based SLAM systems is that their maps carry geometric but no immediate semantic information. An exception is the influential work by Salas-Moreno et al. [22]. This work proposed an object oriented SLAM system by using real-world objects such as chairs and tables as landmarks instead of geometric primitives. [22] detected these objects in RGB-D data by matching 3D models of known object classes. In contrast to [22], the approach presented in this paper does not require a-priori known object CAD models, but instead uses general purpose visual object detection systems, typically based on deep convolutional networks, such as [8, 23, 7].

SemanticFusion [24] recently demonstrated how a dense 3D reconstruction obtained by SLAM can be enriched with semantic information. This work, and other similar papers such as [25], add semantics to the map after it has been created. The maps are not object-centric, but rather dense point clouds, where every point carries a semantic label, or a distribution over labels. In contrast, our approach uses objects as landmarks inside the SLAM system, and the resulting map consists of objects encoded as quadrics.

Ii-B Dual Quadrics as Landmark Representations

The connection between object detections and dual quadrics was recently investigated by [14] and [13]. Crocco et al. [14] presented an approach for estimating dual quadric parameters from object detections in closed form. Their method however is limited to orthographic cameras, while our approach works with perspective cameras, and is therefore more general and applicable to robotics scenarios. Furthermore, [14] requires an ellipse-fitting step around each detected object. In contrast, our method can estimate camera pose and quadric parameters directly from the bounding boxes typically produced by object detection approaches such as [8, 23, 7].

As an extention of [14], Rubino et al. [13] described a closed-form approach to recover dual quadric parameters from object detections in multiple views. Their method can handle perspective cameras, but does not solve for camera pose parameters. It therefore performs only landmark mapping given known camera poses. In contrast, our approach performs full Simultaneous Localization And Mapping, i.e. solving for camera pose, landmark pose and shape parameters simultaneously. Similar to [14], [13] also requires fitting ellipses to bounding box detections first.

We explored initial ideas of using dual quadrics as landmarks in factor-graph SLAM in [15]. This unpublished preliminary work proposed an algebraic error formulation that proved to be not robust in situations where object landmarks are only partially visible. We overcome this problem by a novel geometric error formulation in this paper. In contrast to [15], we constrain the quadric landmarks to be ellipsoids, initialise them correctly, and present a large-scale evaluation in a high-fidelity simulation environment.

Iii Dual Quadrics – Fundamental Concepts

This section explains fundamental concepts around dual quadrics that are necessary to follow the remainder of the paper. For a more in-depth coverage we refer the reader to textbooks on projective geometry such as [26].

Iii-a Dual Quadrics

Quadrics are surfaces in 3D space that are defined by a symmetric matrix , so that all points on the quadric fulfill . Examples for quadrics are bodies such as spheres, ellipsoids, hyperboloids, cones, or cylinders.

A quadric has 9 degrees of freedom. These correspond to the ten independent elements of the symmetric matrix less one for scale. We can represent a general quadric with a 10-vector where each element corresponds to one of the 10 independent elements of .

While the above definition of a quadric concentrates on the points on the quadric’s surface, a quadric can also be defined by a set of tangential planes such that the planes form an envelope around the quadric. This dual quadric is defined as . Every quadric has a corresponding dual form , or if is invertible.

When a quadric is projected onto an image plane, it creates a dual conic, following the simple rule . Here, is the camera projection matrix that contains intrinsic and extrinsic camera parameters. Conics are the 2D counterparts of quadrics and form shapes such as circles, ellipses, parabolas, or hyperbolas. Just like quadrics, they can be defined in a primal form via points (), or in dual form using tangent lines: .

Iii-B Constrained Dual Quadric Parametrization

In its general form, a quadric or dual quadric can represent both closed surfaces such as spheres and ellipsoids and non-closed surfaces such as paraboloids or hyperboloids. As only the former are meaningful representations of object landmarks, we use a constrained dual quadric representation that ensures the represented surface is an ellipsoid or sphere.

Similar to [13], we parametrize dual quadrics as:


where is an ellipsoid centred at the origin, and is a homogeneous transformation that accounts for an arbitrary rotation and translation. Specifically,


where is the quadric centroid translation, is a rotation matrix defined by the angles , and is the shape of the quadric along the three semi-axes of the ellipsoid. In the following, we compactly represent a constrained dual quadric with a 9-vector and reconstruct the full dual quadric as defined in (1).

(a) A naive sensor model truncates the full conic bounds (red) resulting in errors when compared against ground truth detections (green).
(b) Our proposed sensor model correctly predicts the object detection (green) by calculating the on-image conic bounding box (blue).
Fig. 4: Sensor models for a deep-learned object detector.

Iv A Sensor Model for a Deep-Learned Object Detector

Iv-a Motivation

Our goal is to incorporate state-of-the-art deep-learned object detectors such as [7, 8, 9] as a sensor into SLAM. We therefore have to formulate a sensor model that can predict the observations of the object detector given the estimated camera pose and the estimated map structure, i.e. quadric parameters . While such sensor models are often rather simple, e.g. when using point landmarks or laser scanners and occupancy grid maps, the sensor model for an object detector is more complex.

The observations of an object detector comprise an axis-aligned bounding box constrained to the image dimensions and a discrete label distribution for each detected object. In this paper we focus on the bounding box only, which can be represented as a set of four lines or a vector containing the pixel coordinates of its upper-left and lower-right corner. We therefore seek a formulation for the sensor model , mapping from camera pose and quadric to predicted bounding box observation .

This sensor model allows us to formulate a geometric error term between the predicted and observed object detections, which is the crucial component of our overall SLAM system as explained in Section V.

Iv-B Deriving the Object Detection Sensor Model

Our derivation of starts with projecting the estimated quadric parametrized by into the image using the camera pose according to with comprising the intrinsic () and pose parameters of the camera. Given the dual conic , we obtain its primal form by taking the adjugate.

A naive sensor model would simply calculate the enclosing bounding box of the conic and truncate this box to fit the image. However, as illustrated in Figure (a)a, this can introduce significant errors when the conic’s extrema lie outside of the image boundaries.

An accurate sensor model requires knowledge of the intersection points between conic and image borders. The correct prediction of the object detector’s bounding box therefore is the minimal axis aligned rectangle that envelopes all of the conic contained within the image dimensions. We will explain the correct method of calculating this conic bounding box, denoted BBox(), below. The overall sensor model is then defined as


Iv-C Calculating the On-Image Conic Bounding Box

We can calculate the correct on-image conic bounding box by the following algorithm we denote BBox():

  1. Find the four extrema points of the conic , i.e. the points on the conic that maximise or minimise the or component respectively.

  2. Find the up to 8 points where the conic intersects the image boundaries.

  3. Remove all non-real points and all points outside the image boundaries from the set .

  4. Find and return the maximum and minimum and coordinate components among the remaining points.

We will explain each of these steps in detail in the following.

Calculating the conic’s extrema points

A conic can be represented both as a symmetric matrix and in Cartesian form by the following expression:


We obtain the 4 extrema points of the conic by finding the roots of the partial derivatives of (4) with respect to and . These derivatives can be interpreted as two lines that intersect the conic at its extrema points, as depicted in Figure (b)b.


Solving for the values by rearranging equations (5) and (6) in terms of , and substituting into (4) yields:


The roots of these quadratics correspond to the values of points . In order to obtain the corresponding values at each of these locations, we substitute the roots of (IV-C) and (IV-C) into equations (7) and (8) respectively. Solving these expressions leads us to the set of points that define the conics maximum and minimum bounds.

Calculating the intersections with the image boundaries

Factoring (4) in terms of or allows us to solve for the missing values of a point along a conic, seen below.


We calculate the intersections of conic and image boundaries by substituting the and values that define the image dimensions into equations (11) and (12) respectively (i.e. , and , ). Notice that in most circumstances some of the resulting solutions will be non-real when the conic intersects the image boundaries in less than 8 points.

Final steps

We restrict the set such that it contains only points that are real, and create a new set of the remaining points that lie within the image dimensions , :


Finally, we find the maximum and minimum and values from the points in to define the on-screen bounding box of the conic. The function BBox() therefore executes all the above steps and returns a vector


that correctly describes a bounding box that envelopes the portion of the conic that would be visible in the image.

V SLAM with Dual Quadric Landmark Representations

V-a General Problem Setup

We will set up a SLAM problem where we have odometry measurements between two successive poses and , so that . Here is a usually nonlinear function that implements the motion model of the robot and the and are the unknown robot poses. are zero-mean Gaussian error terms with covariances . The source of these odometry measurements is not of concern for the following discussion, and various sources such as wheel odometers or visual odometry are possible.

We furthermore observe a set of detections . We use this notation to indicate a bounding box around an object being observed from pose . Notice that we assume the problem of data association is solved, i.e. we can identify which physical object the detection originates from111For a discussion of SLAM methods robust to data association errors see the relevant literature such as [27, 28]. The methods discussed for pose graph SLAM can be adopted to the landmark SLAM considered here..

V-B Building and Solving a Factor Graph Representation

The conditional probability distribution over all robot poses , and landmarks , given the observations , and can be factored as


This factored distribution can be conveniently modelled as a factor graph [29].

Given the sets of observations ,, we seek the optimal, i.e. maximum a posteriori (MAP) configuration of robot poses and dual quadrics, , to solve the landmark SLAM problem represented by the factor graph. This MAP variable configuration is equal to the mode of the joint probability distribution . In simpler words, the MAP solution is the point where that distribution has its maximum.

The odometry factors are typically assumed to be Gaussian, i.e. , where is the robot’s motion model. To integrate the landmark factors into a Gaussian factor graph, we apply Bayes rule:


Since we are performing MAP estimation, we can ignore the denominator which essentially serves as a normaliser. Furthermore, assuming a uniform prior , we see that for our purposes we can replace by the likelihood term . The latter can be modelled as a Gaussian , where is the sensor model defined in Section IV, and is the Covariance matrix capturing the spatial uncertainty (in image space) of the observed object detections.

The optimal variable configuration can now be determined by maximizing the joint probability (15) from above:


This is a nonlinear least squares problem, since we seek the minimum over a sum of squared terms.

Here denotes the squared Mahalanobis distance with covariance . We use the operator in the odometry factor to denote the difference operation is carried out in SE(3) rather than in Euclidean space.

Nonlinear least-squares problems such as (17) can be solved iteratively using methods like Levenberg-Marquardt or Gauss-Newton. Solvers that exploit the sparse structure of the factorisation can solve typical problems with thousands of variables very efficiently.

V-C The Geometric Error Term

The error term that constitutes the quadric landmark factors in (17) is a geometric error, since and are vectors containing pixel coordinates. In contrast to the algebraic error proposed in previous work [14, 13, 15], we found our geometric error formulation is well-defined even when the observed object is only partially visible in the image (see Figure (b)b). Such situations result in truncated bounding box observations that invalidate the algebraic error formulation and shrink the estimated quadric.

Using a geometrically meaningful error will also allow us to conveniently propagate the spatial uncertainty of the object detector (e.g. as proposed by [30]) into the semantic SLAM system via the covariance matrices in future work.

V-D Variable Initialization

All variable parameters and must be initialized in order for the incremental solvers to work. While the robot poses can be initialized to an initial guess obtained from the raw odometry measurements , initializing the dual quadric landmarks requires more consideration.

It is possible to initialize with the least squares fit to its defining equation:


where is the vector form of a general dual quadric defined in Section III-A, not to be confused with the parametrized quadric vector presented in Section III-B.

We can form the homogeneous vectors defining the planes using the landmark bounding box observations and resulting by projecting them according to . Here the camera matrix is formed using the initial camera pose estimates obtained from the odometry measurements. Exploiting the fact that is symmetric, we can rewrite (18) for a specific as:


By collecting all these equations that originate from multiple views and planes , we obtain a linear system of the form with containing the coefficients of all associated with observations of landmark as in (V-D). A least squares solution that minimizes can be obtained as the last column of , where is the Singular Value Decomposition (SVD) of .

The solution of the SVD represents a generic quadric surface and is not constrained to an ellipsoid; we therefore parametrize each landmark as defined in Section III-B by extracting the quadrics rotation, translation and shape.

As in [13], we extract the shape of a quadric considering:


where is the upper left submatrix of the primal quadric , and , , and are the eigenvalues of . The rotation matrix is equal to the matrix of eigenvectors of . Finally, the translation of a dual quadric is defined by the last column of as a homogeneous 4-vector such that . We can then reconstruct the constrained equivelent of the estimated quadric as in Section III-B.

Hence, we initialize all landmarks by calculating the SVD solution of (18) over the complete set of detections for each landmark, and constrain the estimated quadrics to be ellipsoids.

Vi Experiments and Evaluation

We evaluate the use of quadric landmarks in a simulation environment in order to compare the estimated landmark parameters with ground truth 3D object information.

Vi-a Evaluation Environment

We created a synthetic dataset using the UnrealCV plugin [31] to retrieve ground truth camera trajectory , 2D object observations , and 3D object information over a number of modeled environments. Trajectories were recorded over 10 scenes resulting in 50 trajectories. These 50 ground truth trajectories were each injected with noise generated from 5 seeds for a total of 250 trials.

The camera 6-dof pose was recorded with an average translation and rotation of 0.34 meters and 6.41 degrees respectively. The length of trajectories varies between runs with an average trajectory length of 38.1 meters, and a maximum of 109.2 meters. The relative motion between trajectory positions is corrupted by zero mean Gaussian noise in order to induce an error of roughly 5% for translation and 15% for rotation. This trajectory noise is similar to the noise found when using a real IMU system, where perturbations in the odometry measurements cause the global trajectory to diverge from the ground truth (see Figure 6).

2D observations were recorded at every ground truth camera position. The camera was simulated with the following intrinsic parameters; a focal length of 320.0, a principal point . Images were captured at a resolution of and used to extract a ground truth bounding boxes of the same form as those generated from ConvNet-based object detectors such as [7, 8]. These boxes are corrupted with zero mean Gaussian noise to show the applicability of this method to practical robotics. In our experiments, we inject noise into the 2D detections with a variance of 4 pixels.

Ground truth object information is extracted in the form of 3D axis aligned bounding boxes for each object in the scene. The landmarks for each trial included all objects within the scene that could be detected by conventional object detectors.

Fig. 5: Odometry solution (red) versus estimated trajectory (blue) for a number of different bounding box noise estimates when no additional noise is introduced.

Vi-B Experiment Description

We implemented the SLAM problem (17), coined QuadricSLAM, as a factor graph where the robot poses and dual quadrics, and , populated the latent variables of the graph, connected with odometry factors and 2D bounding box factors . The noise models of these factors were initialized as described in Section VI-A.

We record and compare the initial trajectory estimate from odometry measurements and initial quadric estimates from the SVD solution with the SLAM solution in order to show an improvement in trajectory and landmark quality.

Vi-C Evaluation Metrics

Trajectory Quality

To evaluate the quality of the estimated robot trajectory, we calculate the root mean squared error of the deviation of every estimated robot position from its ground truth. As is standard practice, we analyze the translational component of the trajectory as rotational errors are expected to compound and induce translational errors. Trajectory error is then defined as where is the estimated robot position and is the respective ground truth position.

Landmark Position

We assess the quality of landmark positions by comparing the estimated quadric centroid with the ground truth centroid. Landmark position error where is the centroid of the estimated quadric and is the ground truth centroid.

Landmark Shape

The correctness of a landmarks shape can be evaluated by calculating the error between the ground truth 3D axis aligned bounding boxes and the axis aligned maximum and minimum bounds of the estimated quadric. Here we use the Jaccard distance, which is equivalent to subtracting the Intersection over Union from 1. In order to remove the impact of translational errors from this metric, we first align the centroid of both boxes. Hence where is the estimated 3D rectangle centered at the origin, and is the ground truth 3D rectangle, also centered at the origin.

Landmark Quality

Overall landmark quality is evaluated using the standard Jaccard distance to measure the dissimilarity between the ground truth 3D bounding box and the bounds of the estimated quadric. This metric is affected by the position, shape and orientation of the landmark where a score of 0.0 implies a perfect match, and 1.0 signals a complete lack of overlap. We evaluate landmark quality as

ATEtrans (cm) Landmarktrans (cm) Landmarkshape (%error) Landmarkquality (%error)
initial QuadricSLAM initial QuadricSLAM initial QuadricSLAM initial QuadricSLAM
Scene1 51.00 23.49 60.10 42.48 0.67 0.64 0.88 0.82
Scene2 40.01 14.36 44.10 19.47 0.43 0.43 0.78 0.58
Scene3 29.13 15.57 55.88 14.75 0.68 0.48 0.85 0.66
Scene4 46.50 15.32 54.46 19.25 0.65 0.47 0.88 0.70
Scene5 64.47 9.61 59.39 6.98 0.66 0.34 0.91 0.48
Scene6 73.50 13.05 63.02 12.04 0.69 0.39 0.92 0.53
Scene7 71.94 27.80 57.43 12.70 0.53 0.31 0.79 0.47
Scene8 59.28 31.43 52.97 10.83 0.49 0.37 0.75 0.43
Scene9 64.14 28.19 55.11 19.00 0.55 0.46 0.82 0.58
Scene10 89.51 26.05 76.16 13.91 0.73 0.55 0.90 0.65
Overall 58.95 20.49 57.86 17.14 0.61 0.44 0.85 0.59
TABLE I: Comparison of the average errors for trajectory and landmark quality between initial and final solution.
Fig. 6: Example trajectories from a top-down perspective, comparing ground truth trajectory (green) with the initial odometry (red) and the trajectory estimated by our semantic SLAM system (blue). The plots also show the initialised (red), estimated (blue) and true (green) landmark centroids as dots, connected by red and blue lines to emphasize individual objects.

Vi-D Results and Discussion

We summarize the results of our experiments in Table I and provide qualitative examples illustrating the improvement in camera trajectory and the accuracy of the estimated quadric surfaces in Figures 6 and 7 respectively.

The results show that quadric landmarks significantly improve the quality of the robot trajectory and the estimated map, providing accurate high level information about the shape and position of objects within the environment. Explicitly, the geometric error gains a 65.2% improvement on trajectory error and a 70.4%, 26.7% and 30.6% improvement on landmark position, shape and quality. The correcting effect of the quadric landmarks on the estimated trajectory is a result of re-observing the landmarks between frames, helping to mitigate accumulated odometry errors.

The remaining discrepancies between estimated landmark parameters and ground truth objects is expected to be caused by a combination of; occlusion, a form of noise inherent in the environment which encourages the shrinking of landmark surfaces, and limited viewing angles resulting in the overestimation of landmark shapes.

We have also identified a reduction in performance caused by conservative estimates of the bounding box noise model. This is a direct result of the additional noise introduced by object occlusions. Underestimating the detection noise can cause the optimization to fit to these noisy detections, negatively impacting both trajectory and landmark quality. However, as shown in Figure 5, overestimating this parameter has no adverse effects on the quality of the estimation.

We also evaluated the performance of the standard algebraic error function utilized in previous work [14, 13, 15] and found that the estimated solution rarely improves on the initial map and trajectory estimate. The algebraic error improves camera trajectory and landmark quality by 0.6% and 1.5%, but actually negatively impacts the landmark position and shape by 2.6% and 2.0% respectively. This is caused by partial object visibility, exaggerated by the presence of large objects within the majority of scenes.

Vii Conclusions and Future Work

QuadricSLAM is a step towards integrating state-of-the-art object detection and SLAM systems in order to expand the range of applications for which we can deploy our robotic systems. The introduction of object based landmarks is essential to the development of semantically meaningful object-oriented robotic maps.

Our paper has demonstrated how to use dual quadrics as landmark representations in SLAM with perspective cameras. We provide a method of parametrizing dual quadrics as closed surfaces and show how they can be directly constrained from bounding boxes as they originate from typical object detection systems.

We develop a factor graph-based SLAM formulation that jointly estimates camera trajectory and object parameters in the presence of odometry noise, object detection noise, occlusion and partial object visibility. This has been achieved by devising a sensor model for object detectors and defining the geometric error of a quadric that is robust to partial object observations. We provide an extensive evaluation of trajectory and landmark quality, demonstrating the utility of object-based landmarks for SLAM.

The advantages of using dual quadrics as landmark parametrizations in SLAM will only increase when incorporating higher order geometric constraints into the SLAM formulation, such as prior knowledge on how landmarks of a certain semantic type can be placed in the environment with respect to other landmarks or general structure.

Future work will investigate how this sparse object SLAM formulation can be extended to model dynamic environments by estimating a motion model for each landmark. This could potentially generate time dependent maps with the ability to predict the future state of the map. We are also working towards an efficient on-line implementation of QuadricSLAM allowing for the evaluation of quadric landmarks when using a modern object detector on a real robot.

Fig. 7: Estimated landmark positions and shapes (illustrated by the red ellipses) for 4 scenes of the evaluation, seen from two viewpoints each. To create these figures we projected the estimated 3D Quadrics into the images at each camera pose. The quality of these estimations is demonstrated by the alignment of major and minor ellipse axis with the object boundaries.


  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012.
  • [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [3] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2.   IEEE, 2004, pp. II–104.
  • [4] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
  • [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91–99.
  • [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in European conference on computer vision.   Springer, 2016, pp. 21–37.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” arXiv preprint arXiv:1703.06870, 2017.
  • [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [11] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV).   Springer, 2014, pp. 740–755.
  • [12] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques.   ACM, 1996, pp. 303–312.
  • [13] C. Rubino, M. Crocco, and A. Del Bue, “3d object localisation from multi-view image detections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [14] M. Crocco, C. Rubino, and A. Del Bue, “Structure from motion with objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4141–4149.
  • [15] N. Sünderhauf and M. Milford, “Dual quadrics from object detection boundingboxes as landmark representations in slam,” arXiv preprint arXiv:1708.00965, 2017.
  • [16] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [17] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras,” arXiv preprint arXiv:1610.06475, 2016.
  • [18] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” Lecture Notes in Computer Science, pp. 834–849, 2014.
  • [19] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison, “ElasticFusion: Dense SLAM without a pose graph,” Proc. Robotics: Science and Systems, Rome, Italy, 2015.
  • [20] T. Lemaire and S. Lacroix, “Monocular-vision based SLAM using Line Segments,” in Robotics and Automation, 2007 IEEE International Conference on, april 2007, pp. 2791 –2796.
  • [21] M. Kaess, “Simultaneous Localization and Mapping with infinite planes,” in IEEE Intl. Conf. on Robotics and Automation (ICRA).   IEEE, 2015.
  • [22] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.   IEEE, 2013, pp. 1352–1359.
  • [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
  • [24] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” arXiv preprint arXiv:1609.05130, 2016.
  • [25] T. T. Pham, I. Reid, Y. Latif, and S. Gould, “Hierarchical higher-order regression forest fields: An application to 3d indoor scene labelling,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2246–2254.
  • [26] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed.   Cambridge University Press, 2004.
  • [27] N. Sünderhauf and P. Protzel, “Switchable Constraints for Robust Pose Graph SLAM,” in Proc. of IEEE International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 2012.
  • [28] P. Agarwal, G. D. Tipaldi, L. Spinello, C. Stachniss, and W. Burgard, “Robust map optimization using dynamic covariance scaling,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2013.
  • [29] F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001.
  • [30] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout Sampling for Robust Object Detection in Open-Set Conditions,” in Proc. of IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [31] Y. Z. S. Q. Z. X. T. S. K. Y. W. A. Y. Weichao Qiu, Fangwei Zhong, “Unrealcv: Virtual worlds for computer vision,” ACM Multimedia Open Source Software Competition, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description