QuadricSLAM: Constrained Dual Quadrics from Object Detections
as Landmarks in Semantic SLAM
Abstract
Research in Simultaneous Localization And Mapping (SLAM) is increasingly moving towards richer world representations involving objects and high level features that enable a semantic model of the world for robots, potentially leading to a more meaningful set of robotworld interactions. Many of these advances are grounded in stateoftheart computer vision techniques primarily developed in the context of imagebased benchmark datasets, leaving several challenges to be addressed in adapting them for use in robotics. In this paper, we derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D bounding boxes (such as those typically obtained from visual object detection systems) can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for deeplearned object detectors that addresses the challenge of partial object detections often encountered in robotics applications, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera.
I Introduction
In recent years, impressive visionbased object detection performance improvements have resulted from the “rebirth” of Convolutional Neural Networks (ConvNets). Building on the groundbreaking work by Krizhevsky et al. [1] and earlier work [2, 3], several other groups (e.g. [4, 5, 6, 7, 8, 9]) have increased the quality of ConvNetbased methods for object detection. Recent approaches have even reached human performance on the standardized ImageNet ILSVRC benchmark [10] and continue to push the performance boundaries on other benchmarks such as COCO [11].
Despite these impressive developments, the Simultaneous Localization And Mapping community (SLAM) has not yet fully adopted the newly arisen opportunities to create semantically meaningful maps. SLAM maps typically represent geometric information, but do not carry immediate objectlevel semantic information. Semanticallyenriched SLAM systems are appealing because they increase the richness with which a robot can understand the world around it, and consequently the range and sophistication of interactions that that robot may have with the world, a critical requirement for their eventual widespread deployment at work and in homes.
Semantically meaningful maps should be objectoriented, with objects as the central entities of the map. Quadrics, i.e. 3D surfaces such as ellipsoids, are ideal landmark representations for objectoriented semantic maps. In contrast to more complex object representations such as truncated signed distance fields [12], quadrics have a very compact representation and can be manipulated efficiently within the framework of projective geometry. Quadrics also capture information about the size, position, and orientation of an object, and can serve as anchors for more detailed 3D reconstructions if necessary. They are also appealing from an integration perspective: as we are going to show, in their dual form, quadrics can be constructed directly from object detection bounding boxes and conveniently incorporated into a factor graph based SLAM formulation.
In this paper we make the following contributions. We first show how to parametrize object landmarks in SLAM as constrained dual quadrics. We then demonstrate that visual object detection systems such as Faster RCNN [7], SSD [8], or Mask RCNN [9] can be used as sensors in SLAM, and that their observations – the bounding boxes around objects – can directly constrain dual quadric parameters via our novel geometric error formulation. To incorporate quadrics into SLAM, we derive a factor graphbased SLAM formulation that jointly estimates the dual quadric and robot pose parameters. Our largescale evaluation using 250 indoor trajectories through a highfidelity simulation environment shows how object detections and the dual quadric parametrization aid the SLAM solution.
Previous work [13] utilized dual quadrics as a parametrization for landmark mapping only, was limited to an orthographic camera [14], or used an algebraic error that proved to be invalid when landmarks are only partially visible [15]. In this new work we perform full SLAM, i.e. Simultaneous Localization And Mapping, with a general perspective camera and a more robust geometric error. Furthermore, previous work [13, 14] required ellipse fitting as a preprocessing step: here we show that dual quadrics can be estimated in SLAM directly from bounding boxes.
Ii Related Work
In the following section we discuss the use of semantically meaningful landmark representations in stateoftheart mapping systems and detail existing literature that utilizes quadric surfaces as object representations.
Iia Maps and Landmark Representations in SLAM
Most current SLAM systems represent the environment as a collection of distinct geometric points that are used as landmarks. ORBSLAM [16, 17] is one of the most prominent recent examples for such a pointbased visual SLAM system. Even direct visual SLAM approaches [18, 19] produce point cloud maps, albeit much denser than previous approaches. Other authors explored the utility of higher order geometric features such as line segments [20] or planes [21].
A commonality of all those geometrybased SLAM systems is that their maps carry geometric but no immediate semantic information. An exception is the influential work by SalasMoreno et al. [22]. This work proposed an object oriented SLAM system by using realworld objects such as chairs and tables as landmarks instead of geometric primitives. [22] detected these objects in RGBD data by matching 3D models of known object classes. In contrast to [22], the approach presented in this paper does not require apriori known object CAD models, but instead uses general purpose visual object detection systems, typically based on deep convolutional networks, such as [8, 23, 7].
SemanticFusion [24] recently demonstrated how a dense 3D reconstruction obtained by SLAM can be enriched with semantic information. This work, and other similar papers such as [25], add semantics to the map after it has been created. The maps are not objectcentric, but rather dense point clouds, where every point carries a semantic label, or a distribution over labels. In contrast, our approach uses objects as landmarks inside the SLAM system, and the resulting map consists of objects encoded as quadrics.
IiB Dual Quadrics as Landmark Representations
The connection between object detections and dual quadrics was recently investigated by [14] and [13]. Crocco et al. [14] presented an approach for estimating dual quadric parameters from object detections in closed form. Their method however is limited to orthographic cameras, while our approach works with perspective cameras, and is therefore more general and applicable to robotics scenarios. Furthermore, [14] requires an ellipsefitting step around each detected object. In contrast, our method can estimate camera pose and quadric parameters directly from the bounding boxes typically produced by object detection approaches such as [8, 23, 7].
As an extention of [14], Rubino et al. [13] described a closedform approach to recover dual quadric parameters from object detections in multiple views. Their method can handle perspective cameras, but does not solve for camera pose parameters. It therefore performs only landmark mapping given known camera poses. In contrast, our approach performs full Simultaneous Localization And Mapping, i.e. solving for camera pose, landmark pose and shape parameters simultaneously. Similar to [14], [13] also requires fitting ellipses to bounding box detections first.
We explored initial ideas of using dual quadrics as landmarks in factorgraph SLAM in [15]. This unpublished preliminary work proposed an algebraic error formulation that proved to be not robust in situations where object landmarks are only partially visible. We overcome this problem by a novel geometric error formulation in this paper. In contrast to [15], we constrain the quadric landmarks to be ellipsoids, initialise them correctly, and present a largescale evaluation in a highfidelity simulation environment.
Iii Dual Quadrics – Fundamental Concepts
This section explains fundamental concepts around dual quadrics that are necessary to follow the remainder of the paper. For a more indepth coverage we refer the reader to textbooks on projective geometry such as [26].
Iiia Dual Quadrics
Quadrics are surfaces in 3D space that are defined by a symmetric matrix , so that all points on the quadric fulfill . Examples for quadrics are bodies such as spheres, ellipsoids, hyperboloids, cones, or cylinders.
A quadric has 9 degrees of freedom. These correspond to the ten independent elements of the symmetric matrix less one for scale. We can represent a general quadric with a 10vector where each element corresponds to one of the 10 independent elements of .
While the above definition of a quadric concentrates on the points on the quadric’s surface, a quadric can also be defined by a set of tangential planes such that the planes form an envelope around the quadric. This dual quadric is defined as . Every quadric has a corresponding dual form , or if is invertible.
When a quadric is projected onto an image plane, it creates a dual conic, following the simple rule . Here, is the camera projection matrix that contains intrinsic and extrinsic camera parameters. Conics are the 2D counterparts of quadrics and form shapes such as circles, ellipses, parabolas, or hyperbolas. Just like quadrics, they can be defined in a primal form via points (), or in dual form using tangent lines: .
IiiB Constrained Dual Quadric Parametrization
In its general form, a quadric or dual quadric can represent both closed surfaces such as spheres and ellipsoids and nonclosed surfaces such as paraboloids or hyperboloids. As only the former are meaningful representations of object landmarks, we use a constrained dual quadric representation that ensures the represented surface is an ellipsoid or sphere.
Similar to [13], we parametrize dual quadrics as:
(1) 
where is an ellipsoid centred at the origin, and is a homogeneous transformation that accounts for an arbitrary rotation and translation. Specifically,
(2) 
where is the quadric centroid translation, is a rotation matrix defined by the angles , and is the shape of the quadric along the three semiaxes of the ellipsoid. In the following, we compactly represent a constrained dual quadric with a 9vector and reconstruct the full dual quadric as defined in (1).
Iv A Sensor Model for a DeepLearned Object Detector
Iva Motivation
Our goal is to incorporate stateoftheart deeplearned object detectors such as [7, 8, 9] as a sensor into SLAM. We therefore have to formulate a sensor model that can predict the observations of the object detector given the estimated camera pose and the estimated map structure, i.e. quadric parameters . While such sensor models are often rather simple, e.g. when using point landmarks or laser scanners and occupancy grid maps, the sensor model for an object detector is more complex.
The observations of an object detector comprise an axisaligned bounding box constrained to the image dimensions and a discrete label distribution for each detected object. In this paper we focus on the bounding box only, which can be represented as a set of four lines or a vector containing the pixel coordinates of its upperleft and lowerright corner. We therefore seek a formulation for the sensor model , mapping from camera pose and quadric to predicted bounding box observation .
This sensor model allows us to formulate a geometric error term between the predicted and observed object detections, which is the crucial component of our overall SLAM system as explained in Section V.
IvB Deriving the Object Detection Sensor Model
Our derivation of starts with projecting the estimated quadric parametrized by into the image using the camera pose according to with comprising the intrinsic () and pose parameters of the camera. Given the dual conic , we obtain its primal form by taking the adjugate.
A naive sensor model would simply calculate the enclosing bounding box of the conic and truncate this box to fit the image. However, as illustrated in Figure (a)a, this can introduce significant errors when the conic’s extrema lie outside of the image boundaries.
An accurate sensor model requires knowledge of the intersection points between conic and image borders. The correct prediction of the object detector’s bounding box therefore is the minimal axis aligned rectangle that envelopes all of the conic contained within the image dimensions. We will explain the correct method of calculating this conic bounding box, denoted BBox(), below. The overall sensor model is then defined as
(3) 
IvC Calculating the OnImage Conic Bounding Box
We can calculate the correct onimage conic bounding box by the following algorithm we denote BBox():

Find the four extrema points of the conic , i.e. the points on the conic that maximise or minimise the or component respectively.

Find the up to 8 points where the conic intersects the image boundaries.

Remove all nonreal points and all points outside the image boundaries from the set .

Find and return the maximum and minimum and coordinate components among the remaining points.
We will explain each of these steps in detail in the following.
Calculating the conic’s extrema points
A conic can be represented both as a symmetric matrix and in Cartesian form by the following expression:
(4) 
We obtain the 4 extrema points of the conic by finding the roots of the partial derivatives of (4) with respect to and . These derivatives can be interpreted as two lines that intersect the conic at its extrema points, as depicted in Figure (b)b.
(5)  
(6) 
Solving for the values by rearranging equations (5) and (6) in terms of , and substituting into (4) yields:
(7)  
(8) 
(9)  
(10) 
The roots of these quadratics correspond to the values of points . In order to obtain the corresponding values at each of these locations, we substitute the roots of (IVC) and (IVC) into equations (7) and (8) respectively. Solving these expressions leads us to the set of points that define the conics maximum and minimum bounds.
Calculating the intersections with the image boundaries
Factoring (4) in terms of or allows us to solve for the missing values of a point along a conic, seen below.
(11)  
(12) 
We calculate the intersections of conic and image boundaries by substituting the and values that define the image dimensions into equations (11) and (12) respectively (i.e. , and , ). Notice that in most circumstances some of the resulting solutions will be nonreal when the conic intersects the image boundaries in less than 8 points.
Final steps
We restrict the set such that it contains only points that are real, and create a new set of the remaining points that lie within the image dimensions , :
(13) 
Finally, we find the maximum and minimum and values from the points in to define the onscreen bounding box of the conic. The function BBox() therefore executes all the above steps and returns a vector
(14) 
that correctly describes a bounding box that envelopes the portion of the conic that would be visible in the image.
V SLAM with Dual Quadric Landmark Representations
Va General Problem Setup
We will set up a SLAM problem where we have odometry measurements between two successive poses and , so that . Here is a usually nonlinear function that implements the motion model of the robot and the and are the unknown robot poses. are zeromean Gaussian error terms with covariances . The source of these odometry measurements is not of concern for the following discussion, and various sources such as wheel odometers or visual odometry are possible.
We furthermore observe a set of detections . We use this notation to indicate a bounding box around an object being observed from pose . Notice that we assume the problem of data association is solved, i.e. we can identify which physical object the detection originates from^{1}^{1}1For a discussion of SLAM methods robust to data association errors see the relevant literature such as [27, 28]. The methods discussed for pose graph SLAM can be adopted to the landmark SLAM considered here..
VB Building and Solving a Factor Graph Representation
The conditional probability distribution over all robot poses , and landmarks , given the observations , and can be factored as
(15) 
This factored distribution can be conveniently modelled as a factor graph [29].
Given the sets of observations ,, we seek the optimal, i.e. maximum a posteriori (MAP) configuration of robot poses and dual quadrics, , to solve the landmark SLAM problem represented by the factor graph. This MAP variable configuration is equal to the mode of the joint probability distribution . In simpler words, the MAP solution is the point where that distribution has its maximum.
The odometry factors are typically assumed to be Gaussian, i.e. , where is the robot’s motion model. To integrate the landmark factors into a Gaussian factor graph, we apply Bayes rule:
(16) 
Since we are performing MAP estimation, we can ignore the denominator which essentially serves as a normaliser. Furthermore, assuming a uniform prior , we see that for our purposes we can replace by the likelihood term . The latter can be modelled as a Gaussian , where is the sensor model defined in Section IV, and is the Covariance matrix capturing the spatial uncertainty (in image space) of the observed object detections.
The optimal variable configuration can now be determined by maximizing the joint probability (15) from above:
(17) 
This is a nonlinear least squares problem, since we seek the minimum over a sum of squared terms.
Here denotes the squared Mahalanobis distance with covariance . We use the operator in the odometry factor to denote the difference operation is carried out in SE(3) rather than in Euclidean space.
Nonlinear leastsquares problems such as (17) can be solved iteratively using methods like LevenbergMarquardt or GaussNewton. Solvers that exploit the sparse structure of the factorisation can solve typical problems with thousands of variables very efficiently.
VC The Geometric Error Term
The error term that constitutes the quadric landmark factors in (17) is a geometric error, since and are vectors containing pixel coordinates. In contrast to the algebraic error proposed in previous work [14, 13, 15], we found our geometric error formulation is welldefined even when the observed object is only partially visible in the image (see Figure (b)b). Such situations result in truncated bounding box observations that invalidate the algebraic error formulation and shrink the estimated quadric.
Using a geometrically meaningful error will also allow us to conveniently propagate the spatial uncertainty of the object detector (e.g. as proposed by [30]) into the semantic SLAM system via the covariance matrices in future work.
VD Variable Initialization
All variable parameters and must be initialized in order for the incremental solvers to work. While the robot poses can be initialized to an initial guess obtained from the raw odometry measurements , initializing the dual quadric landmarks requires more consideration.
It is possible to initialize with the least squares fit to its defining equation:
(18) 
where is the vector form of a general dual quadric defined in Section IIIA, not to be confused with the parametrized quadric vector presented in Section IIIB.
We can form the homogeneous vectors defining the planes using the landmark bounding box observations and resulting by projecting them according to . Here the camera matrix is formed using the initial camera pose estimates obtained from the odometry measurements. Exploiting the fact that is symmetric, we can rewrite (18) for a specific as:
(19) 
By collecting all these equations that originate from multiple views and planes , we obtain a linear system of the form with containing the coefficients of all associated with observations of landmark as in (VD). A least squares solution that minimizes can be obtained as the last column of , where is the Singular Value Decomposition (SVD) of .
The solution of the SVD represents a generic quadric surface and is not constrained to an ellipsoid; we therefore parametrize each landmark as defined in Section IIIB by extracting the quadrics rotation, translation and shape.
As in [13], we extract the shape of a quadric considering:
(20) 
where is the upper left submatrix of the primal quadric , and , , and are the eigenvalues of . The rotation matrix is equal to the matrix of eigenvectors of . Finally, the translation of a dual quadric is defined by the last column of as a homogeneous 4vector such that . We can then reconstruct the constrained equivelent of the estimated quadric as in Section IIIB.
Hence, we initialize all landmarks by calculating the SVD solution of (18) over the complete set of detections for each landmark, and constrain the estimated quadrics to be ellipsoids.
Vi Experiments and Evaluation
We evaluate the use of quadric landmarks in a simulation environment in order to compare the estimated landmark parameters with ground truth 3D object information.
Via Evaluation Environment
We created a synthetic dataset using the UnrealCV plugin [31] to retrieve ground truth camera trajectory , 2D object observations , and 3D object information over a number of modeled environments. Trajectories were recorded over 10 scenes resulting in 50 trajectories. These 50 ground truth trajectories were each injected with noise generated from 5 seeds for a total of 250 trials.
The camera 6dof pose was recorded with an average translation and rotation of 0.34 meters and 6.41 degrees respectively. The length of trajectories varies between runs with an average trajectory length of 38.1 meters, and a maximum of 109.2 meters. The relative motion between trajectory positions is corrupted by zero mean Gaussian noise in order to induce an error of roughly 5% for translation and 15% for rotation. This trajectory noise is similar to the noise found when using a real IMU system, where perturbations in the odometry measurements cause the global trajectory to diverge from the ground truth (see Figure 6).
2D observations were recorded at every ground truth camera position. The camera was simulated with the following intrinsic parameters; a focal length of 320.0, a principal point . Images were captured at a resolution of and used to extract a ground truth bounding boxes of the same form as those generated from ConvNetbased object detectors such as [7, 8]. These boxes are corrupted with zero mean Gaussian noise to show the applicability of this method to practical robotics. In our experiments, we inject noise into the 2D detections with a variance of 4 pixels.
Ground truth object information is extracted in the form of 3D axis aligned bounding boxes for each object in the scene. The landmarks for each trial included all objects within the scene that could be detected by conventional object detectors.
ViB Experiment Description
We implemented the SLAM problem (17), coined QuadricSLAM, as a factor graph where the robot poses and dual quadrics, and , populated the latent variables of the graph, connected with odometry factors and 2D bounding box factors . The noise models of these factors were initialized as described in Section VIA.
We record and compare the initial trajectory estimate from odometry measurements and initial quadric estimates from the SVD solution with the SLAM solution in order to show an improvement in trajectory and landmark quality.
ViC Evaluation Metrics
Trajectory Quality
To evaluate the quality of the estimated robot trajectory, we calculate the root mean squared error of the deviation of every estimated robot position from its ground truth. As is standard practice, we analyze the translational component of the trajectory as rotational errors are expected to compound and induce translational errors. Trajectory error is then defined as where is the estimated robot position and is the respective ground truth position.
Landmark Position
We assess the quality of landmark positions by comparing the estimated quadric centroid with the ground truth centroid. Landmark position error where is the centroid of the estimated quadric and is the ground truth centroid.
Landmark Shape
The correctness of a landmarks shape can be evaluated by calculating the error between the ground truth 3D axis aligned bounding boxes and the axis aligned maximum and minimum bounds of the estimated quadric. Here we use the Jaccard distance, which is equivalent to subtracting the Intersection over Union from 1. In order to remove the impact of translational errors from this metric, we first align the centroid of both boxes. Hence where is the estimated 3D rectangle centered at the origin, and is the ground truth 3D rectangle, also centered at the origin.
Landmark Quality
Overall landmark quality is evaluated using the standard Jaccard distance to measure the dissimilarity between the ground truth 3D bounding box and the bounds of the estimated quadric. This metric is affected by the position, shape and orientation of the landmark where a score of 0.0 implies a perfect match, and 1.0 signals a complete lack of overlap. We evaluate landmark quality as
ATE_{trans} (cm)  Landmark_{trans} (cm)  Landmark_{shape} (%error)  Landmark_{quality} (%error)  

initial  QuadricSLAM  initial  QuadricSLAM  initial  QuadricSLAM  initial  QuadricSLAM  
Scene1  51.00  23.49  60.10  42.48  0.67  0.64  0.88  0.82 
Scene2  40.01  14.36  44.10  19.47  0.43  0.43  0.78  0.58 
Scene3  29.13  15.57  55.88  14.75  0.68  0.48  0.85  0.66 
Scene4  46.50  15.32  54.46  19.25  0.65  0.47  0.88  0.70 
Scene5  64.47  9.61  59.39  6.98  0.66  0.34  0.91  0.48 
Scene6  73.50  13.05  63.02  12.04  0.69  0.39  0.92  0.53 
Scene7  71.94  27.80  57.43  12.70  0.53  0.31  0.79  0.47 
Scene8  59.28  31.43  52.97  10.83  0.49  0.37  0.75  0.43 
Scene9  64.14  28.19  55.11  19.00  0.55  0.46  0.82  0.58 
Scene10  89.51  26.05  76.16  13.91  0.73  0.55  0.90  0.65 
Overall  58.95  20.49  57.86  17.14  0.61  0.44  0.85  0.59 
ViD Results and Discussion
We summarize the results of our experiments in Table I and provide qualitative examples illustrating the improvement in camera trajectory and the accuracy of the estimated quadric surfaces in Figures 6 and 7 respectively.
The results show that quadric landmarks significantly improve the quality of the robot trajectory and the estimated map, providing accurate high level information about the shape and position of objects within the environment. Explicitly, the geometric error gains a 65.2% improvement on trajectory error and a 70.4%, 26.7% and 30.6% improvement on landmark position, shape and quality. The correcting effect of the quadric landmarks on the estimated trajectory is a result of reobserving the landmarks between frames, helping to mitigate accumulated odometry errors.
The remaining discrepancies between estimated landmark parameters and ground truth objects is expected to be caused by a combination of; occlusion, a form of noise inherent in the environment which encourages the shrinking of landmark surfaces, and limited viewing angles resulting in the overestimation of landmark shapes.
We have also identified a reduction in performance caused by conservative estimates of the bounding box noise model. This is a direct result of the additional noise introduced by object occlusions. Underestimating the detection noise can cause the optimization to fit to these noisy detections, negatively impacting both trajectory and landmark quality. However, as shown in Figure 5, overestimating this parameter has no adverse effects on the quality of the estimation.
We also evaluated the performance of the standard algebraic error function utilized in previous work [14, 13, 15] and found that the estimated solution rarely improves on the initial map and trajectory estimate. The algebraic error improves camera trajectory and landmark quality by 0.6% and 1.5%, but actually negatively impacts the landmark position and shape by 2.6% and 2.0% respectively. This is caused by partial object visibility, exaggerated by the presence of large objects within the majority of scenes.
Vii Conclusions and Future Work
QuadricSLAM is a step towards integrating stateoftheart object detection and SLAM systems in order to expand the range of applications for which we can deploy our robotic systems. The introduction of object based landmarks is essential to the development of semantically meaningful objectoriented robotic maps.
Our paper has demonstrated how to use dual quadrics as landmark representations in SLAM with perspective cameras. We provide a method of parametrizing dual quadrics as closed surfaces and show how they can be directly constrained from bounding boxes as they originate from typical object detection systems.
We develop a factor graphbased SLAM formulation that jointly estimates camera trajectory and object parameters in the presence of odometry noise, object detection noise, occlusion and partial object visibility. This has been achieved by devising a sensor model for object detectors and defining the geometric error of a quadric that is robust to partial object observations. We provide an extensive evaluation of trajectory and landmark quality, demonstrating the utility of objectbased landmarks for SLAM.
The advantages of using dual quadrics as landmark parametrizations in SLAM will only increase when incorporating higher order geometric constraints into the SLAM formulation, such as prior knowledge on how landmarks of a certain semantic type can be placed in the environment with respect to other landmarks or general structure.
Future work will investigate how this sparse object SLAM formulation can be extended to model dynamic environments by estimating a motion model for each landmark. This could potentially generate time dependent maps with the ability to predict the future state of the map. We are also working towards an efficient online implementation of QuadricSLAM allowing for the evaluation of quadric landmarks when using a modern object detector on a real robot.
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012.
 [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [3] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2. IEEE, 2004, pp. II–104.
 [4] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.
 [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
 [7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91–99.
 [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
 [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask rcnn,” arXiv preprint arXiv:1703.06870, 2017.
 [10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [11] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755.
 [12] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques. ACM, 1996, pp. 303–312.
 [13] C. Rubino, M. Crocco, and A. Del Bue, “3d object localisation from multiview image detections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [14] M. Crocco, C. Rubino, and A. Del Bue, “Structure from motion with objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4141–4149.
 [15] N. Sünderhauf and M. Milford, “Dual quadrics from object detection boundingboxes as landmark representations in slam,” arXiv preprint arXiv:1708.00965, 2017.
 [16] R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “ORBSLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
 [17] R. MurArtal and J. D. Tardos, “ORBSLAM2: an OpenSource SLAM System for Monocular, Stereo and RGBD Cameras,” arXiv preprint arXiv:1610.06475, 2016.
 [18] J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: Largescale direct monocular SLAM,” Lecture Notes in Computer Science, pp. 834–849, 2014.
 [19] T. Whelan, S. Leutenegger, R. F. SalasMoreno, B. Glocker, and A. J. Davison, “ElasticFusion: Dense SLAM without a pose graph,” Proc. Robotics: Science and Systems, Rome, Italy, 2015.
 [20] T. Lemaire and S. Lacroix, “Monocularvision based SLAM using Line Segments,” in Robotics and Automation, 2007 IEEE International Conference on, april 2007, pp. 2791 –2796.
 [21] M. Kaess, “Simultaneous Localization and Mapping with infinite planes,” in IEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2015.
 [22] R. F. SalasMoreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 1352–1359.
 [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
 [24] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” arXiv preprint arXiv:1609.05130, 2016.
 [25] T. T. Pham, I. Reid, Y. Latif, and S. Gould, “Hierarchical higherorder regression forest fields: An application to 3d indoor scene labelling,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2246–2254.
 [26] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, 2004.
 [27] N. Sünderhauf and P. Protzel, “Switchable Constraints for Robust Pose Graph SLAM,” in Proc. of IEEE International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 2012.
 [28] P. Agarwal, G. D. Tipaldi, L. Spinello, C. Stachniss, and W. Burgard, “Robust map optimization using dynamic covariance scaling,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2013.
 [29] F. Kschischang, B. Frey, and H.A. Loeliger, “Factor graphs and the sumproduct algorithm,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001.
 [30] D. Miller, L. Nicholson, F. Dayoub, and N. Sünderhauf, “Dropout Sampling for Robust Object Detection in OpenSet Conditions,” in Proc. of IEEE International Conference on Robotics and Automation (ICRA), 2018.
 [31] Y. Z. S. Q. Z. X. T. S. K. Y. W. A. Y. Weichao Qiu, Fangwei Zhong, “Unrealcv: Virtual worlds for computer vision,” ACM Multimedia Open Source Software Competition, 2017.