Stereo Visionbased Semantic 3D Object and Egomotion Tracking for Autonomous Driving
Abstract
We propose a stereo visionbased approach for tracking the camera egomotion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using endtoend approaches, we propose to use the easytolabeled 2D detection and discrete viewpoint classification together with a lightweight semantic inference method to obtain rough 3D object measurements. Based on the objectawareaided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the egomotion estimation and object localization are compared with the stateofoftheart solutions.
Keywords:
Semantic SLAM, 3D Object Localization, Visual Odometry1 Introduction
Localizing dynamic objects and estimating the camera egomotion in 3D space are crucial tasks for autonomous driving. Currently, these objectives are separately explored by endtoend 3D object detection methods [1, 2] and traditional visual SLAM approaches [3, 4]. However, it is hard to directly employ these approaches for autonomous driving scenarios. For 3D object detection, there are two main problems: 1. endtoend 3D regression approaches need lots of training data and require heavy workload to precisely label all the object boxes in 3D space and 2. the instance 3D detection produces frameindependent results, which are not consistent enough for continuous perception in autonomous driving. To overcome this, we propose a lightweight semantic 3D box inference method depending only on 2D object detection and discrete viewpoint classification (Sect. 4). Comparing with directly 3D regression, the 2D detection and classification task are easy to train, and the training data can be easily labeled with only 2D images. However, the proposed 3D box inference is also frameindependent and conditional on the instance 2D detection accuracy.
In another aspect, the wellknown SLAM approaches can track the camera motion accurately due to precise feature geometry constraints. Inspired by this, we can similarly utilize the sparse feature correspondences for object relative motion constraining to enforce temporal consistency. However, the object instance pose cannot be obtained with pure feature measurement without semantic prior. To this end, due to the complementary nature of semantic and feature information, we integrate our instance semantic inference model and the temporal feature correlation into a tightlycoupled optimization framework which can continuously track the 3D objects and recover the dynamic sparse point cloud with instance accuracy and temporal consistency, which can be overviewed in Fig. 1. Benifitting from objectregionaware property, our system is able to estimate camera pose robustly without being affected by dynamic objects. Thanks to the temporal geometry constraints, we can track the objects continuously even for the extremely truncated case (see Fig. 1), where the object pose is hard for instance inference. Additionally, we employ a kinematics motion model for detected cars to ensure consistent orientation and motion estimation; it also serves as important smoothing for distant cars which have few feature observation. Depending only on a mature 2D detection and classification network [5], our system is capable of performing robust egomotion estimation and 3D object tracking in diverse scenarios. The main contributions are summarized as follows:

A lightweight 3D box inference method using only 2D object detection and the proposed viewpoint classification, which provides the object reprojection contour and occlusion mask for object feature extraction. It also serves as the semantic measurement model for the followup optimization.

A novel dynamic object bundle adjustment approach which tightly couples the semantic and feature measurements to continuously track the object states with instance accuracy and temporal consistency.

Demonstration over diverse scenarios to show the practicability of the proposed system.
2 Related Work
We review the related works in the context of semantic SLAM and learningbased 3D object detection from images.
2.0.1 Semantic SLAM
With the development of 2D object detection, several joint SLAM with semantic understanding works have sprung up, which we discuss in three categories. The first is semanticaided localization: [6, 7] focus on correcting the global scale of monocular Visual Odometry (VO) by incorporating object metric size of only one dimension into the estimation framework. Indoor with small objects and outdoor experiments are conducted respectively in these two works. [8] proposes an object data association method in a probabilistic formulation and shows its drift correcting ability when reobserving the previous objects. However, it omits the orientation of objects by treating the 2D bounding boxes as points. And in [9], the authors address the localization task from only object observation in a prior semantic map by computing a matrix permanent. The second is SLAMaided object detection [10, 11] and reconstruction [12, 13]: [10] develops an 2D object recognition system which is robust to viewpoint changing with the assistance of camera localization, while [11] performs confidencegrowing 3D objects detection using visualinertial measurements. [12, 13] reconstruct the dense surface of 3D object by fusing the point cloud from monocular and RGBD SLAM respectively. Finally, the third category is joint estimation for both camera and object poses: With prebuilt bags of binary words, [14] localizes the objects in the datasets and correct the map scale in turns. In [15, 16], the authors propose a semantic structure from motion (SfM) approach to jointly estimate camera, object with considering scene components interaction. However, neither of these methods shows the ability to solve dynamic objects, nor makes full use of 2D bounding box data (center, width, and height) and 3dimensions object size. There are also some existing works [17, 18, 19, 20] building the dense map and segmenting it with semantic labels. These works are beyond the scope of this paper, so we will not discuss them in details.
2.0.2 3D Object Detection
Inferring object pose from images by deep learning approaches provides an alternative way to localize 3D objects. [21, 22] use the shape prior to reason 3D object pose, where the dense shape and wireframe models are used respectively. In [23], a voxel pattern is employed to detect 3D pose of objects with specific visibility patterns. Similarly to object proposal approaches in 2D detection [5], [1] generates 3D proposals by utilizing depth information calculated from stereo images, while [2] exploits the ground plane assumption and additional segmentation features to produce 3D candidates; RCNN is then used for candidates scoring and object recognition. Such highdimension features used for proposal generating or model fitting are computationally complex for both training and deploying. Instead of directly generating 3D boxes, [24] regresses object orientation and dimensions in separate stages; then the 2D3D box geometry constraints are used to calculate the 3D object pose, while purely depending on instance 2D box limits its performance in objecttruncated cases.
In this work, we study the pros and cons of existing works and propose an integrated perception solution for autonomous driving that makes full use of the instance semantic prior and precise feature spatialtemporal correspondences to achieve robust and continuous state estimation for both the egocamera and static or dynamic objects in diverse environments.
3 Overview
Our semantic tracking system has three main modules, as illustrated in Fig. 2. The first module performs 2D object detection and viewpoint classification (Sect. 4), where the objects poses are roughly inferred based on the constraints between 2D box edges and 3D box vertexes. The second module is feature extraction and matching (Sect. 5). It projects all the inferred 3D boxes to the 2D image to get the objects contour and occlusion masks. Guided feature matching is then applied to get robust feature associations for both stereo and temporal images. In the third module (Sect. 6), we integrate all the semantic information, feature measurements into a tightlycoupled optimization approach. A kinematics model is additionally applied to cars to get consistent motion estimation.
4 Viewpoint Classification and 3D Box Inference
Our semantic measurement includes the 2D object box and classified viewpoints. Based on this, the object pose can be roughly inferred instantly in closeform.
4.1 Viewpoint Classification
2D Object detection can be implemented by the stateoftheart object detectors such as Faster RCNN [5], YOLO [25], etc. We use Faster RCNN in our system since it performs well on small objects. Only left images are used for object detection due to realtime requirement. The network architecture is illustrated in Fig. 2 (a). Instead of the pure object classification in the original implementation of [5], we add subcategories classification in the final FC layers, which denotes object horizontal and vertical discrete viewpoints. As Fig 3 shown, We divide the continuous object observing angle into eight horizontal and two vertical viewpoints. With total 16 combinations of horizontal and vertical viewpoint classification, we can generate associations between edges in the 2D box and vertexes in the 3D box based on the assumption that the reprojection of the 3D bounding box will tightly fit the 2D bounding box. These associations provide essential condition to build the four edgevertex constraints for 3D box inference (Sect. 4.2) and formulate our semantic measurement model (Sect. 6.2).
Comparing with direct 3D regression, the welldeveloped 2D detection and classification networks are more robust over diverse scenarios. The proposed viewpoint classification task is easy to train and have high accuracy, even for small and extreme occluded objects.
4.2 3D Box Inference Based on Viewpoint
Given the object 2D bounding box described by four edges in normalized image plane and classified viewpoint, we aim to infer the object pose based on four constriants between 2D box edges and 3D box vertexes, which is inspired by [24]. A 3D bounding box can be represented by its center position and horizontal orientation respecting to camera frame, and the known dimensions prior . For example, in such a viewpoint presented in Fig. 3 from one of 16 combinations in Fig. 3 (denoted as red), four vertexes are projected to the 2D edges, the corresponding constraints can be formulated as:
(1) 
where is a 3D projection warp function which defined as , and represents the u coordinate in the normalized image plane. We use to denote the rotation parameterized by horizontal orientation from the object frame to the camera frame. are four diagonal selection matrixes to describe the relations between the object center to the four selected vertexes, which can be determined after we get the classified viewpoint without ambiguous. From the object frame defined in Fig. 3, it’s easy to see that:
(2) 
With these four equations, the 4 DoF object pose can be solved intuitively given the dimensions prior. This solving process has very trivial time consuming comparing with [24] which exhaustive tests all the valid edgevertex configurations.
We convert the complex 3D object detection problem into 2D detection, viewpoint classification, and straightforward closedform calculation. Admittedly, the solved pose is an approximated estimation which is conditioned on the instance ”tightness” of the 2D bounding box and the object dimension prior. Also for some top view cases, the reprojection of the 3D box does not strictly fit the 2D box, which can be observed from the top edge in Fig. 3. However, for the almost horizontal or slight lookingdown viewpoints in autonomous driving scenarios, this assumption can be held reasonably. Note that our instance pose inference is only for generating object projection contour and occlusion mask for the feature extraction (Sect. 5) and serves as an initial value for the followup maximumaposteriori (MAP) estimation, where the 3D object trajectory will be further optimized by sliding window based feature correlation and object point cloud alignment.
5 Feature Extraction and Matching
We project the inferred 3D object boxes (Sect. 4.2) to the stereo images to generate a valid 2D contour. As Fig. 2 (b) illustrates, we use different colors mask to represent visible part of each object (gray for the background). For occlusion objects, we mask the occluded part as invisible according to objects 2D overlap and 3D depth relations. For truncated objects which have less than four valid edges measurements thus cannot be inferred by the method in Sec. 4.2, we directly project the 2D box detected in the left image to the right image. We extract ORB features [26] for both the left and right image in the visible area for each object and the background.
Stereo matching is performed by epipolar line searching. The depth range of object features are known from the inferred object pose, so we limit the search area to a small range to achieve robust feature matching. For temporal matching, we first associate objects for successive frames by 2D box similarity score voting. The similarity score is weighted by the center distance and shape similarity of the 2D boxes between successive images after compensating the camera rotation. The object is treated as lost if its maximum similarity score with all the objects in the previous frame is less than a threshold. We note that there are more sophisticated association schemes such as probabilistic data association [8], but it is more suitable for avoiding the hard decision when revisiting the static object scene than for the highly dynamic and norepetitive scene for autonomous driving. We subsequently match ORB features for the associated objects and background with the previous frame. Outliers are rejected by RANSAC with local fundamental matrix test for each object and background independently.
6 Egomotion and Object Tracking
Beginning with the notation definition, we use to denote the semantic measurement of the object at time , where are the observations of the lefttop and the rightbottom coordinates of the 2D bounding box respectively, is the object class label and are four selection matrixes defined in Sec. 4.2. For measurements of sparse feature which is anchored to one object or the background, we use to denote the stereo observations of the feature on the object at time ( for the static background), where are feature coordinates in the normalized left and right image plane respectively. The states of the egocamera and the object are represented as , respectively, where we use , and to denote the world, camera and object frame. represents the position in the world frame. For objects orientation, we only model the horizontal rotation instead of rotation for the egocamera. is the timeinvariant dimensions of the object, and are the speed and steering angle, which are only estimated for cars. For conciseness, we visualize the measurements and states in Fig. 4 at the time .
Considering a general autonomous driving scene, we aim to continuously estimate the egomotion of the onboard camera from time to ,
(3) 
track the number of 3D objects,
(4) 
and recover the 3D position of the dynamic sparse features,
(5) 
(note that here we use to denote the object frame, in which the features are relatively static, for background world, in which the features are globally static), given the semantic measurements
(6) 
and sparse feature observations anchored to the object
(7) 
We formulate our semantic objects and camera egomotion tracking from the probabilistic model to a nonlinear optimization problem.
6.1 Egomotion Tracking
Given the static background feature observation, the egomotion can be solved via maximum likelihood estimation (MLE):
(8)  
(9)  
(10) 
This is the typical SLAM or SfM approach. The camera pose and background point cloud are estimated conditionally on the first state. As Eq. 8 shows, the log probability of measurement residual is proportional to its Mahalanobis norm . Then the MLE is converted to a nonlinear least square problem, this process is also known as Bundle Adjustment(BA).
6.2 Semantic Object Tracking
After we solve the camera pose, the object state at each time can be solved based on the dimension prior and the instance semantic measurements (Sect 4.2). We assume the object is a rigid body, which means the feature anchored to it is fixed respecting to the object frame. Therefore, the temporal states of the object are correlated if we have continuous object feature observations. States of different objects are conditionally independent given the camera pose, so we can track all the objects in parallel and independently. For the object,
(11) 
For semantic objects, we have the dimension prior distribution for each class label. We assume the 2D detection results for each object at each time are independent and Gaussian distributed. According to Bayes’ rule, the Eq. 11 can be converted to a maximumaposteriori (MAP) estimation:
(12)  
(13)  
(14)  
(15) 
where we use , , , and to denote the residual of the feature reprojection, dimension prior, object motion model, and semantic bounding box reprojection respectively. is the corresponding covariance matrix for each measurement. We formulate our 3D object tracking problem into a dynamic object BA approach which makes fully exploit object dimension and motion prior and enforces temporal consistency. Maximum a posteriori estimation can be achieved by minimizing the sum of the Mahalanobis norm of all the residuals.
6.2.1 Sparse Feature Observation
We extend the projective geometry between static features and camera pose to dynamic features and object pose. Based on anchored relative static features respecting to the object frame, the object poses which share feature observations can be connected by a factor graph. For each feature observation, the residual can be represented by the reprojection error of predicted feature position and the actual feature observations on the left and right image:
(16)  
(17) 
where we use to denote applying a 3D rigid body transform to a point . For example, transforms the feature point from the object frame to the world frame, is the corresponding inverse transform. denotes the extrinsic transform of the stereo camera, which is calibrated offline.
6.2.2 Semantic 3D Object Measurement
Benefiting from the viewpoint classification, we can know the relations between edges of the 2D bounding box and vertexes of the 3D bounding box. Assume the 2D bounding box is tightly fitted to the object boundary, then each edge is intersected with a reprojected 3D vertex. These relations can be determined as four selection matrixes for each 2D edge. The semantic residual can be represented by the reprojection error of the predicted 3D box vertexes with the detected 2D box edges:
(18)  
(19) 
where we use to project a vertex specified by the corresponding selection matrix of the 3D bounding box to the camera plane. This factor builds the connection between the object pose and its dimensions instantly. Note that we only perform 2D detection on the left image due to the realtime requirement.
6.2.3 Vehicle Motion Model
To achieve consistent estimation of motion and orientation for the vehicle class, we employ the kinematics model introduced in [27]. The vehicle state at time can be predicted with the state at :
(20)  
(21) 
where is the length of the wheelbase, which can be parameterized by the dimensions. The orientation of the car is always parallel to the moving direction. We refer readers to [27] for more derivations. Thanks to this kinematics model, we can track the vehicle velocity and orientation continuously, which provides rich information for behavior and path planning for autonomous driving. For other class such as pedestrians, we directly use a simple constantvelocity model to enhance smoothness.
6.2.4 Point Cloud Alignment
After minimizing all the residuals, we obtain the MAP estimation of the object pose based on the dimension prior. However, the pose might be biased estimated due to object size difference (See Fig. 5). We therefore align the 3D box to the recovered point cloud, which is unbiased because of accurate stereo extrinsic calibration. We minimize the distance of all 3D points with their anchored 3D box surfaces:
(22) 
where denotes the distance of the feature with its corresponding observed surface. After all the above information is tightly fused together, we get consistent and accurate pose estimation for both the static and dynamic objects.
7 Experimental Results
We evaluate the performance of the proposed system on both KITTI [28, 29] and Cityscapes [30] dataset over diverse scenarios. The mature 2D detection and classification module has good generalization ability to run on unseen scenes, and the followup nonlinear optimization is dataindependent. Our system is therefore able to perform consistent results on different datasets. The quantitative evaluation shows our semantic 3D object and egomotion tracking system has better performance than the isolated stateoftheart solutions.
7.1 Qualitative Results Over Diverse Scenarios
Firstly, we test the system on long challenging trajectories in KITTI dataset which contains 1240 376 stereo color and grayscale images captured at 10 Hz. We perform 2D detection on left color images and extract 500 (for the background) and 100 (for the object) ORB features [26] on both left and right grayscale images. Fig. 6 shows a 700 m closeloop trajectory which includes both static and dynamic cars. We use red cone and line to represent the camera pose and trajectory, and various color CAD models and lines to represent different cars and their trajectories, all the observed cars are visualized in the top view. Currently, our system performs object tracking in a memoryless manner, so the reobserved object will be treated as a new one, which can also be found in the enlarged start and end views in Fig. 6. In Fig. 6, the black car is continuously truncated over a long time, which is an unobservable case for instance 3D box inference (Sect. 4.2). However, we can still track its pose accurately due to temporal feature constraints and dynamic point cloud alignment.
We also demonstrate the system performance on different datasets over more scenarios which include concentrated cars, crossroads, and dynamic roads. All the reprojected images and the corresponding top views are shown in Fig. 7.
7.2 Quantitative Evaluation
Since there are no available integrated academic solutions for both egomotion and dynamic objects tracking currently, we conduct quantitative evaluations for the camera and objects poses by comparing with the isolated stateoftheart works: ORBSLAM2 [3] and 3DOP [1].
7.2.1 Camera Pose Evaluation
Benefiting from the semantic prior, our system can perform robust camera estimation in dynamic environments. We evaluate the accuracy of camera odometry by comparing the relative pose error (RPE) [28] and RMSE of ATE (Absolute Trajectory Error) [31] with the ORBSLAM2 [3] with stereo settings. Two sequences in KITTI raw dataset: 0929_0004 and 1003_0047 which include dynamic objects are used for RPEs comparison. The relative translation and rotation errors are presented in Fig. 8 (a). Ten long sequences of KITTI raw dataset are additionally used to evaluate RMSEs of ATE, as detailed in Fig. 8 (b). It can be seen that our estimation shows almost same accuracy with [3] in less dynamic scenarios due to the similar Bundle Adjustment approaches (0926_0051, etc.). However, our system still works well in high dynamic environments while ORBSLAM2 shows nontrivial errors due to introducing many outliers (1003_0047, 0929_0004, etc.). This experiment shows that the semanticaided objectaware property is essential for camera pose estimation, especially for dynamic autonomous driving scenarios.
7.2.2 Object Localization Evaluation
We evaluate the car localization performance on KITTI tracking dataset since it provides sequential stereo images with labeled objects 3D boxes. According to occlusion level and 2D box height, we divide all the detected objects into three regimes: easy, moderate and hard then evaluate them separately. To evaluate the localization accuracy of the proposed
estimator, we collect objects average position error statistics. By setting a series of IntersectionoverUnions (IoU) thresholds from 0 to 1, we calculate the true positive (TP) rate and the average error between estimated position of TPs and ground truth at each instance frame for each threshold. The average position error (in %) vs TP rate curves are shown in Fig. 9, where we use blue, red, yellow lines to represent statistics for easy, moderate and hard objects. It can be seen that the average error for half tuth positive objects is below 5%. For all the estimated results, the average position errors are 5.9%, 6.1% and 6.3% for easy, moderate and hard objects respectively.
To compare with baselines, we evaluate the Average Precision (AP) for bird’s eye view boxes and 3D boxes by comparing with 3DOP [1], the stateoftheart stereo based 3D object detection method. We set IoU thresholds to 0.25 and 0.5 for both bird’s eye view and 3D boxes. Note that we use the oriented box overlap, so the object orientation is also implied evaluated in these two metrics. We use S, M, F, P to represent semantic measurement, motion model, feature observation, and point cloud alignment respectively. As listed in Table.1, the semantic measurement serves as the basis of the 3D object estimation. Adding feature observation increases the AP for easy (near) objects obviously due to large feature extraction area (same case for adding point clout alignment), while adding motion model helps the hard (far) objects since it ”smooths” the nontrivial 2D detection noise for small objects. After integrating all these cues together, we obtain accurate 3D box estimation for both near and far objects. It can be seen that our integrated method shows more accurate results for all the AP in bird’s eye view and 3D box with 0.25 IoU threshold. Due to the unregressed object size, our performance slightly worse than 3DOP in 3D box comparison of 0.5 IoU. However, we stress our method can efficiently track both static and dynamic 3D objects with temporal smoothness and motion consistency, which is essential for continuous perception and planning in autonomous driving.
Method  (IoU=0.25)  (IoU=0.5)  (IoU=0.25)  (IoU=0.5)  Time  

Easy  Mode  Hard  Easy  Mode  Hard  Easy  Mode  Hard  Easy  Mode  Hard  (ms)  
S  63.12  56.37  53.18  33.12  28.91  27.77  58.78  52.42  48.82  25.68  21.70  21.02  120 
S+M  66.27  63.81  58.84  41.08  38.90  34.84  62.97  60.70  55.28  34.18  30.98  27.32  121 
S+F  76.23  70.18  66.18  48.82  43.07  39.80  73.35  66.86  62.66  38.93  33.43  30.46  170 
S+F+M  77.87  74.48  70.85  46.96  44.39  42.23  73.32  71.06  67.30  40.50  36.28  34.59  171 
S+F+M+P  88.07  77.83  72.73  58.52  46.17  43.97  86.57  74.13  68.96  48.51  37.13  34.54  173 
3DOP  81.34  70.70  66.32  54.83  43.36  37.15  80.62  70.01  65.76  53.73  42.27  35.87  1200 
8 Conclusions and Future work
In this paper, we propose a 3D objects and egomotion tracking system for autonomous driving. We integrate the instance semantic prior, sparse feature measurement and kinematics motion model into a tightlycoupled optimization framework. Our system can robustly estimate the camera pose without being affected by the dynamic objects and track the states and recover dynamic sparse features for each observed object continuously. Demonstrations over diverse scenarios and different datasets illustrate the practicability of the proposed system. Quantitative comparisons with stateoftheart approaches show our accuracy for both camera estimation and objects localization.
In the future, we plan to improve the object temporal correlation by fully exploiting the dense visual information. Currently, the camera and objects tracking are implemented successively in our system. We are also going to model them into a fullyintegrated optimization framework such that the estimation for both camera and dynamic objects can benefit from each other.
References
 [1] Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. In: Advances in Neural Information Processing Systems. (2015) 424–432
 [2] Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3d object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2147–2156
 [3] MurArtal, R., Tardós, J.D.: ORBSLAM2: an opensource SLAM system for monocular, stereo and RGBD cameras. IEEE Transactions on Robotics 33(5) (2017) 1255–1262
 [4] Engel, J., Schöps, T., Cremers, D.: Lsdslam: Largescale direct monocular slam. In: European Conference on Computer Vision, Springer (2014) 834–849
 [5] Ren, S., He, K., Girshick, R., Sun, J.: Faster rcnn: Towards realtime object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
 [6] Frost, D.P., Kähler, O., Murray, D.W.: Objectaware bundle adjustment for correcting monocular scale drift. In: Robotics and Automation (ICRA), 2016 IEEE International Conference on, IEEE (2016) 4770–4776
 [7] Sucar, E., Hayet, J.B.: Probabilistic global scale estimation for monoslam based on generic object detection. In: Computer Vision and Pattern Recognition Workshops (CVPRW). (2017)
 [8] Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic slam. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE (2017) 1722–1729
 [9] Atanasov, N., Zhu, M., Daniilidis, K., Pappas, G.J.: Semantic localization via the matrix permanent. In: Proceedings of Robotics: Science and Systems. Volume 2. (2014)
 [10] Pillai, S., Leonard, J.J.: Monocular slam supported object recognition. In: Proceedings of Robotics: Science and Systems. Volume 2. (2015)
 [11] Dong, J., Fei, X., Soatto, S.: Visualinertialsemantic scene representation for 3d object detection. arXiv preprint arXiv:1606.03968 (2016)
 [12] Civera, J., GálvezLópez, D., Riazuelo, L., Tardós, J.D., Montiel, J.: Towards semantic slam using a monocular camera. In: Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, IEEE (2011) 1277–1284
 [13] SalasMoreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H., Davison, A.J.: Slam++: Simultaneous localisation and mapping at the level of objects. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE (2013) 1352–1359
 [14] GálvezLópez, D., Salas, M., Tardós, J.D., Montiel, J.: Realtime monocular object slam. Robotics and Autonomous Systems 75 (2016) 435–449
 [15] Bao, S.Y., Savarese, S.: Semantic structure from motion. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE (2011) 2025–2032
 [16] Bao, S.Y., Bagra, M., Chao, Y.W., Savarese, S.: Semantic structure from motion with points, regions, and objects. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2703–2710
 [17] Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3d reconstruction from monocular video. In: European Conference on Computer Vision, Springer (2014) 703–718
 [18] Vineet, V., Miksik, O., Lidegaard, M., Nießner, M., Golodetz, S., Prisacariu, V.A., Kähler, O., Murray, D.W., Izadi, S., Pérez, P., et al.: Incremental dense semantic stereo fusion for largescale semantic scene reconstruction. In: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE (2015) 75–82
 [19] Li, X., Belaroussi, R.: Semidense 3d semantic mapping from monocular slam. arXiv preprint arXiv:1611.04144 (2016)
 [20] McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on, IEEE (2017) 4628–4635
 [21] Bao, S.Y., Chandraker, M., Lin, Y., Savarese, S.: Dense object reconstruction with semantic priors. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE (2013) 1264–1271
 [22] Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3d representations for object recognition and modeling. IEEE transactions on pattern analysis and machine intelligence 35(11) (2013) 2608–2623
 [23] Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Datadriven 3d voxel patterns for object category recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1903–1911
 [24] Mousavian, A., Anguelov, D., Flynn, J., Košecká, J.: 3d bounding box estimation using deep learning and geometry. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 5632–5640
 [25] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, realtime object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 779–788
 [26] Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: Computer Vision (ICCV), 2011 IEEE international conference on, IEEE (2011) 2564–2571
 [27] Gu, T.: Improved Trajectory Planning for OnRoad SelfDriving Vehicles Via Combined Graph Search, Optimization & Topology Analysis. PhD thesis, Carnegie Mellon University (2017)
 [28] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR). (2012)
 [29] Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR) (2013)
 [30] Cordts, M., Omran, M., Ramos, S., Scharwächter, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset. In: CVPR Workshop on the Future of Datasets in Vision. Volume 1. (2015) 3
 [31] Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgbd slam systems. In: Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, IEEE (2012) 573–580