Visual SLAM with Network Uncertainty Informed Feature Selection
Abstract
In order to facilitate longterm localization using a visual simultaneous localization and mapping (SLAM) algorithm, careful feature selection is required such that reference points persist over long durations and the runtime and storage complexity of the algorithm remain consistent. We present SIVO (Semantically Informed Visual Odometry and Mapping), a novel informationtheoretic feature selection method for visual SLAM which incorporates machine learning and neural network uncertainty into the feature selection pipeline. Our algorithm selects points which provide the highest reduction in Shannon entropy between the entropy of the current state, and the joint entropy of the state given the addition of the new feature with the classification entropy of the feature from a Bayesian neural network. This feature selection strategy generates a sparse map suitable for longterm localization, as each selected feature significantly reduces the uncertainty of the vehicle state and has been detected to be a static object (building, traffic sign, etc.) repeatedly with a high confidence. The KITTI odometry dataset is used to evaluate our method, and we also compare our results against ORB_SLAM2. Overall, SIVO performs comparably to ORB_SLAM2 (average of 0.17% translation error difference, deg/m rotation error difference) while reducing the map size by 69%.
{keywords}Localization, Mapping, SLAM, Deep Learning, Information Theory, Semantic Segmentation
I Introduction
Localization is a crucial problem for an autonomous vehicle. Accurate location knowledge facilitates a variety of tasks required for autonomous driving such as vehicle control, path planning or object tracking. Accurate positioning information is also a matter of safety, as localization accuracy must be known on the order of centimetres in order to prevent collisions and maintain lane positioning. Although sensors such as a Global Positioning System (GPS) can provide localization information to the desired accuracy, there are numerous situations where this is not possible, such passing through a tunnel or driving in dense urban environments. In recent years, visual odometry (VO) [1] and visual simultaneous localization and mapping (SLAM) have emerged as reliable techniques for vehicle localization through the use of cameras. By observing the apparent motion of distinct reference points, or features, in the scene, we can determine the motion of a camera through the environment. The map generated by the SLAM algorithm can be used for longterm localization as it provides the vehicle with known references if it returns to a premapped area. However, the runtime performance and storage requirements of the algorithm scale with the number of landmarks detected. Therefore, careful landmark selection is required as the robot navigates through its environment.
In order to accurately track camera motion, selected features should be: viewpoint invariant, scale invariant, rotation invariant, illumination invariant, season invariant, and static. Traditional feature detectors and descriptors, such as SIFT [2], SURF [3], FAST [4], or ORB [5] aim to tackle the first 3 criteria, while appearance based methods such as FABMAP [6] or SeqSLAM [7] aim to tackle criteria 4 and 5. Typically, visual SLAM algorithms depend on outlier rejection schemes such as RANSAC [8] to characterize an object as dynamic. In this case, the motion of the dynamic reference point would be an outlier compared to the motion of static objects, which should comprise the majority of the scene. This, however, is not always the case for autonomous driving due to other vehicles or pedestrians in the scene.
Figure 1 illustrates typical features used by a visual SLAM algorithm. The best reference points are most likely on the corners of the building. These are useful longterm references as they would only be modified in the event of construction or vandalism. In contrast, the reference points on the cars may be gone within the hour, and the features on foliage will no longer be present as the seasons change. The emergence of deep learning has led to rapid advances in scene understanding, which allows us to incorporate context into a SLAM algorithm and address our last criterion. We can now dictate which features are more likely to be stable from our contextual understanding of typical static and dynamic objects.
We present SIVO^{1}^{1}1Publicly available: www.github.com/navganti/SIVO (Semantically Informed Visual Odometry and Mapping), a novel feature selection method for visual SLAM which facilitates longterm localization. This work enhances traditional feature detectors with deep learning based scene understanding using a Bayesian neural network (NN), which provides context for visual SLAM while accounting for neural network uncertainty. Our method selects features which provide the highest reduction in Shannon entropy between the entropy of the current state, and the joint entropy of the state given the addition of a new feature with the classification entropy of the feature from the Bayesian NN. This strategy generates a sparse map suitable for longterm localization, as each selected feature significantly reduces the uncertainty of the vehicle state and has been detected to be a static object (building, traffic sign, etc.) repeatedly with a high confidence. To the best of our knowledge, this is the first algorithm which directly considers NN uncertainty in the SLAM pipeline.
Ii Related Works
Informationtheoretic (IT) approaches have been prominent in maintaining the number of variables used by the SLAM optimization pipeline. These methods select features or keyframes which maximize the information gain (see Section IIIB), and aim to reduce the number of optimization variables without appreciably compromising the accuracy of the SLAM solution. Dissanayake et al. [11] propose a feature selection strategy which reduces the computational burden of maintaining a large map without affecting the statistical consistency of the estimation process. As the robot travels through its environment, the authors only select one new landmark each time the robot has travelled a predefined distance. The feature selected is the one with the maximum information content; this value is determined by calculating the reciprocal of the trace of the landmark position covariance matrix. Hochdorfer and Schlegel [12] build on the method proposed by Dissanayake, and consider spatial position in conjunction with quantifying the quality of a landmark. Davison [13] proposes to use mutual information as the quality metric of a visual measurement in order to determine the location within an image to focus processing resources. The landmark with the highest mutual information between itself and the robot pose will reduce the pose uncertainty the most, and is selected to update the pose estimate. This process is repeated until the mutual information of the best feature drops below a predefined threshold. Kaess and Dellaert [14] build on the method proposed by Davison and further save computing resources by immediately select all features which have a mutual information above a predefined threshold. A number of other methods have also been proposed to maintain the number of features, including an entropybased approach [15], an incremental approach which optimizes the tradeoff between memory requirement and estimation [16], as well as an approach which learns the most useful landmarks for a navigation task using Monte Carlo reinforcement learning [17]. Map point maintenance is crucial for longterm localization, and informationtheoretic approaches have proven to be effective in reducing the complexity of the SLAM problem. However, the maps generated by these algorithms do not have guaranteed longterm stability due to the lack of semantic information.
The idea to incorporate semantic information into the visual SLAM formulation is not novel. The emergence of deep learning based approaches has resulted in increasingly accurate methods to determine context within a scene. SalasMoreno et al. [18] propose to incorporate semantic information at the level of objects. In contrast to traditional SLAM algorithms, which use lowlevel primitive features such as points, lines, or patches for localization, the authors track the motion of objects in the scene from a known 3D object database, such as tables, chairs or desks. Murali et al. [19] extend a custom mapbuilder and the localization functionality of ORB_SLAM2 [9] to use semantic scene information obtained from a lowrank version of SegNet [20]. A feature is deemed to be invalid if the class is a temporal object (car, bike, pedestrian), or too far away (sky, road). This feature selection scheme is only used for mapping, while all features are incorporated for visual odometry. Semantic information has also been included into direct methods [21], and An et al. [22] propose a VO pipeline which fuses semantics into a semidirect method. Some semantic approaches bypass the use of feature detectors entirely. Bowman et al. [23] use expectation maximization to jointly estimate data association and sensor states. Stenborg et al. [24] use only the 3D location of a feature and its semantic label as a descriptor, and use a particle filter to bypass the use of traditional feature detectors. While the rapid advancement of deep learning has allowed for the development of semantic SLAM algorithms, the approaches to date treat network output as deterministic and do not properly account for uncertainty in the NN output.
As we continue to develop our machine learning based methods, it is imperative to understand how much trust we can place in the network, and use this uncertainty in our SLAM algorithm to facilitate robot decisions. Gal [25, 26, 27, 28] and Kendall [29, 30, 31] have explored the use of dropout [32] at test time to better estimate neural network uncertainty for both classification and regression tasks. The authors show that multiple forward passes with dropout is equivalent to approximating a Bayesian neural network, which allows for uncertainty to be extracted from the network output. Our approach uses this network uncertainty formulation in conjunction with a traditional IT approach, allowing us to reliably incorporate semantic information while maintaining map size.
Iii Problem Formulation
Iiia Uncertainty Estimation for Semantic Segmentation
IiiA1 Bayesian Neural Networks
It has been shown that an NN with one layer, an infinite number of weights, and a Gaussian distribution placed over each of its weights converges to a Gaussian Process (GP) [33]. We can intuitively see that this is the case; NNs can be considered as “function approximators”, and placing a Gaussian distribution (typically a standard Gaussian) over the weights results in a distribution over the function. An infinitelywide NN is obviously impossible to construct, however, finite NNs with distributions over its weights have been studied as Bayesian Neural Networks [33]. Gal [27] shows that applying dropout [32] before every weight layer in an NN with arbitrary depth and nonlinearities is a mathematically equivalent approximation to the Bayesian NN, which in turn is an approximation to the deep GP.
IiiA2 Output Prediction using a Bayesian Neural Network
Let us define the input to a neural network as , and the weight matrix for each layer as , with total layers and varying dimensions. The output of an NN, can be expressed by
(1) 
where is the underlying function approximated by the Bayesian NN. The classification output is constructed by passing the network output through softmax function.
(2) 
where represents the probability of a particular class output, , out of possibilities, given the input and weights. For any new input , the predicted output can be determined by integrating over all possible functions represented by the Bayesian NN. Let us define as our training data input, and as our training data output, indicating that we had training examples. The probability for a new predicted output, is defined by
(3) 
This integral is typically intractable, but can be approximated using variational inference and Monte Carlo integration [27]. In contrast to averaging the weights at test time as described in [32], the same input is passed through the network repeatedly, and dropout is used at test time to provide a different “thinned” network for each trial. The outputs from each trial are then averaged to provide the final output. This is referred to as MC (Monte Carlo) dropout [26].
(4) 
where is the number of passes through the network and represents some nonzero subset of the weights after applying dropout. It is important to note the distinction between the softmax output and the result of Monte Carlo sampling; the softmax mapping describes relative probabilities between class detections, but is not an absolute measure of uncertainty.
IiiB Information Theory
For a stochastic variable with pmf , the entropy, or average uncertainty, is defined by
(5) 
and is measured in bits. The entropy of a multivariate Gaussian variable, , is defined by
(6) 
where represents the covariance matrix, and is the dimension of the random variable. Let us define two dependent random variables, and with pdfs and respectively. The mutual information, or information shared between the two variables is represented by [34]
(7) 
The mutual information between two parts of a multivariate Gaussian, , is defined by [35]
(8) 
(9) 
Mutual information and entropy are tightly coupled. The mutual information between variables and can also be represented by [34]
(10) 
IiiC Uncertainty in Classification Results for a Bayesian Neural Network
Entropy can be used as a metric for classification uncertainty from a Bayesian NN [25]. Let us denote as our network output, as our input image data, as our training data, and as a particular class output with potential classes. The entropy is defined by [25]
(11) 
Equation 11 reaches a maximum value when all of the class outputs are equiprobable, and a minimum value of when one class is predicted with a probability of . Although the individual confidence values do not have any meaning of uncertainty, the entropy calculation will observe the spread in the confidence value for each class output of a pixel. We can write an expression for the approximate entropy for the confidence output in bits by substituting Equation 4 into Equation 11 [25].

(12) 
Iv Feature Selection Criteria
We now present our feature selection methodology for longterm visual SLAM. Our method builds upon the work of Davison [13] and the enhancement by Kaess and Dellaert [14]. We will first outline Davison’s method in detail, and then present SIVO.
Iva InformationTheoretic Feature Selection Criteria
Let us denote the 6DOF state parameterization at some time as . As our main goal is to track camera poses through time, the state represents the pose of the camera frame with respect to the world frame at time . This can also be represented by with associated covariance matrix . Measurements are defined through a nonlinear measurement model, , as
(13) 
where represents the feature measurement, and is zeromean Gaussian noise with noise covariance . The measurement model is the rectified stereo projection model, , where we assume that the transformation between the right and left cameras is a purely horizontal translation equivalent to the baseline. We define as the , , and coordinates of the point in the camera frame . The point in the world frame is defined as , the camera intrinsic parameters are , and as the baseline between stereo cameras.
(14) 
Assume that at some time , we have available features distributed throughout the scene that we can select for our SLAM algorithm. We can stack the current pose with the candidate measurements into a vector as such
(15) 
As each of these random variables are described by multivariate Gaussians, it follows that the stacked vector, , is also a multivariate Gaussian. This stacked variable also has a covariance matrix which consists of the pose covariance and measurement covariances, where the latter is calculated by propagating the state covariance through the measurement model. This is defined by
(16) 
The measurement which best reduces the pose uncertainty is then selected to update the state. This measurement has the maximum mutual information between the state and measurement, and can be easily calculated using Equation 9. However, the marginal covariance for each feature , must first be constructed by selecting the relevant variables from Equation 16.
(17) 
Once the state has been updated, this process is repeated until the maximum information provided by a new measurement falls below a userdefined threshold value.
Kaess and Dellaert [14] build on Davison’s method. The authors argue that selecting individual features and then updating the state estimate is not practical, as updating the state and extracting the marginal pose covariance values prior to taking each measurement can be quite expensive. Therefore, all measurements which have a mutual information above a predefined threshold are selected, and only then is the pose estimate updated. Although this is will not guarantee that the optimal landmark is selected, it is less computationally expensive. This approach forms the foundation for SIVO.
IvB SIVO Feature Selection Criteria
The informationtheoretic approach is now modified to incorporate semantic information. Each measurement from Equation 14 is a stereo projection of a 3D point into the image space. Using semantic segmentation, a discrete class value for each pixel can also be determined, providing context to the measurement.
Using Equation 10, the mutual informationbased criteria from Equation 9 can be expressed in terms of entropy.
(18) 
where represents all previous measurements made in order to obtain our current state estimate. If is greater than a predefined threshold for a measurement, , it is selected as a reference for the SLAM algorithm.
We propose to modify , and evaluate the entropy difference between the current state and the joint entropy of the state given the new feature measurement and the semantic segmentation classification, using Equation 12. This can be expressed as
(19) 
where represents the current image and represents the dataset used to train the neural network. We assume that the classification entropy and state entropy are conditionally independent, and therefore express the latter term in Equation 19 as
(20)  
The state is not dependent on the actual image or dataset, thus the conditionally dependent terms can be removed from the individual entropy terms. Similarly, the classification detection is not dependent on any of the feature measurements. Therefore, Equation 19 can be rewritten as
(21) 
The first two terms are exactly the mutual information criterion from Equation 18. Substituting Equation 18 into Equation 21 yields the SIVO feature selection threshold.
(22) 
We argue that the best reference points should not only provide the most information to reduce the uncertainty of the state, but they should be static reference points which have been detected as such with a very high certainty. This feature selection criteria allows us to balance the value of a feature for the state estimate and the certainty of the feature’s classification. Recall from Section IIIC that the minimum value of the classification entropy is when the network predicts one class with a confidence of . Therefore, in an ideal world where the class of each pixel is perfectly identified, features will be selected according to the original mutual information based criterion from Equation 9, as long as they have been classified as static.
V Experimental Validation
Va Implementation and Training
The localization functionality of SIVO is built on top of ORB_SLAM2 [9], and all loop closure and relocalization functionality is enabled. Bayesian SegNet [30] is used to semantically segment the images and provide network uncertainty. Network inference is implemented using Caffe’s [36] C++ interface in order to integrate the results from Bayesian SegNet with ORB_SLAM2. SIVO is publicly available on Github^{2}^{2}2https://www.github.com/navganti/SIVO, and the training setup can also be found online^{3}^{3}3https://www.github.com/navganti/SegNet.
Bayesian SegNet is trained using the Cityscapes dataset [37] and then finetuned using the KITTI semantic [10] dataset. The network was trained on data with 15 classes: road, sidewalk, building, wall/fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person/rider, car, truck/bus, motorcycle/bicycle, and void. The Bayesian SegNet basic network is used in order to preserve GPU memory and speed up inference time. This architecture contains fewer layers in both the encoder and decoder sections in comparison to the original Bayesian SegNet.
VB Results
The KITTI [10] odometry dataset is used to validate the performance of SIVO. The tunable parameters are the feature selection entropy threshold () and the number of samples for MC Dropout ().The experiments will be referred to as follows: Bayesian SegNet Entropy . For example, an experiment where , and is set to bits is denoted as BS6E4. The following configurations are evaluated: BS2E4, BS6E2, BS6E3, BS6E4, and BS12E4.
The same metrics used by the KITTI odometry benchmark are used to compare SIVO results to the KITTI ground truth and ORB_SLAM2. First, both rotation and translational errors for all subsequences from lengths 100m to 800m are evaluated. These values are then averaged over the subsequence lengths to provide a final translation error (%) and rotation error (deg/m) for each trajectory. Table I contains the compiled results for all trajectory results as well as the number of map points used by the algorithms, and is organized by translation error performance compared to ORB_SLAM2.
KITTI  

Sequence  ORB Trans. Err. (%)  ORB Rot. Err. (deg/m)  SIVO Trans. Err. (%)  SIVO Rot. Err. (deg/m)  ORB Map Points  SIVO Map Points  Map  
Reduction (%)  SIVO  
Config.  
09  1.20  64,442  18,893  70.68  BS6E2  
10  0.97  33,181  9,369  71.76  BS2E4  
08  1.29  127,810  40,574  68.25  BS6E3  
05  0.76  73,463  22,237  69.73  BS6E3  
07  0.80  29,632  9,684  67.32  BS6E3  
00  1.44  138,153  45,875  66.79  BS6E4  
02  1.70  202,293  58,894  70.89  BS12E4  
04  1.50  21,056  6,328  69.95  BS12E4  
03  4.65  27,209  8,449  68.95  BS12E4  
01  3.17  101,378  37,233  63.27  BS2E4  
06  7.10  47,461  11,396  75.99  BS6E3 
In summary, SIVO outperformed ORB_SLAM2 on KITTI sequence 09 (Figure 4), performed comparably (average of 0.17% translation error difference, deg/m rotation error difference) on sequences 00, 02, 05, 07, 08, and 10, and performed worse on sequences 01, 03, 04, and 06 (sequences separated by the midline) while removing, on average, 69% of the map points. The comparable results on 7 out of the 11 odometry sequences indicates that the points removed by SIVO are redundant and the remaining points should be excellent longterm reference points for visual SLAM, although it is not possible to verify feature persistence with the KITTI dataset.
Vi Discussion
Overall, our feature selection scheme is a good first approach to incorporating neural network uncertainty into a visual SLAM formulation. In most cases the results between SIVO and ORB_SLAM2 are comparable, with a translation error difference of 0.17% and a rotation error difference of deg/m even when removing, on average, 69% of the map points used by the optimization pipeline. SIVO successfully removed points from the environment which are uninformative and/or dynamic. Figure 2 illustrates features selected on KITTI sequence 00, while Figure 3 shows the variance image for the scene. The variance image shows the spread of classification confidence from the trials of MC dropout, discussed in Section IIIA2; black indicates a normalized variance of 0, while white indicates a normalized variance of 1. SIVO has mostly selected features which have a low variance, however the occasional uncertain point (such as the windowsill on the right side) has been selected as it sufficiently reduces the pose uncertainty.
In some cases, however, removing these map points did have an adverse effect on localization performance. Some trajectories have significantly worse performance, which can be mostly attributed to the removal of short range feature points. SIVO immediately removes a point if it has been designated as a dynamic class, however, the KITTI odometry set has been curated to contain mostly static scenes and the most sequences contain numerous parked cars. This difference is illustrated in Figures 1 and 2. These cars make up the majority of close features in the scene, which is required to better estimate translation. To accurately estimate camera pose, an even distribution of points throughout the image and 3D space is required. Far points will help with rotation estimation but are poor translation estimates, and close points can help with both. For all trajectories, ORB_SLAM2 and SIVO have comparable, accurate rotation estimation, but SIVO generally performs worse in estimating translation. The four trajectories which performed significantly worse (01, 03, 04, and 06) are all “straightline” trajectories, where the apparent motion of the features is quite small as they are far away and lie directly ahead of the vehicle. This causes significant drift in translation in particular, as illustrated by Table I.
The localization performance is also dependent on segmentation quality. For example, sequence 01 (the “highway” sequence), in addition to being a mostly “straightline” sequence, is not well represented in the semantic dataset. In Figure 5, part of the highway divider as well as the bridge in the distance are misclassified as a car, which is immediately ignored by the algorithm. These features make up most of the close features for this sequence; SIVO is therefore relying on further features for this trajectory, and the translation performance suffers as a result.
Vii Conclusion
We present SIVO (Semantically Informed Visual Odometry and Mapping), a novel feature selection algorithm for visual SLAM which fuses together NN uncertainty with an informationtheoretic approach to visual SLAM. SIVO outperformed ORB_SLAM2 on KITTI sequence 09, and performed comparably well on 6 of the 10 remaining trajectories while removing 69% of the map points on average. While incorporating semantic information into the SLAM algorithm is not novel, to the best of our knowledge this is the first algorithm which directly considers NN uncertainty to drive decisionmaking in the SLAM process. Our method selects points which significantly reduce the Shannon entropy between the current state entropy, and the joint entropy of the state given the addition of the new feature with the classification entropy of the feature from the Bayesian NN. These points should provide the most benefit in longterm localization, as our method maintains the number of landmarks by selecting the most informative points while ensuring they are static longterm references. Our future work aims to refine segmentation performance and verify longterm localization capability on different datasets. We also will look to introduce further context and determine whether an observed vehicle is static or dynamic. This would allow for the use of short range features detected on static vehicles in a visual odometry solution for local pose estimation, while still ignoring these points in map creation.
References
 [1] D. Nistér, O. Naroditsky, and J. Bergen, “Visual odometry,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1. IEEE, 2004, pp. I–I.
 [2] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision (IJCV), vol. 60, no. 2, pp. 91–110, 2004.
 [3] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” in European Conference on Computer Vision (ECCV). Springer, 2006, pp. 404–417.
 [4] E. Rosten and T. Drummond, “Machine learning for highspeed corner detection,” in European Conference on Computer Vision (ECCV). Springer, 2006, pp. 430–443.
 [5] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to sift or surf,” in IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2564–2571.
 [6] M. Cummins and P. Newman, “FABMAP: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research (IJRR), vol. 27, no. 6, pp. 647–665, 2008.
 [7] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual routebased navigation for sunny summer days and stormy winter nights,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2012, pp. 1643–1649.
 [8] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
 [9] R. MurArtal and J. D. Tardós, “ORBSLAM2: An opensource SLAM system for monocular, stereo, and RGBD cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 [10] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 3354–3361.
 [11] G. Dissanayake, H. DurrantWhyte, and T. Bailey, “A computationally efficient solution to the simultaneous localisation and map building (SLAM) problem,” in IEEE International Conference on Robotics and Automation (ICRA), vol. 2. IEEE, 2000, pp. 1009–1014.
 [12] S. Hochdorfer and C. Schlegel, “Landmark rating and selection according to localization coverage: Addressing the challenge of lifelong operation of SLAM in service robots,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2009, pp. 382–387.
 [13] A. J. Davison, “Active search for realtime vision,” in IEEE International Conference on Computer Vision (ICCV), vol. 1. IEEE, 2005, pp. 66–73.
 [14] M. Kaess and F. Dellaert, “Covariance recovery from a square root information matrix for data association,” Robotics and autonomous systems, vol. 57, no. 12, pp. 1198–1210, 2009.
 [15] S. Zhang, L. Xie, and M. D. Adams, “Entropy based feature selection scheme for real time simultaneous localization and map building,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2005, pp. 1175–1180.
 [16] S. Choudhary, V. Indelman, H. I. Christensen, and F. Dellaert, “Informationbased reduced landmark SLAM,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 4620–4627.
 [17] H. Strasdat, C. Stachniss, and W. Burgard, “Which landmark is useful? learning selection policies for navigation in unknown environments,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2009, pp. 1410–1415.
 [18] R. F. SalasMoreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359.
 [19] V. Murali, H. P. Chiu, S. Samarasekera, and R. T. Kumar, “Utilizing semantic visual landmarks for precise vehicle navigation,” arXiv preprint arXiv:1801.00858, 2018.
 [20] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoderdecoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 39, no. 12, pp. 2481–2495, 2017.
 [21] X. Li and R. Belaroussi, “Semidense 3D semantic mapping from monocular SLAM,” arXiv preprint arXiv:1611.04144, 2016.
 [22] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentation–aided visual odometry for urban autonomous driving,” International Journal of Advanced Robotic Systems (IJARS), vol. 14, no. 5, p. 1729881417735667, 2017.
 [23] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Probabilistic data association for semantic SLAM,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1722–1729.
 [24] E. Stenborg, C. Toft, and L. Hammarstrand, “Longterm visual localization using semantically segmented images,” arXiv preprint arXiv:1801.05269, 2018.
 [25] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, University of Cambridge, 2016.
 [26] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networks with bernoulli approximate variational inference,” arXiv preprint arXiv:1506.02158, 2015.
 [27] ——, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in International Conference on Machine Learning (ICML), 2016, pp. 1050–1059.
 [28] ——, “Dropout as a bayesian approximation: Appendix,” arXiv preprint arXiv:1506.02157, 2015.
 [29] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 5580–5590.
 [30] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding,” arXiv preprint arXiv:1511.02680, 2015.
 [31] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” in IEEE international Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 4762–4769.
 [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research (JMLR), vol. 15, no. 1, pp. 1929–1958, 2014.
 [33] R. M. Neal, “Bayesian learning for neural networks,” Ph.D. dissertation, University of Toronto, 1995.
 [34] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
 [35] M. Chli, “Applying information theory to efficient SLAM,” Ph.D. dissertation, Department of Computing, Imperial College London, 2010.
 [36] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in 22nd ACM International Conference on Multimedia. ACM, 2014, pp. 675–678.
 [37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223.