Variational EndtoEnd Navigation and Localization
Abstract
Deep learning has revolutionized the ability to learn “endtoend” autonomous vehicle control directly from raw sensory data. While there have been recent advances on extensions to handle forms of navigation instruction, these works are unable to capture the full distribution of possible actions that could be taken and to reason about localization of the robot within the environment. In this paper, we extend endtoend driving networks with the ability to understand maps. We define a novel variational network capable of learning from raw camera data of the environment as well as higher level roadmaps to predict (1) a full probability distribution over the possible control commands; and (2) a deterministic control command capable of navigating on the route specified within the map. Additionally, we formulate how our model can be used to localize the robot according to correspondences between the map and the observed visual road topology, inspired by the rough localization that human drivers can perform. We evaluate our algorithms on realworld driving data, and reason about the robustness of the inferred steering commands under various types of rich driving scenarios. In addition, we evaluate our localization algorithm over a new set of roads and intersections which the model has never driven through and demonstrate rough localization in situations without any GPS prior.
I Introduction
Human drivers have an innate ability to reason about the highlevel structure of their driving environment even under severely limited observation. They use this ability to relate highlevel driving instructions to concrete control commands, as well as to better localize themselves even without concrete localization information. Inspired by these abilities, we develop a learning engine that enables robot vehicle to learn how to use maps within an endtoend autonomous driving system.
Coarse grained maps afford us a higherlevel of understanding of the environment, both because of their expanded scope, but also due to their distilled nature. This allows for reasoning about the lowlevel control within a hierarchical framework with longterm goals [1], as well as localizing, preventing drift, and performing loopclosure when possible. We note that unlike much of the work in place recognition and loop closure, we are looking at a higher level of matching, where the vehicle is matching intersection and road patterns to the coarse scale geometry found in the map, allowing handling of different appearances and small variations in the scene structure, or even unknown finescale geometry, as long as the overall road network structure matches the expected structures.
While endtoend driving [2] holds promise due to its easily scalable and adaptable nature, it has a limited capability to handle longterm plans, relating to the nature of imitation learning [3, 4]. Some recent methods incorporate maps as inputs [5, 6] to capture longer term action structure, yet they ignore the uncertainty maps inherently allow us to address – uncertainty about the location, and uncertainty about the longerterm plan.
In this paper, we address these limitations by developing a novel model for integrating navigational information with raw sensory data into a single endtoend variational network, and do so in a way that preserves reasoning about uncertainty. This allows the system to not only learn to navigate complex environments entirely from human perception and navigation data, but also understand when localization or mapping is incorrect, and thus correct for the pose (cf. Fig. 1).
Our model processes coarse grained, unrouted roadmaps, along with forward facing camera images to produce a probabilistic estimate of the different possible lowlevel steering commands which the robot can execute at that instant. In addition, if a routed version of the same map is also provided as input, our model has the ability to output a deterministic steering control signal to navigate along that given route.
The key contributions of this paper are as follows:

Design of a novel variational endtoend control network which integrates raw sensory data with routed and unrouted maps, enabling navigation and localization in complex driving environments; and

Formulation of a localization algorithm using our trained endtoend network to reason about a given road topology and infer the robot’s pose by drawing correspondences between the map and the visual road appearance; and

Evaluation of our algorithms on a challenging realworld dataset, demonstrating navigation with steering control as well as improved pose localization even in situations with severely limited GPS information.
Ii Related Work
Our work ties in to several related efforts in both control and localization. As opposed to traditional methods for autonomous driving which typically rely on distinct algorithms for localization and mapping [7, 8, 9], planning [10, 11, 12], and control [13, 14], endtoend algorithms attempt to collapse the problem (directly from raw sensory data to output control commands) into a single learned model. The ALVINN system [15] originally proposed the use of multilayer perceptron to learn the direction a vehicle should steer in 1989. Recent advancements in convolutional neural networks (CNNs) have revolutionized the ability to learn driving control (i.e. steering wheel angle or road curvature) from raw imagery [2]. Followup works have incorporated conditioning on additional cues [3], including mapped information [5, 6]. However, these works are still quite limited in that they do not capture the uncertainty of multiple outcomes, nor present the ability to reason about discrepancy between their input modalities.
A recent line of work has tied endtoend driving networks to variational inference [16], allowing us to look at cases where multiple directions are possible, as well as reason about robustness, atypical data, and dataset normalization. Our work extends this line in that it allows us to use the same outlook and reason about maps as an additional conditioning factor.
Our work also relates to several research efforts in reinforcement learning in subfields such as bridging different levels of planning hierarchies [1, 17], and relating to maps as agents plan and act [18, 19]. This work relates to a vast literature in localization and mapping [7, 8], such as visual SLAM [9], [20] and place recognition [21, 22]. However, our notion of visual matching is much more highlevel, more akin to semantic visual localization and SLAM [23, 24], where the semanticlevel features are driving affordances.
Iii Model
In this section, we describe the model used in our approach. We use a variational neural network, which takes raw camera images, , and an image of a noisy, unrouted roadmap, , as input. At the output we attempt to learn a full, parametric probability distribution over inverse curvature to navigate that instant. We use a Gaussian Mixture Model (GMM) with modes to describe the possible steering control command, and penalize the norm of the weights to discourage extra components. Additionally, the model will optionally output a deterministic control command if a routed version of the map is also provided as input. The overall network can be written separately as two functions, representing the stochastic (unrouted) and deterministic (routed) parts respectively:
(1)  
where is the current pose in the map (position and heading), and are network outputs computed by cropping a square region of the relevant map according to , and feeding it, along with the forward facing images, , to the network. denotes the unrouted map, with only the traversible areas marked, while denotes the routed map, containing the desired route highlighted. The deterministic control command is denoted as . In this paper, we refer to steering command interchangeably as the road curvature: the actual steering angle requires reasoning about road slip and control plant parameters that change between vehicles, making it less suitable for our purpose. Finally, the parameters (i.e. weight, mean, and variance) of the GMM’s th component are denoted by , which represents the steering control in the absence of a given route.
The overall network structure is given in Fig. 2. Each camera image is processed by a separate convolutional pipeline similar to the one used in [2]. Similarly, the cropped, nonrouted, map patch is fed to a set of convolutional layers, before concatenation to the image processing outputs. However, here we use fewer layers for two main reasons: a) The map images contain significantly fewer features and thus don’t require a complex feature extraction pipeline and b) We wish to avoid translational invariance effects often associated with convolutional layers and subsampling, as we are interested in the pose on the map. The output of the convolutional layers is flattened and fed to a set of fully connected layers to produce the parameters of a probability distribution of steering commands (forming ). As a second task, we the previous layer output along with a convolutional module processing the routed map, , to output a single deterministic steering command, forming . This network structure allows us to handle both routed and nonrouted maps, and later affords localization and driver intent, as well as driving according to high level navigation (i.e. turnbyturn instruction).
We learn the weights of our model using backpropogation with the cost defined as:
(2) 
where is a percomponent penalty on the standard deviation . We chose a quadratic term in log as the regularization,
(3) 
is the negative loglikelihood of the steering command according to a GMM with parameters and
(4) 
Iiia Localization via EndtoEnd Networks
The conditional structure of the network affords updating a posterior belief about the vehicle’s pose, based on the relation between the map and the road topology seen from the vehicle. For example, if the network is provided visual input, , which appears to be taken at a 4 way intersection we aim to compute over different poses on the map to reason about where this input could have been taken. Note that our network only computes , but due to the conditioning structure of our model, we are able to estimate our pose given the visual input through double marginalization over and . Given a prior belief about the pose, , we can write the posterior belief after seeing an image as:
(5)  
where the equalities are due to full probability theorem and Bayes theorem. The posterior belief can be therefore computed via marginalization over . While marginalization over two random variables is traditionally inconvenient, in two cases of interest, marginalizing over becomes easily tractable: a) when the pose is highly localized due to previous observations, as in the case of online localization and b) where the pose is sampled over a discrete road network, as is done in mapmatching algorithms. The algorithm to update the posterior belief is shown in Algorithm 1. Intuitively, the algorithm computes, over all steering angle samples, the probability that a specific pose and images/map explain that steering angle, with the additional loop required to estimate the partition function and normalize the distribution. We note the same algorithm can be used with small modifications within the mapmatching framework [25, 26].
Iv results
In this section, we demonstrate results obtained using our method on a realworld train and test dataset of rich driving enviornments. We start by describing our system setup and dataset and then proceed to demonstrate driving using an image of the roadmap and steering angle estimation with both routed and unrouted maps. Finally, we demonstrate how our approach allows us to reduce pose uncertainty based on the agreement between the map and the camera feeds.
Iva System Setup
We evaluate our system on a 2015 Toyota Prius V outfitted with autonomous drivebywire capabilities [27]. Additionally, we made several advancements to the sensor and compute platform specifically for this work. Three Leopard Imaging LIAR0231GMSL cameras [28], capable of capturing 1080p RGB images at approximately 30Hz, are used as the vision data source for this study. We mount the three cameras on the front of the vehicle at various yaw angles: one forward facing and the remaining two rotated on the left/right of the vehicle to capture a larger FOV. Coarse grained global localization is captured using the OXTS RT3000 GPS [29] while an Xsense MTi 100series IMU [30] is used to capture acceleration, rotation, and orientation data from the vehicle. Specifically, at any time, , we use the yaw rate (), and the speed of the vehicle, (), to compute the curvature (or inverse radius) of the path which the human executed as . Finally, all of the sensor processing was done onboard an NVIDIA Drive PX2 [31].
In order to build the road network, we gather edge information from Open Street Maps (OSM) throughout the traversed region. We are given a directed topological graph, , where represents an intersection on the road network and represents a single directed road between two intersections. The weight of every edge, , is defined according to its great circle length, but with slight modifications could also capture more complexity by incorporating the type of road or even realtime traffic delays. The problem of offline mapmatching involves going from a noisy set of ordered poses, , to corresponding set of traversed road segments (i.e. the route taken). We implemented our map matching algorithm as an offline preprocessing step as in [26].
One concern during map rendering process is handling ambiguous parts of the map. Ambiguous parts are defined as parts where the the driven route cannot be described as a single simple curve. In order to handle large scale driving sequences efficiently, and avoid map patches that are ambiguous, we break the route into nonambiguous subroutes, and generate the routed map for each of the subroutes, forming a set of charts for the map. For example, in situations where the vehicle travels over the same intersection multiple times but in different directions, we should split the route into distinct segments such that there are no selfcrossings in the rendered route. Finally, we render the unrouted map by drawing all edges on a black canvas. The routed map is rendered by adding a red channel to the canvas and by drawing the traversed edges, .
Using the rendered maps and raw perceptual image data from the three cameras, we trained our model (implemented in TensorFlow [32]) using 25 km of driving data taken in a suburban area with various different types of turns, intersections, roundabouts, as well as other dynamic obstacles (vehicles and pedestrians). We test and evaluate our results on an entirely different set of road segments and intersections which the network was never trained on.
IvB Driving with Navigational Inputs
We demonstrate the ability of the network to compute both the continuous probability distribution over steering control for a given unrouted map as well as the deterministic control to navigate on a routed map. In order to drive with a navigational input, we feed both the unrouted and the routed maps into the network and compute . In Fig. 3 we show the inputs and parametric distributions of steering angles of our system. The roads in both the routed and unrouted maps are shown in white. In the routed map, the desired path is painted in red. In order to generate the trajectories shown in the figure, we project the instantaneous control curvature as an arc, into the image frame. In this sense, we visualize the projected path of the vehicle if it was to execute the given steering command. Green rectangles in camera imagery denote the region of interest (ROI) which is actually actually fed to the network. We crop according to these ROIs so the network is not provided extra nonessential data which should not contribute to its control decision (e.g., the pixels above the horizon line do not impact steering).
We visualize the model output under various different driving scenarios ranging (left to right) from simple lane following to richer situations such as turns and intersections (cf. Fig. 3). In the case of lane following (left), the network is able to identify that the GMM only requires a single Gaussian mode for control, whereas multiple Gaussian modes start to appear for forks, turns, and intersections. In the case where a routed path is also provided, the network is able to disambiguate from the multiple modes and select a correct control command to navigate towards the route. We also demonstrate generalization on a richer type of intersection such as the roundabout (right) which was never included as part of the training data.
To quantitatively evaluate our network we compute the mixture estimation accuracy over our entire test set (cf. Fig 4). Specifically, for a range of zscores over the steering control distribution we compute the number of samples within the test set where the true (human) control output was within the predicted range. To provide a more qualitative understanding of the spatial accuracy of our variational model we also visualized a heatmap of GPS points over the test set (cf. Fig. 4B), where the color represents the probability density of the predicted distribution evaluated at the true control value. We observed that the density decreases before intersections as the modes of the GMM naturally spread out to cover the greater number of possible paths.
IvC Reducing Localization Uncertainty
We demonstrate how our model can be used to localize the vehicle based on the observed driving directions using Algorithm 1. We investigate in our experiments the reduction of pose uncertainty, and visualize areas which offer better types of pose localization.
For this experiment, we began with the pose obtained from the GPS and assumed an initial error in this pose with some uncertainty (Gaussian over the spatial position, heading, or both). We compute the posterior probability of the pose as given by Alg. 1, and look at the individual uncertainty measures or total entropy of the prior and posterior distributions. If the uncertainty in the posterior distribution is lower than that of the prior distribution we can conclude that our learned model was able to increase its localization confidence after seeing the visual inputs provided (i.e. the camera images). In Fig. 5 we show the training (A) and testing (B) datasets. Note that the roads and intersections in both of these datasets were entirely disjoint; the model was never trained on roads/intersections from the test set.
For the test set, we overlaid the individual GPS points on the map and colored each point according to whether our algorithm increased (blue) or decreased (orange) posterior uncertainty. When looking at uncertainty reduction, it is important to note which degrees of freedom (i.e. spatial vs angular heading) localize better at different areas in the road network. For this reason, we visualize the uncertainty reduction heatmaps four times individually across (1) spatial variance, (2) angular variance, (3) overall pose variance, and (4) overall entropy reduction (cf. Fig. 5).
While header angle is corrected easily at both straight driving and more complex areas (turns and intersections), spatial degrees of freedom are corrected best at rich map areas, and poorly at linear road segments. This is expected and is similar to the aperture problem in computer vision [33] – the information in a linear road geometry is not enough to establish 3DOF localization.
If we focus on areas preceding intersections (approx 20 meters before), we typically see that the spatial uncertainty (prior uncertainty of ) is reduced right before the intersection, which makes sense since after we pass through our forward facing visual inputs are not able to capture the intersection behind the vehicle. Looking in the vicinity of intersections, we achieved average reduction of nats. For the angular uncertainty, with initial uncertainty of radians ( degs), we achieved a reduction in the standard deviation of 0.2 radians (11 degs).
We quantify the degree of posterior uncertainty reduction around intersections in Fig. 6. Specifically, for each of the degrees of uncertainty (spatial, angular, etc) in Fig. 5 we present the corresponding numerical uncertainty reduction as a function of the prior uncertainty in Fig. 6. Note that we obtain reduction of both heading and spatial uncertainty for a variety of prior uncertainty values. Additionally, the averaged improvement over intersection regions is always positive for all prior uncertainty values indicating that localization does not worsen (on average) after using our algorithm.
IvD Coarse Grained Localization
We next evaluate our model’s ability to distinguish between significantly different locations without any prior on pose. For example, imagine that you are in a location without GPS but still want to perform rough localization given your visual surroundings (similar to the kidnapped robot problem). We seek to establish correspondences between the map and the visual road area for coarse grained place recognition.
In Fig. 7 we demonstrate how we can identify and disambiguate a small set of locations, based on the the map and the camera images’ interpreted steering direction. Our results show that we can easily distinguish between places of different road topology or road geometry, in a way that should be invariant to the appearance of the region or environmental conditions. Additionally, the cases where the network struggles to disambiguate various poses is understandable. For example, when trying to determine which map the image from environment 4 was taken, the network selects maps A and D where both have upcoming left and right turns. Likewise, when trying to determine the location of environment 5, maps B and E achieve the highest probabilities. Even though the road does not contain any immediate turns, it contains a large driveway on the lefthand side which resembles a possible left turn (thus, justifying the choice of map B). However, the network is able to correctly localize each of these five cases to the correct map location (i.e. noted by the strong diagonal of the confusion matrix).
V Conclusion
In this paper, we developed a novel variational model for incorporating coarse grained roadmaps with raw perceptual data to directly learn steering control of autonomous vehicle. We demonstrate deterministic prediction of control according to a routed map, estimation of the likelihood of different possible control commands, as well as localization correction and place recognition based on the map. We formulate a concrete pose estimation algorithm using our learned network to reason about the localization of the robot within the environment and demonstrate reduced uncertainty (greater confidence) in our resulting pose.
In the future, we intend to integrate our localization algorithm in the online setting of discrete road mapmatching onboard our fullscale autonomous vehicle and additionally provide a more robust evaluation of the localization compared to that of a human driver.
References
 [1] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 12, pp. 181–211, 1999.
 [2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to end learning for selfdriving cars,” arXiv preprint arXiv:1604.07316, 2016.
 [3] F. Codevilla, M. Müller, A. Dosovitskiy, A. López, and V. Koltun, “Endtoend driving via conditional imitation learning,” arXiv preprint arXiv:1710.02410, 2017.
 [4] S. ShalevShwartz, S. Shammah, and A. Shashua, “Safe, multiagent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
 [5] G. Wei, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intentionnet: Integrating planning and deep learning for goaldirected autonomous navigation,” arXiv preprint arXiv:1710.05627, 2017.
 [6] S. Hecker, D. Dai, and L. Van Gool, “Endtoend learning of driving models with surroundview cameras and route planners,” in European Conference on Computer Vision (ECCV), 2018.
 [7] J. J. Leonard and H. F. DurrantWhyte, “Simultaneous map building and localization for an autonomous mobile robot,” in Intelligent Robots and Systems’ 91.’Intelligence for Mechanical Systems, Proceedings IROS’91. IEEE/RSJ International Workshop on. Ieee, 1991, pp. 1442–1447.
 [8] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “FastSLAM: A factored solution to the simultaneous localization and mapping problem,” AAAI, vol. 593598, 2002.
 [9] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Realtime single camera SLAM,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 6, pp. 1052–1067, 2007.
 [10] L. Kavraki, P. Svestka, and M. H. Overmars, Probabilistic roadmaps for path planning in highdimensional configuration spaces. Unknown Publisher, 1994, vol. 1994.
 [11] S. M. Lavalle and J. J. Kuffner Jr, “Rapidlyexploring random trees: Progress and prospects,” in Algorithmic and Computational Robotics: New Directions, 2000.
 [12] S. Karaman, M. R. Walter, A. Perez, E. Frazzoli, and S. Teller, “Anytime motion planning using the rrt,” Shanghai, China, May 2011, pp. 1478–1483.
 [13] W. Schwarting, J. AlonsoMora, L. Paull, S. Karaman, and D. Rus, “Safe nonlinear trajectory generation for parallel autonomy with a dynamic vehicle model,” IEEE Transactions on Intelligent Transportation Systems, no. 99, pp. 1–15, 2017.
 [14] P. Falcone, F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat, “Predictive active steering control for autonomous vehicle systems,” IEEE Transactions on control systems technology, 2007.
 [15] D. A. Pomerleau, “ALVINN: An autonomous land vehicle in a neural network,” in Advances in neural information processing systems, 1989, pp. 305–313.
 [16] A. Amini, W. Schwarting, G. Rosman, B. Araki, S. Karaman, and D. Rus, “Variational autoencoder for endtoend control of autonomous driving with novelty detection and training debiasing,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018.
 [17] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, “Multilevel discovery of deep options,” arXiv preprint arXiv:1703.08294, 2017.
 [18] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value iteration networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2154–2162.
 [19] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in Advances in Neural Information Processing Systems, 2017, pp. 6118–6128.
 [20] J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: Largescale direct monocular SLAM,” in European Conference on Computer Vision. Springer, 2014, pp. 834–849.
 [21] R. Paul and P. Newman, “Fabmap 3d: Topological mapping with spatial and visual appearance,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 2649–2656.
 [22] T. Sattler, B. Leibe, and L. Kobbelt, “Fast imagebased localization using direct 2dto3d matching,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 667–674.
 [23] J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” ISPRS Journal of Photogrammetry and Remote Sensing (JPRS), 2018.
 [24] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Probabilistic data association for semantic SLAM,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1722–1729.
 [25] D. Bernstein and A. Kornhauser, “An introduction to map matching for personal navigation assistants,” Tech. Rep., 1998.
 [26] P. Newson and J. Krumm, “Hidden Markov map matching through noise and sparseness,” in Proceedings of the 17th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, 2009, pp. 336–343.
 [27] F. Naser, D. Dorhout, S. Proulx, S. D. Pendleton, H. Andersen, W. Schwarting, L. Paull, J. AlonsoMora, M. H. Ang Jr., S. Karaman, R. Tedrake, J. Leonard, and D. Rus, “A parallel autonomy research platform,” in IEEE Intelligent Vehicles Symposium (IV), June 2017.
 [28] “LIAR0231GMSLxxxH leopard imaging inc data sheet,” 2018.
 [29] “OXTS RT user manual,” 2018.
 [30] A. Vydhyanathan and G. Bellusci, “The next generation xsens motion trackers for industrial applications,” Xsens, white paper, Tech. Rep., 2018.
 [31] “NVIDIA Drive PX2 SDK documentation.” [Online]. Available: –https://docs.nvidia.com/drive/nvvib_docs/index.html˝
 [32] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: a system for largescale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
 [33] D. Marr and S. Ullman, “Directional selectivity and its use in early visual processing,” Proc. R. Soc. Lond. B, vol. 211, no. 1183, pp. 151–180, 1981.