Fast MonteCarlo Localization on Aerial Vehicles using Approximate Continuous Belief Representations
Abstract
Size, weight, and power constrained platforms impose constraints on computational resources that introduce unique challenges in implementing localization algorithms. We present a framework to perform fast coarse localization on such platforms enabled by the compressive capabilities of Gaussian Mixture Model representations of point cloud data. Given raw structural data from a depth sensor and pitch and roll estimates from an onboard attitude reference system, a multihypothesis particle filter localizes the vehicle by exploiting the likelihood of the data originating from the mixture model. We demonstrate analysis of this likelihood in the vicinity of the ground truth and detail its utilization in a particle filterbased vehicle localization strategy, and later present results of realtime implementations on a commercial offtheshelf embedded platform that outperform localization results from running a stateoftheart algorithm, ORBSLAM2, on the same environment.
1 Supplementary Material
Representative video: https://youtu.be/RzS2v32850E
2 Introduction
For an agent lost in a known environment, much of the cost of localization can be offset by precomputing measures of what the sensor is expected to see with localization then cast as the much simpler problem of a search through these preexisting “hallucinated” views. However, exhaustively considering all possible views incurs a prohibitive cost that increases exponentially with both the dimensionality of the state space and the environment volume extents. Further, naïve prerendering approaches can be susceptible to errors caused by perceptual aliasing due to slight variations in the environment appearance or by regions that are not feature rich such as blank walls [1].
In this paper we present a framework enabling rapid computation of sensor data likelihood via an environment representation parameterization that is both highfidelity and memory efficient. The framework models a depth camera observation as the projection of the environment representation into image space with a likelihood measure that varies smoothly about the true camera pose. We thus exploit the dramatically reduced storage complexity of the representation and the local spatial regularity of a fasttocompute likelihood function to recast the pose estimation problem as that of MonteCarlo localization [13].
We pursue the problem of 6 DegreeofFreedom coarse pose estimation problem given Size, Weight, and Power (SWaP) constrained micro air vehicles operating in a known environment with an onboard monocular depth camera, Inertial Measurement Unit (IMU), and a dense 3D pointcloud of the environment. We assume that the the vehicle pitch and roll are obtained from an attitude estimation algorithm or provided by the IMU to constrain the search space to just heading and position in realtime.
Prior works exploiting known map appearance for monocular pose estimation [12, 10, 8, 4] employ a textured depth map within an iterative optimization framework to compute the warp that minimizes a photometric cost function between a rendered image and the live image such as the Normalized Information Distance[10], that is robust to illumination change, and a Sum of Squared Differences cost function with an affine illumination model to tackle illumination change[8]. Both algorithms rely on initialization for tracking via a GPS prior or an ORBbased bagofwords approach, respectively, and textured data for refinement. In the latter, the authors initialize the algorithm via an exhaustive search over all keyframes generated according to a heuristic that scales at best quadratically with increasing data.
We propose a localization strategy that generates expected sensor observations via an analytic projection of a finite number of mixture components into the camera frame, yielding a fast prerendering of a large number of potential candidate locations (as compared to approaches that render the expected dense image). These rendered projection views enable the computation of the likelihood of the sensor observations arising for a hypothesized pose. Note that in contrast to the above mentioned algorithms that use RGB information, we only use the depth observations from our sensors. Contributions of this work include:

A localization formulation based on a memory efficient, high fidelity environment model;

Fast computation of the likelihood of sensor data originating from a pose given this representation;

A particle filterbased localization strategy with adaptive importance sampling based on this fast likelihood calculation; and

A parallel formulation with experimental evaluation on a mobile GPU system.
3 Approach
Contemporary direct tracking algorithms require the projection of high dimensional dense (or semidense) points into the image space to align the current sensor data to a model. In contrast, we employ GMMs as succinct parameterized representations that avoid raycasting high dimensional data via an analytic projection into image space leading to a substantial reduction in input and complexity. The consequent orders of magnitude computational savings (hundreds of components vs many thousands of 3D points) enables projection to multiple pose hypotheses concurrently in realtime and motivates this work.
This section details the choice of environment representation and how it enables the proposed realtime sensor data likelihood computation.
3.1 Spatial GMMs as an Environment Representation for Tracking
Conventional means of representing maps such as voxel grids discretize space and encode occupancy leading to resolution dependent model fidelity and memory efficiency. An alternate approach to representing occupancy (or the existence of matter in space) is to use a Hierarchical Gaussian Mixture Model (HGMM) that attempts to approximate the underlying distribution from which sensor measurements are sampled. This formulation is capable of representing the underlying environment model with as high a fidelity as required and gracefully scales in fidelity to avoid information loss [11]. Additionally, HGMM representations provide a probabilistic uncertainty estimate of the occupancy at any sampled location. Fitting these models to data in real time is possibledue to recent advances that enable efficient operation [2]. For the purpose of this paper, we limit the discussion to using only GMM maps of a high fidelity, but note that the approach can be readily extended to a hierarchical formulation.
3.2 Projection into Image Space
Given an undistorted depth image acquired from a sensor at time , we are interested in finding the pose of the camera with respect to a prior map of the environment assuming a standard pinhole camera projection.
A spatial GMM represents the probability of matter existing at a specific 3D location given the model component parameters such that
(1) 
where is the mixture weight, the mean 3D position, and the covariance of the component of the GMM respectively, with and . This GMM is projected to image space by a transformation into the camera coordinate frame followed by a pinhole camera projection.
A point in the world frame is transformed into the camera frame by and denoted as
The operator is expressed as a composition of the input point with the corresponding rotation matrix and translation
Consider a sample and a monotonic continuous nonlinear function of , (where and are in the same space). The first order Taylor’s series expansion about a point leads to
Thus, about the point (after dropping superscripts for brevity),
and by applying linearity of expectation  
This transformed random variable is then projected to the image space with the pinhole projection operation on a point :
where is the focal length of the camera, and and are the principal point offsets. The derivative of the projection operation with respect to the 3D point is
and the projection of a 3D normal distribution into image space is a 2D normal distribution
(2) 
Thus, for a GMM, which is simply a linear combination of 3D Normal distributions, the corresponding belief distribution in image space for camera pose is
(3) 
where estimates the probability of the depth at a particular pixel location having originated from the component per Eq. 2, is a random variable signifying the depth at a certain pixel, and is the number of componenents in the GMM model .
3.3 Estimating the Likelihood of a Camera Pose Hypothesis
As shown in the previous subsection, the projection of the spatial GMM constitutes a corresponding GMM in image space. Thus, each projected GMM component represents a probability support region for pixels in the corresponding spatial extent being that intensity (here depth). Since the contribution of multiple components may be very small, and to avoid numerical underflow, we utilize the negative log likelihood of the sensor data having originated from those components [2]. Given a scan of depth pixels from a sensor scan and a set of GMM parameters , the log likelihood is calculated as
(4) 
where is the normal distribution presented in Eq. 2, and is the number of pixels in the sensor scan. In a local neighbourhood, this likelihood should peak at the true pose of the camera and fall off as we move further away. This is indeed the case in practice, as shown in Fig. 2.
3.4 Tracking Multiple Hypotheses
The discussion above only considers the nature of the likelihood in the vicinity of the true location; in practice it is not reasonable to assume that a single viewpoint suffices to localize the system as perceptual aliasing may arise (due to a paucity of data that precludes state observability). Hence, we propose a technique that permits tracking of multiple hypotheses and ensures appropriate weighting of equally likely viewpoints given the current observations. In the case of perceptual aliasing, multiple observations are required prior to converging to an estimate.
A standard approach to tracking multiple hypotheses is a Monte Carlo filter (or particle filter). Particle filters operate by continuously sampling candidate particle poses and measure the likelihood of the current sensor data having originated at the sampled pose. Based on the relative scores of the samples the particles are resampled and propagated based on a process model (often a noisy source of odometry). Convergence is generally achieved as soon as the sequence of observations made over time render alternate hypotheses inadmissible. Note that due to their inherent structure, particle filters are extremely parallelizable, a fact that we exploit in our implementation.
State Propagation
We use a general first order Markov model and observation model in terms of conditional PDFs. We assume the presence of some odometry to drive the process model and inject Gaussian noise into it. This odometry can be obtained for instance from a vision based velocity estimator or IMU preintegration. Note that we assume that we know the pitch and roll that can be obtained for instance from the attitude and heading reference system onboard a robotic system.
Importance Weight
The particle filter we employ uses the negative log likelihood of the current scan being drawn from the model at the particle location as its importance weight. Thus, given the current state estimate of a particle out of particles at time step ,
(5)  
where is the corresponding normalized importance weight obtained using the likelihood measurements for all the particles. Note that we use the inverse of the negative log likelihood function as our importance weight since negative log likelihood is minimum at the correct location.
Sampling Strategy
A particle filter should ideally converge to the correct hypothesis after running for a finite amount of iterations with a reduction in the filter variance signifying the confidence of the filter. At the same time, an early reduction in the filter variance may cause the filter to diverge to an incorrect hypothesis and never recover due to low variance. In order to avoid such situations, we implement the stratified sampling strategy [6] in combination with low variance sampling [13]. The particles are divided into random groups of equal weights and in each group we employ low variance sampling. This approach has low particle variance [13] and works well when the particle filter is tracking multiple hypotheses at once.
Handling Particle Deprivation
One of the most common failure modalities of a particle filter is that of particle deprivation [14]. Even with a large number of particles, the random nature of a particle filter might cause it to diverge from the correct state. Since the importance weights are an absolute measure of similarity between the current sensor scan and the model, the average of the importance weights signifies the probability of the filter tracking the true hypothesis. We employ the Augmented MCL strategy as described in [13]
(6)  
where is the number of particles, is as defined in Eq. 5, and and are parameters used to smoothen out the absolute sum of sensor measurement probabilities over different time horizons.
Due to implementation limitations we cannot increase the number of particles once the filter is initialized and hence we choose to randomly reinitialize number of particles from the original set every iteration.
4 Fast Localization
In order to perform fast localization using the above approach it is essential to compute the likelihood of the data given a proposed pose as quickly as possible. Eq. 7 suggests that computing the likelihood of a scan having been sampled from the model is the summation of the contribution of all the components within the GMM. However, the key insight here is that not all the components have a significant contribution to the likelihood.
The pointclouds that we use in our experiments have roughly uniform coverage of points across the scene. As a consequence, all components fit to these pointclouds end up having roughly equivalent weighting factors. This fact in addition to the diminishing support of the normal distribution permits the approximation of using only the projected components in spatial proximity to a certain pixel location for likelihood calculation. Thus, the projected components of the model whose bound ellipsoids that do not lie within the sensor field of view are considered to have a negligible contribution to the likelihood computation 4. This enables the computation of the likelihood with a much smaller subset of the model therefore facilitating high computation speeds. As an added optimization step we perform this membership computation over subdivided patches of the image. These optimizations have negligible effect on the likelihood value of the sensor data, as demonstrated in Fig. 2.
In summary, we follow the following steps to obtain the relevant components for computing the likelihood:

Divide the image into 32 x 32 pixel patches;

Compute the 2D projection of each Gaussian component on to the image plane of the depth sensor;

Inflate the 2D ellipse of the projection of each component by a conservative radius (greater than or equal to half the diagonal of the patch) along its major and minor axis; and

For each patch, check if the center of the image patch lies within or on each of the inflated ellipses. All the ellipses which contain the center of the image patch are then used for likelihood computation for that patch.
Given the set of these projected 2D Gaussian components and a scan, , the likelihood of the depth image can be computed with the reduced subset of components as
(7) 
5 Results
5.1 Experiment Design
This section presents performance analysis of the filter on a variety of datasets. First, we conduct sensitivity analysis to determine the number of particles and the number of components we use in our implementation. Second, we analyze runtime performance of the filter and compare it with a stateoftheart conventional tracking algorithm and show that the runtime is competitive both on a desktop class processor and on an embedded platform insitu thus enabling SWaP constrained operation. Third, we analyze metric accuracy of the proposed filter on publicly available datasets and show that the filter output is close to ground truth. Fourth, we compare the localization performance of the filter with the stateoftheart RGBD visual SLAM algorithm on the same data and demonstrate superior performance for localization despite only using depth information from the sensor.
We evaluate the approach via

D1: The (a) “lounge” and (b) “copyroom” datasets [15];

D2: The voxblox dataset [9]; and

D3: A representative dataset.
In all cases we utilize a fixed number of components (as informed by [11]) to first fit a GMM to the pointcloud using the scipy
We employ two processing systems for evaluation: (1) A Desktop with an Intel i7 CPU and a NVIDIA GTX 960 Ti GPU, and (2) An embedded NVIDIA TX2 module.
5.2 Sensitivity Analysis
Particle filters achieve increased performance with large number of particles at the cost of increased computational complexity. Conversely too few particles can lead to divergence from the true location due to an inability to represent the true underlying distribution. In order to find the appropriate number of particles that guarantee precision while being computationally feasible we compare the filter performance with a ground truth particle filter (). Assuming the underlying distribution represented by the particle set to be Gaussian, we compute the variance of the KLDivergence [5] of multiple runs of the filter output with that of the ground truth filter to determine the empirically optimal parameters to be used in our implementation. A low value of the KLDivergence indicates similar performance to the ground truth.
We first conduct a Particle Count Sensitivity test by varying the number of particles to assess filter performance and repeatability while ensuring computational feasibility. We then perform similar experiments with a fixed number of particles () obtained from the previous study with varying Gaussian components used to represent the ground truth map. These experiment motivate the design choice for the maximum number of particles and components in our implementation detailed in Table 1.
Variance  Noise  N  

Desktop  
x  4  0.02  
y  4  0.02  
z  2  0.02  
yaw  0.3  0.01  
TX2  
x  4  0.025  
y  4  0.025  
z  2  0.025  
yaw  0.3  0.1 
5.3 Runtime Performance Analysis
As seen in Fig. 6 the likelihood computation is the most computationally expensive operation, running at approximately 120 Hz on the Desktop and at 14 Hz on the TX2. The timings for likelihood computation vary with the number of Gaussian members used to compute likelihood for each image patch and therefore is dependent on the model and the scene.
The filter runs at an average rate of 80 Hz and 9.5 Hz on the Desktop and embedded class systems, respectively. This is comparable to the ORBSLAM2 rates of 47 Hz and 20 Hz on the respective platforms.
5.4 Metric Accuracy Analysis
In this subsection we discuss results of the localization accuracy of the filter. As mentioned in Sec. 4 since we do not add new particles when the filter observes particle deprivation and instead randomly reinitialize the particles from the original set, the Root Mean Squared Error (RMSE) of the filter estimate increases when the filter observes particle deprivation. This is highlighted in the plots as vertical shaded regions. For all our evaluations we run the filter 10 times on each dataset and report the average of the mean filter estimate.
Evaluation with Ground Truth Datasets (D1, D2)
The objective of using these datasets is to demonstrate the ability of the filter to converge to correct results given ground truth odometry. We generated a GMM map of the environments using the reconstructed point cloud and used the delta position between two consecutive reported sensor poses as our process model. In all these experiments, we initialized the particles from a uniform distribution in a cube and yaw orientation around the known initial location. D1(a) and D1(b) contain nominal motion of the sensor, while D2 consists of very aggressive motion in all degrees of freedom. Additionally, in 8 we also analyze the impact of the slower runtime performance on the TX2 compared to the Desktop.
The filter estimate converged to an incorrect hypothesis for some runs in the initial iterations due to the highly symmetric nature of the environments but managed to snap back to the correct pose as soon as more distinguishing sensor data was observed.
The RMSE magnitude is higher for the D2 dataset than the other two as seen in Fig. [9].
Evaluation with Representative Dataset (D3)
The objective of using this dataset is to demonstrate results on a realworld application of the filter. We no longer use ground truth odometry. Additionally, since we don’t have a baseline algorithm to directly compare against, we compare the localization performance against a stateoftheart RGBD visual SLAM algorithm (ORBSLAM2 [7]) that builds its own succinct map representation. Finally, we also briefly contrast the performance of the filter on the same dataset.
We generate a ground truth pointcloud and sensor data using a standalone FARO Focus 3D Laser scanner
As ground truth is not available for this dataset, we report the likelihood values at the mean particle filter location and the likelihood values at the reported ORBSLAM2 poses. We also show a selection of views of the predicted sensor measurements at the mean filter and ORBSLAM2 estimated poses, respectively. We performed multiple runs starting at different locations in the same environment and report results for the same. The particles in these experiments are initialized from a uniform distribution over a for position and a yaw volume given the dimensions of the environment.
An advantage of using particle filters over maximum likelihood estimatros is the fact that the filter can converge to the correct result even after moving through a region of low observability. To demonstrate this we also evaluated performance given an illconditioned dataset where the sensor transitions from a region where the GMM map is unavailable to a region with adequate information for localization prior to returning to the same impoverished region (Fig. 12).
We observe that the sensor measurements register at the converged filter location better than those for the ORBSLAM2 estimate, as can be seen in Fig. 10.
For testing the robustness, as seen in Fig. 11, the filter estimate diverges from the ORBSLAM2 estimate but eventually converges to the correct estimate while ORBSLAM2 drifts to an incorrect location.
6 Summary and Future Work
We present a framework to perform realtime localization of depth sensors given a prior continuous spatial belief representation. By utilizing a likelihood measure of the representation projected into image space we can perform robust particle filter localization in realtime even on an embedded GPU system. Key to being able to do this is the ability to render a much more succint representation into the image frame of the sensor to evaluate the likelihood of the data having originated from the given map at a given location.
The objective of this work is to provide a robust relocalization backend to bootstrap an optimization based precise localization strategy that incorporates additional data such as image intensity [3].
Part of the reason why it is not as straightforward to pursue an optimization based tracking algorithm with this map representation is the emphasis on it being approximately continuous. For Gaussian components that are more flat, the gradient profile can vary rapidly in the vicinity of the components, leading to poor conditioning that can challenge traditional iterative GaussNewton descent strategies. Further, the absence of strong gradient information in spatial data as encoded by the necessarily smoothbyconstruction representation hinders the application of these methods bu is more robust to changes in local environment.
Future steps will pursue further formulation, optimization and incorporation as a backend for a visual SLAM algorithm running onboard a relevant SWaP constrained aerial platform.
Footnotes
 footnotemark:
 footnotemark:
 https://www.scipy.org/
 https://www.faro.com/products/constructionbimcim/farofocus/
References
 A. Angeli, D. Filliat, S. Doncieux, and J.A. Meyer. Fast and incremental method for loopclosure detection using bags of visual words. IEEE Transactions on Robotics, 24(5):1027–1037, 2008.
 B. Eckart, K. Kim, A. Troccoli, A. Kelly, and J. Kautz. Accelerated generative models for 3d point cloud data. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 J. Engel, T. Schöps, and D. Cremers. Lsdslam: Largescale direct monocular slam. In European Conference on Computer Vision, pages 834–849. Springer, 2014.
 T. Gonçalves and A. I. Comport. Realtime direct tracking of color images in the presence of illumination variation. In Proc. of IEEE International Conference on Robotics and Automation, 2011.
 J. R. Hershey and P. A. Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages IV–317. IEEE, 2007.
 G. Kitagawa. Monte carlo filter and smoother for nongaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1), 1996.
 R. MurArtal and J. D. Tardós. Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
 K. Ok, W. N. Greene, and N. Roy. Simultaneous Tracking and Rendering: Realtime Monocular Localization for MAVs. In Proc. of IEEE International Conference on Robotics and Automation, 2016.
 H. Oleynikova, Z. Taylor, M. Fehr, J. Nieto, and R. Siegwart. Voxblox: Building 3D Signed Distance Fields for Planning. arXiv:1611.03631, 2016.
 G. Pascoe, W. Maddern, and P. Newman. Robust Direct Visual Localisation using Normalised Information Distance. In Proc. of British Machine Vision Conference, 2015.
 S. Srivastava and N. Michael. Approximate continuous belief distributions for precise autonomous inspection. In Proc. of the IEEE International Symposium on Safety, Security and Rescue Robotics, 2016.
 A. D. Stewart and P. Newman. LAPSLocalisation using Appearance of Prior Structure: 6DoF Monocular Camera Localisation using Prior Pointclouds. In Proc. of IEEE International Conference on Robotics and Automation, 2012.
 S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. MIT press, 2005.
 R. Van Der Merwe, A. Doucet, N. De Freitas, and E. A. Wan. The unscented particle filter. In Advances in neural information processing systems, 2001.
 Q.Y. Zhou and V. Koltun. Dense scene reconstruction with points of interest. ACM Transactions on Graphics (TOG), 32(4):112, 2013.