Joint Point Cloud and Image Based Localization For Efficient Inspection in Mixed Reality

Joint Point Cloud and Image Based Localization
For Efficient Inspection in Mixed Reality

Manash Pratim Das, Zhen Dong and Sebastian Scherer Manash Pratim Das is with the Indian Institute of Technology, Kharagpur, 721302, WB, India. Email: Dong is with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China. Email: Scherer is with the The Robotics Institute, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA. Email:

This paper introduces a method of structure inspection using mixed-reality headsets to reduce the human effort in reporting accurate inspection information such as fault locations in 3D coordinates. Prior to every inspection, the headset needs to be localized. While external pose estimation and fiducial marker based localization would require setup, maintenance, and manual calibration; marker-free self-localization can be achieved using the onboard depth sensor and camera. However, due to limited depth sensor range of portable mixed-reality headsets like Microsoft HoloLens, localization based on simple point cloud registration (sPCR) would require extensive mapping of the environment. Also, localization based on camera image would face same issues as stereo ambiguities and hence depends on viewpoint. We thus introduce a novel approach to Joint Point Cloud and Image-based Localization (JPIL) for mixed-reality headsets that uses visual cues and headset orientation to register small, partially overlapped point clouds and save significant manual labor and time in environment mapping. Our empirical results compared to sPCR show average 10 fold reduction of required overlap surface area that could potentially save on average 20 minutes per inspection. JPIL is not only restricted to inspection tasks but also can be essential in enabling intuitive human-robot interaction for spatial mapping and scene understanding in conjunction with other agents like autonomous robotic systems that are increasingly being deployed in outdoor environments for applications like structural inspection.

This paper has been accepted for publication at the IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), Madrid, 2018. ©IEEE

I Introduction

The onset of portable mixed reality headsets like Google Glass and Microsoft HoloLens has enabled efficient human interactions with 3D information. These headsets are found to be suitable for on-site inspection [1, 2], due to their ability to visualize augmented geo-tagged holograms and measure 3D distances by virtue of gaze and gesture. Notably, before every inspection, the headset needs to be localized, preferably with onboard sensors, to establish a common frame of reference for 3D coordinates. As spatial mapping and understanding is key to mixed-reality, we can safely assume that the primary sensors would include at least a depth sensor and a camera. However, due to low form factor and portability, these sensors have limited range. Given a prior 3D model of the structure as a template, existing methods for simple point cloud registration (sPCR) [3, 4] can be employed on a spatial map generated on-site by the headset. These methods, however, require significant overlap between the point clouds and thus would require the user to sufficiently map the structure on the order of surface area. Conversely, camera pose estimation using 3D-2D correspondences [5] lacks desired precision and depends on viewpoint due to stereo ambiguities.

Fig. 1: An inspector initiates JPIL with a gesture. JPIL uses spatial map, camera image and headset orientation for localization.

In this paper, we introduce an efficient approach to Joint Point Cloud and Image-based Localization (JPIL) for marker-free self-localization of mixed-reality headsets that requires minimum on-site mapping time. JPIL can successfully register spatial map with significantly low overlap with a prior 3D model by combining additional information from camera image and headset orientation, which is simply available from the onboard camera, IMU and magnetometer sensors. Interestingly, it simplifies the problem to an extent that is critical for operation in symmetric environments. A small spatial map might resemble multiple structures on the prior 3D model, thus, a camera pose estimation (CPE) problem is solved, as a function of point cloud registration to differentiate between multiple candidates. The resulting localization accuracy is similar, albeit at significantly lower point cloud overlap. The contributions of this paper are: {enumerate*}

We contribute a JPIL method (Section V) to use spatial map (Section III), camera image (Section IV) and headset orientation to localize mixed-reality headset with minimum mapping time.

A modified binary shape context based 3D point cloud descriptor with an efficient multiple-candidate descriptor matching algorithm (Section III-B).

We empirically show that JPIL method results in the successful registration of point clouds with overlap as low as and average 10 fold reduction in required surface area.

Finally, we provide software implementation 111Source code: and hardware verification on a Microsoft HoloLens.

Ii Digital Model and Setup

A common frame of reference is required to report 3D coordinates that can be persistently used across multiple inspection sessions. While a global positioning system (e.g., GPS) can be used, instead, we use a digital 3D model (Fig. 2) of the structure and report all 3D coordinates in its reference frame . has to be real scale and can be a partial or full representation of the structure. The source for can be a computer-aided model or a map generated by other 3D reconstruction methods such as [6, 7]. If of a structure does not exist prior to its inspection, then the user can scan the structure to build a map that can be used as in future inspections.

Fig. 2: An example CAD model of the Delaware Memorial Bridge.

We choose a 3D triangular mesh as the format for as ray casting on a mesh for measurement is accurate and computationally cheap due to the surface representation. In most cases, a mixed-reality headset would setup up a local frame of reference and an arbitrarily defined origin for its operation. Let define the triangle mesh generated by spatial mapping. Orientation is estimated in Easting, Northing and Elevation (ENU) geographic Cartesian frame from the IMU and magnetometer for both the models. The models are thus aligned with the ENU frame.

Localization of the headset has to be performed only once per inspection session, where a transformation is established between and . Headset pose w.r.t at time is tracked by the headset itself. A normal inspection procedure under the proposed method would be to first map a small () structural surface and initiate JPIL with , where represents the time in that instance.

Iii Point Cloud Registration

JPIL samples and meshes to generate corresponding point clouds for registration. Point cloud registration estimates a transformation matrix given and , such that perfectly aligns with when every point of is transformed by  Fig. 3.


where and are corresponding points (homogeneous coordinates) in and respectively. A common framework consists of essentially the following steps: {enumerate*}

Keypoint extraction,

Descriptor computation,

Descriptor matching, and

Transformation estimation. The models are aligned with ENU frame, thus, will have negligible rotation component. We use binary shape context descriptor [8] for step 1 and 2 while we modify step 1 to incorporate the orientation information. The modified descriptor (tBSC) is now orientation specific. Finally, we propose an efficient algorithm for step 3 (subsection III-B). Thus, we discuss steps 1 and 3 in detail while briefly revising the other steps.

Model Model Origin of Origin of Raycast in gaze direction Current headset position
Fig. 3: establishes a common frame of reference at the origin of . During each inspection is generated, while is common. A raycast performed in gaze direction from headset’s position gives the measurement . Point cloud registration estimates , that allows the point of interest to be reported in .

Iii-a Keypoint extraction

Keypoint extraction complements the feature descriptors by providing points that are expected to have high feature descriptiveness and are robust to minor viewpoint changes. For each point of a point cloud, we perform two separate eigenvalue decompositions on the covariance matrices and .

The covariance matrices and are constructed using and its neighboring points within a supporting radius , which is user defined. We used . The neighboring points and are denoted as , where is the point and is the number of neighboring points. A new set of points is generated from with only the and component of those points. The covariance matrices and are calculated as follows:


The eigenvalues in decreasing order of magnitude and , along with the corresponding eigenvectors and respectively, are computed by performing eigenvalue decompositions of and respectively.

Now, the point is considered a keypoint if:

  • : Note, are the eigenvalues computed from , which in turn is formed from the 2D distribution of in the plane. This condition, thus, promotes points with well defined eigenvectors.

  • Highest curvature among its neighbors. The curvature is defined by .

For each keypoint, a local reference frame (LRF) is defined. An LRF imparts invariance in feature descriptors to rigid transformations. Descriptors are computed based on the LRFs, thus, it is critical for LRFs to be reproducible. In [8], the authors define an LRF by adopting as the origin and as the , respectively, where represents the cross product of vectors. However, the orientation of each eigenvector calculated by eigenvalue decomposition has a ambiguity. Therefore, there are two choices for the orientation of each axis and thus four possibilities for the LRF. They end up computing descriptors for all four possible LRFs and thus leading to reduced precision. This ambiguity is generally faced by all sPCR methods that use eigenvectors for LRF. In, contrast JPIL adopts as the , respectively, where and is a vector towards magnetic East and North respectively. Thus, the LRF is written as


Iii-B Multiple Candidate Registration

Since, would generally be very small as compared to , the registration would have translational symmetry Fig. 4. Let the descriptor for a keypoint be defined as and and be the set of keypoints for and respectively.

Let be the hamming distance between two binary descriptors and , and denote maximum possible value of . A match is a set of keypoint pairs formed by selecting a keypoint and from and respectively such that , where is a user definable threshold. contains set of matches that 1) belong to either one of the possible candidates, and 2) are outliers. We find a family of match sets where a set represents a candidate. is found based on geometric consistency with error margin of and clustering on given by Algorithm 1. We used .

1:procedure FindRegCandidates()
2:      Family of match sets
3:      Set of transformations
4:      Set of alignment costs
6:      zeros
8:     while  do
9:         if  then
14:              while  do
15:                  if  then
20:                       if  then
23:              if  then
24:                   RigidTransform()
25:                   AlignCost()
30:     return
Algorithm 1 Find multiple candidate registrations

Subroutine called by Algorithm 1 estimates the transformation matrix from the corresponding 3D points using singular value decomposition. An alignment is evaluated based on how similar the two point clouds looks after alignment. is transformed by to get , and is clipped to get using a box filter of dimension equivalent to the bounding box of . The idea is to compute two tBSC descriptors and , one each for and with radius equivalent to longest side of the bounding box, and keypoints and at the box centers respectively. Subroutine return the alignment cost given by .

Fig. 4: Multiple possible registration candidates for very small () (Blue). A user can easily map such small region which is sufficient for unique localization when coupled with image based candidate selection.

Iv Image Based Candidate Selection

The motivation behind JPIL to use visual cues is the rich amount of information that is present in a single image. While the depth sensor on the headsets has a range of about and about field of view (FOV), a spherical camera, can provide full horizontal and vertical field of view. Consider a particular candidate : let its transformation matrix and alignment cost be and respectively. Since, the headset pose w.r.t at time is known (Section II). JPIL generates synthetic projection images of setting a virtual camera at position and headset orientation in ENU frame.

If is the correct candidate, then the camera image and synthetic image should match better than those of the other candidates. Thus, implying to be the correct headset pose w.r.t . We evaluate a match based on the distance metric described in Section IV-A.

We demonstrate JPIL with a spherical camera, however it is not a necessity. The user may use any camera with suitable FOV for their case. We however do recommend the use of spherical cameras as the localizability increases with FOV and we discuss the benefits of the same in the experiments.

Iv-a 3D-2D Image Match Distance Metric

Given a 3D point cloud structure and a camera image, the estimation of the camera pose is a standard computer vision problem. We follow interesting articles [5, 9] from the literature that solves this problem. And use it as a framework to build upon our method that supports spherical image projection and a orientation constraint non-linear optimization for camera pose.

Since, is generated by projecting on a sphere, we can backtrack from a pixel coordinate to the 3D points . The initial goal is to detect 2D-2D image correspondences between and , and establish 3D-2D correspondences after backtracking from  Fig. 5.

Let where is kernel dimension. Standard deviation . JPIL rejects a correspondence if , assuming the 3D point to be uncertain.

Let and be 3D point and image pixel respectively of correspondence. is the position of headset in according to candidate. Using spherical projection, we can project to a point on the projection sphere. Also, we can project every point to a point on projection sphere as a function of orientation and position (viewpoint). Thus, we can solve for by minimizing the cost on the set of 3D-2D correspondences with RANSAC [10] based outlier rejection.


where is Roll, Pitch or Yaw angle and is the allowed flexibility. We use Ceres solver [11] for the optimization. To evaluate the similarity of two images there are many distance metrics in the literature, like Hausdorff distance [12], photometric error [13] and homography consistent percentage inlier matches. Since, we have the 3D information of an image, we rather check for geometric consistency with the error in position given by :

3D Point Cloud 2D-2D3D-2D
Fig. 5: Generating 3D-2D correspondence from 2D-2D image correspondences and backtracking to point cloud.

Iv-B Confident Positive Matches

The cost function is non-linear and due to noise in feature matching, the optimization might reach a false local minimum giving erroneous estimate. Therefore, metric is only good to determine if a match is confident positive given by:


where depends on the noise level and we used .

V Joint Point Cloud and Image Based Localization

To summarize, JPIL has as its inputs, a reference model , a small 3D map of the structure scanned in the particular session , headset position , headset orientation in ENU frame and spherical image at time . The output of JPIL Algorithm 2 is , the transformation of to , such that headset position w.r.t can be estimated.

1:procedure JPIL()
2:      registration error threshold
3:      set of keypoints from
4:      set of keypoints from
5:      Descriptor match of and
9:     while  do
10:          Synthetic spherical image for
11:         if  then
12:              return               
14:     return
Algorithm 2 Localize Headset w.r.t

Vi Experimental Results

We performed few experiments to evaluate the following:

  1. Performance of CPE with orientation constraints and differentiability between candidates.

  2. Tolerance of tBSC to error in orientation.

  3. Relation of to error in orientation and relative distance between the camera poses of the two images.

  4. Relation of to .

  5. Reduction in minimum required surface area and mapping time.

The experiments were performed on a Microsoft HoloLens headset with a Ricoh theta S camera Fig. 6. We used an off-board computer to process the data. The HoloLens uploaded ( MB), ( kB for pixels), and to the server. ( GB) was already present in the server and an instance of JPIL was running to accept communication from the HoloLens and return back the transformation . We used SPHORB [14] feature descriptors for the spherical images.

HoloLens EDM prism Theta camera
Fig. 6: Microsoft HoloLens used for the experiments and test arena.

We tested JPIL in real world as well as simulated environments. Real world data from HoloLens was collected from Charles Anderson bridge, Pittsburgh, PA, USA. We used a high precision Terrestrial Laser Scanner (TLS) to generate dense RGB . The ground truth positions for the experiments were measured using a Electronic Distance Measuring (EDM) Survey equipment that can report the position of an EDM prism with millimeter accuracy. We manually calibrated the EDM measurement w.r.t .

Vi-a Performance of CPE with orientation constraints and differentiability between candidates.

The cost function (5) might have many local minimums due to erroneous feature matching that might get selected by RANSAC as inliers. We performed CPE for 22 image pairs with varying and computed the average error and standard deviation for each  Fig. 7. We observed that errors increase drastically with as expected. We observed a slight decrease at , which can be credited to the flexibility that allowed optimization to minimize considering errors in estimate. We also evaluate how discriminative is CPE to candidate positions. We took a spherical image and generated multiple spherical images at position with error increments of along the bridge. From Fig. 7 we observe that becomes unstable with increasing error and thus the concept of confident positive matching works well to discriminate between candidates that are further away from nominal position.

Fig. 7: CPE error with varying relaxation of orientation constraints (a) and error in nominal position (b).

Vi-B Tolerance of tBSC to error in orientation.

We added error in the orientation estimates along X, Y and Z axes individually in an increment of . We, then performed tBSC registration and selected the candidate with transformation estimate closest to the ground truth transformation. The results in Fig. 8 shows a minimum tolerance of for error within 0.6m. It indicates the rotation specificity of LRF as well as robustness to error in sensor values.

Fig. 8: tBSC registration error with error in orientation estimates.

Vi-C Relation of to error in orientation.

is generated as a function of . We wanted to evaluate how error in CPE is affected by error in . So, we added errors in and performed CPE for each pair of and . We observe that CPE is tolerant to error in orientation estimates up to  Fig. 9.

Fig. 9: Performance in CPE with error in orientation estimate.

Vi-D Relation of to

The Fig. 10 shows an example synthetic spherical image and a heatmap visualization of at each of its pixel. gives the uncertainty measure of 3D-2D correspondence. Correspondences with high value might result in more erroneous CPE, while a generous threshold would promote the number of inliers the can constrain the camera pose. We performed CPE with varying on an image pair. From Fig. 11 we can observe that the variance of certainly increases, however relative number of inliers increase too.

Fig. 10: Heatmap visualization of standard deviation (m) in
Fig. 11: With increasing , the error in CPE tends to increase, however, more inlier matches are being used which contributes to confident outlier rejection.

Vi-E Reduction in minimum required surface area and mapping time.

JPIL is targeted towards enabling accurate localization for very low overlap point clouds that requires significantly less user time. In sPCR, a user would need to walk on-site while mapping the structure to build a sufficiently large map. We simulated random walks in the vicinity of point clouds generated by the HoloLens. As the walk distance increased, more parts of the HoloLens point cloud were included for registration. We wanted to evaluate tBSC and BSC as a function of surface area mapped. Thus, we generated a cumulative density function of minimum surface area required by these methods for successful localization on 15 real datasets Fig. 12. We observe an average reduction of 10 times the surface area required by sPCR. For one dataset, we achieved a reduction from to and the time difference was minutes. Finally, performing JPIL over 12 datasets with EDM ground truth, we observe an average accuracy of 0.28m for tBSC registration and 1.21m for CPE Fig. 13. The surface area was calculated by remeshing the point clouds and summing up the area of each triangle. The surface also included parts of the environment other than the structure, thus the required surface area in practice would be less than the values shown.

Fig. 12: Left: Overall JPIL performance in X and Y axis. Black is the ground truth, blue is error in tBSC registration and brown is error in CPE. Right: Surface area required for successful registration by JPIL and sPCR in 15 real world datasets.

Vii Related Works

While we have covered the related works in the above text, here we emphasize on few other point cloud registration methods that uses visual cues. Dold [15] uses planar patches from image data to refine a point cloud registration whereas Men et al[16] uses hue as fourth dimension (x,y,z,hue) and search for correspondence in 4D space. Similarly, authors of [17, 18] use 2D image features in a tightly coupled framework to direct point cloud registration. These requires accurate calibration between the Lidar scans and camera images and work well for dominantly planer structures without stereo ambiguities. When accounted for errors in sensor calibration, stereo ambiguities and complex 3D environments, local image features tend to fail and thus decrease the robustness due to the their tightly coupled nature.

Fig. 13: Top: Example JPIL runs for three inspection sessions. Blue denotes , the spatial map generated by the HoloLens and yellow denotes the reference model . Synthetic spherical images Bottom of confident positive matches are shown along with real image Middle. The three spatial maps shown here have varying surface area .

Viii Conclusion And Future Work

We have presented a marker-free self-localization method for mixed-reality headsets and emphasized that data from three onboard sensors: a depth sensor, a camera and an IMU unit are critical for efficient localization. Our method is robust against errors in orientation estimation unto , which is generous for most sensors. Localization accuracy of 0.28m, is comparable to that of sPCR while requiring 10 fold less surface area on average. Our method does not evaluate the user’s selection of . Practically, the user should generate from well defined structures with minimum symmetry, and which also exists in . In future, we would like to explore the 3D information from time series image data to further enhance the efficiency and robustness of this method.

Ix Acknowledgment

The authors would like to thank Daniel Maturana of Carnegie Mellon University for his inputs in the initial phase of the framework development and also proofreading.


  • [1] A. Webster, S. Feiner, B. MacIntyre, W. Massie, and T. Krueger, “Augmented reality in architectural construction, inspection and renovation,” in Proc. ASCE Third Congress on Computing in Civil Engineering, pp. 913–919, 1996.
  • [2] F. MOREU, B. BLECK, S. VEMUGANTI, D. ROGERS, and D. MASCARENAS, “Augmented reality tools for enhanced structural inspection,” Structural Health Monitoring 2017, 2017.
  • [3] F. Pomerleau, F. Colas, R. Siegwart, et al., “A review of point cloud registration algorithms for mobile robotics,” Foundations and Trends® in Robotics, vol. 4, no. 1, pp. 1–104, 2015.
  • [4] H. Lei, G. Jiang, and L. Quan, “Fast descriptors and correspondence propagation for robust global point cloud registration,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3614–3623, 2017.
  • [5] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localization using direct 2d-to-3d matching,” in 2011 International Conference on Computer Vision, pp. 667–674, Nov 2011.
  • [6] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruction in real-time,” in Intelligent Vehicles Symposium (IV), 2011 IEEE, pp. 963–968, Ieee, 2011.
  • [7] C. Brenner, “Building reconstruction from images and laser scanning,” International Journal of Applied Earth Observation and Geoinformation, vol. 6, no. 3-4, pp. 187–198, 2005.
  • [8] Z. Dong, B. Yang, Y. Liu, F. Liang, B. Li, and Y. Zang, “A novel binary shape context for 3d local surface description,” vol. 130, pp. 431–452, 08 2017.
  • [9] T. Sattler, B. Leibe, and L. Kobbelt, “Improving image-based localization by active correspondence search,” in Computer Vision – ECCV 2012 (A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, eds.), (Berlin, Heidelberg), pp. 752–765, Springer Berlin Heidelberg, 2012.
  • [10] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” in Readings in computer vision, pp. 726–740, Elsevier, 1987.
  • [11] S. Agarwal, K. Mierle, et al., “Ceres solver,” 2012.
  • [12] C. Zhao, W. Shi, and Y. Deng, “A new hausdorff distance for image matching,” Pattern Recognition Letters, vol. 26, no. 5, pp. 581–586, 2005.
  • [13] J. R. Torreão and J. L. Fernandes, “Matching photometric-stereo images,” JOSA A, vol. 15, no. 12, pp. 2966–2975, 1998.
  • [14] Q. Zhao, W. Feng, L. Wan, and J. Zhang, “Sphorb: A fast and robust binary feature on the sphere,” International Journal of Computer Vision, vol. 113, no. 2, pp. 143–159, 2015.
  • [15] C. Dold and C. Brenner, “Registration of terrestrial laser scanning data using planar patches and image data,” International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 36, no. 5, pp. 78–83, 2006.
  • [16] H. Men, B. Gebre, and K. Pochiraju, “Color point cloud registration with 4d icp algorithm,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1511–1516, IEEE, 2011.
  • [17] J.-Y. Han, N.-H. Perng, and H.-J. Chen, “Lidar point cloud registration by image detection technique,” IEEE Geoscience and Remote Sensing Letters, vol. 10, no. 4, pp. 746–750, 2013.
  • [18] K. Al-Manasir and C. S. Fraser, “Registration of terrestrial laser scanner data using imagery,” The Photogrammetric Record, vol. 21, no. 115, pp. 255–268, 2006.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description