Joint Point Cloud and Image Based Localization
For Efficient Inspection in Mixed Reality
Abstract
This paper introduces a method of structure inspection using mixedreality headsets to reduce the human effort in reporting accurate inspection information such as fault locations in 3D coordinates. Prior to every inspection, the headset needs to be localized. While external pose estimation and fiducial marker based localization would require setup, maintenance, and manual calibration; markerfree selflocalization can be achieved using the onboard depth sensor and camera. However, due to limited depth sensor range of portable mixedreality headsets like Microsoft HoloLens, localization based on simple point cloud registration (sPCR) would require extensive mapping of the environment. Also, localization based on camera image would face same issues as stereo ambiguities and hence depends on viewpoint. We thus introduce a novel approach to Joint Point Cloud and Imagebased Localization (JPIL) for mixedreality headsets that uses visual cues and headset orientation to register small, partially overlapped point clouds and save significant manual labor and time in environment mapping. Our empirical results compared to sPCR show average 10 fold reduction of required overlap surface area that could potentially save on average 20 minutes per inspection. JPIL is not only restricted to inspection tasks but also can be essential in enabling intuitive humanrobot interaction for spatial mapping and scene understanding in conjunction with other agents like autonomous robotic systems that are increasingly being deployed in outdoor environments for applications like structural inspection.
This paper has been accepted for publication at the IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), Madrid, 2018. ©IEEE
I Introduction
The onset of portable mixed reality headsets like Google Glass and Microsoft HoloLens has enabled efficient human interactions with 3D information. These headsets are found to be suitable for onsite inspection [1, 2], due to their ability to visualize augmented geotagged holograms and measure 3D distances by virtue of gaze and gesture. Notably, before every inspection, the headset needs to be localized, preferably with onboard sensors, to establish a common frame of reference for 3D coordinates. As spatial mapping and understanding is key to mixedreality, we can safely assume that the primary sensors would include at least a depth sensor and a camera. However, due to low form factor and portability, these sensors have limited range. Given a prior 3D model of the structure as a template, existing methods for simple point cloud registration (sPCR) [3, 4] can be employed on a spatial map generated onsite by the headset. These methods, however, require significant overlap between the point clouds and thus would require the user to sufficiently map the structure on the order of surface area. Conversely, camera pose estimation using 3D2D correspondences [5] lacks desired precision and depends on viewpoint due to stereo ambiguities.
In this paper, we introduce an efficient approach to Joint Point Cloud and Imagebased Localization (JPIL) for markerfree selflocalization of mixedreality headsets that requires minimum onsite mapping time. JPIL can successfully register spatial map with significantly low overlap with a prior 3D model by combining additional information from camera image and headset orientation, which is simply available from the onboard camera, IMU and magnetometer sensors. Interestingly, it simplifies the problem to an extent that is critical for operation in symmetric environments. A small spatial map might resemble multiple structures on the prior 3D model, thus, a camera pose estimation (CPE) problem is solved, as a function of point cloud registration to differentiate between multiple candidates. The resulting localization accuracy is similar, albeit at significantly lower point cloud overlap. The contributions of this paper are: {enumerate*}
We contribute a JPIL method (Section V) to use spatial map (Section III), camera image (Section IV) and headset orientation to localize mixedreality headset with minimum mapping time.
A modified binary shape context based 3D point cloud descriptor with an efficient multiplecandidate descriptor matching algorithm (Section IIIB).
We empirically show that JPIL method results in the successful registration of point clouds with overlap as low as and average 10 fold reduction in required surface area.
Finally, we provide software implementation ^{1}^{1}1Source code: https://bitbucket.org/castacks/jpil and hardware verification on a Microsoft HoloLens.
Ii Digital Model and Setup
A common frame of reference is required to report 3D coordinates that can be persistently used across multiple inspection sessions. While a global positioning system (e.g., GPS) can be used, instead, we use a digital 3D model (Fig. 2) of the structure and report all 3D coordinates in its reference frame . has to be real scale and can be a partial or full representation of the structure. The source for can be a computeraided model or a map generated by other 3D reconstruction methods such as [6, 7]. If of a structure does not exist prior to its inspection, then the user can scan the structure to build a map that can be used as in future inspections.
We choose a 3D triangular mesh as the format for as ray casting on a mesh for measurement is accurate and computationally cheap due to the surface representation. In most cases, a mixedreality headset would setup up a local frame of reference and an arbitrarily defined origin for its operation. Let define the triangle mesh generated by spatial mapping. Orientation is estimated in Easting, Northing and Elevation (ENU) geographic Cartesian frame from the IMU and magnetometer for both the models. The models are thus aligned with the ENU frame.
Localization of the headset has to be performed only once per inspection session, where a transformation is established between and . Headset pose w.r.t at time is tracked by the headset itself. A normal inspection procedure under the proposed method would be to first map a small () structural surface and initiate JPIL with , where represents the time in that instance.
Iii Point Cloud Registration
JPIL samples and meshes to generate corresponding point clouds for registration. Point cloud registration estimates a transformation matrix given and , such that perfectly aligns with when every point of is transformed by Fig. 3.
(1) 
where and are corresponding points (homogeneous coordinates) in and respectively. A common framework consists of essentially the following steps: {enumerate*}
Keypoint extraction,
Descriptor computation,
Descriptor matching, and
Transformation estimation. The models are aligned with ENU frame, thus, will have negligible rotation component. We use binary shape context descriptor [8] for step 1 and 2 while we modify step 1 to incorporate the orientation information. The modified descriptor (tBSC) is now orientation specific. Finally, we propose an efficient algorithm for step 3 (subsection IIIB). Thus, we discuss steps 1 and 3 in detail while briefly revising the other steps.
Iiia Keypoint extraction
Keypoint extraction complements the feature descriptors by providing points that are expected to have high feature descriptiveness and are robust to minor viewpoint changes. For each point of a point cloud, we perform two separate eigenvalue decompositions on the covariance matrices and .
The covariance matrices and are constructed using and its neighboring points within a supporting radius , which is user defined. We used . The neighboring points and are denoted as , where is the point and is the number of neighboring points. A new set of points is generated from with only the and component of those points. The covariance matrices and are calculated as follows:
(2) 
(3) 
The eigenvalues in decreasing order of magnitude and , along with the corresponding eigenvectors and respectively, are computed by performing eigenvalue decompositions of and respectively.
Now, the point is considered a keypoint if:

: Note, are the eigenvalues computed from , which in turn is formed from the 2D distribution of in the plane. This condition, thus, promotes points with well defined eigenvectors.

Highest curvature among its neighbors. The curvature is defined by .
For each keypoint, a local reference frame (LRF) is defined. An LRF imparts invariance in feature descriptors to rigid transformations. Descriptors are computed based on the LRFs, thus, it is critical for LRFs to be reproducible. In [8], the authors define an LRF by adopting as the origin and as the , respectively, where represents the cross product of vectors. However, the orientation of each eigenvector calculated by eigenvalue decomposition has a ambiguity. Therefore, there are two choices for the orientation of each axis and thus four possibilities for the LRF. They end up computing descriptors for all four possible LRFs and thus leading to reduced precision. This ambiguity is generally faced by all sPCR methods that use eigenvectors for LRF. In, contrast JPIL adopts as the , respectively, where and is a vector towards magnetic East and North respectively. Thus, the LRF is written as
(4) 
IiiB Multiple Candidate Registration
Since, would generally be very small as compared to , the registration would have translational symmetry Fig. 4. Let the descriptor for a keypoint be defined as and and be the set of keypoints for and respectively.
Let be the hamming distance between two binary descriptors and , and denote maximum possible value of . A match is a set of keypoint pairs formed by selecting a keypoint and from and respectively such that , where is a user definable threshold. contains set of matches that 1) belong to either one of the possible candidates, and 2) are outliers. We find a family of match sets where a set represents a candidate. is found based on geometric consistency with error margin of and clustering on given by Algorithm 1. We used .
Subroutine called by Algorithm 1 estimates the transformation matrix from the corresponding 3D points using singular value decomposition. An alignment is evaluated based on how similar the two point clouds looks after alignment. is transformed by to get , and is clipped to get using a box filter of dimension equivalent to the bounding box of . The idea is to compute two tBSC descriptors and , one each for and with radius equivalent to longest side of the bounding box, and keypoints and at the box centers respectively. Subroutine return the alignment cost given by .
Iv Image Based Candidate Selection
The motivation behind JPIL to use visual cues is the rich amount of information that is present in a single image. While the depth sensor on the headsets has a range of about and about field of view (FOV), a spherical camera, can provide full horizontal and vertical field of view. Consider a particular candidate : let its transformation matrix and alignment cost be and respectively. Since, the headset pose w.r.t at time is known (Section II). JPIL generates synthetic projection images of setting a virtual camera at position and headset orientation in ENU frame.
If is the correct candidate, then the camera image and synthetic image should match better than those of the other candidates. Thus, implying to be the correct headset pose w.r.t . We evaluate a match based on the distance metric described in Section IVA.
We demonstrate JPIL with a spherical camera, however it is not a necessity. The user may use any camera with suitable FOV for their case. We however do recommend the use of spherical cameras as the localizability increases with FOV and we discuss the benefits of the same in the experiments.
Iva 3D2D Image Match Distance Metric
Given a 3D point cloud structure and a camera image, the estimation of the camera pose is a standard computer vision problem. We follow interesting articles [5, 9] from the literature that solves this problem. And use it as a framework to build upon our method that supports spherical image projection and a orientation constraint nonlinear optimization for camera pose.
Since, is generated by projecting on a sphere, we can backtrack from a pixel coordinate to the 3D points . The initial goal is to detect 2D2D image correspondences between and , and establish 3D2D correspondences after backtracking from Fig. 5.
Let where is kernel dimension. Standard deviation . JPIL rejects a correspondence if , assuming the 3D point to be uncertain.
Let and be 3D point and image pixel respectively of correspondence. is the position of headset in according to candidate. Using spherical projection, we can project to a point on the projection sphere. Also, we can project every point to a point on projection sphere as a function of orientation and position (viewpoint). Thus, we can solve for by minimizing the cost on the set of 3D2D correspondences with RANSAC [10] based outlier rejection.
(5) 
where is Roll, Pitch or Yaw angle and is the allowed flexibility. We use Ceres solver [11] for the optimization. To evaluate the similarity of two images there are many distance metrics in the literature, like Hausdorff distance [12], photometric error [13] and homography consistent percentage inlier matches. Since, we have the 3D information of an image, we rather check for geometric consistency with the error in position given by :
(6) 
IvB Confident Positive Matches
The cost function is nonlinear and due to noise in feature matching, the optimization might reach a false local minimum giving erroneous estimate. Therefore, metric is only good to determine if a match is confident positive given by:
(7) 
where depends on the noise level and we used .
V Joint Point Cloud and Image Based Localization
To summarize, JPIL has as its inputs, a reference model , a small 3D map of the structure scanned in the particular session , headset position , headset orientation in ENU frame and spherical image at time . The output of JPIL Algorithm 2 is , the transformation of to , such that headset position w.r.t can be estimated.
Vi Experimental Results
We performed few experiments to evaluate the following:

Performance of CPE with orientation constraints and differentiability between candidates.

Tolerance of tBSC to error in orientation.

Relation of to error in orientation and relative distance between the camera poses of the two images.

Relation of to .

Reduction in minimum required surface area and mapping time.
The experiments were performed on a Microsoft HoloLens headset with a Ricoh theta S camera Fig. 6. We used an offboard computer to process the data. The HoloLens uploaded ( MB), ( kB for pixels), and to the server. ( GB) was already present in the server and an instance of JPIL was running to accept communication from the HoloLens and return back the transformation . We used SPHORB [14] feature descriptors for the spherical images.


We tested JPIL in real world as well as simulated environments. Real world data from HoloLens was collected from Charles Anderson bridge, Pittsburgh, PA, USA. We used a high precision Terrestrial Laser Scanner (TLS) to generate dense RGB . The ground truth positions for the experiments were measured using a Electronic Distance Measuring (EDM) Survey equipment that can report the position of an EDM prism with millimeter accuracy. We manually calibrated the EDM measurement w.r.t .
Via Performance of CPE with orientation constraints and differentiability between candidates.
The cost function (5) might have many local minimums due to erroneous feature matching that might get selected by RANSAC as inliers. We performed CPE for 22 image pairs with varying and computed the average error and standard deviation for each Fig. 7. We observed that errors increase drastically with as expected. We observed a slight decrease at , which can be credited to the flexibility that allowed optimization to minimize considering errors in estimate. We also evaluate how discriminative is CPE to candidate positions. We took a spherical image and generated multiple spherical images at position with error increments of along the bridge. From Fig. 7 we observe that becomes unstable with increasing error and thus the concept of confident positive matching works well to discriminate between candidates that are further away from nominal position.


ViB Tolerance of tBSC to error in orientation.
We added error in the orientation estimates along X, Y and Z axes individually in an increment of . We, then performed tBSC registration and selected the candidate with transformation estimate closest to the ground truth transformation. The results in Fig. 8 shows a minimum tolerance of for error within 0.6m. It indicates the rotation specificity of LRF as well as robustness to error in sensor values.
ViC Relation of to error in orientation.
is generated as a function of . We wanted to evaluate how error in CPE is affected by error in . So, we added errors in and performed CPE for each pair of and . We observe that CPE is tolerant to error in orientation estimates up to Fig. 9.
ViD Relation of to
The Fig. 10 shows an example synthetic spherical image and a heatmap visualization of at each of its pixel. gives the uncertainty measure of 3D2D correspondence. Correspondences with high value might result in more erroneous CPE, while a generous threshold would promote the number of inliers the can constrain the camera pose. We performed CPE with varying on an image pair. From Fig. 11 we can observe that the variance of certainly increases, however relative number of inliers increase too.
ViE Reduction in minimum required surface area and mapping time.
JPIL is targeted towards enabling accurate localization for very low overlap point clouds that requires significantly less user time. In sPCR, a user would need to walk onsite while mapping the structure to build a sufficiently large map. We simulated random walks in the vicinity of point clouds generated by the HoloLens. As the walk distance increased, more parts of the HoloLens point cloud were included for registration. We wanted to evaluate tBSC and BSC as a function of surface area mapped. Thus, we generated a cumulative density function of minimum surface area required by these methods for successful localization on 15 real datasets Fig. 12. We observe an average reduction of 10 times the surface area required by sPCR. For one dataset, we achieved a reduction from to and the time difference was minutes. Finally, performing JPIL over 12 datasets with EDM ground truth, we observe an average accuracy of 0.28m for tBSC registration and 1.21m for CPE Fig. 13. The surface area was calculated by remeshing the point clouds and summing up the area of each triangle. The surface also included parts of the environment other than the structure, thus the required surface area in practice would be less than the values shown.


Vii Related Works
While we have covered the related works in the above text, here we emphasize on few other point cloud registration methods that uses visual cues. Dold [15] uses planar patches from image data to refine a point cloud registration whereas Men et al. [16] uses hue as fourth dimension (x,y,z,hue) and search for correspondence in 4D space. Similarly, authors of [17, 18] use 2D image features in a tightly coupled framework to direct point cloud registration. These requires accurate calibration between the Lidar scans and camera images and work well for dominantly planer structures without stereo ambiguities. When accounted for errors in sensor calibration, stereo ambiguities and complex 3D environments, local image features tend to fail and thus decrease the robustness due to the their tightly coupled nature.






Viii Conclusion And Future Work
We have presented a markerfree selflocalization method for mixedreality headsets and emphasized that data from three onboard sensors: a depth sensor, a camera and an IMU unit are critical for efficient localization. Our method is robust against errors in orientation estimation unto , which is generous for most sensors. Localization accuracy of 0.28m, is comparable to that of sPCR while requiring 10 fold less surface area on average. Our method does not evaluate the user’s selection of . Practically, the user should generate from well defined structures with minimum symmetry, and which also exists in . In future, we would like to explore the 3D information from time series image data to further enhance the efficiency and robustness of this method.
Ix Acknowledgment
The authors would like to thank Daniel Maturana of Carnegie Mellon University for his inputs in the initial phase of the framework development and also proofreading.
References
 [1] A. Webster, S. Feiner, B. MacIntyre, W. Massie, and T. Krueger, “Augmented reality in architectural construction, inspection and renovation,” in Proc. ASCE Third Congress on Computing in Civil Engineering, pp. 913–919, 1996.
 [2] F. MOREU, B. BLECK, S. VEMUGANTI, D. ROGERS, and D. MASCARENAS, “Augmented reality tools for enhanced structural inspection,” Structural Health Monitoring 2017, 2017.
 [3] F. Pomerleau, F. Colas, R. Siegwart, et al., “A review of point cloud registration algorithms for mobile robotics,” Foundations and Trends® in Robotics, vol. 4, no. 1, pp. 1–104, 2015.
 [4] H. Lei, G. Jiang, and L. Quan, “Fast descriptors and correspondence propagation for robust global point cloud registration,” IEEE Transactions on Image Processing, vol. 26, no. 8, pp. 3614–3623, 2017.
 [5] T. Sattler, B. Leibe, and L. Kobbelt, “Fast imagebased localization using direct 2dto3d matching,” in 2011 International Conference on Computer Vision, pp. 667–674, Nov 2011.
 [6] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruction in realtime,” in Intelligent Vehicles Symposium (IV), 2011 IEEE, pp. 963–968, Ieee, 2011.
 [7] C. Brenner, “Building reconstruction from images and laser scanning,” International Journal of Applied Earth Observation and Geoinformation, vol. 6, no. 34, pp. 187–198, 2005.
 [8] Z. Dong, B. Yang, Y. Liu, F. Liang, B. Li, and Y. Zang, “A novel binary shape context for 3d local surface description,” vol. 130, pp. 431–452, 08 2017.
 [9] T. Sattler, B. Leibe, and L. Kobbelt, “Improving imagebased localization by active correspondence search,” in Computer Vision – ECCV 2012 (A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, eds.), (Berlin, Heidelberg), pp. 752–765, Springer Berlin Heidelberg, 2012.
 [10] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” in Readings in computer vision, pp. 726–740, Elsevier, 1987.
 [11] S. Agarwal, K. Mierle, et al., “Ceres solver,” 2012.
 [12] C. Zhao, W. Shi, and Y. Deng, “A new hausdorff distance for image matching,” Pattern Recognition Letters, vol. 26, no. 5, pp. 581–586, 2005.
 [13] J. R. Torreão and J. L. Fernandes, “Matching photometricstereo images,” JOSA A, vol. 15, no. 12, pp. 2966–2975, 1998.
 [14] Q. Zhao, W. Feng, L. Wan, and J. Zhang, “Sphorb: A fast and robust binary feature on the sphere,” International Journal of Computer Vision, vol. 113, no. 2, pp. 143–159, 2015.
 [15] C. Dold and C. Brenner, “Registration of terrestrial laser scanning data using planar patches and image data,” International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 36, no. 5, pp. 78–83, 2006.
 [16] H. Men, B. Gebre, and K. Pochiraju, “Color point cloud registration with 4d icp algorithm,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1511–1516, IEEE, 2011.
 [17] J.Y. Han, N.H. Perng, and H.J. Chen, “Lidar point cloud registration by image detection technique,” IEEE Geoscience and Remote Sensing Letters, vol. 10, no. 4, pp. 746–750, 2013.
 [18] K. AlManasir and C. S. Fraser, “Registration of terrestrial laser scanner data using imagery,” The Photogrammetric Record, vol. 21, no. 115, pp. 255–268, 2006.