Feasibility of Video-based Sub-meter Localization on Resource-constrained Platforms
While the satellite-based Global Positioning System (GPS) is adequate for some outdoor applications, many other applications are held back by its multi-meter positioning errors and poor indoor coverage. In this paper, we study the feasibility of real-time video-based localization on resource-constrained platforms. Before commencing a localization task, a video-based localization system downloads an offline model of a restricted target environment, such as a set of city streets, or an indoor shopping mall. The system is then able to localize the user within the model, using only video as input.
To enable such a system to run on resource-constrained embedded systems or smartphones, we (a) propose techniques for efficiently building a 3D model of a surveyed path, through frame selection and efficient feature matching, (b) substantially reduce model size by multiple compression techniques, without sacrificing localization accuracy, (c) propose efficient and concurrent techniques for feature extraction and matching to enable online localization, (d) propose a method with interleaved feature matching and optical flow based tracking to reduce the feature extraction and matching time in online localization.
Based on an extensive set of both indoor and outdoor videos, manually annotated with location ground truth, we demonstrate that sub-meter accuracy, at real-time rates, is achievable on smart-phone type platforms, despite challenging video conditions.
Localization technology is the key enabler of many important mobile and sensing applications today. However, the inherent limitations of current localization technology often limit the performance of current applications, and render many other infeasible.
In an outdoor setting, the Global Positioning System (GPS) is in widespread use today. Consumer grade GPS receivers generally encounter an error of 5–10 meters [gps_accuracy1] under ideal conditions, and over 100 meters [gps_accuracy_urban1] under non-ideal conditions, such as in urban canyons where tall buildings result in an obstructed view of the sky and multi-path effects. Cellular [cell-tower] and Wi-Fi [placelab, ap-fingerprinting-kalman, metropolitan-wifi] localization can complement GPS in outdoor areas with poor or no GPS signal. Cellular localization often incurs errors of 100s of meters in urban areas to a few kilometers in rural areas, whereas outdoor Wi-Fi localization is limited to urban areas due to the short range of Wi-Fi access points, and offers lower accuracy than GPS.
Indoor localization is even more challenging, as the GPS signal is typically unavailable indoors. There has been a significant research on indoor localization using other techniques such as ultrasonic sensors [cricket, active_bat], acoustic beacons [acoustic, walrus, sorroundsense], light [loc_light, sorroundsense], and Wi-Fi [radar, loc_wo_pain, loc_fine_grained, zero_cost, loc_limit, spotloc]. Ultrasonic and acoustics sensors can offer good accuracy, however such methods require adding instrumentation to the space, which can be impractical.
More demanding applications such as autonomous vehicles [kitti-autonomous, apolloscape], pedestrian navigation for the visually impaired [blind-nav], and self-guided museum tours [virtualtourist], require sub-meter localization. Sub-meter accuracy is generally available only using expensive hardware such as survey-grade GPS receivers, high-cost IMU, and LIDAR, and is largely unavailable indoors, or in a pedestrian context.
By contrast, video-based localization [get_out_of_my_lab, wide_area_loc, loc_sfm, global_slam_loc] is applicable to both indoors and outdoors and has been shown to achieve sub-meter accuracy. However, video-based localization can be very compute- and storage intensive. In this work, we investigate the feasibility of real-time, sub-meter localization using a smartphone or other resource-limited platforms, through an end-to-end video localization system optimized for this target. The primary contributions of this paper are as follows.
Efficient and accurate 3D model construction with video data by prioritized and filtered feature matching.
Compression of a 3D model by reducing both 3D points and image features to reduce storage needs.
Fast feature extraction by subdivision of a video frame, and parallelized, incremental feature computation.
Interleaved feature matching and optical flow-based feature tracking for real-time localization.
An end-to-end system combining multiple interdependent components involving efficient model building and online localization.
Evaluation of our system with meticulously annotated ground-truth data.
The rest of the paper is organized as follows. We provide a system overview in §2. In §3, we discuss the 3D reconstruction of the environment, our contributions to achieve high-quality reconstruction, and 3D model compression. We present the localization pipeline in §4, where we discuss our contributions to make the pipeline faster to achieve near real-time operation. In §5 we discuss evaluation challenges and our ground truth datasets. In §6, we present the results of localization accuracy and other performance measures. Finally, we discuss the related work in §7.
2 System Overview
The envisioned localization system consists of four main phases: survey, offline model creation, model retrieval, and online localization. We describe each in some detail below.
In the survey phase, a content creator traverses the target environment, recording a video along the way. It is not necessary to record GPS coordinates during the survey phase, although if localization within an Earth reference frame is required, this can be convenient. For large scale collection outdoors, a survey grade GPS and IMU may be used during the survey phase, to automatically tag each image with an accurate location.
Here, the target environment could be an indoor shopping mall, the walk to the bus stop, or an individual room. How much to capture of the environment during this traversal depends on the application, but generally speaking, a more complete model of the space may results in more robust localization. It is important that any significant features, such as shop doors, elevator buttons, or crosswalks, be well captured by the video.
2.2 Model Creation
In the model creation phase, the video is processed to automatically create a 3D model of the space visited. After processing, the content creator is presented with an interactive 3D model, which they annotate with any features significant to the intended navigation task. Generally speaking, the model is an independent reference frame, and localization is with respect to the model. To provide accurate model scaling, the content creator may annotate a known distance between two points within the model. If the target application needs to locate the model within the Earth reference frame, the content creator may annotate three or more points within the model with earth coordinates collected separately. Finally, if frames are tagged with accurate GPS coordinates, this may also be used.
In more detail, we first extract keypoints and descriptors from the video frames and match descriptors among adjacent frames. For faster matching, we build an Approximate Nearest-Neighbor (ANN) index for all descriptors of a frame and match the descriptors of adjacent frames using the index. Second, we use the descriptor matches to reconstruct the 3D model of the environment using Structure-from-motion (SfM) [bundler, bundler2, sfm-rome]. This stage produces a 3D point cloud representing the environment along with corresponding image descriptors from video frames. Additionally, we construct an ANN index from all image descriptors for efficient descriptor matching during the online localization stage.
The produced model thus consists of a set of named locations, a 3D point cloud with visual features attached to each point, and an associated visual feature index.
2.3 Model Retrieval
Before commencing localization in a given space, a model of the space is downloaded. Model identification can be by coarse location, by name, or by any other identifier. While the incremental model download is a straightforward extension, we assume that a complete model of the space is downloaded in one step.
2.4 Online Localization
Online localization continuously matches visual features in the live camera view against visual features in the 3D model, and tracks identified points in the space as the user moves. With a sufficient number of correctly identified 3D points, the camera’s pose is accurately recovered.
More specifically, we extract keypoints and descriptors from each video frame, and match these descriptors with the survey descriptors using the nearest-neighbor index. Since we know the correspondence between 3D points and image descriptors for the offline survey, we can find the correspondence between descriptors of localization frames and 3D points. Using these correspondences, we compute 6 degree-of-freedom (DOF) pose/location. Finally, we apply a Kalman filter [kalman] to compute optimal poses in the presence of noise and inaccuracies.
Below, we discuss the model creation and online localization phases in more detail.
3 3D reconstruction of an environment
3D reconstruction of an environment is the most significant component of our offline survey process. Figure 1 shows the stages of the 3D reconstruction pipeline. We describe the stages of this process in more detail below.
It is possible to reconstruct a 3D environment from multiple images from different viewpoints using Structure-from-motion (SfM) [bundler, bundler2, sfm-rome]. SfM reconstructs the original 3D world from a sequence of images. This is similar to stereo-vision, where two cameras are used to infer depth, and subsequently, reconstruct an environment. However, unlike stereo-vision, SfM can reconstruct an environment from a single camera due to the disparity between the images resulting from camera movement.
We reconstruct the 3D model of an environment from video recorded with a smartphone while walking or driving. We use the open source tool Bundler [bundler_url] for the reconstruction. One important aspect of the reconstruction process is that the success and quality of the reconstruction are directly dependent on the underlying feature matching quality. We use three heuristics to provide good matching for the reconstruction process:
We match keypoint descriptors of a frame with up to 300 adjacent frames (10 seconds of video at 30 frames/sec). According to our experiments, a 3D point gets reconstructed mostly from adjacent video frames. Figure 2 shows an example of this. It shows the CDF of the count of frames that contribute to the construction of a 3D point, for all points in a reconstruction. Here, almost all points in the 3D reconstruction come from nearby 300 image frames. Additionally, a limited number of adjacent frames reduces the ambiguity that can arise from spurious matches from distant frames.
Within the 300 adjacent frames, we prioritize matches for nearby frames as they are more likely to create consistent 3D points, compared to matches from more distant frames. Accordingly, we gradually reduce the total number of matches we keep for reconstruction from the distant frames. Here, we use ratio test [sift] (ratio=0.7) to remove ambiguous matches and order the matches based on L2 distance for a gradual reduction.
For successful triangulation during SfM reconstruction, it is important to have a disparity between two frames so that the triangulation converges. Hence, we do not use descriptor matching between adjacent frames as there is little motion or disparity between them. Instead, we match descriptors for regularly spaced frames. We experimented with various intervals such as every 5th frame, 10th frame, 20th frame, etc. We finally used the 10 frame interval since it provided the best trade-off in terms of model size, runtime, and localization accuracy.
Figure 6 shows an example reconstruction. Here, we use a sparse point cloud reconstruction for localization purposes. However, we also show the dense reconstruction for visualization. The dense reconstruction shows that the real world is accurately represented in the 3D reconstruction. Here, we clearly see environment details such as the stop sign, the road closure barricade, the green street sign, and the Subway logo.
3.2 3D model compression
3D model compression reduces both storage and computational needs by removing redundant data from the model. We use two techniques for 3D model compression: 1) reduction of 3D points of the model, 2) reduction of descriptors by averaging all descriptors corresponding to a 3D point.
The 3D reconstructed model obtained using Structure-from-Motion contains a large number of 3D points and their corresponding descriptors, consuming significant memory. For example, one of our 3D reconstruction from 700 images produced 113,000 3D points, and 630,000 distinct, 128-dimensional descriptors. Many of these 3D points only correspond to a small number of frames. Figure 8 shows the 3D point count for various minimum corresponding frame count. Here, approximately 46% of the total points appeared in only 2–3 frames. Accordingly, we can substantially compress the 3D model by removing 3D points that correspond to a small number of frames. After all, we only need a small number of correct correspondences from the 3D model during the localization, and points that are rarely seen in the survey phase are unlikely to be commonly observed during online localization.
Moreover, the descriptors corresponding to a 3D point are by definition very similar. Figure 8 shows an example of this. Here, we show the component values of the SIFT descriptor vectors for matching descriptors corresponding to a 3D point. Since all these matching descriptors are very similar, we replace them all by the mean descriptor of the point, resulting in additional compression of the 3D model.
Table 1 shows the storage requirement for a 3D model constructed from video of a 300-meter long urban street. Here, the storage requirement after two-stage compression is 7.5MB compared to the uncompressed storage of 91MB, or a 12 size reduction. As we show in (see §6.3), this type of 3D model compression has a negligible effect on localization performance.
|At least 10 frames||All||41 MB|
|At least 10 frames||Mean||7.5 MB|
3.3 3D scaling and alignment
The 3D reconstruction obtained using SfM typically has an arbitrary scale and orientation. All 3D points are consistent relative to each other within the model, but absent scaling and alignment, they do not represent real-world dimensions. For the most accurate alignment with the earth coordinate system, the content creator can specify the precise coordinates of three points within the model.
If accuracy of alignment is less important, an outdoor deployment can leverage GPS coordinates from the survey video. While the instantaneous accuracy of a consumer grade GPS is low, averaging over hundreds of GPS samples spanning a longer recording will eliminate much of the error. If needed, the survey can be performed with a survey grade GPS.
For our evaluation purposes, we use the ground-truth scale and orientation information recorded during the survey. We describe our approach to ground-truth data collection in §5.
3.4 Index for efficient matching
Each 3D point in the reconstructed model has a corresponding set of matched image descriptors. During localization, we match image descriptors of a localization frame with the descriptors of survey frames to find the 3D point correspondence. Matching a localization frame with a subset of reference frames in a one-to-one fashion is prohibitively expensive. Hence, we construct an approximate nearest-neighbor (ANN) index using the FLANN [flann1, flann2, flann3] library with all descriptors of the 3D model. Approximate matching with the ANN index provides adequate accuracy for localization.
4 Online localization
Below, we discuss the design of the online localization pipeline, with the goal of real-time operation.
Figure 9 shows our localization pipeline in detail. Throughout, the focus is on minimizing latency, through pipelining, parallelization, and sampling. To provide a high rate of localization, we separate localization into parallel matching and tracking processes.
matching identifies key points in the image, and matches these against the 3D model. This is the key operation in the localization pipeline, but it is computationally both highly variable and quite demanding, consuming 1–2 seconds per frame on a server core. To support a typical video frame rate, we parallelize the matching process, and adaptively sample incoming frames to match available computational resources. Matching thus slowly, but continuously adds key points to the tracking set.
tracking quickly tracks key points already in the tracking set, using optical flow. Optical flow is fast, but not very robust. Typically, we are able to successfully track a point for 50–100 frames before tracking fails, and the point is removed from the tracking set.
Given a tracking set of 2D points, we compute their correspondence to the 3D points in the point cloud, then use a RANSAC technique to estimate the final 6-DOF pose. Finally, we filter the stream of poses using a Kalman filter to remove noise and other aberrations. The Kalman filter output is also used to inform the correspondence computation. We describe these stages in more detail below.
4.1 Keypoint and descriptor computation
SIFT keypoint and descriptor computation generally takes 1 to 2 seconds for a 1080x720p video frame on a server CPU, and approximately 10 longer on a smartphone core. We use two techniques to amortize this computation time. First, we subdivide a video frame into smaller segments and compute SIFT keypoints and descriptors for some of these segments in parallel. In our experiments, we use 8 cores for parallel computation since many recent embedded processors and smartphones have 8 cores. Second, the frame arrival rate from a camera is still far greater than the SIFT computation rate by parallel threads. However, we do not need to compute keypoints and descriptors for all segments as they arrive for successful localization. Accordingly, we compute the SIFT keypoints and descriptors on a sampled subset of frames, and interleave this with optical flow tracking (described in §4.3).
4.2 Descriptor matching
SIFT descriptor matching is one of the most expensive stages of the localization pipeline. To find a descriptor’s most similar feature in the point cloud (i.e., exact nearest neighbor), the brute-force method is to compute the L1 norm or the L2 norm between the descriptor and all other descriptors and picking the descriptor with the minimum distance. While brute-force matching guarantees the best accuracy, it is not practical due to excessive computation time.
For fast matching, we create an Approximate Nearest-Neighbor (ANN) index for all survey video frames within an area (e.g., 100 meters) in the offline modeling phase. Using the index, we match the localization frame with all survey frames in a single step. Although this method can result in some inaccuracy, it is sufficient for localization as we can filter out spurious matches in later stages. However, even this matching scheme requires over 1 second per frame on a server core. Thus, as in the keypoint and descriptor computation, descriptor matching is performed on sampled frames, and in the background.
4.3 Keypoint tracking
Given the coordinates of some points in one video frame, optical flow [optical_flow] computes the new coordinates of those points in a successive frame caused by the movement of the object or camera. Optical flow is fast because it only needs to search the adjacent pixels of a point for the new location.
Tracking with optical flow works well primarily in constant lighting conditions and with a small changes to the viewpoint. By contrast, descriptor matching is robust to varying lighting conditions. After localization of a video frame, we are able to use optical flow tracking for subsequent frames as the lighting condition is likely to remain the same for some time. Eventually, however, the optical flow tracking is bound to fail for some points because of the change in viewpoint. To compensate for this, we continuously replenish the working set with new matched points.
Figure 10 shows the frame count/duration of successfully tracked keypoints using optical flow. Here, the left y-axis shows error in pose estimation with only optical flow tracking for consecutive frames. The right y-axis shows the count of successfully tracked keypoints and also the count of pose estimation inliers for consecutive frames. Over time, the number of successfully tracked points (and inliers) drops, and as a result, localization accuracy using only optical flow collapses after approximately 150 frames (or 5 seconds, at 30 Hz frame rate).
4.4 2D-3D Correspondence and pose estimation
We find the correspondence between localization frame keypoints and the 3D model points by combining the matching of localization frame descriptors with the surveyed model’s descriptors, and the mapping between survey descriptors and 3D model points. The correspondence count between a single frame and the 3D model can be large, as a 1280x720 video frame can have several thousand SIFT descriptors and a majority of them may match with the model descriptors if we are in a well-surveyed location.
Perspective-n-point algorithms [p3p, epnp] may be used to compute a 6-DOF pose/location. These algorithms can compute pose with just 3 accurate correspondences [p3p] or a few more [epnp]. However, some of our correspondences are likely to be spurious, due to errors in matching and tracking. To filter out such spurious correspondences, we use a RANSAC [ransac, ransac_eval, optimal_random_ransac] pose estimator, which iteratively finds the correct pose.
The RANSAC pose estimator works with hundreds of correspondences and will successfully remove false correspondences (outliers). However, working with a smaller set of correspondences is beneficial since RANSAC pose estimation converges faster with a smaller number of correspondences. Thus, we prioritize the correspondences by quality during descriptor matching in order to take the top correspondences for RANSAC pose estimation. To approximate the quality, we use two parameters: projection error and descriptor distance. At first, we find the projection error (the difference between the projection of 3D points onto the image plane and the corresponding keypoints) using the Kalman filter’s prediction of the next pose. The projection error can be used to filter out erroneous matches. For example, Figure 12 shows the CDF of projection error of both inliers and outliers for an image frame. Here, 95% of the inliers have projection error below 10 pixels and most of the outliers have projection error above 10 pixels. Accordingly, we discard the correspondences that have a projection error above 10 pixels.
Next, we order the remaining correspondences using the matching distance so that we can pick the top for pose estimation. We use in this paper. Figure 11 shows the mean computation time of RANSAC pose estimator with varying correspondence count, which shows that the time required by RANSAC reduces on average as the total number of correspondences is reduced.
4.5 Pose correction using Kalman filter
RANSAC pose estimation can be invalid due to a number of reasons: incorrect descriptor matching, bad optical flow tracking, erroneous RANSAC output, etc. However, the poses of successive video frames are expected to be correlated since we make a small movement between two consecutive frames while walking or driving. To take advantage of this, we apply a linear Kalman filter [kalman, kalman-tutorial] to the estimated poses of the video frames.
We model the state vector [kalman_model] with 6-DOF position and their first and second derivatives (velocity and acceleration respectively). Thus, the state vector is () as following: . Here, are position components and are roll, pitch, yaw. We use a conventional physics-based process model () relating the state vector between consecutive time steps (omitted for brevity).
Our measurement model is , which we obtain from the pose estimation step as described in §4.4. Finally, we use error-gating to improve robustness to large errors in matching, tracking, or pose estimation. With the error gating, if any pose component is not within four std-dev of the current Kalman estimated pose, we discard the measurement. Note that having a two or three std-dev bound is aggressive for Kalman filter for noisy data. In practical systems, it is more common to use four or even five std-dev bound [kalman-tutorial].
Figure 13 shows an example of the Kalman filter correcting the noisy locations. It shows the successive distance from the starting position for 2000 frames after location/pose estimation. The locations obtained after feature matching, optical-flow tracking, and RANSAC pose-estimation can be noisy and spurious for some frames. The Kalman filter smooths out and corrects such noisy poses.
5 Evaluation Challenges and Datasets
We discuss the evaluation challenges and two ground truth datasets that we collected overcoming these challenges in this section.
5.1 Evaluation challenges
Collecting ground-truth data for sub-meter localization at a large scale is challenging. The typical method of collecting location data is to use a GPS device to collect latitude-longitude for geo-tagging. However, consumer-grade GPS devices typically encounter 10s of meters of errors, and even 100s of meters in challenging environments (e.g. urban canyons, tree shades, etc.), and the challenging environments are most appropriate for video-based localization. Therefore it is not possible to use generic GPS receivers for ground-truth data for evaluating a system having sub-meter accuracy. One option to achieve high accuracy is to use Real-Time Kinematic (RTK) GPS receiver along with IMU for centimeter accurate geo-location while moving. However, these systems are highly expensive and they do not work indoors. Hence, in this work, we use manually annotated ground truth datasets in challenging outdoor environments and indoors for evaluation. We used two methods to collect ground truth data. We describe them below.
5.2 Ground truth with a measuring wheel
To collect ground truth data with high accuracy, we walked along a path with a measuring wheel and put chalk marks every 5 feet for a total length of 1000 feet outdoors at an urban street and 430 feet indoors in UIC campus cafe. We then walked the path and recorded a video for 3D reconstruction. Later we walked the path two more times and took pictures at every mark to compare their location for measuring the accuracy of our methods. This dataset provides intermittent but highly accurate ground truth.
We used a Nexus 5x smartphone for collecting these data sets. Figure 15 shows the measuring wheel that we use to mark a path. Figure 18 shows the outdoor urban street and the indoor cafe at which we evaluated our system.
5.3 Ground truth with landmark tagging
The data collected with the measuring wheel offers ground truth for every 5 feet only. However, we need to evaluate our system for continuous video frames too. Hence, we collected another ground-truth dataset by recording video several times with an iPhone 6 at an outdoor street for measuring localization accuracy of continuous video frames. Here, we annotated every light pole in the street by clicking the headphone button to record a time-stamp in the video as we pass by light-poles. The light poles are approximately 20 meters apart. For all frames between the clicks/light-poles, we interpolate the location based on a best effort attempt of walking at a constant pace. Since the distance between two adjacent light poles is low, this interpolation likely does not introduce a significant error. Figure 15 is an example of this approach where we mount the smartphone using a chest harness.
We present the performance of our localization system for both intermittent and continuous ground truth below. Additionally, we describe the effect of model compression on localization accuracy and smartphone computation time.
6.1 Localization accuracy for intermittent ground truth
We collected ground truth data in both outdoors and indoors at every 5 feet as described in §5.2. In this section, we present the localization accuracy for these ground truth datasets.
Figure 21 shows the CDF of localization error in outdoor for two different days with varying lighting condition. Figure 22 shows the scene variation where the scene at the left is cloudy and the scene at the right is sunny. The time of day is also different as evident from the shadow. The median localization error on these datasets is 0.35 meters for ground truth collected on the same day as the model, and 0.6 meters for data collected on a different day.
It is worth noting that approximately 15% locations in Figure 21 for the different day has error higher than 5 meters. According to our investigation, the primary reason behind this is repetitive brick textures in the scene. Figure 27 shows some examples of this. The Kalman filter removes many of these spurious cases, which we discuss in the next section.
Figure 21 shows the CDF of localization error in a indoor setting. Since the lighting condition generally remains the same in the indoor setting, we collected one ground truth dataset. Here, the median error is 0.6 meter.
6.2 Localization accuracy for continuous video
In this section, we report the localization accuracy outdoor for ground truth data for every frame in a video sequence, which we describe in §5.3. Figure 21 shows the CDF of localization error. Here, the median error is 0.3 meters for locations estimated before Kalman filtering. The median error reduces further with Kalman filtering. Note that Kalman filtering is not applicable for the intermittent dataset described above since a Kalman filter needs continuous video frames.
We note that the error has a long tail distribution, with 2% of errors above 3 meters, and occasional much larger errors (up to 40 meters in these experiments). These high errors occur primarily due to repetitive textures as discussed in the section above. Large and abrupt pose changes may be eliminated through inertial sensor fusion [visual-inertial], which falls outside the scope of this paper
6.3 Effect of compression
Figure 29 and Figure 29 show the localization error CDFs for a uncompressed 3D model and compressed 3D models with various settings for same day and different day. Specifically, the compression settings are (a) point reduction, and (b) mean of descriptors along with point reduction (discussed in §3.2).
Overall, Figure 29 and Figure 29 show that compression of 3D model does not reduce the localization accuracy significantly. Additionally, compression of a 3D model by removing points that have a low number of frame correspondences can sometimes improve accuracy as seen from Figure 29 for errors below 1 meter and from Figure 29 for errors below 0.5 meters. This happens as pose estimation can be more accurate if we discard bad 3D points. Surprisingly, model compression using descriptor means also improves the localization accuracy in some cases. We suspect that using the descriptor mean results in lower feature matching ambiguity, but we did not investigate this further.
6.4 Smartphone computation time
Table 2 shows the computation time of primary stages of the localization pipeline for both a server and a smartphone (Nexus 5X). We did not implement the descriptor matching and pose estimation for the Android or iOS platform because of problems with porting the relevant libraries. However, we estimated the approximate computation time on the Nexus 5X smartphone by comparing the relative performance between the smartphone and the server. Note that these numbers are a rough estimation and can vary in a real smartphone implementation. We implemented keypoint tracking on the smartphone and report the computation time obtained directly from the smartphone at Table 2.
The descriptor matching stage runs in the background with additional threads besides the main thread. For each location estimation, we need to use keypoint tracking and pose estimation. These two stages take an estimated 0.25 seconds combined on a smartphone. Hence, we can achieve 4 location estimations per seconds, which is sufficient for effective real-time localization on a smartphone. Note that most GPS receivers provide locations at a one second interval, and some higher-end receivers compute 5 locations every second. Also, additional performance gains may be available through the use of GPU computing - all results presented here are from pure CPU implementations.
7 Related Work
Localization by matching an image with a database of reference images is similar to image retrieval techniques. Brute-force matching is very expensive. Hence approximate matching is used, where test image descriptors are searched and matched with database descriptors using Approximate Nearest Neighbor (ANN) algorithms. One popular ANN method is the use of KD-tree [kd-tree-search1, kd-tree-search2]. However, KD-tree only builds an index with the database descriptors and all original descriptors are required during searching. Although KD-tree makes the searching fast, the storage requirement remains high. Hence KD-tree is typically used for only small databases. To alleviate the storage problem, most of the recent work use vocabulary tree of visual words created with hierarchical K-means clustering [vocabulary-tree, video-google], where many descriptors are merged into a single cluster center, thus reducing the storage requirement.
In vocabulary tree, retrieval performance decreases with the larger reference database. Reducing the number of descriptors in the database is important for both improving retrieval performance and reducing storage requirement. In [city-scale-location], information gain is used to keep only important descriptors in the reference database, and this enabled them to support 10 times more images in the database. A simple but effective technique is used in [confusing-features] to remove confusing features such as those found on trees, streets. For each query image, they simply find the top K reference images that are not from the same location using vocabulary tree matching. If some parts of the query image match well with these retrieved images then those parts are marked as confusing. In [predicting-matchability], a simple random decision forest classifier is trained to predict matchable descriptors. They reduced the descriptors to 30% of the original while keeping 60% of matchable descriptors.
Simultaneous Localization And Mapping (SLAM) is another overlapping area of work for image-based localization. SLAM [cd_slam, orb_slam, slam_robot] constructs a map from a sequence of frames while simultaneously localizing all frames with respect to that map. However, SLAM encounters drift and accumulates error because of dead-reckoning. The primary way of detecting and correcting the accumulated error is loop-closing [loop-closing-survey], where the previously visited places are identified. Image matching is used for identifying the same location. Since brute-force matching is computationally prohibitive, approximate matching techniques [dbow, fabmap] are used in practice. One major drawback with SLAM techniques is that the localization is only with respect to the locally constructed map, and global localization is not considered except for the loop closing. Additionally, SLAM uses a sequence of frames or video where scene conditions (lighting, obstacles, etc.) are constant. Localization with previously captured images is more challenging due to varying scene conditions.
Recently, the 3D model constructed using Structure-from-motion (SfM) is being increasingly used for image-based localization. This is a very attractive method of localization as it provides 6-DOF pose instead of simple closest matching reference image provided by the image retrieval techniques discussed earlier. Here, both fast matching of an image with respect to the reference 3D model and reducing the size of the 3D model is important for efficiency. Unlike 2D features, a nice property of the 3D model is that it contains structural information, which can be used for both fast searching and 3D model compression.
For fast searching, [prioritized-matching] used the structural property of the 3D model to prioritize the search. [2d-3d-match] performed 2D-to-3D matching using a vocabulary tree of descriptors along with corresponding 3D points associated with each visual word. During matching, they go by decreasing order of total associated 3D points for the descriptors. In [2d-3d-and-3d-2d], both 2D-to-3D matching and 3D-to-2D matching are used. Since there are much more 2D descriptors than the corresponding 3D points, 3D-to-2D matching is more efficient. However, the matching quality tends to be lower in 3D-to-2D matching compared to 2D-to-3D matching where the ratio test can eliminate bad quality matches. [2d-2d-match] used efficient 2D-to-2D matching that also retrieves corresponding 3D points for pose estimation.
Compression of 3D model reduces the storage requirement and also makes the searching faster. [prioritized-matching] formulated the model compression as a set-cover problem where a reduced point set is selected such that they are visible at a minimum of N images. While [prioritized-matching] used simple greedy set-cover formulation to reduce the 3D-points, [quadratic-reduction] used mixed-integer quadratic programming for 3D point reduction. This resulted in more flexibility and parameterization such as weights on the 3D points, occurrence, and co-occurrence of 3D points, etc. [probabilistic-reduction] used both descriptor distinctiveness and probabilistic modeling of the set cover problem to reduce 3D points by more than 98%. In [outdoor-localization-intel], 3D points are reduced by removing points that do not belong to planes and lines. This essentially reduces the 3D points while keeping the 3D structure preserved.
Large-scale image-based localization is being attempted recently because of the availability of a lot of imagery data. Hence, fast searching and reduction of the 3D model are becoming even more important. In [world-wide-pose], worldwide localization is attempted for two million images and 70 million reconstructed 3D points. They could localize successfully even for 1% inliers by introducing prior in the RANSAC selection step.
Real-time localization is still challenging because of slow descriptor computation and searching. In [real-time-msr], real-time localization is attempted with a combination of tracking and matching. Tracking is fast, and expensive matching with real-valued features is only used when a sufficient number of points could not be tracked. They used smartphones’ local memory to store the image descriptors, the 3D point cloud map, and various indices. However, for large reference map, the local storage can become infeasible. In contrast, [mobile-server] only uses a small local map in the mobile and perform local tracking and localization, and simultaneously keeps the global map at the server. Then they perform alignment of a local and global map to find the 6-DOF pose of a camera with respect to the server side global map. However, the trade-off here is that this method encounters significant bandwidth cost. [get-out-of-my-lab] focuses on compressing the reference dataset by both 3D model reduction using set-cover formulation and descriptor quantization for real-time localization. During localization, they use local SLAM along with descriptor matching and IMU data. In contrast, our system uses only visual features and interleaved tracking and matching for fast localization.
Machine learning and Convolutional Neural Network (CNN) are successful and popular for computer vision applications nowadays. Recently there has been work [posenet, geometric-pose, lstm-pose, geometry-aware-pose] on 6-DOF pose estimation based on learning and CNN techniques. These methods offer end-to-end learning and often learn high-level features in addition to corner features. However, the primary constraint in learning based methods is the training requirement. Unlike feature-based techniques, these systems often perform poorly if the test data significantly defer from the training data. Additionally, these techniques often require hyper-parameter tuning for each training set separately.
In conclusion, we present a system to achieve sub-meter accurate localization using video analysis. Consumer grade GPS receivers in smartphones generally encounter 5-10 meters [gps_accuracy1] of error under ideal conditions, and significantly more under non-ideal conditions, such as in urban canyons. We demonstrated that localization based on video analysis is feasible despite compute-intensive video processing by careful design and optimization of various stages of the pipeline. We also demonstrated that sub-meter localization accuracy can be achieved in both indoors and outdoors.