ENFT: Efficient Non-Consecutive Feature Tracking for Robust Structure-from-Motion

ENFT: Efficient Non-Consecutive Feature Tracking for Robust Structure-from-Motion

Guofeng Zhang, Haomin Liu, Zilong Dong, Jiaya Jia, Tien-Tsin Wong, and Hujun Bao Guofeng Zhang, Haomin Liu, Zilong Dong and Hujun Bao are with the State Key Lab of CAD&CG, Zhejiang University. Guofeng Zhang and Hujun Bao are also affiliated with Innovation Joint Research Center for Cyber-Physical-Society System, Zhejiang University. Email: {zhangguofeng, zldong, bao}@cad.zju.edu.cn, 172753015@qq.com. Corresponding authors: Guofeng Zhang and Hujun Bao. Jiaya Jia and Tien-Tsin Wong are with The Chinese University of Hong Kong. Tien-Tsin Wong is also affiliated with Shenzhen Research Institute, The Chinese University of Hong Kong. Email: {leojia, ttwong}@cse.cuhk.edu.hk

Structure-from-motion (SfM) largely relies on feature tracking. In image sequences, if disjointed tracks caused by objects moving in and out of the field of view, occasional occlusion, or image noise, are not handled well, corresponding SfM could be affected. This problem becomes severer for large-scale scenes, which typically requires to capture multiple sequences to cover the whole scene. In this paper, we propose an efficient non-consecutive feature tracking (ENFT) framework to match interrupted tracks distributed in different subsequences or even in different videos. Our framework consists of steps of solving the feature ‘dropout’ problem when indistinctive structures, noise or large image distortion exists, and of rapidly recognizing and joining common features located in different subsequences. In addition, we contribute an effective segment-based coarse-to-fine SfM algorithm for robustly handling large datasets. Experimental results on challenging video data demonstrate the effectiveness of the proposed system.

Non-Consecutive Feature Tracking, Track Matching, Structure-from-Motion, Bundle Adjustment

I Introduction

Large-scale 3D reconstruction [46, 16, 15] finds many practical applications. It primarily relies on SfM algorithms [20, 53, 67, 2, 8] to firstly estimate sparse 3D features and camera poses given the input of video or image collections.

Compared to images, videos contain denser geometrical and structural information, and are the main source of SfM in the movie, video and commercial industry. A common strategy for video SfM estimation is by employing feature point tracking [59, 34], which takes care of the temporal relationship among frames. It is also a basic tool for solving a variety of computer vision problems, such as camera tracking, video matching, and object recognition.

In this paper, we address two critical problems for feature point tracking, which could handicap SfM especially for large-scale scene modeling. The first problem is the vulnerability of feature tracking to object occlusions, illumination change, noise, and large motion, which easily causes occasional feature drop-out and distraction. This problem makes robust feature tracking from long sequences challenging.

The other problem is the inability of sequential feature tracking to cope with feature matching over non-consecutive subsequences. A typical scenario is that the tracked object moves out and then re-enters the field-of-view, which yields two discontinuous subsequences containing the same object. Although there are common features in the two subsequences, they are difficult to be matched/included in a single track using conventional tracking methods. Addressing this issue can alleviate the drift problem of SfM, which benefits high-quality 3D reconstruction. A naïve solution is to exhaustively search all features, which could consume much computation since many temporally far away frames simply share no content.

Fig. 1: A large-scale “Garden” example. (a) Snapshots of the input videos. (b) With the matched feature tracks, we register the recovered 3D points and camera trajectories in a large-scale 3D system. Camera trajectories are differently color-coded.

We propose an efficient non-consecutive feature tracking (ENFT) framework which can effectively address the above problems in two phases – that is, consecutive point tracking and non-consecutive track matching. We demonstrate their significance for SfM using challenging sequence data. Consecutive point tracking detects and matches invariant features in consecutive frames. A matching strategy is proposed to greatly increase the matching rate and extend lifetime of the tracks. Then in non-consecutive track matching, by rapidly computing a matching matrix, a set of disjoint subsequences with overlapping content can be detected. Common feature tracks scattered over these subsequences can also be reliably matched.

Our ENFT method reduces estimation errors for long loopback sequences. Given limited memory, it is generally intractable to use global bundle adjustment to refine camera poses and 3D points for very long sequences. Iteratively applying local bundle adjustment is difficult to effectively distribute estimation errors to all frames. We address this issue by adopting a segment-based coarse-to-fine SfM algorithm, which globally optimizes structure and motion with limited memory.

Based on our ENFT algorithm and segment-based coarse-to-fine estimation scheme, we present the SfM system ENFT-SFM, which can effectively handle long loopback sequences and even multiple sequences. Fig. 1 shows an example containing 6 sequences with about frames in total in a large-scale scene. Our system splits them to 37 shorter sequences, quickly computes many long and accurate feature tracks, efficiently estimates camera trajectories in different sequences, and finally registers them in a unified 3D system, as shown in Fig. 1(b). The whole process only takes about 90 minutes (excluding I/O) on a desktop PC, i.e., 17.7 FPS on average. Our supplementary video111http://www.cad.zju.edu.cn/home/gfzhang/projects/tracking
contains the complete result.

Compared to our prior work [66], in this paper, we make a number of modifications to improve robustness and efficiency. Particularly, we improve the second-pass matching by formulating it as minimizing an energy function incorporating two geometric constraints, which not only boosts the matching accuracy but also reduces computation. The non-consecutive track matching algorithm is re-designed to perform feature matching and match-matrix update together. It is less sensitive to initialization and reduces the matching time. Finally, we propose a novel segment-based coarse-to-fine SfM method, which performs efficient global optimization for large data with only limited memory.

Ii Related Work

We review feature tracking and large-scale SfM methods in this section.

Ii-a Feature Matching and Tracking

For video tracking, sequential matchers are used for establishing correspondences between consecutive frames. Kanade-Lucas-Tomasi (KLT) tracker [35, 59] is widely used for small baseline matching. Other methods detect image features and match them considering local image patches [42, 47] or advanced descriptors [34, 37, 36, 3].

Both the KLT tracker and invariant feature algorithms depend on modeling feature appearance, and can be distracted by occlusion, similar structures, and noise. Generally, sequential matchers are difficult to match non-consecutive frames under image transformation. Scale-invariant feature detection and matching algorithms [34, 3] are effective in matching images with large transformation. But they generally produce many short tracks in consecutive point tracking due primarily to the global indistinctiveness and feature dropout problems. In addition, invariant features are relatively sensitive to perspective distortion. Although variations, such as ASIFT [38], can improve matching performance under substantial viewpoint change, computation overhead increases owing to exhaustive viewpoint simulation. Cordes et al. [7] proposed a memory-based tracking method to extend feature trajectories by matching each frame to its neighbors. However, if an object re-enters the field-of-view after a long period of time, the size of neighborhood windows has to be very large. Besides, multiple-video setting was not discussed. In contrast, our method can not only extend track lifetime but also efficiently match common feature tracks in different subsequences by iteratively matching overlapping frame pairs and refining match matrix. The computation complexity is linear to the number of overlapping frame pairs.

There are methods using invariant features for object and location recognition in images/videos [52, 49, 22, 50, 23]. These methods typically use bag-of-words techniques to perform global localization and loop-closure detection in an image classification framework. Nistér and Stewénius [43] proposed using a hierachical -means algorithm to construct a vocabulary tree with feature descriptors, which can be used for large-scale image retrieval and location recognition. Cummins and Newman [10] proposed a probabilistic approach called FAB-MAP for location recognition and online loop closure detection, which models the world as a set of locations and computes the probability of belonging to previously visited locations for each input image. Later, they proposed using a sparse approximation for large scale location recognition [9]. However, FAB-MAP assumes the neighboring locations are not too close, so might perform less satisfyingly if we simply input a normal video sequence. In addition, existing methods generally divide the location recognition and non-consecutive feature matching into two separated phases  [31, 6, 11, 21]. Because the match matrix by bag-of-words only roughly reflects the match confidence, completely trusting it may lose many common features. In this paper, we introduce a novel strategy where the match matrix can be refined and updated along with non-consecutive feature matching. Our method can reliably and efficiently match the common features even with a coarse match matrix.

Engels et al. [13] proposed integrating wide-baseline local features with the tracked ones to improve SfM. The method creates small and independent submaps and links them via feature recognition. This approach also cannot produce many long and accurate point tracks. Short tracks are not enough for drift-free SfM estimation. In comparison, our method is effective in high-quality point track estimation. We also address the ubiquitous nondistinctive feature matching problem in dense frames. Similar to the scheme of [19], we utilize track descriptors, instead of the feature descriptors, to reduce computation redundancy.

Wu et al. [65] proposed using dense 3D geometry information to extend SIFT features. In contrast, our method only uses sparse matches to estimate a set of homographies to represent scene motion, which also handles viewpoint change. It is general since geometry is not required.

Ii-B Large-Scale Structure-from-Motion

State-of-the-art large-scale SfM methods can handle millions of images on a single PC in one day [15]. To this end, large image data are separated into a number of independent submaps, each is optimized independently. Steedly et al. [55] proposed a partitioning approach to decompose a large-scale optimization into multiple better-conditioned subproblems. Clemente et al. [5] proposed building local maps independently and stitching them with a hierarchical approach.

Ni et al. [41] proposed an out-of-core bundle adjustment (BA) for large-scale SfM. This method decomposes the data into multiple submaps, each of which has its own local coordinate system for optimization in parallel. For global optimization, an out-of-core implementation is adopted. Snavely et al. [54] proposed speeding up reconstruction by selecting a skeletal image set for SfM and then adding other images with pose estimation. Similarly, Konolige and Agrawal [27] selected a skeletal frame set and used reduced relative constraints for closing large loops. Each skeleton frame can actually be considered as a submap. A similar scheme is applied to iconic views [29], which are generated by clustering images with similar gist features [44]. In our work, a segment-based scheme is adopted, which first estimates SfM for each sequence independently, and then aligns the recovered submaps. Depending on estimation errors, we split each sequence to multiple segments, and perform segment-based refinement. This strategy can effectively handle large data and quickly reduce estimation errors during optimization.

Another line of research is to improve large-scale BA, which is a core component of SfM. Agarwal et al. [1] pointed out that connectivity graphs of Internet image collections are generally much less structured and accordingly presented an inexact Newton type BA algorithm. To speed up large-scale BA, Wu et al. [64] utilized multi-core CPUs or GPUs, and presented a parallel inexact Newton BA algorithm. Wu et al. [62] also proposed preemptive feature matching that reduces matching image pairs, and an incremental SfM for full BA when the model is large enough. Pose graph optimization [45, 56, 28] was also widely used in realtime SfM and SLAM [12, 40], which uses the relative-pose constraints between cameras and is more efficient than full BA.

Most existing SfM approaches achieve reconstruction in an incremental way, which may risk drifting or local minima when dealing with large-scale image sets. Crandall et al. [8] proposed combining discrete and continuous optimization to yield a better initialization for BA. By formulating SfM estimation as a labeling problem, belief propagation is employed to estimate camera parameters and 3D points. In the continuous step, Levenberg-Marquardt nonlinear optimization with additional constraints is used. This method is restricted to urban scenes, and assumes that the vertical vanishing point can be detected for rotation estimation, similar to the method proposed by Sinha et al. [51]. It also needs to leverage geotag contained in the collected images and takes complex discrete optimization. In contrast, our segment-based scheme can run on a common desktop PC with limited memory, even for large video data.

Real-time monocular SLAM methods [25, 58, 12, 40] typically perform tracking and mapping in parallel threads. The methods of [12, 40] can close loops efficiently for large-scale scenes. However, they could still have difficulty in directly handling multiple sequences, as demonstrated in Figs. 9 and 10.

1. Consecutive point tracking (Section IV):
1.1 Match the extracted SIFT features between consecutive frames with descriptor comparison.
1.2 Perform the second-pass matching to extend track lifetime.
2. Non-consecutive track matching (Section V):
2.1 Use hierachical k-means to cluster the constructed invariant tracks.
2.2 Estimate the match matrix with the grouped tracks.
2.3 Detect overlapping subsequences and join the matched tracks.
3. Segment-based coarse-to-fine SfM (Section VI):
3.1 Estimate the submap for each sequence.
3.2 Match the common tracks among different sequences, and then use them to estimate the similarity transformations for each submap.
3.3 Use segment-based SfM to refine the aligned submaps.
4. [Optional] Feature propagation with camera estimation:
4.1 Quickly propagate features from sampled frames to others.
4.2 Quickly estimate camera poses for remaining frames.
TABLE I: Framework overview of ENFT-SFM

Iii Our Approach

Given a video sequence with frames, , our objective is to extract and match features in all frames in order to form a set of feature tracks. A feature track is defined as a series of feature points in images: , where denotes the frame set spanned by track . Each SIFT feature in frame is associated with an appearance descriptor [34] and we denote all descriptors in a feature track as .

With the detected features in all frames, finding matchable ones generally requires a large amount of comparisons even using the k-d trees; meanwhile it inevitably induces errors due to the fact that a large number of features make descriptor space hardly distinctive, resulting in ambiguous matches. So it is neither reliable nor practical to only compare the feature descriptors to form tracks. Our ENFT method has two main steps to address this issue. The framework is outlined in Table I.

For reducing computation, we can extract one frame for every frames to constitute a new sequence and then perform feature tracking on it. In the consecutive tracking stage, we employ a two-pass matching strategy to extend the track lifetime. Then in the non-consecutive tracking stage, we match common features in different subsequences. With the obtained feature tracks, a segment-based SfM scheme is employed to robustly recover the 3D structure and camera motion. Finally, if necessary, we propagate feature points from sampled frames to others. Since the 3D positions of these features have been computed, we can quickly estimate the camera poses of remaining frames with the obtained 3D-2D correspondences.

Iv Consecutive Tracking

For video sequences, feature tracks are typically obtained by matching features between consecutive frames. However, due to illumination change, repeated texture, noise, and large image distortion, features are easily dropped out or mismatched, resulting in breaking many tracks into shorter ones. In this section, we propose an improved two-pass matching strategy to alleviate this problem. The first-pass matching is the same as our prior method [66], which uses SIFT algorithm [34] with RANSAC [14] to obtain high-confidence matches and remove outliers. In the second pass matching, we firstly use the inlier matches to estimate a set of homographies with multiple RANSAC procedures [66, 24]. To handle illumination change, we estimate global illumination variation between images and by computing the median intensity ratio between matched features. Here, denotes the gray scale image of frame .

Fig. 2: Feature matching comparison. (a) First-pass matching by SIFT descriptor comparison. There are features detected in the first image, but only 53 matches are found. (b) Additional match by directly searching the correspondences along the epipolar lines with SIFT descriptor comparison. Only 11 additional matches are found. (c) Second-pass matching with outlier rejection. 399 (i.e. ) matches are obtained. (d) Matching result of [66]. 314 matches are obtained.
Fig. 3: Constrained spatial search with planar motion segmentation. Given homography , we rectify to such that . Then we select the midpoint between and its projection to for initialization, and search the matched point by minimizing (1). The red dot is the result.

We first linearly scale image with , and then transform it with homography to obtain the rectified image . Correspondingly, in image is rectified to where in . The distance between a 2D point and the epipolar line is denoted by . If largely deviates from the epipolar line (i.e., ), we reject since it does not describe the motion of well. For each remaining , we track to by minimizing the matching cost:


where are the points in the window centered at . Different from our prior method [66], the matching cost incorporates two geometric constraint terms, which encourage to be along the epipolar line and obey homography. is the Euclidean distance and is the absolute value. The corresponding weights are and , where , , and account for the uncertainty of intensity, epipolar geometry and homography transformation respectively. In our experiments, these values are by default  (for intensity values normalized to ), and . Note that is relatively large because we do not require the points to strictly lie on the same plane. As long as the point is near the plane, can alleviate the major distortion and provide a better matching condition.

Similar to KLT tracking, we solve for iteratively by taking the partial derivative w.r.t. and setting it to zero:


is approximated by a Taylor expansion truncated up to its first order:


where is the image gradient in the frame. With the computed gradients, we propose an iterative solver to optimize (1) by first initializing as the midpoint between and its projection to , as shown in Fig. 3. Then we iteratively update by solving (2). In iteration , is updated as

where denotes the value of in iteration . This procedure continues until is sufficiently small.

The found match is denoted as . With the set of homographies , we can find several matches . Only the best one is kept.

In case the feature motion cannot be described by any homographies or feature correspondence is indeed missing, the found match is actually an outlier. We detect it with the following conditions:

These conditions represent the constraints of color constancy, epipolar geometry and homography respectively. If any of them is satisfied, is treated as an outlier. is set to a small value (generally in our experiments) since the image is rectified. The remaining two parameters are and . Considering points may not strictly undergo planar transformation, is set to a relatively large value.

Fig. 2(c) shows the result after the second-pass matching. Compared to our prior method [66] (Fig. 2(d)), the improved two-pass matching method does not need to perform additional KLT matching. It thus runs faster. The computation time is only ms with GPU acceleration on a NVIDIA GTX780 display card. The number of credible matches also increases.

The two-pass matching can produce many long tracks. Each track has a group of descriptors. They are similar to each other in the same track due to the matching criteria. We compute average of the descriptors over the track, and denote it as track descriptor . It is used in the following non-consecutive track matching.

V Non-Consecutive Track Matching

In this stage, we match features distributed in different subsequences, which is vital for drift-free SfM estimation. If we select all image pairs in a brute-force manner, the process can be intolerably costly for a long sequence. A better strategy is to estimate content similarity among different images first. We propose a non-consecutive track matching (NCTM) method to address this problem.

There are two steps. In the first step, similarity of different images is coarsely estimated by constructing a symmetric match matrix , where is the number of frames. stores overlapping confidence between images and . We use the same method of [66] to quickly estimate the initial matching matrix , which first uses hierarchical K-means to cluster the track descriptors and then compute the similarity confidence of frame pairs by counting the number of potentially matched tracks that are clustered into the same leaf node.

For acceleration, we only select long tracks that span 5 or more keyframes to estimate overlap confidence. In our experiments, for the “Desktop” sequence, the initial match matrix estimation only takes 1.08 seconds, with a total of selected feature tracks. Fig. 4(a) shows the initially estimated match matrix for the “Desktop” sequence. Bright pixels are with high overlapping confidence where many common features exist. Because we exclude track self-matching, the diagonal band of estimated match matrix has no value. Our method handles dense image sequences, unlike FAB-MAP [10, 9] that assumes sparsely sampled ones. When applying FAB-MAP to the original “Desktop” sequence, no loop is detected. So we manually sample the original sequence until common points between adjacent sampled frames are no more than 100. This generates 26 sampled frames. As shown in Fig. 4(b), in this case, a few overlapping image pairs are identified by FAB-MAP; but they are not enough to match many common features.

In the second step, with the initially estimated match matrix, we select the frame pairs with maximum overlapping confidence to perform feature matching, and update the match matrix iteratively. Matrix estimation and non-consecutive feature matching are benefitted from each other to simplify computation. Fig. 4(c) shows our finally estimated match matrix.

For speedup, we extract keyframes based on the result of consecutive feature tracking described in Section IV. Frame is selected as the first keyframe. Then we select frame as the second keyframe if it satisfies and , where denotes the number of common features between frames and . Other keyframes are selected as follows. For the two recent keyframes with indices and in the original sequence, we select frame  () as the new keyframe if it is the farthest one from that satisfies , , where denotes the number of common points among the three frames . This step is repeated until all frames are processed. In our experiments, and . Without special notice, the following procedures are only performed on keyframes.

Fig. 4: Match matrix estimation for the “Desktop” sequence containing 941 frames. (a) Our initially estimated match matrix based on the keyframes. The matrix size is scaled for visualization. (b) Estimated match matrix by FAB-MAP [9] on the re-sampled sequence that contains 26 frames. (c) The final match matrix for all frames after our non-consecutive matching based on (a). (d) The final match matrix after our non-consecutive matching based on (b).

V-a Non-Consecutive Track Matching

Since the number of common features between two frames can be coarsely reflected by the initially estimated match matrix , we select a frame pair with the largest value in to start matching. After matching , the set of matched track pairs approximately represent the number of common tracks for neighboring frame pairs. The matched track pairs in frame pair can be expressed as


The number of common features in can be approximated by as long as shares sufficient common tracks with . We maintain an updating match matrix , computed as


to propagate the overlapping confidence from toward the neighboring frame pairs, and determine where the next matching should be performed. Details are given below.

Main Procedure: We first detect the largest element in . The value of is also denoted as . If is larger than a threshold, several common features may exist. After matching , we collect and put the matched track pairs into and update according to Eq. (5). In particular, we set , indicating is matched. Next, we repeatedly select the largest element in the updating matrix , match , and update and accordingly. This procedure continues until . Then we go to another region by re-detecting the brightest point in that has not been processed. The step ends if the brightest value is smaller than .

Methods Desktop Sequence Circle Sequence
Merged Tracks Time Merged Tracks Time
[66] 16, 279 81s 101, 948 132s
Our method 16, 827 35s 102, 583 55s
TABLE II: Non-consecutive track matching comparison between the method of [66] and ours for the “Desktop” and “Circle” sequences.

Frame pair matching and outlier rejection: When entering a new bright region, we perform the classical 2NN matching for . Then each matching pair is detected from the updating matrix . Thus there are common features found previously. We use these matches to estimate the fundamental matrix of frame pair , and re-match those outlying features along the epipolar lines. We further search the correspondences for other unmatched features along epipolar lines.

Along with the fundamental matrix estimation between and , these matches are classified into inliers and outliers. Since only part of matches are used to estimate , the estimated could be biased. So we do not reject outliers immediately. Fortunately, each matched track pair undergoes multi-pass epipolar verification during processing the whole bright region. We record all the verification results for each , and determine inliers/outliers after all bright regions are processed. Suppose is classified as an inlier match times and as an outlier match times. We reject if  ( in our experiments). In addition, we use the following strategy to remove the potential matching ambiguity. For example, a track may find two corresponding tracks and , where and have overlapping frames. So the track matches and conflict with each other. In this case, we simply select the best match with the largest , and regard the other as an outlier.

Benefits: The proposed matching method outperforms previous ones in the following aspects. In our prior method [66], a rectangular region in the roughly estimated match matrix is sought each time and local exhaustive track matching is performed for all frame pairs in it. It could involve a lot of unnecessary matching for non-overlapping frames and repeated feature comparison. Our current scheme only selects the frame pairs with sufficient overlapping, and matches each pair of frames and most tracks at most once. As shown in Table II, compared to the method of [66], our new non-consecutive track matching algorithm is more efficient. Both methods are implemented without GPU acceleration.

Standard image matching is to find a set of most similar images given the query one. This scheme has been extensively used in large-scale SfM [2, 15] and realtime SLAM systems for loop closure detection [4, 6, 31]. It, however, also may involve unnecessary matching for unrelated frame pairs and miss those with considerable common features. It is because image similarity based on appearance may not be sufficiently reliable. In contrast, we progressively expand frames with track matching. The expansion is not fully related to the initial match matrix. Therefore a very rough matrix is enough to give a good starting point. Practically, as long as there is one good position, our system can extend it to the whole overlapping region accurately. To verify this, we provided two refined match matrices based on two different rough match matrices, as shown in Figs. 4(a) and (b). Although the two initially estimated match matrices are different and only based on keyframes, the finally estimated match matrices after our non-consecutive track matching are quite similar (except the bottom right area, where the initial match matrix by FAB-MAP does not provide any high confidence elements), which demonstrates the effectiveness of our method.

V-B SfM for Multiple Sequences

Our method can be naturally extended to handle multiple sequences. Given one or multiple sequences, we first split long ones, making each new sequence generally contains only frames. The splitted neighboring sequences can contain some overlapping frames for reliable matching. The sequence set is denoted as . Then we apply our feature tracking to each , and estimate its 3D structure and camera motion using a keyframe-based incremental SfM scheme similar to that of [67]. The major modification is that we use known intrinsic camera parameters, and simply select an initial frame pair that has sufficient matches and a large baseline to start SfM. For each sequence pair, we use the fast matching matrix estimation algorithm [66] to estimate the rough match matrix such that related frames in any two different sequences can be found and common features can be matched by the algorithm introduced in Section V-A. Then we use the segment-based SfM method described in Section VI to efficiently recover and globally register 3D points and camera trajectories, as shown in Fig. 1(b).

Vi Segment-based Coarse-to-Fine SfM

With the independently reconstructed sequences and matched common tracks, we align them in a unified 3D coordinate system. For a long loopback sequence, error accumulation could be serious, making traditional bundle adjustment stuck in local optimum. It is because the first a few iterations of bundle adjustment aggregate accumulation errors at the joint loop points, which are hard to be propagated to the whole sequence. To address this problem, we split each sequence into multiple segments – each is with a similarity transformation. Only these transformations and overlapping points across different segments are optimized. We name it segment-based bundle adjustment and illustrate it in Fig. 6. Lim et al. [31] performed global adjustment by clustering keyframes into multiple disjoint sets (i.e. segments), which is conceptually similar to our idea. But the geodesic-distance-based segmentation to cluster frames could make inconsistent structure be put into a single body, complicating alignment-error reduction. This method also did not adaptively split the segments in a coarse-to-fine way to minimize the accumulation error. Local optimization within each body may not sufficiently minimize the error which is mainly caused by global misalignment.

In the beginning, we order all sequences and define the one that contains the maximum number of tracks merged with others as the reference. Without losing generality, we define it as sequence , denoted as . Its local 3D coordinate system is also set as the reference. Then with the common tracks among different sequences, we can estimate the coordinate transformation for each sequence (i.e., ), denoted as , where is the scale factor, is the rotational matrix, and is the translation vector. For the reference sequence, have value 1, is an identity matrix, and .

Fig. 5: Split point detection. (a) Original camera trajectory of the “Desktop” sequence. (b) Splitted camera trajectories. Each segment contains 100 frames. (c) Re-projection errors (green curve) and angles of steepest descending direction (blue curve). Values are all normalized to [0,1] for better comparison. The angle more accurately reflects the split result quality compared to the re-projection error.
Fig. 6: Segment-based coarse-to-fine refinement. (a) Recovered camera trajectories marked with different colors. (b) Each sequence is split into 2 segments where endpoints and split points are highlighted. (c) Refined camera trajectories after the first iteration, where errors are redistributed. (d) “Split points”, which are joints of largely inconsistent camera motion for consecutive frames. (e) Sequence separation by split points. Two dark points denote the splitted two consecutive frames in a split point. (f) Refined camera trajectories after 2 iterations.

Each segment is assigned with a similarity transformation, and the relative camera motion between frames in each segment is fixed, so that the number of variables is small enough for efficient optimization. Different from [31], which clusters frames using geodesic distances, we propose clustering neighboring and geometrically consistent frames into segments. The position at which two consecutive frames are inconsistent is defined as a “split point”. We project the common points in each consecutive frame pair into the two images and check the re-projection error.

However, directly detecting the split points according to reprojection error is not optimal since it is generally large at loop closure points. Splitting such frame pairs does not help. We instead find split points that the re-projection error is most likely to be reduced. Assume each frame is associated with a small similarity transformation , which is parameterized as a -vector  (three Rodrigues components for rotation, 3D translation and scale). If we minimize the re-projection error w.r.t. , the steepest descent direction is


where is the number of points visible in frame , and is the Jacobian matrix . is the projection function. is the re-projection error , which is reduced along the direction of . For two consecutive frames , if their and have similar directions, their re-projection errors both can be reduced with the same similarity transformation. Otherwise, these two frames are better to be assigned to different segments. The inconsistency between two consecutive frames is defined as the angle between the two steepest descent directions


For verification, we group every 100 consecutive frames into one segment for the “Desktop” (Fig. 5(a)), and apply a certain transformation to each segment (Fig. 5(b)). As expected, the re-projection errors distribute in the whole overlapping regions. In contrast, the angle between the steepest descent directions reliably reflects the splitting result.

We progressively segment the sequences. At the iteration, each sequence is divided into segments. We compute for all and detect the split points with the largest . In order to evenly spread the split points across the whole sequence, we perform non-maximal suppression during selecting split points. While selecting the largest one, its neighboring candidates ( is the number of frames in sequence ) are suppressed and then select the next largest one from the remaining ones with non-maximal suppression. This procedure is repeated until split points are selected. We put the consecutive frames in between two adjacent split points into a segment, and use the method described as follows to estimate the similarity transformations and submaps jointly for all segments. When the optimization is done, we detect split points for each sequence again, and re-separate the sequence into multiple segments. We can repeat this process until the average reprojection error is below a threshold or each segment contains only one frame. Errors are progressively propagated and reduced. The procedure of our segment-based coarse-to-fine refinement scheme is illustrated in Fig. 6.

Feature Tracking SfM Estimation Propagation
 Datasets Frames Step Sampled Resolution CPT KNCTM Submap Align Refine Feature Camera Reprojection
Frames Estimation Propagation Estimation Error
 Desktop 1 941 46.5s 5.8s 14.1s - - - - 1.26 pixels
 Circle 3 710 63.1s 40.9s 13.8s - - 13.9s 4.2s 1.07 pixels
 Street 5 5,537 7.4 min. 7.5 min. 176.0s 3.6s 32.0s 3.4 min. 65.1s 2.49 pixels
 Garden 3 or 5 21,791 27.4 min. 31.2 min. 588.1s 2.5s 130.4s 15.6 min. 3.7 min. 2.28 pixels
TABLE III: Running time of ENFT-SFM.

Algorithm Details:  Suppose the number of detected split points among all sequences is . We break the sequences into a total of segments. Each of them is with a similarity transformation , where , w.r.t. the world coordinate. We use BA to refine the reconstructed 3D feature points with these similarity transformations. Different from traditional BA, the camera parameters inside each segment are fixed, we thus only update the similarity transformation. The procedure is to first transform one 3D point in the world coordinate to a local one with parameters . Then traditional perspective camera projection is employed to compute the re-projection error. Our BA function is written as


where is the number of frames in the -th segment, is the number of the 3D feature points, and is the number of the segments. is the projection function. is the image location of in the -th frame of the -th subsequence. , , and are the intrinsic matrix, rotational matrix, and translation vector, respectively. is defined as

We use Schur complement and forward substitution [60] to solve the normal equation, which separates the updating of rigid transformation and of 3D points in each iteration. It reduces the large linear system to a linear symmetric one with scale for updating transformation. It makes 3D point estimation much cheaper because each point can be updated independently by solving a linear symmetric system. Moreover, since only a few segment pairs share common points, the Schur complement is rather sparse. In SBA [33], the system of Schur complement was explicitly constructed and solved by Cholesky decomposition. Wu et al. [64] implicitly built the Schur complement for parallel computing. They did not take full advantage of the sparsity property. For acceleration, sSBA  [26] proposed to utilize the sparse structure of Schur complement and solve it with sparse Cholesky decomposition. We also utilize the sparsity and solve it with efficient preconditioned conjugate gradient similar to that of [64], which can significantly reduce the computation.

Because the size of the linear system is actually determined by , we can estimate based on the available memory. Once the size linear system is reached, SfM refinement is performed in the following two steps. In the first step, we only select the split points to split the sequences, and solve (8) to refine the result. In the second step, we perform a local BA for each sequence iteratively by re-splitting sequence to multiple segments with detected split points and refining them by solving (8) while fixing cameras and 3D points in other sequences. This process stops when all sequences are processed. This strategy makes it possible to efficiently and robustly handle large data with limited memory consumption.

Finally, we fix the 3D points and estimate the camera poses respectively for all frames. During the course of iterations, errors are quickly reduced.

Vii Experimental Results

Algorithms Running Time Average Track Length
C-SIFT 38.4s 1.73
CPT 63.1s 2.28
CPT+NCTM 150.7s 3.10
CPT+KNCTM 104.0s 2.68
BF-SIFT 1086.4s 2.71
TABLE IV: Performance of different algorithms.

We evaluate our method with several challenging sequences. Running time is listed in Table III excluding I/O, which is obtained on a desktop PC with an Intel i7-4770K CPU, 8GB memory, and a NVIDIA GTX780 graphics card. The operating system is 64-bit Windows 7. Only the feature tracking component is accelerated by GPU. We use 64D descriptors for SIFT features. Our SIFT GPU implementation is inspired by [61] but runs faster. For SfM estimation, we optimize the code by applying SSE instructions, but only use a single thread without parallel computing. For the sequences captured by us, since the intrinsic matrix is known, we optimize the SfM code by incorporating this prior to improve the robustness and efficiency. Garden dataset contains 6 sequences, which are further splitted into 37 shorter sequences, from which we sample the frames by setting the step to 3 or 5. The source code and datasets can be found in our project website222http://www.zjucvg.net/ls-acts/ls-acts.html.

As our consecutive point tracking can handle wide-baseline images, frame-by-frame tracking is generally unnecessary. For our datasets listed in Table III, we usually extract one frame for every frames to apply feature tracking. We quickly propagate the feature points to other frames by KLT with GPU acceleration. This trick further saves computation. In addition, in order to reduce image noise and blur, for each input frame , we perform matching with two past frames. One is the last frame , and the other (denoted as ) is the farthest frame that shares over 300 common features with . Note that only a small number of features in need to be matched with , which does not increase computation much.

Vii-a Quantitative Evaluation of Feature Tracking

We compare the feature tracking methods of consecutive SIFT matching (C-SIFT), our consecutive point tracking (CPT), brute-force SIFT matching (BF-SIFT), our consecutive point tracking with non-consecutive track matching (CPT+NCTM), our consecutive point tracking with keyframe-based non-consecutive track matching (CPT+KNCTM).

C-SIFT extracts and matches SIFT features only in consecutive frames. It is a common strategy for feature tracking. The advantage is that the complexity is linear to the number of frames. However, feature dropout could occur due to global indistinctiveness or image noise, which causes producing many short tracks. The brute-force SIFT matching exhaustively compares extracted SIFT features, whose complexity is quadratic to the number of processed frames. In comparison, the complexity of our method (CPT+NCTM) is linear to the number of processed frames and the number of overlapping frame pairs while high quality results are guaranteed.

Fig. 7: The recovered 3D points (track length 3) and camera trajectories using feature tracks computed by different matching algorithms: (a) C-SIFT; (b) CPT; (c) CPT+NCTM; (d) CPT+KNCTM; (e) BF-SIFT. (f-i) Superimposing the camera trajectories (highlighted in red) in (a-d) to (e). (j-m) Magnified regions of (f)-(i).

The “Circle” sequence contains frames. To make computation feasible for a few prior methods, we select one frame for every 3 consecutive ones, which forms a new sequence containing 710 frames in total. Table IV lists the running time with GPU acceleration. Our consecutive point tracking (CPT) needs a bit more time than C-SIFT. But it significantly extends the lifetime of most tracks. With our non-consecutive track matching, common feature tracks scattered over disjoint subsequences are connected, further expanding track lifetime. Compared with the computationally most expensive BF-SIFT, our result (CPT+NCTM) obtains more long feature tracks and the computation is much faster. With keyframe-based acceleration, our non-consecutive track matching time is further significantly reduced (from 87.6s to 40.9s), without influencing much matching result. Table IV lists the average length of tracks for all tracks with length . The computed average length is short because we also take into account unmatched features with track length 1. The quality of SfM results computed by BF-SIFT, CPT+NCTM and CPT+KNCTM are quite comparable, as shown in Fig. 7.

Vii-B Comparison with Other SfM/SLAM Systems

(Keyframes) SLAM (Keyframes) (Keyframes)
  00 4.58 / 100% 4.76 / 100% 5.33 2.78 / 3.71% 5.83 / 0.7%
  01 57.20 / 100% 53.96 / 100% X 52.34 / 12.46% 8.79 / 2.08%
  02 28.13 / 100% 28.26 / 100% 21.28 1.77 / 4.53% 50.36 / 3.74%
  03 2.82 / 100% 2.94 / 100% 1.51 0.28 / 12.05% 3.53 / 8.43%
  04 0.66 / 100% 0.66 / 100% 1.62 0.76 / 23.44% 5.14 / 14.06%
  05 2.88 / 100% 3.48 / 100% 4.85 9.77 / 7.42% 22.42 / 9.07%
  06 14.24 / 100% 14.43 / 100% 12.34 8.58 / 7.41% 3.16 / 3.37%
  07 1.83 / 100% 2.03 / 100% 2.26 3.85 / 7.78% 7.75 / 5%
  08 30.74 / 100% 28.32 / 100% 46.68 0.81 / 0.90% 17.82 / 2.58%
  09 5.63 / 100% 5.88 / 100% 6.62 0.90 / 4.92% 14.26 / 3.36%
  10 19.53 / 100% 18.49 / 100% 8.80 5.70 / 6.05% 27.06 / 7.01%
TABLE V: Localization error (RMSE (m)/Completeness) comparison in KITTI odometry dataset.
(Keyframes) SLAM (Keyframes) (Keyframes)
 fr1_desk 2.71/99.84% 2.96/100% 1.69 2.74/100% X
 fr1_floor 4.08/96.70% 3.93/100% 2.99 53.11/69.23% 0.52/6.92%
 fr1_xyz 1.25/100% 1.59/100% 0.90 1.43/100% X
 fr2_360_kidnap 13.57/91.47% 15.31/100% 3.81 10.08/50.91% 5.21/14.55%
 fr2_desk 2.43/100% 2.27/100% 0.88 1.79/100% 1.38/13.95%
 fr2_desk_person 2.46/100% 2.55/100% 0.63 1.92/100% 2.16/97.01%
 fr2_xyz 0.81/100% 0.73/100% 0.30 0.71/100% 5.74/97.6%
 fr3_long_office 1.21/100% 1.44/100% 3.45 1.15/100% 2.94/32.74%
 fr3_nst_tex_far 3.60/86.58% 7.76/100% X 7.29/100% 35.64/3.79%
 fr3_nst_tex_near 1.87/100% 1.66/100% 1.39 1.13/100% 3.4/39.13%
 fr3_sit_half 1.50/100% 1.55/100% 1.34 2.30/100% 0.68/9.3%
 fr3_sit_xyz 0.84/100% 1.39/100% 0.79 1.28/100% 1.03/100%
 fr3_str_tex_far 0.94/100% 0.95/100% 0.77 2.15/100% 1.12/100%
 fr3_str_tex_near 1.86/100% 1.82/100% 1.58 0.95/100% 0.97/19.74%
 fr3_walk_half 2.08/100% 2.21/100% 1.74 1.88/100% X
 fr3_walk_xyz 1.30/100% 1.74/100% 1.24 1.62/100% X
TABLE VI: Localization error (RMSE (cm)/Completeness) comparison on TUM RGB-D dataset.
Fig. 8: The recovered camera trajectories by ENFT-SFM in KITTI odometry 00-10 sequences.

We compare our ENFT-SFM system with state-of-the-art SfM/SLAM systems (i.e. ORB-SLAM [40], VisualSFM [63, 64, 62] and OpenMVG [39]) using our datasets and other public benchmark datasets (i.e. KITTI odometry dataset [18] and TUM RGB-D dataset [57]). Since VisualSFM and OpenMVG are mainly designed for unordered image datasets, we extract keyframes from the original sequences as input for VisualSFM and OpenMVG. For fair comparison, our method processes both original sequences and extracted keyframes for KITTI odometry dataset and TUM RGB-D dataset.

For KITTI and TUM RGB-D datasets, we align recovered camera trajectories and ground truth by estimating a 7DoF similarity transformation. The RMSE and completeness of camera trajectories for all methods are listed in Tables V and VI. “X” denotes that the map cannot be accurately initialized or processed. The recovered camera trajectories of sequences 00-10 from KITTI odometry dataset by our method are shown in Fig. 8. Because sequences 01 and 08 do not contain loops, the drift cannot be corrected, leading to large RMSE.

For ORB-SLAM, we directly quote reported RMSE error of keyframe trajectory in their paper. Compared with ORB-SLAM, our method achieves comparable results in KITTI odometry dataset. We note only our method is able to process all sequences (the camera poses of some frames in TUM RGB-D sequences are not recovered due to extremely serious motion blur, occlusion or there are not sufficient texture regions). We fix the parameters for both KITTI and TUM RGB-D dataset except for the maximum frame number for each sequence segment. It is set as for KITTI odometry dataset and for TUM RGB-D dataset respectively. Since the camera moves fast in KITTI odometrry dataset, the maximum frame number for each segment should be smaller to reduce the accumulation error.

For our multi-sequence data, since ORB-SLAM cannot directly handle multiple sequences, we constitute multiple sequences into a single sequence by re-ordering the frame index. The input frame rate is set to 10fps for ORB-SLAM333We use ORB-SLAM2: https://github.com/raulmur/ORB_SLAM2.. The recovered camera trajectories by ORB-SLAM are shown in Figs. 9 and 10. The camera poses of many frames are not recovered due to unsuccessful relocalization. Although some loops are closed, the optimization is stuck in a local optimum. The reason is twofold. On the one hand, the matched common features among non-consecutive frames by a traditional bag-of-words place recognition method [17] are insufficient for robust SfM/SLAM. On the other hand, using pose graph optimization [56, 28] may not sufficiently minimize accumulation error, and traditional BA is easily stuck in a local optimum if a good starting point is not provided.

Fig. 9: Reconstruction comparison on the “Street” example. (a) SfM result of ENFT-SFM. (b) SfM result of VisualSFM, which is separated to 3 individual models. (c) The recovered camera trajectory by ORB-SLAM.

VisualSFM does not work that well in KITTI odometry dataset and our long sequences, as shown in Table V and Figs. 9 and 10. Note the matching time in our data is overly long for VisualSFM. We have to use our non-consecutive feature tracking algorithm to get the feature matching results. The produced SfM results still have the drifting problem and the whole camera trajectory is easily separated into multiple segments. We thus select the largest segment for computing RMSE and completeness. One reason for this drifting problem is the incremental SfM, which may not effectively eliminate accumulated errors. Another explanation is that sequence continuity/ordering is not completely utilized. Since the KITTI dataset is captured by an autonomous driving platform and each frame is only matched to its consecutive frames. Once camera tracking fails in one frame, the connection between two neighboring subsequences will be broken. In our experiments, OpenMVG usually performs worse than VisualSFM.

Fig. 10: Reconstruction comparison on the “Garden” example. (a) Two individual models reconstructed by VisualSFM. The reconstructed SfM result contains 60 individual models. (b) The recovered camera trajectory by ORB-SLAM.
Fig. 11: The reconstruction result of “Colosseum” dataset by our method. Cameras in the same sequence are encoded with the same color.
Fig. 12: Matching result with different . (a) First-pass matching result. (b-d) Results of the second-pass matching only using the homography corresponding to the left green book with , respectively. The matches that do not belong to the green book are outliers. (e) Second-pass matching result using all homographies with . 95 matches are obtained.

Vii-C Results on General Image Collections

Although our segment-based SfM method is originally designed for handling sequences, it can be naturally extended to work with general image collections. The basic idea is to separate the unordered image data to a set of sequences according to their common matches.

We first select two images with the maximum number of common features to constitute an initial sequence. Then we select another image, which has the most common features with the head or tail frame, and add it into the sequence as the new head or tail. This process repeats until no image can be added. Then we begin to build another sequence based on remaining images. For some 3D points that have only one or two corresponding features in one sequence, we additionally select related images from other sequences to help estimate the 3D positions.

Fig. 11 shows our SfM result on Colosseum dataset [30, 32], which contains images. We directly use the feature matching result obtained by VisualSFM. Because our current SfM implementation requires that the intrinsic camera parameters and radial distortion are known for each image, we calibrate the matched feature positions according to the calibrated parameters by VisualSFM. Then we use our extended segment-based SfM method to estimate camera poses and 3D points. The processing time of our SfM estimation in a single thread is 125 seconds, which is even shorter than that of VisualSFM enabling GPU (269 seconds).

Vii-D Parameter Configuration and Limitation

The parameters can be easily set in our system because most of them are not sensitive and use default values. The most important parameter is , which controls the strength to mark outliers during feature tracking. A large could result in many matches, and introduce outliers. In our experiments, we conservatively set to a small value . By removing a small set of matches, the system becomes reliable for high-quality SfM. Fig. 12 shows the matching result with different . After the fist-pass matching, 35 matches are obtained. The second-pass matching result with is shown in Fig. 12(b). A few features that do not belong to the green book are included. These outliers are removed by using smaller values, as shown in (c) and (d). By setting , almost all outliers are removed and 95 reliable matches are obtained.

The proposed two-pass matching works best if the scene can be represented by multiple planes. For a video sequence with dense frames, this condition can be generally achieved because image transformation between two consecutive frames is small for viable approximation by one or multiple homographies. We note even if the scene deviates from piecewise planarity, our second-pass matching still works as rectified images are close to the target ones. Our method may be not suitable for wide-baseline sparse images where the number of matches by first-pass matching is too small.

Viii Conclusion and Discussion

We have presented a robust and efficient non-consecutive feature tracking (ENFT) method for robust SfM, which consists of two main steps, i.e., consecutive point tracking and non-consecutive track matching. Different from typical sequential matchers, e.g., KLT, we use invariant features and propose a two-pass matching strategy to significantly extend the track lifetime and reduce the feature sensitivity to noise and image distortion. The obtained tracks avail estimating a match matrix to detect disjointed subsequences with overlapping views. A new segment-based coarse-to-fine SfM estimation scheme is also introduced to effectively reduce accumulation error for long sequences. The presented ENFT-SFM system can handle tracking and registering large video datasets with limited memory consumption.

Our ENFT method greatly helps SfM, and considers feature tracking on non-deforming objects by tradition. Part of our future work is to handle dynamic objects. In addition, although the proposed method is based on SIFT features, there is no limitation to use other representations, e.g., SURF [3] and ORB [48], for further acceleration.


We thank Changchang Wu for his kind help in running VisualSFM in our datasets. This work was partially supported by NSF of China (Nos. 61272048, 61232011), the Fundamental Research Funds for the Central Universities (2015XZZX005-05), a Foundation for the Author of National Excellent Doctoral Dissertation of PR China (No. 201245), and two grants from the Research Grants Council of the Hong Kong SAR (Project Nos. 2150760, CUHK417913).


  • [1] Sameer Agarwal, Noah Snavely, Steven M. Seitz, and Richard Szeliski. Bundle adjustment in the large. In ECCV, Part II, pages 29–42, 2010.
  • [2] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski. Building rome in a day. In ICCV, pages 72–79, 2009.
  • [3] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. Computer Vision and Image Understanding, 110(3):346–359, 2008.
  • [4] Robert O. Castle, Georg Klein, and David W. Murray. Video-rate localization in multiple maps for wearable augmented reality. In ISWC, pages 15–22, 2008.
  • [5] Laura A. Clemente, Andrew J. Davison, Ian D. Reid, José Neira, and Juan D. Tardós. Mapping large loops with a single hand-held camera. In Robotics: Science and Systems, 2007.
  • [6] Brian Clipp, Jongwoo Lim, Jan-Michael Frahm, and Marc Pollefeys. Parallel, real-time visual SLAM. In IROS, pages 3961–3968, 2010.
  • [7] Kai Cordes, Oliver Muller, Bodo Rosenhahn, and Jorn Ostermann. Feature trajectory retrieval with application to accurate structure and motion recovery. In George Bebis, Richard Boyle, Bahram Parvin, Darko Koracin, Song Wang, Kim Kyungnam, Bedrich Benes, Kenneth Moreland, Christoph Borst, Stephen DiVerdi, Chiang Yi-Jen, and Jiang Ming, editors, Advances in Visual Computing, volume 6938 of Lecture Notes in Computer Science, pages 156–167. Springer Berlin / Heidelberg, 2011.
  • [8] David J. Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-continuous optimization for large-scale structure from motion. In CVPR, pages 3001–3008, 2011.
  • [9] Mark Cummins and Paul Newman. Appearance-only SLAM at large scale with FAB-MAP 2.0. International Journal of Robotics Research, 30(9):1100–1123, 2011.
  • [10] Mark Joseph Cummins and Paul Newman. FAB-MAP: probabilistic localization and mapping in the space of appearance. International Journal of Robotics Research, 27(6):647–665, 2008.
  • [11] Ethan Eade and Tom Drummond. Unified loop closing and recovery for real time monocular SLAM. In BMVC, 2008.
  • [12] Jakob Engel, Thomas Schöps, and Daniel Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In 13th European Conference on Computer Vision, Part II, pages 834–849. Springer, 2014.
  • [13] Chris Engels, Friedrich Fraundorfer, and David Nistér. Integration of tracked and recognized features for locally and globally robust structure from motion. In VISAPP (Workshop on Robot Perception), pages 13–22, 2008.
  • [14] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
  • [15] Jan-Michael Frahm, Pierre Fite Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, and Svetlana Lazebnik. Building rome on a cloudless day. In ECCV, Part IV, pages 368–381, 2010.
  • [16] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In CVPR, pages 1434–1441, 2010.
  • [17] Dorian Gálvez-López and Juan D. Tardós. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 28(5):1188–1197, 2012.
  • [18] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR, pages 3354–3361, 2012.
  • [19] Michael Grabner and Horst Bischof. Extracting object representations from local feature trajectories. In Joint Hungarian-Austrian Conference on Image Processing and Pattern Recognition, volume 192, pages 265–272.
  • [20] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
  • [21] Michal Havlena, Akihiko Torii, and Tomás Pajdla. Efficient structure from motion by graph optimization. In ECCV, Part II, pages 100–113, 2010.
  • [22] Kin Leong Ho and Paul Newman. Detecting loop closure with scene sequences. International Journal of Computer Vision, 74(3):261–286, 2007.
  • [23] Arnold Irschara, Christopher Zach, Jan-Michael Frahm, and Horst Bischof. From structure-from-motion point clouds to fast location recognition. In CVPR, 2009.
  • [24] Yuxin Jin, Linmi Tao, Huijun Di, Naveed I Rao, and Guangyou Xu. Background modeling from a free-moving camera by multi-layer homography algorithm. In 15th IEEE International Conference on Image Processing, pages 1572–1575, 2008.
  • [25] Georg Klein and David W. Murray. Parallel tracking and mapping for small ar workspaces. In ISMAR, pages 225–234, 2007.
  • [26] Kurt Konolige. Sparse sparse bundle adjustment. In Proceedings of British Machine Vision Conference, pages 1–11, 2010.
  • [27] Kurt Konolige and Motilal Agrawal. FrameSLAM: From bundle adjustment to real-time visual mapping. IEEE Transactions on Robotics, 24(5):1066–1077, 2008.
  • [28] Rainer Kümmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. go: A general framework for graph optimization. In IEEE International Conference on Robotics and Automation, pages 3607–3613, 2011.
  • [29] Xiaowei Li, Changchang Wu, Christopher Zach, Svetlana Lazebnik, and Jan-Michael Frahm. Modeling and recognition of landmark image collections using iconic scene graphs. In Proceedings of 10th European Conference on Computer Vision, Part I, pages 427–440, 2008.
  • [30] Yunpeng Li, Noah Snavely, and Daniel P. Huttenlocher. Location recognition using prioritized feature matching. In Proceedings of 11th European Conference on Computer Vision, Part II, pages 791–804, 2010.
  • [31] Jongwoo Lim, Jan-Michael Frahm, and Marc Pollefeys. Online environment mapping. In CVPR, pages 3489–3496, 2011.
  • [32] Yin Lou, Noah Snavely, and Johannes Gehrke. Matchminer: Efficient spanning structure mining in large image collections. In Proceedings of the 12th European Conference on Computer Vision, Part II, 2012.
  • [33] M.I. A. Lourakis and A.A. Argyros. SBA: A Software Package for Generic Sparse Bundle Adjustment. ACM Trans. Math. Software, 36(1):1–30, 2009.
  • [34] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • [35] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI, pages 674–679, 1981.
  • [36] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image Vision Comput., 22(10):761–767, 2004.
  • [37] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, 2005.
  • [38] Jean-Michel Morel and Guoshen Yu. ASIFT: A new framework for fully affine invariant image comparison. SIAM J. Img. Sci., 2(2):438–469, 2009.
  • [39] Pierre Moulon, Pascal Monasse, Renaud Marlet, and Others. OpenMVG. an open multiple view geometry library. https://github.com/openMVG/openMVG.
  • [40] Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tardós. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
  • [41] Kai Ni, Drew Steedly, and Frank Dellaert. Out-of-core bundle adjustment for large-scale 3D reconstruction. In ICCV, pages 1–8, 2007.
  • [42] David Nistér, Oleg Naroditsky, and James R. Bergen. Visual odometry. In CVPR, pages 652–659, 2004.
  • [43] David Nistér and Henrik Stewénius. Scalable recognition with a vocabulary tree. In CVPR, pages 2161–2168, 2006.
  • [44] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
  • [45] Edwin Olson, John J. Leonard, and Seth J. Teller. Fast iterative alignment of pose graphs with poor initial estimates. In Proceedings of IEEE International Conference on Robotics and Automation, pages 2262–2269, 2006.
  • [46] Marc Pollefeys, David Nistér, Jan-Michael Frahm, Amir Akbarzadeh, Philippos Mordohai, Brian Clipp, Christoph Engels, David Gallup, Seon Joo Kim, Paul Merrell, C. Salmi, Sudipta N. Sinha, B. Talton, Liang Wang, Qingxiong Yang, Henrik Stewénius, Ruigang Yang, Greg Welch, and Herman Towles. Detailed real-time urban 3D reconstruction from video. International Journal of Computer Vision, 78(2-3):143–167, 2008.
  • [47] Eric Royer, Maxime Lhuillier, Michel Dhome, and Jean-Marc Lavest. Monocular vision for mobile robot localization and autonomous navigation. International Journal of Computer Vision, 74(3):237–260, 2007.
  • [48] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R. Bradski. ORB: an efficient alternative to SIFT or SURF. In IEEE International Conference on Computer Vision, pages 2564–2571, 2011.
  • [49] Frederik Schaffalitzky and Andrew Zisserman. Automated location matching in movies. Computer Vision and Image Understanding, 92(2-3):236–264, 2003.
  • [50] Grant Schindler, Matthew Brown, and Richard Szeliski. City-scale location recognition. In CVPR, 2007.
  • [51] Sudipta N. Sinha, Drew Steedly, and Richard Szeliski. A multi-stage linear approach to structure from motion. In Trends and Topics in Computer Vision - ECCV 2010 Workshops, pages 267–281, 2010.
  • [52] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, pages 1470–1477, 2003.
  • [53] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3D. ACM Trans. Graph., 25(3):835–846, 2006.
  • [54] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Skeletal sets for efficient structure from motion. In CVPR, 2008.
  • [55] Drew Steedly, Irfan A. Essa, and Frank Dellaert. Spectral partitioning for structure from motion. In ICCV, pages 996–1003, 2003.
  • [56] Hauke Strasdat, J. M. M. Montiel, and Andrew J. Davison. Scale drift-aware large scale monocular SLAM. In Robotics: Science and Systems VI, 2010.
  • [57] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.
  • [58] Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust monocular SLAM in dynamic environments. In IEEE International Symposium on Mixed and Augmented Reality, pages 209–218, 2013.
  • [59] Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie Mellon University, April 1991.
  • [60] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W. Fitzgibbon. Bundle adjustment - a modern synthesis. In Workshop on Vision Algorithms, pages 298–372, 1999.
  • [61] Changchang Wu. SiftGPU: A GPU implementation of scale invariant feature transform (SIFT). http://cs.unc.edu/ ccwu/siftgpu, 2007.
  • [62] Changchang Wu. Towards linear-time incremental structure from motion. In 3DV, 2013.
  • [63] Changchang Wu. VisualSFM: A visual structure from motion system. http://homes.cs.washington.edu/ ccwu/vsfm/, 2013.
  • [64] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M. Seitz. Multicore bundle adjustment. In CVPR, pages 3057–3064, 2011.
  • [65] Changchang Wu, Brian Clipp, Xiaowei Li, Jan-Michael Frahm, and Marc Pollefeys. 3D model matching with viewpoint-invariant patches (VIP). In CVPR, 2008.
  • [66] Guofeng Zhang, Zilong Dong, Jiaya Jia, Tien-Tsin Wong, and Hujun Bao. Efficient non-consecutive feature tracking for structure-from-motion. In ECCV, Part V, pages 422–435, 2010.
  • [67] Guofeng Zhang, Xueying Qin, Wei Hua, Tien-Tsin Wong, Pheng-Ann Heng, and Hujun Bao. Robust metric reconstruction from challenging video sequences. In CVPR, 2007.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description