Global-Local Airborne Mapping (GLAM):
Reconstructing a City from Aerial Videos
Monocular visual SLAM has become an attractive practical approach for robot localization and 3D environment mapping, since cameras are small, lightweight, inexpensive, and produce high-rate, high-resolution data streams. Although numerous robust tools have been developed, most existing systems are designed to operate in terrestrial environments and at relatively small scale (a few thousand frames) due to constraints on computation and storage.
In this paper, we present a feature-based visual SLAM system for aerial video whose simple design permits near real-time operation, and whose scalability permits large-area mapping using tens of thousands of frames, all on a single conventional computer. Our approach consists of two parallel threads: the first incrementally creates small locally consistent submaps and estimates camera poses at video rate; the second aligns these submaps with one another to produce a single globally consistent map via factor graph optimization over both poses and landmarks. Scale drift is minimized through the use of 7-degree-of-freedom similarity transformations during submap alignment.
We quantify our system’s performance on both simulated and real data sets, and demonstrate city-scale map reconstruction accurate to within 2 meters using nearly aerial video frames - to our knowledge, the largest and fastest such reconstruction to date.
Vision Systems Inc. (VSI)
Providence, RI, USA \addinstitution Sciopic Inc,
Winchester, MA, USA \addinstitution Massachuessets Institute of Technology,
Cambridge, MA, USA
(Work performed while at VSI) Global-Local Airborne Mapping
Recent progress in unmanned aerial vehicle (UAV) technology has led to exponential growth in the use of drones with onboard sensing. Originally designed for military applications, camera-equipped UAVs are now commonplace in domains such as commercial surveillance, photography, disaster management, product delivery, and mapping [Colomina and Molina(2014)]. An essential step toward better utilization of aerial videos and autonomous drones is real-time localization of the vehicle and accurate mapping of the observed world. Localization relying solely on onboard inertial and GPS measurements, however, cannot achieve pixel-level accuracy due to error accumulation, latency, and relatively coarse precision; furthermore, onboard sensing can be unreliable in GPS-denied environments. Visual simultaneous localization and mapping (SLAM) attempts to address these challenges using camera images to augment or replace other sensors.
Our proposed method, nicknamed Global-Local Airborne Mapping (GLAM), approaches large-scale visual SLAM by partitioning the camera’s video stream into small local submaps created using point features, epipolar geometry between keyframes, point triangulation, and incremental bundle adjustment (BA). Submaps are aligned globally using a graph-based least squares optimization that minimizes the distance between corresponding 3D points. Small temporal overlap ensures that sequential submaps observe common scene content and hence provides correspondences for alignment, while restricting submap size to a small number of frames mitigates drift accumulation. Associations between non-overlapping submaps, for instance due to loop closures, are detected via fast bag-of-visual-words recognition. This paper demonstrates that such a system can run in near real-time on videos of the scale of a hundred thousand frames.
2 Related Work
Monocular SLAM is a long-researched subject that has seen evolution from filter based [Thrun et al.(2005)Thrun, Burgard, and Fox, Davison et al.(2007)Davison, Reid, Molton, and Stasse] to keyframe based [Strasdat et al.(2012)Strasdat, Montiel, and Davison, Resch et al.(2015)Resch, Lensch, Wang, Pollefeys, and Sorkine-Hornung] approaches. PTAM [Klein and Murray(2007)], originally developed for augmented reality applications, was one of the first widely used and practical real-time SLAM systems, but was limitated in terms of scale and robustness. Subsequent research improved upon this point feature and bun-dle adjustment method [Strasdat et al.(2010)Strasdat, Montiel, and Davison, Resch et al.(2015)Resch, Lensch, Wang, Pollefeys, and Sorkine-Hornung]. More sophisticated monocular SLAM approaches like DTAM [Newcombe et al.(2011)Newcombe, Lovegrove, and Davison] and LSD-SLAM [Engel et al.(2012)Engel, Sturm, and Cremers] perform semi-dense reconstruction by optimizing directly on image intensities rather than discrete features. ORB-SLAM [Mur-Artal et al.(2015)Mur-Artal, Montiel, and Tardos] is a recent feature-based approach that has shown good performance on a wide variety of datasets. SVO [Forster et al.(2014)Forster, Pizzoli, and Scaramuzza] is a semi-direct approach monocular SLAM approach designed specifically for Micro Aerial Vehicles (MAVs); although SVO operates at very high frame rates, it does not perform loop closure and has been evaluated only on small datasets.
City-scale 3D reconstruction has received some attention in the structure-from-motion research community [Musialski et al.(2013)Musialski, Wonka, Aliaga, Wimmer, Gool, and Purgathofer]. Pollefeys et al. [Pollefeys et al.(2008)Pollefeys, Nistér, Frahm, Akbarzadeh, Mordohai, Clipp, Engels, Gallup, Kim, Merrell, et al.] reconstruct parts of a city from a hundred thousand frames using INS and metadata to simplify the computation. Agarwal et al. [Agarwal et al.(2011)Agarwal, Furukawa, Snavely, Simon, Curless, Seitz, and Szeliski] and Heinly et al. [Heinly et al.(2015)Heinly, Schönberger, Dunn, and Frahm] perform city and world scale reconstruction respectively from large collections of photographs utilizing cloud computing. Google Earth provides 3D models of a few selected cities, but the technology behind its reconstructions is largely unknown.
Real-time reconstruction of a city from aerial videos using purely visual data has remained largely unexplored in the literature: traditional visual SLAM approaches do not address the challenges of very long image sequences and large-scale maps, while structure-from-motion methods require substantial offline computation and/or information from additional sensors. Our system incorporates elements of both visual SLAM and large-scale 3D reconstruction, demonstrating fast and accurate city-scale mapping on aerial videos nearly 90,000 frames in length on a single consumer grade computer.
3 Proposed Approach
Like many existing real-time SLAM systems, we adopt a multi-threaded strategy for computational efficiency, partitioning the problem into tasks that run in parallel. However, rather than defining the tasks as tracking and mapping [Klein and Murray(2007), Forster et al.(2014)Forster, Pizzoli, and Scaramuzza, Mur-Artal et al.(2015)Mur-Artal, Montiel, and Tardos], we define tasks as local submap creation and global submap alignment. Figure 1 shows an overview of GLAM’s building blocks.
The submap creation thread operates similarly to previous point features and keyframe based approaches [Klein and Murray(2007), Mur-Artal et al.(2015)Mur-Artal, Montiel, and Tardos, Strasdat et al.(2010)Strasdat, Montiel, and Davison] that extract point features from video frames, determine relative pose, triangulate points, and perform incremental bundle adjustment. Building the Hessian matrix for incremental bundle adjustment is the most computationally expensive step [Sibley et al.(2009)Sibley, Mei, Reid, and Newman]: as the number of frames processed increases, the cost of building the Hessian at every BA iteration becomes prohibitive for real-time operation. We therefore limit each submap to a small number of keyframes in order to maintain a relatively consistent and bounded processing rate.
Each submap consists of a set of 3D landmarks. The submap alignment thread creates and optimizes a pose-graph to determine the set of 7-DoF similarity transformations, one per submap, that minimizes the distances between corresponding landmarks. In contrast to existing approaches that close loops by creating a global database of keyframes (e.g. [Mur-Artal et al.(2015)Mur-Artal, Montiel, and Tardos]), our approach closes loops by building a visual vocabulary of submaps.
GLAM has been evaluated on both synthetic and real datasets. The former are created by simulating drone flight trajectories over an aerial LIDAR point cloud [rig(2011)], while the latter consists of a large continuous aerial video (labeled “Downtown”) with nearly 1-megapixel frames captured at fps.
The main contribution of this paper is a novel visual SLAM pipeline that (1) partitions work into parallel threads of fast local submap creation and large-scale global submap alignment; (2) reconstructs accurate city-size maps from aerial videos; and (3) operates in near real-time at the scale of a hundred thousand images.
3.1 Local Submap Creation
The following sections describe keyframe-based submap creation in greater detail. The central idea of keyframe-based SLAM is to use only frames that have sufficiently distinct information for 3D reconstruction. Each video frame is processed in sequence, with keyframes created periodically to reduce redundancy and improve efficiency—only the keyframes are used in intra-submap bundle adjustment.
Feature Extraction. Point features are extracted from each video frame. We use SIFT features [Lowe(2004)] throughout our system due to their proven robustness to viewpoint scale and orientation changes. SIFT extraction and matching can be slow, so we use a GPU implementation [Wu(2007)] for greater efficiency.
Tracking. Existing 3D landmarks are projected into the current frame using the pose of the previous frame, since inter-frame motion is assumed to be small. Points that fall outside the frame or that were originally viewed from a substantially different angle are removed. The SIFT descriptors of the remaining points are then matched to those of the current image features to obtain initial 2D-3D correspondences. The 3D pose of the current frame is then estimated using Perspective-n-Point (PnP) localization [Lepetit et al.(2009)Lepetit, Moreno-Noguer, and Fua] embedded in a RANSAC outer loop.
PnP can fail if the video is discontinuous or the viewpoint has changed drastically between frames. If the number of PnP inliers is too low, the submap is terminated at the previous frame and a new submap is initiated at the current frame.
Keyframes. As new parts of the world come into view, the number of PnP inliers continually decreases. When this number falls below a threshold , the system adds a new keyframe and triangulates new 3D points. Selection of is key in ensuring a balance between speed and stability: a high value causes frequent keyframe additions and thus slows the overall system, while a low value reduces visual overlap between keyframes and may lead to unstable or failed bundle adjustment.
To improve stability further, an additional keyframe is chosen halfway between the previous keyframe and the current frame. This middle keyframe’s pose is initialized to its PnP estimate, and its 2D PnP inliers are added as observations of their counterpart 3D landmarks for BA. Geometrically consistent 2D matches are obtained between the middle keyframe and the current frame via RANSAC-based essential matrix estimation, and these matches are triangulated using the PnP pose estimates to form new 3D landmarks and corresponding 2D observations for BA. The SIFT descriptor from the middle keyframe is assigned to each landmark.
Since the triangulation of new landmarks does not take into account the landmarks that have already been constructed, a subset may be duplicated. Therefore, any new landmark whose descriptor matches any existing landmark is removed. The current frame is finally added as a keyframe and its pose and observations are added to BA.
This selection strategy depends on neither physical nor temporal distance between keyframes; instead, keyframes are added only when sufficient new visual information is available, allowing the system to process videos acquired at arbitrary speed. In the Downtown dataset, for instance, the system selected keyframes spaced roughly 25 to 100 frames apart.
|\bmvaHangBox \bmvaHangBox \bmvaHangBox \bmvaHangBox|
|\bmvaHangBox \bmvaHangBox \bmvaHangBox \bmvaHangBox|
Incremental Bundle Adjustment. Similar to PTAM [Klein and Murray(2007)] and ORB-SLAM [Mur-Artal et al.(2015)Mur-Artal, Montiel, and Tardos], GLAM runs BA every time a new keyframe is added, optimizing the poses of all keyframes and the positions of all landmarks while holding fixed the intrinsic camera parameters. Different libraries including g2o [Kuemmerle et al.(2011)Kuemmerle, Grisetti, Strasdat, Konolige, and Burgard] and ceres-solver [Agarwal et al.(2013)Agarwal, Mierle, et al.] were evaluated, but PBA [Wu et al.(2011b)Wu, Agarwal, Curless, and Seitz] performed best due to its specialization for BA problems.
Bootstrapping. When a new submap is initiated, no keyframes or 3D landmarks yet exist, so a strategy different from the steady-state process described above must be employed to establish initial geometry. The first frame of the submap becomes the first keyframe, with its pose fixed at the origin. Features from subsequent frames are matched against those of the first keyframe to find epipolar geometry via RANSAC-based essential matrix estimation [Hartley and Zisserman(2000)]. The current frame becomes the second keyframe if the number of RANSAC inliers falls below or if the average triangulation angle exceeds . (for the experiments described in this paper, we used and ). The inlier correspondences are triangulated to form the initial 3D landmarks.
Completion. A submap is sucessfully completed when the number of keyframes exceeds a threshold. We found that 20 keyframes were generally sufficient to form a stable reconstruction and minimize internal loop closures. Upon completion, all frames are re-localized to the final landmark set via PnP, which requires only that 3D-2D matches, and not features, be recomputed. The next submap is initiated to have a degree of overlap (in our experiments, 10%) with the current submap so that the two share a set of 3D points for alignment.
Outlier filtering. Incorrectly matched point features lead to outliers that can dramatically affect bundle adjustment accuracy. Our system therefore filters outliers at several stages. First, after every bundle adjustment step, any point whose reprojection error exceeds a threshold is removed. Some incorrect matches may exhibit small reprojection error, such as when triangulation rays are nearly parallel and the reconstructed landmark is very far from the rest of the scene. To remove these points, the k (30) nearest neighbours of each point are computed using a kd-tree, and a point is removed if its average distance from its neighbours is more than 2 standard deviations. Finally, upon submap completion, any point with large reprojection error in all frames is removed.
Focal length estimation. Use of accurate camera intrinsics is instrumental in full-scale reconstruction, since there is no single rectifying transformation for incorrect intrinsics. The published focal length of 1800 for the lens used to collect the Downtown dataset was imprecise and led to reconstruction failures. We therefore included per-keyframe focal length in PBA optimization, observing that the median over all frames’ estimated focal length converged to 1751 as the number of frames increased.
3.2 Global Submap Alignment
As local submaps are created, a parallel thread uses them as the unit of processing for full map reconstruction. This thread discovers correspondences between submaps, including loop closures, and optimizes pose and structure globally and efficiently.
Submap matching. Although temporally adjacent submaps are known to be associated with one another by construction, it is required to discover visual links between submaps that view common scene content, regardless of the time at which they were created. This forms a connectivity graph in which submaps constitute nodes and commonly observed points constitute edges.
A naive implementation would identify visual links through brute-force search over every possible pair of submaps. This requires time for submaps, making the process computationally intractable as grows. To form a connectivity graph efficiently, we employ a bag-of-words technique. A vocabulary tree is created offline using randomly-sampled SIFT descriptors drawn from aerial images; a series of hierarchically-applied k-means clustering operations partition the descriptors into a tree of visual words [Nister and Stewenius(2006)]. As each submap is completed, its landmarks’ SIFT descriptors are added to a database formed over the tree, and visual words are incrementally associated with the submap via an inverted index. The set of landmark descriptors in this submap is then used to query the database, which returns a ranked set of matching submaps and their weighted match scores. This mechanism reduces the complexity of submap matching from quadratic to linear time, as shown in Figure 6.
The result is a graph whose edges represent potential visual matches based on an unordered bag of indexed visual words with no geometric constraints. Some graph edges may be incorrect, sharing a number of similar-looking features but not actually viewing common areas of the scene (e.g. due to repeated urban structures such as windows). The geometric consistency of correspondences is verified by estimating a closed form similarity transform between 3D landmarks in a RANSAC loop, accounting for both pose and scale differences. Edges with too few inliers are removed, as are all outlier landmark correspondences.
Submap pose-graph After establishing temporal and spatial correspondences between submaps, the next step is to create a pose-graph and optimize the pose of the submaps such that the 3D distance between corresponding points is minimized (Figure 4). Scale-drift [Strasdat et al.(2010)Strasdat, Montiel, and Davison] within submaps is addressed implicitly by the keyframe addition strategy, while scale-drift across submaps is adressed by using a 7 DoF similarity transform to represent each submap’s pose.
The distance between corresponding 3D points of the submaps is minimized by using non-linear Gauss-Newton optimization [Hartley and Zisserman(2000)], which involves finding a Jacobian w.r.t. to all free parameters. In this case, the 7 DoF pose of a submap and the location of a landmark are the free parameters of the cost function. Differentiating the cost function w.r.t. to the Lie group elements (R,s,t) yields the Lie algebra space sim(3) [ead(2014)]. sim(3) is a 7 dimensional vector (, , ), respresenting the Lie algebra components of rotation, translation and scale respectively.
Given a set of submaps and landmarks , a 7 DoF similarity transform is associated with every submap . An edge between submap and 3D point is denoted by . The scale is constrained by multiplication of a small constant (< 1) with the Lie algebra of scale, . The scale prior prevents the solver from converging to a trivial solution, i.e zero scale. The resulting cost function is:
If and are the Lie algebra and Lie group of the current state of a Sim(3) vertex, then the vector update is applied as follows:
The g2o [Kuemmerle et al.(2011)Kuemmerle, Grisetti, Strasdat, Konolige, and Burgard] framework is used to form and optimize the factor graph via Levenberg-Marquardt (LM), with robustness added via Huber loss [Hartley and Zisserman(2000)]. Preconditioned conjugate gradient is used to efficiently solve the sparse linear system of equations arising at each step of LM, and analytic Jacobians were dervied to further speed up the computation.
We evaluated GLAM on both synthetic and real datasets. Our experiments demonstrate centimeter accuracy on synthetic datasets, and accuracy comparable to VisualSFM [Wu et al.(2011c)] on real datasets, while providing a much faster run-time.
4.1 Synthetic Data
We created synthetic datasets using Google Earth to simulate camera trajectories and using a LIDAR point cloud [rig(2011)] to serve as ground truth for simulated image projections. The results of 3D reconstruction on three such datasets is shown in Figure 5 and summarized in Table 1.
4.2 Real Data
The Downtown dataset (Figure 2) is a video of 88,100 frames shot over Providence, USA. We applied GLAM to the entire video; Figure 6 illustrates runtime performance and scale, and Figure 7 visualizes the reconstruction results. The average run-time for submap creation is 5.8 fps, while the latency of the submap alignment and reconstruction thread is typically a few seconds.
We registered the landmark cloud reconstructed in the first 5,000 frames to a LIDAR point cloud of the same area using ICP [Besl and McKay(1992)] to quantify accuracy with respect to ground truth. We also compared performance with VisualSFM [Wu et al.(2011c)], a widely used 3D reconstruction package, run on every frame of the first 5,000 frames (Table 2).
|RMSE(m)||2.88 m||2.93 m|
|Time||3.96 hrs||11.79 min|
This paper introduces a new visual SLAM system that can process long videos in near real-time on a single computer. Local submaps are constructed using incremental bundle adjustment over keyframes, and the global scene is reconstructed using factor graph optimization over submaps rather than keyframes, which allows closure of large loops. Submap creation has constant runtime complexity throughout, while submap alignment time increases linearly at a low rate, indicating that the system can be run on even larger datasets while maintaining acceptable latency.
The system’s main performance bottleneck is currently SIFT feature extraction and matching. Future work will focus on improving run time by using binary descriptors. We also plan to investigate reducing the total number of constraints, incorporation of global averaging techniques, testing the approach on even larger datasets, and applying the system to terrestrial videos.
The authors would like to thank Vishal Jain for his guidance. This research is based upon work partly supported by Air Force Research Laboratory(AFRL) under contract number FA8650-14-C-1826. This document is approved for public release via 88ABW-2017-2724.
- [rig(2011)] Rhode island geographic information system, 2011. URL http://www.rigis.org/data/topo/2011.
- [ead(2014)] Lie groups for computer vision, 2014. URL http://ethaneade.com/lie_groups.pdf.
- [Agarwal et al.(2011)Agarwal, Furukawa, Snavely, Simon, Curless, Seitz, and Szeliski] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
- [Agarwal et al.(2013)Agarwal, Mierle, et al.] Sameer Agarwal, Keir Mierle, et al. Ceres solver, 2013. URL https://github.com/ceres-solver/ceres-solver.
- [Besl and McKay(1992)] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Robotics-DL tentative, pages 586–606. International Society for Optics and Photonics, 1992.
- [Colomina and Molina(2014)] Ismael Colomina and Pere Molina. Unmanned aerial systems for photogrammetry and remote sensing: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 92:79–97, 2014.
- [Davison et al.(2007)Davison, Reid, Molton, and Stasse] Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE transactions on pattern analysis and machine intelligence, 29(6), 2007.
- [Eade and Drummond(2006)] Ethan Eade and Tom Drummond. Scalable monocular slam. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 1, pages 469–476. IEEE Computer Society, 2006.
- [Engel et al.(2012)Engel, Sturm, and Cremers] J. Engel, J. Sturm, and D. Cremers. Camera-based navigation of a low-cost quadrocopter. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012.
- [Engel et al.(2013)Engel, Sturm, and Cremers] Jakob Engel, Jurgen Sturm, and Daniel Cremers. Semi-dense visual odometry for a monocular camera. In Proceedings of the IEEE international conference on computer vision, pages 1449–1456, 2013.
- [Forster et al.(2014)Forster, Pizzoli, and Scaramuzza] Christian Forster, Matia Pizzoli, and Davide Scaramuzza. SVO: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA), 2014.
- [Frahm et al.(2010)Frahm, Fite-Georgel, Gallup, Johnson, Raguram, Wu, Jen, Dunn, Clipp, Lazebnik, et al.] Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Building rome on a cloudless day. In European Conference on Computer Vision, pages 368–381. Springer, 2010.
- [Grisetti et al.(2010)Grisetti, Kummerle, Stachniss, and Burgard] Giorgio Grisetti, Rainer Kummerle, Cyrill Stachniss, and Wolfram Burgard. A tutorial on graph-based slam. IEEE Intelligent Transportation Systems Magazine, 2(4):31–43, 2010.
- [Hartley and Zisserman(2000)] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049, 2000.
- [Heinly et al.(2015)Heinly, Schönberger, Dunn, and Frahm] Jared Heinly, Johannes Lutz Schönberger, Enrique Dunn, and Jan-Michael Frahm. Reconstructing the World* in Six Days *(As Captured by the Yahoo 100 Million Image Dataset). In Computer Vision and Pattern Recognition (CVPR), 2015.
- [Kerl et al.(2013)Kerl, Sturm, and Cremers] C. Kerl, J. Sturm, and D. Cremers. Dense visual slam for rgb-d cameras. In Proc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2013.
- [Klein and Murray(2007)] Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November 2007.
- [Kuemmerle et al.(2011)Kuemmerle, Grisetti, Strasdat, Konolige, and Burgard] R. Kuemmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard. g2o: A general framework for graph optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 3607–3613, Shanghai, China, May 2011.
- [Lepetit et al.(2009)Lepetit, Moreno-Noguer, and Fua] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2):155–166, 2009.
- [Lowe(2004)] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
- [Mur-Artal et al.(2015)Mur-Artal, Montiel, and Tardos] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
- [Musialski et al.(2013)Musialski, Wonka, Aliaga, Wimmer, Gool, and Purgathofer] Przemyslaw Musialski, Peter Wonka, Daniel G Aliaga, Michael Wimmer, L v Gool, and Werner Purgathofer. A survey of urban reconstruction. 32(6):146–177, 2013.
- [Newcombe et al.(2011)Newcombe, Lovegrove, and Davison] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011.
- [Nister and Stewenius(2006)] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 2161–2168. Ieee, 2006.
- [Pollefeys et al.(2008)Pollefeys, Nistér, Frahm, Akbarzadeh, Mordohai, Clipp, Engels, Gallup, Kim, Merrell, et al.] Marc Pollefeys, David Nistér, J-M Frahm, Amir Akbarzadeh, Philippos Mordohai, Brian Clipp, Chris Engels, David Gallup, S-J Kim, Paul Merrell, et al. Detailed real-time urban 3d reconstruction from video. International Journal of Computer Vision, 78(2):143–167, 2008.
- [Resch et al.(2015)Resch, Lensch, Wang, Pollefeys, and Sorkine-Hornung] Benjamin Resch, Hendrik P. A. Lensch, Oliver Wang, Marc Pollefeys, and Alexander Sorkine-Hornung. Scalable structure from motion for densely sampled videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3936–3944, 2015.
- [Shewchuk et al.(1994)] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain, 1994.
- [Sibley et al.(2009)Sibley, Mei, Reid, and Newman] Dieter Sibley, Christopher Mei, Ian D Reid, and Paul Newman. Adaptive relative bundle adjustment. In Robotics: science and systems, volume 32, page 33, 2009.
- [Strasdat et al.(2010)Strasdat, Montiel, and Davison] Hauke Strasdat, JMM Montiel, and Andrew J Davison. Scale drift-aware large scale monocular slam. Robotics: Science and Systems VI, 2010.
- [Strasdat et al.(2012)Strasdat, Montiel, and Davison] Hauke Strasdat, José MM Montiel, and Andrew J Davison. Visual slam: why filter? Image and Vision Computing, 30(2):65–77, 2012.
- [Thrun et al.(2005)Thrun, Burgard, and Fox] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.
- [Triggs et al.(1999)Triggs, McLauchlan, Hartley, and Fitzgibbon] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment a modern synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.
- [Wright and Nocedal(1999)] Stephen Wright and Jorge Nocedal. Numerical optimization. Springer Science, 35:67–68, 1999.
- [Wu(2007)] Changchang Wu. Siftgpu: A gpu implementation of scale invariant feature transform (sift). 2007.
- [Wu et al.(2011a)Wu, Agarwal, Curless, and Seitz] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven Seitz. Multicore bundle adjustment. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 3057–3064, 2011a. URL http://grail.cs.washington.edu/projects/mcba/pba.pdf.
- [Wu et al.(2011b)Wu, Agarwal, Curless, and Seitz] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M Seitz. Multicore bundle adjustment. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3057–3064. IEEE, 2011b.
- [Wu et al.(2011c)] Changchang Wu et al. Visualsfm: A visual structure from motion system. 2011c.