Batch based Monocular SLAM for Egocentric Videos
Simultaneous Localization and Mapping (SLAM) from a monocular camera has been a well researched area. However, estimating camera pose and 3d geometry reliably for egocentric videos still remain a challenge [25, 40, 42]. Some of the common causes of failures are dominant 3D rotations and low parallax between successive frames, resulting in unreliable pose and 3d estimates. For forward moving cameras, with no opportunities for loop closures, the drift leads to eventual failures for traditional feature based and direct SLAM techniques. We propose a novel batch mode structure from motion based technique for robust SLAM in such scenarios. In contrast to most of the existing techniques, we process frames in short batches, wherein we exploit short loop closures arising out of to-and-fro motion of wearer’s head, and stabilize the egomotion estimates by 2D batch mode techniques such as motion averaging on pairwise epipolar results. Once pose estimates are obtained reliably over a batch, we refine the 3d estimate by triangulation and batch mode Bundle Adjustment (BA). Finally, we merge the batches using 3D correspondences and carry out a BA refinement post merging. We present both qualitative and quantitative comparison of our method on various public first and third person video datasets, to establish the robustness and accuracy of our algorithm over the state of the art.
Simultaneous Localization and Mapping (SLAM) has received a lot of attention from the computer vision researchers because of its applicability in robotics, defense, unmanned vehicles, augmented reality applications, etc. Egocentric or first person cameras [17, 18, 30] are the wearable cameras, typically harnessed on a wearer’s head. First person perspective and always-on nature have made these cameras popular in extreme sports, law enforcement, life logging, home automation and assistive vision applications. This has drawn a lot of interests into novel egocentric video applications [1, 2, 5, 13, 14, 15, 23, 29, 37, 39, 43, 46, 49]. Sharp head rotations resulting in quick changes in the camera view as well as forward motion cause most visual SLAM techniques to fail on such videos, as claimed in [25, 40, 42]. In this paper we revisit the monocular SLAM problem with a special emphasis on egocentric videos [25, 40, 42]. To show the general applicability of our method, we also demonstrate our method for other standard SLAM applications like vehicle mounted and hand-held videos.
Most of the current techniques for visual SLAM [4, 11, 33] deal with the problem incrementally, by picking one frame at a time from a video stream, and estimating the camera pose with respect to the 3d structure obtained so far, especially from the last few frames.
Feature based techniques [4, 9, 10, 27, 32, 33, 47, 50, 51] use resectioning  to first estimate the pose with respect to the existing 3d structure, and in the second step the estimated geometry is incrementally updated using Bundle Adjustment (BA) [52, 58]. This refines the camera pose and scene geometry simultaneously by minimizing reprojection errors with the new image added. The process is repeated for every new frame. In some cases an additional loop closure step recognizes the places visited earlier and refine the pose estimates over the loop so as to make them consistent at the intersection .
Dense and semi-dense visual odometry techniques  use all or a substantial subset of image pixels to register a new frame using a Gauss-Newton procedure. The optimization is carried out over both 3d structure and camera pose using a Lie-algebraic formulation for the later. In the second stage, a batch mode loop-closure over key-frames is used to fix the scale ambiguities.
It has been widely reported that both classes of techniques fail for egocentric videos [25, 40, 41, 42]. We observe that the incremental nature of both the styles is unsuitable for the low parallax due to dominant 3D rotations between successive frames in an egocentric video. Neither a feature based BA like strategy, nor a direct technique based on image registration can stabilize the 3d structure well due to errors in triangulation. Ultimately, relying on such 3D points causes drifts in the estimated camera trajectories leading to failure of the whole pipeline. In fact,  could only address this problem by carrying out bundle adjustment [45, 57] over large batches of 1400 frames, thereby making the problem well conditioned. We analyze these problems and suggest a novel pipeline for robust and accurate structure estimation from a fast moving, low parallax video such as from egocentric or a vehicle mounted camera. The specific contributions of this paper are:
We analyze the failure of existing SLAM techniques for egocentric videos. We posit that computing geometry from unreliable pose estimation is the primary cause of such failures.
We propose a batch mode technique which first stabilizes pose estimation before computing 3d structure using it. We compute the camera poses in small batches using local loop closures based on motion averaging  of initial estimates obtained using multiple epipolar relationships. The technique does not use unreliable 3D estimates at this stage.
We show that the proposed technique can reliably estimate camera pose and 3D structure from long public egocentric videos, which is not possible from any of the current SLAM methods.
We show that our technique works comparably, qualitatively as well as quantitatively, on other regular SLAM applications like vehicle mounted and scanning videos.
For simplicity and speed we use optical flow for image matching. However, our pipeline does not preclude the use of feature descriptor based matching for relocalization and mapping applications (see Section 5.3 for a discussion).
2 Related Work
Based on the method of feature selection for pose estimation, SLAM algorithms from a monocular camera can be classified into feature based, dense, semi-dense or hybrid methods. Feature based methods, both filtering based  and key frame based [4, 24, 33], use sparse features like SIFT , ORB , SURF  etc for tracking. The sparse feature correspondences are then used to refine the pose using structure-from-motion techniques like bundle adjustment. Due to the incremental nature of all these approaches, a large number of points are often lost during the resectioning phase .
Dense methods initialize the entire or a significant portion of an image for tracking . The camera poses are estimated in an expectation maximization framework, where in one iteration the tracking is improved through pose refinement by minimizing the photometric error, and, in alternate iterations, the 3D structure is refined using the improved tracking. To increase the accuracy of estimation, semi-dense methods perform photometric error minimization only in regions of sufficient gradient [11, 12]. However, these methods do not fare well in cases of low parallax and wild camera motions mainly because structure estimation cannot be decoupled from pose refinement.
SLAM techniques also differ on the kinds of scene being tracked: road scenes captured from vehicle mounted cameras, indoor scans from a hand held camera, and from head mounted egocentric cameras usually accompanied by sharp head rotations of the wearer. Visual odometry algorithms have been quite successful for hand-held or vehicle mounted cameras [11, 12, 22, 24, 33, 34], but do not fare well for egocentric videos due to unrestrained camera motion, wide variety of indoor and outdoor scenes and presence of moving objects [25, 40, 41, 42].
In recent years, the structure-from-motion (SfM) techniques have seen a lot of progress, using the concepts of rotation averaging (RA)  and translation averaging (TA) [19, 21, 31, 56]. The computational cost being linear in number of cameras, these techniques are fast, robust and well suited for small image sets. They provide good initial estimates for camera pose and structure using pairwise epipolar geometry, which can be refined further using standard SfM techniques.
Loop closures in SLAM are detected using three major approaches : map-to-map, image-to-image and image-to-map. Clemente et al.  use a map-to-map approach where they find correspondences between common features in two sub-maps. Cummins et al.  use visual features for image-to-image loop-closures. Matching is performed based on presence or absence of these features from a visual vocabulary. Williams et al.  use an image-to-map approach and find loop-closure using re-localization of camera by estimating the pose relative to map correspondences.
The pose of a camera w.r.t a reference image is denoted by a rotation matrix and a translation direction vector t. The pairwise pose can be estimated from the decomposition of the essential matix which binds two views using pairwise epipolar geometry such that: [20, 36]. Here is a skew-symmetric matrix corresponding to the vector t. A view graph has the images as nodes and the pair-wise epipolar relationships as edges.
3.1 Motion Averaging
Given such a view graph, embedding of the camera poses into a global frame of reference can be done using motion averaging [6, 21, 31, 56]. The motion between a pair of cameras and can be expressed in terms of the pairwise rotation () and translation direction () as: , where, is the scale of the translation. If and are the motion parameters of cameras and respectively in the global frame of reference, then we have the following relationship between pairwise and global camera motions: .
Rotation averaging: Using the above expression, the relationship between global rotations and pairwise rotations can be derived as: , where and are the global rotations of cameras and . From a given set of pairwise rotation estimates , we can estimate the absolute rotations of all the cameras by minimising a robust sum of discrepancies between the estimated relative rotations and the relative rotations suggested by the terms .
where , which is the intrinsic bivariate distance measure defined in the manifold of 3D rotations on the .
Translation averaging: The global translations and are related with pairwise translation directions as: . Global camera positions () can be obtained as : , where the summation is over all the camera-camera and camera-point constraints derived from 3D points. Wilson and Snavely  use a gradient descent approach to solve the minimization problem.
3.2 Bundle Adjustment
An alternative way to estimate camera pose is to use Structure-from-Motion (SfM) which recovers both camera poses and 3D structure by minimizing the reprojection error (described in (2)) using bundle adjustment .
where, is the visibility of the 3D point in the camera, is the function which projects a 3D point onto camera which is modelled using parameters ( for focal length, for rotation, for position) , is the actual projection of the point on the camera, is the single parameter () distortion function and is the Euclidean distance.
Two kinds of bundle adjustment methods are used in the literature to minimize (2):
Global or Batch-mode BA (BBA): Batch-mode bundle adjustment optimizes for all the camera poses at once by minimizing (2) globally. The approach is less susceptible to discontinuities in reconstruction or drift due to joint optimization of all cameras at once. Also, it requires an initialization for camera parameters and 3D structure which can be provided through motion averaging and linear triangulation .
4 Proposed Algorithm
4.1 Key Framing
We start by processing each frame and whenever there is sufficient parallax between frames we designate a new key-frame. This designation is defined by two considerations: Two key-frames must be separated by either frames, or the average optical flow crosses a threshold of pixels, which is sufficient to define a good parallax between key-frames. We take the minimum of these two criteria for defining a new key-frame. In our experiments we typically choose between 10 to 30 based on the assumption that we use a camera with a frame rate of 30-60 fps. is typically chosen as 20 pixels. The rationale behind using 20 pixels of average optical flow in defining key-frames is to make our method adaptive to wild motions. For example, on the hyperlapse videos , whenever there are wild turns, every frame becomes a key-frame, whereas when we use the same method on a walking video, the key-frames are more wide apart.
4.2 Batch Generation
We do pose estimation in batches of key-frames. The batch mode processing makes the motion and structure estimation problem well constrained, when parallax between successive frames is small due to dominant 3D rotations, as is common in egocentric videos. We allocate a number of key-frames into a batch and then process each batch independently. Typically each batch contains around 10-30 key-frames with each key-frame separated by about 5-7 frames for the case when the wearer is walking. While lack of parallax justifies creating batches, making too large a batch is also problematic, because of the following reasons:
Motion averaging works more stably on smaller batches.
A smaller batch size helps in controlling drifts and breaks in structure estimation.
In Figure 3, we show the effect of batch size on trajectory estimation. Both incremental, as well as processing the entire sequence containing about 500 key-frames at once, results in breaks in trajectory and structure. Whereas, when we do structure and pose estimation in smaller batches, it results in a perfect trajectory and structure (Figure 3).
Note that the batch mode of processing does not necessarily result in any delay in computation of output. We can use sliding window of key-frames with significant overlaps between successive batches. In the extreme case two successive batches can be different in only the two end key-frames. In addition, since the pose and structure estimation results of a previous batch provides a good initialization for a new overlapping batch, this does not significantly add to the computational burden either. Most of our results in this paper have been produced with non-overlapping batches.
4.3 Local Loop Closures
In order to handle large rotations, we use the concept of local loop closures which gives extra constraints for stabilising the camera estimates. For a classical SLAM problem from a hand held video, global loop closure is an important step to fix the accumulated errors over individual pose estimation. However, in case of the egocentric videos, where the motion of the wearer is linear forward, a user may not revisit a particular scene point, which makes global loop closure impossible. In addition, given the wild nature of egocentric videos, the camera pose estimates and trajectories tend to drift quickly unless fixed by loop closures. We note that a wearer’s head typically scans the scene left to right and back during the course of natural walking. The camera looks at the same scene multiple times, thus providing opportunities for a series of short local loop closures. We take advantage of this phenomenon to improve the accuracy of the estimated camera poses. Similarly, in case of pure forward motions the camera sees the same scene continuously, giving enough constraints through local loop closures.
To use this concept we maintain a set of past key-frames in a window. When considering a new key-frame we estimate its pairwise pose estimates with existing key-frames in this window for establishing redundant paths. This constitutes a local loop closure. Additionally the loop closures are detected on sliding windows for imposing a smoothness constraint over the trajectory. Figure 4a shows the effect of loop closures on the estimation where in absence of loop closures the structure gets deformed in scale and shifts above the ground. However, with loop closures it fits well with the structure.
4.4 Camera Motion and 3D Structure Initialization
We use the five-point algorithm  for estimating the pair-wise epipolar geometry to create the initial view graph for a batch. Once the pairwise estimates are obtained with local loop closures, we have enough redundant paths traced through each and every camera on the view graph of the current batch. This provides sufficient constraints for motion averaging. We use motion averaging as described in 3.1. We first use rotation averaging for finding global rotation estimates followed by translation averaging. Here we mention that we use a mixture of two different methods for robustly averaging out the translations. To initialize the global translations we generate an initial guess using global convex optimization technique specified in  and subsequently refine the solution using the approach of . This provides a very good initial estimate for the camera pose. The 3D structure is initialized using linear triangulation as specified in 
4.5 SfM Refinement
The initial structure and camera poses are further refined using a final run of batch mode bundle adjustment, which converges very fast because of the good initialization as described in last section.
4.6 Merging and BBA Refinement with Resectioning
Finally, the batches are successively merged using 7 dof alignment based on SVD as in . Also, during merging new points which were not used previously due to not being visible in most cameras, are added back as these points get stable with more cameras viewing them now. A final round of global BBA based refinement is run whenever the cross batch reprojection errors get high. This leads to a non-linear refinement in scale of the estimated structure and poses. We describe the complete algorithm in Figure 2.
5 Experiments and Results
In this section we present results using our techniques on some challenging data sets for egocentric videos. Since, classical SLAM algorithms like LSD  and ORB SLAM  does not work for egocentric videos, we compare with the state of the art on regular hand held videos. We also perform careful experimental analysis to justify our choice of parameters such as key-frame separation and batch size.
We have implemented portions of our algorithm in C++ and MATLAB. All the experiments have been carried out on a regular desktop with Core i7 2.3 GHz processor (containing 4 cores) and 32 GB RAM, running Ubuntu 14.04.
Our algorithm requires the intrinsic parameters of the cameras for SfM estimation. For the sequences taken from public sources, we have used the calibration information about the make and the version of the cameras used provided on their websites.
5.1 Qualitative Results
We have tested our algorithm on various Hyperlapse sequences . The bike01  video in the dataset is a very challenging sequence with wild head motions, fast forward movements and sharp turns. Both [4, 33] fail at the very wild motions at frames 1907 -1920 whereas  works for around 2600 frames. Our method works smoothly for 12000 frames and beyond. We have shown the computed trajectory for upto 12000 frames in Figure 1. In the same figure we have shown the 3d map by carrying out dense reconstruction of some portions using CMVS  based on the camera poses and the sparse structure computed using our algorithm. Note that CMVS can produce high quality output only if the pose and the initial structure estimates are correct. In Figure 5 we compare the dense 3d structure of a portion computed using our method with the one given in Hyperlapse. It is to be noted that in  pose and 3d structure are computed using SfM over batches of 1400 frames.
5.2 Quantitative Analysis
|Dataset||RMSE (cm) of key-frame trajectory|
None of the egocentric datasets we encountered have ground truth trajectories available, making it difficult to carry out any quantitative analysis of the proposed algorithm. However, there are multiple third person datasets available for such quantitative analysis of a SLAM algorithm. We have used the TUM Visual odometry dataset  for such analysis. Figure 7 shows the dense reconstruction and the trajectory estimated by the proposed method. Note that the graph shown in the figure also contains the ground truth trajectory, but the estimated trajectory is completely aligned with the ground truth and is hiding it completely.
The TUM dataset also allows us to compute RMS error of the computed trajectory with respect to the ground truth trajectory. Table 1 shows the error for our algorithm as well as the ones reported by the state of the art algorithm on same sequence. Though our algorithm is targeted at egocentric videos, we match and often improve the state of the art even for regular hand-held videos as well.
|Method||Rotation (deg.)||Position (cm)|
|Dataset||Dimension||KeyFrame trajectory RMSE (m)|
A popular metric to measure the accuracy of estimated 3D structure is by measuring the relocalization eror. In our algorithm, we use optical flow for image matching for the sake of simplicity and speed. Since optical flow vectors do not have associated feature descriptors, they cannot be used for relocalization and mapping. However, our pipeline does not preclude use of such feature descriptors for relocalization. To demonstrate relocalization using our framework, we train a vocabulary tree  using the SIFT  features computed from the key-frames in the TUM sequence . We then use a set of frames which are not key-frames to calculate relocalization error. We carry out feature matching with the key-frames using vocabulary tree, reject outliers using the pre-computed trajectory of the key-frames, and estimate the pose of the unknown frames using 3D-2D correspondences. In Figure 8, we plot the relocalized unknown frames on the computed trajectory. Location of the frames on the trajectory indicate the correctness of relocalization. In Table 2 we show the accuracy of relocalization with respect to the ground-truth both with and without a final BA refinement.
Relocalization also facilitates global loop closures in our original match graph. For the TUM dataset global loop closures brought down the absolute RMS trajectory error from 2.21 cm to 1.06 cm.
5.4 Vehicle Mounted Cameras
Though, the focus of this paper is on egocentric videos, our algorithm is equally applicable for other capture scenario where there is low parallax between consecutive frames. One such popular case arise from a forward looking camera mounted on a vehicle. We have taken one such popular KITTI dataset  for the comparison. Figure 9 shows the trajectories computed using our algorithm along with ground truth trajectories on various sequences from the dataset.
Despite tremendous progress made in SLAM techniques, running such algorithms for many categories of videos in the wild still remain a challenge. We believe careful case by case analysis of such challenging videos may give insights into solving the problem. Egocentric a.k.a first person videos are one such category we focus on in this paper. We observe that incremental estimation employed in most current SLAM techniques often cause unreliable 3D estimates to be used within trajectory estimation. We suggest to first stabilize the trajectory using 2D techniques and then go for structure estimation. We also exploit domain specific heuristics such as local loop closure. Interestingly, we observe that the proposed technique improves state of the art not only for targeted egocentric videos but also for videos captured from vehicle mounted cameras. We believe that robust trajectory and structure estimation from the proposed technique will help many current and novel egocentric applications.
-  O. Aghazadeh, J. Sullivan, and S. Carlsson. Novelty detection from an ego-centric perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3297–3304, 2011.
-  L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara. Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In Proceedings of IEEE Computer Vision and Pattern Recognition Workshops (CVPRW), 2014.
-  H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Distinctive image features from scale-invariant keypoints. Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.
-  D. H. C., K. Kim, J. Kannala, K. Pulli, and J. HeikkilÃ¤. DT-SLAM: deferred triangulation for robust slam. In Proceedings of International Conference on 3D Vision (3DV), pages 609–616, 2014.
-  D. Castro, S. Hickson, V. Bettadapura, E. Thomaz, G. Abowd, H. Christensen, and I. Essa. Predicting daily activities from egocentric images using deep learning. In Proceedings of the ACM International Symposium on Wearable Computers (ISWC), pages 75–82, 2015.
-  A. Chatterjee and V. M. Govindu. Efficient and robust large-scale rotation averaging. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 521–528, 2013.
-  L. Clemente, A. Davison, I. Reid, J. Neira, and J. D. Tardós. Mapping large loops with a single hand-held camera. In Proceedings of Robotics: Science and Systems Conference, 2007.
-  M. Cummins and P. Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. The International Journal of Robotics Research, 27(6):647–665, 2008.
-  A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(6):1052–1067, 2007.
-  E. Eade and T. Drummond. Scalable monocular slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 469â–476, 2006.
-  J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Proceedings of the European Conference on Computer Vision (ECCV), pages 834–849, 2014.
-  J. Engel, J. Sturm, and D. Cremers. Semi-dense visual odometry for a monocular camera. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1449–1456, 2013.
-  A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 407–414, 2011.
-  A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize daily actions using gaze. In Proceedings of the European Conference on Computer Vision (ECCV), pages 314–327, 2012.
-  A. Fathi, X. Ren, and J. Rehg. Learning to recognize objects in egocentric activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3281–3288, 2011.
-  A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  Google. Glass. https://www.google.com/glass/start.
-  GoPro. Hero3. https://gopro.com/.
-  V. Govindu. Combining two-view constraints for motion estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 218–225, 2001.
-  R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, 2nd edition, 2004.
-  N. Jiang, Z. Cui, and P. Tan. A global linear method for camera pose registration. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 481–488, 2013.
-  C. Kerl, J. Sturm, and D. Cremers. Robust odometry estimation for RGB-D cameras. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 3748–3754, 2013.
-  K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast unsupervised ego-action learning for first-person sports videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3241–3248, 2011.
-  G. Klein and D. Murray. Parallel tracking and mapping on a camera phone. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 83–86, 2009.
-  J. Kopf, M. F. Cohen, and R. Szeliski. First-person hyper-lapse videos. ACM Transactions on Graphics (TOG), 33(4):78:1–78:10, 2014.
-  M. Li and A. I. Mourikis. High-precision, consistent ekf-based visual-inertial odometry. International Journal of Robotics Research, 32(6):690–711, 2013.
-  H. Lim, J. Lim, and H. J. Kim. Real-time 6-dof monocular visual slam in a large-scale environment. In Proceedings of the International Conference on Robotics and Automation (ICRA), pages 1532â–1539, 2014.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004.
-  W. Lucia and E. Ferrari. Egocentric: Ego networks for knowledge-based short text classification. In Proceedings of ACM International Conference on Conference on Information and Knowledge Management (CIKM), pages 1079–1088, 2014.
-  Microsoft Research. SenseCam. https://research.microsoft.com/en-us/um/cambridge/projects/sensecam/.
-  P. Moulon, P. Monasse, and R. Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 3248–3255, 2013.
-  E. . Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real time localization and 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 363–370, 2006.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
-  R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. DTAM: Dense tracking and mapping in real-time. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2320–2327, 2011.
-  D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2161–2168, 2006.
-  D. NistÃ©r. An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26(6):756–777, 2004.
-  K. Ogaki, K. M. Kitani, Y. Sugano, and Y. Sato. Coupling eye-motion and ego-motion features for first-person activity recognition. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1–7, 2012.
-  O. Ozyesil and A. Singer. Robust camera location estimation by convex programming. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2674–2683, 2015.
-  H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2847–2854, 2012.
-  Y. Poleg, C. Arora, and S. Peleg. Temporal segmentation of egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2537–2544, 2014.
-  Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact CNN for indexing egocentric videos. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, 2016.
-  Y. Poleg, T. Halperin, C. Arora, and S. Peleg. EgoSampling: Fast-forward and stereo for egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4768–4776, 2015.
-  X. Ren and C. Gu. Figure-ground segmentation improves handled object recognition in egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3137–3144, 2010.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2564–2571, 2011.
-  N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3d. In Proceedings of ACM SIGGRAPH, pages 835–846, 2006.
-  T. Starner, J. Weaver, and A. Pentland. Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 20(12):1371–1375, 1998.
-  H. Strasdat, A. J. Davison, J. M. M. Montiel, and K. Konolige. Double window optimisation for constant time visual slam. In IEEE International Conference on Computer Vision (ICCV), pages 2352–2359, 2011.
-  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proceedings of the International Conference on Intelligent Robot Systems (IROS), 2012.
-  S. Sundaram and W. Mayol-Cuevas. High level activity recognition using low resolution wearable vision. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops (CVPRW), pages 25–32, 2009.
-  W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao. Robust monocular slam in dynamic environments. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 209–218, 2013.
-  P. H. Torr and A. Zisserman. Feature based methods for structure and motion estimation. In Proceedings of Vision Algorithms: Theory and Practice. Springer, pages 278–294, 1999.
-  B. Triggs, P. Mclauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment â a modern synthesis. In Vision Algorithms: Theory and Practice, LNCS, pages 298–372, 2000.
-  S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 13(4):376–380, 1991.
-  B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. Tardós. A comparison of loop closing techniques in monocular SLAM. Robotics and Autonomous Systems, 57(12):1188–1197, 2009.
-  B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. D. Tardos. An image-to-map loop closing method for monocular SLAM. In Proceeding of the International Conference on Intelligent Robots and Systems, pages 2053–2059, 2008.
-  K. Wilson and N. Snavely. Robust global translations with 1dsfm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 61–75, 2014.
-  C. Wu. Towards linear-time incremental structure from motion. In Proceedings of the International Conference on 3D Vision (3DV), pages 127–134, 2013.
-  C. Wu, S. Agarwal, B. Curless, and S. Seitz. Multicore bundle adjustment. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3057–3064, 2011.
-  S. M. S. Y. Furukawa, B. Curless and R. Szeliski. Towards internet-scale multi-view stereo. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1434–1441, 2010.