SESLAM: SemiDense Structured EdgeBased Monocular SLAM
Abstract
Visionbased Simultaneous Localization And Mapping (VSLAM) is a mature problem in Robotics. Most VSLAM systems are feature based methods, which are robust and present high accuracy, but yield sparse maps with limited application for further navigation tasks. Most recently, direct methods which operate directly on image intensity have been introduced, capable of reconstructing richer maps at the cost of higher processing power. In this work, an edgebased monocular SLAM system (SESLAM) is proposed as a middle point: edges present good localization as point features, while enabling a structural semidense map reconstruction. However, edges are not easy to associate, track and optimize over time, as they lack descriptors and biunivocal correspondence, unlike point features. To tackle these issues, this paper presents a method to match edges between frames in a consistent manner; a feasible strategy to solve the optimization problem, since its size rapidly increases when working with edges; and the use of nonlinear optimization techniques. The resulting system achieves comparable precision to state of the art featurebased and dense/semidense systems, while inherently building a structural semidense reconstruction of the environment, providing relevant structure data for further navigation algorithms. To achieve such accuracy, state of the art nonlinear optimization is needed, over a continuous feed of edgepoints per frame, to optimize the full semidense output. Despite its heavy processing requirements, the system achieves near to realtime operation, thanks to a custom built solver and parallelization of its key stages. In order to encourage further development of edgebased SLAM systems, SESLAM source code will be released as open source.
I Introduction
Visionbased Simultaneous Localization And Mapping (VSLAM) is a well studied problem, and key for Computer Vision and Robotics. SLAM lies at the core of most systems that require awareness of self motion and scene geometry.
Many solutions have been proposed for the VSLAM problem. Feature based algorithms [1, 2, 3, 4, 5, 6] rely on recognition and tracking of point landmarks over time. These methods are robust and present high accuracy, however the reconstructed map is sparse, thus presents little practical use by itself. They also present problems in textureless environments, where point features are difficult to come by [7], [8].
Direct methods operate directly on image intensity, reconstructing depth at every image pixel [9, 10] or in a selected subset of pixels [11]. These methods avoid the cost of feature extraction, but state of the art methods need to estimate and correct a photometric camera model to deal with illumination changes and occlusions, yielding a more complex optimization problem [11]. Nevertheless, the reconstruction output of these systems usually yields richer maps, more suitable for navigation than featurebased methods, at the cost of higher processing power and GPU processing in full dense methods[9].
In this context, an edgebased system appears as an interesting middle point between the former. On one hand edge extraction is robust, highly parallelizable and widely studied in computer vision, and it is also known for taking part in biological vision systems. Moreover, edges are trackable features, meaning that the geometric reprojection error can be employed as done in classical featurebased systems. Furthermore, edges can be extracted in textureless scenes where classical feature extraction is difficult. Finally, an edgemap usually contains objects boundaries, hence its reconstruction provides rich information about structure of the observed scene, and may play a part in full dense reconstruction and semantic interpretation.
Despite its potential benefits, SLAM systems using edges are not common and most of them rely on straight lines as a complement to classical features, to increase robustness when base methods fail [12, 8, 7]. Only recently an edgebased system has been proposed in the literature [13]. This is probably due to the fact that edges are not easy to associate, track and optimize over time, as they lack descriptors and do not present biunivocal correspondence, unlike point features.
In this work, a novel edgebased SLAM system is introduced (SESLAM), which relies on edge detection as sole input, except for the loop closure step. Tracking and data association is tackled by following up on previous developments in edgebased visual odometry for MAVs control [14, 15]. Full nonlinear keyframe based optimization is performed on edges, adapting recent methods [3, 11] to the limitations and advantages presented by this type of feature. The resulting system achieves comparable accuracy to state of the art methods and provides a full edge reconstruction output, suitable for further navigation steps (see Figure 1) [16]. Using this type of backend to optimize a pure edge input is, to the authors’ knowledge, the first of its kind.
Ii Related Work
Feature based algorithms that rely on recognition and tracking of point landmarks over time are the most common solution to the Visual SLAM problem [1, 2, 3]. ORBSLAM2 [2] is probably the state of the art in terms of accuracy. OKVIS [3] is another relevant solution, targeted to stereo visualinertial odometry, whose optimization and marginalization approach presented served as inspiration for this work. These methods present high accuracy, however the reconstructed map is sparse and a separate mapping algorithm is required to get denser ones [17].
Addressing this issue, Direct SLAM methods deal with the problem of minimizing an intensity based functional in all image pixels without a feature extraction phase. DTAM [9] was the first to show a dense 3D monocular reconstruction of this kind. It relies on heavy GPU aided processing to incrementally fuse and regularize every image pixel into a dense model and is constrained to small environments. As an effort to reduce the computational cost of these algorithms, semidense approaches were presented [18] that operate over image intensities but only on a meaningful set of pixels. These pixels are picked based on their traceability, using the high intensity gradient parts of the image. Hence, its output contains edges but no explicit treatment is made. DSO [11] applies nonlinear optimization to an intensity based functional over a sparse selection of pixels. Moreover, they incorporate lighting parameters to the camera model to account for illumination and exposition time changes, making the optimization more complex. In addition, dealing with individual pixels makes association even harder compared to edges. In [11] this is tackled by redefining variables on each keyframe, which leads to variable and measurement repetition, consequently reducing performance. In the present work, illumination changes are handled by edge detection, and association is used to reduce the number of repeated variables, enabling whole edgemap reconstruction with nonlinear optimization.
Furthermore, SLAM and visual odometry systems explicitly using edges have also been introduced. The ones using edges as sole input, actually use fit straight lines [19, 20, 21, 22], hence their application is restricted to structured scenarios where these features are present. On the other hand, line features have also been used to enhance existing systems, with the aim of boosting their performance in textureless environments, while yielding a structured line map as a secondary goal. A pioneer in this field is [12], where authors incorporate lines into a monocular Extended Kalman Filter system. More recently, [23] and [24] extended the semidirect monocular visual odometry system SVO [10] using lines to improve its performance in textureless enviroments. In PLSLAM [8] authors extend ORBSLAM with lines, presenting a monocular SLAM system that shows improvement in less textured sequences at the cost of nearly duplicating the computational cost of plain ORBSLAM. Almost at the same time, another PLSLAM [7] that builds upon the ORBSLAM pipeline was presented, although using a stereo approach. This work presents results showing better performance than ORBSLAM on specially recorded realworld textureless scenes, without such a high computational burden. They also develop a method for enhanced loop closure using line descriptors, which could be an interesting addon to the system in this paper. All of these approaches use the LSD line detector [25] and LBD line descriptor for matching purposes [26].
There is not much work in explicitely using general edges in SLAM systems, probably since edges are not easy to associate, track and optimize over time, as they lack descriptors and do not present biunivocal correspondence. Eade and Drummond [27] present what may be the first work in this line. They define an edge features called edgelet  a very short, locally straight segment of what may be a longer, possibly curved, line  and use it to enhance a particle filter SLAM system, to take advantage of the structural information provided by edges. Klein and Murray [28] also use this edgelet definition, but in this case to enhance robustness against rapid camera motions of their PTAM system [29]. Closest to the present work is the Edge SLAM System presented recently in [13], although their optical flow for tracking approach to track edges is significantly different from SESLAM. To compare against this last system, section includes results for ICLNUIM sequences reported by them, showing that SESLAM outperforms theirs in all but one sequence. Regrettably, there is no public available code, neither parametrization of their work to compare in more challenging datasets like EuRoC or TUMVI.
Iii Contribution
This paper introduces SESLAM, a novel semidense structured edgebased monocular SLAM system. SESLAM shows comparable precision to state of the art featurebased and dense/semidense systems, but using edges as sole input in all the stages but loop closure, achieving realtime operation in several cases. The edgebased nature of the system not only helps in textureless environments, but also inherently builds a structural semidense reconstruction of the environment suitable for further navigation algorithms. This is accomplished thanks to the following contributions:

A method to associate edges between frames and relatively far keyframes, in a consistent manner, that enables the application of techniques traditionally reserved to point features.

A parametrization, residual and variable selection and marginalization strategy that exploits edges’ nature, to make the optimization problem treatable, since its size increases rapidly if applying existing methods to edges.

A nonlinear optimization approach that better adapts to an edge based input.
The rest of this paper is organized as follows: section IV presents the overall problem, sections V, VI and VII describe the frontend of the system, namely how the initial conditions and data association are generated to build the optimization problem. Section VIII presents the backend and which assumptions and approximations are made to solve the optimization efficiently. Section IX presents a loop closure detection and integration strategy. Finally, some relevant implementation details are explained in section X and results obtained by running the algorithm on public datasets are presented in section XI.
Iv Problem Formulation
The input of the algorithm is composed by edgemaps [14] defined as sets of edge belonging pixels (edgepoints), including its subpixel position , a measure of local edge gradient or normal direction and connections to its adjacent edgepoints. The later play a key role in the data association phase.
The purpose of the system is to recover 3D position of each edgepoint, while at the same time infer camera poses. What makes this problem challenging is that there is no obvious way of differentiating neighbour pixels. Therefore, edgepoints cannot be univocally matched over time, as a matter of fact, the same edge observed from different scales may have different amounts of edgepoints. For this reason, a keyframebased approach similar to previous semidense methods is followed [18], edgemaps are defined for each keyframe and anchored to them, and the global map results from the union of all edgemaps.
However, redefining and optimizing an edgemap per keyframe inevitability leads to variable repetition and a waste of computational resources. To solve this, two types of keyframes will be differentiated thought all this work. Keyframes in which an edgemap is defined are called “Prior KeyFrames” (PKFs). This corresponds with the fact that for each one of them there will be a measurement prior, or factor added to the pose graph optimization step. On the other hand, and for the sake of clarity, keyframes that only provide edge position measurements will be called “Data KeyFrames” (DKFs).
Another advantage of this separation is that it allows to use different criteria for keyframe selection, while PKFs are selected based on covisibility (number of novel edgepoints), DKFs are selected using simple heuristics that maximize inter keyframe translation. As done in recent SLAM methods [1], every new frame is added as a keyframe and later removed from an optimization window according to selection criteria.
Notation convention: PKFs will be indexed with the letter , DKFs with and edgepoints with .
Iva Edge Parametrization
In order to perform bundle adjustment, a parametrization must be chosen for spatial landmarks, edgepoints in this case. To account for the fact that edgepoints are defined relative to PKFs, a relative inverse depth parametrization is adopted [30]. However, given that edgepoints present no localization information along the edge direction[14], only 2 parameters are selected: inverse depth , and distance along the edge’s normal direction . This parametrization is sketched in Figure 2.
Defining as the image position of edgepoint in PKF , its normal direction vector and the back projection from image to euclidean space, the 3D position of the edgepoint in camera coordinates is defined as:
(1) 
This parametrization constrains the edgepoint to only move along edge’s perpendicular direction during optimization. Hence dealing with uncertainty in localization of edgepoints along edge’s tangential direction.
IvB Error Model
The error model used by [14] is kept, which is a reprojection error function that takes into account edge’s matching uncertainty along its tangential direction. Hence, only the error along its normal direction is taken into account. For a generic edgepoint defined in PKF and associated in DKF , the measurement takes the form:
(2) 
Where are the respective keyframe’s poses, and are the associated edgepoint’s position and normal direction in DKF , and the scalar product yields the projected error along this direction.
IvC Uncertainty and Robust Estimation
Given a model for edge position uncertainty provided by the detection method, equation 2 can be modified to account for it. Moreover, IRLS is used in every optimization step weighting the reprojection error to account for outliers. Overall, the residual function takes the form:
(3) 
Employing a slightly modified version of Huber weighting.
To take into account edgepoint position in its defining PKF, an extra measurement must be added as prior so as to constrain , which is not robustly weighted:
(4) 
IvD Total Energy Functional
Having defined the error model, the optimization problem can be posed as the minimization of the reprojection errors of edgepoints defined in each PKF with respect to matches in DKFs:
(5) 
V Edge Tracking
In the context of this work, tracking is the problem of finding the best transformation between the current and next frame, without prior data association or initial condition. For this purpose the method presented in [14] is used, with some enhancements to make it more robust against fast motions.
In [14] an auxiliary image is created from the new frame which contains in each pixel the index of the closest edgepoint. This image is used to reproject the current frame into the new one and find an initial match for each edgepoint. Later, the reprojection error functional in equation 2 is minimized, only with respect to the relative translation between frames. This is done iteratively, rebuilding the matches in each iteration, and using the current frame’s estimated depth as an input datum.
In favor of rejecting some wrong association, a threshold is applied to the difference between gradient vectors of each edgepoint. Optionally, a threshold can also be applied to the intensity difference at both sides of the edge (one comparison is made on each side).
Va Multi Scale Tracking
The input image is decimated at different scales and an edgemap is extracted for each of them. The tracking algorithm is applied in sequence from coarse to fine scales.
No depth estimation is made for the coarsest scale, instead depth from the main (finest) scale is recursively propagated to the coarser ones using a nearest neighbour approach. Edges that fail to find a close neighbour in the finer scale are rejected. Beside saving running time, this approach chooses coarse edges that have a correspondence in finer scales. This edgepoints usually belong to better localized object boundaries.
VB Depth Smoothing for Tracking and Association
Only for the tracking stage, edgepoint’s depth is smoothed, where each edgepoint is averaged using it’s closest neighbours’ depth.
By smoothing depth, a uniform warping of close edges is achieved, thus improving association in image parts with close similar edges. After the tracking stage, smoothed depth is discarded.
Vi Data asociation
The main output of the tracking stage is the optimal transformation between the current and new frame, given the edgemap’s structure. However, associations are made from the current edgemap points to the new ones (Fig. 3a). Due to the nature of edges, these matches are not biunivocal, as two different edgepoints may be matched with the same edgepoint in the new frame and there is no guarantee that every new edgepoint will be matched (Fig. 3b).
Given how the optimization problem is posed in the backend, it is important to maximize and refine associations from the new frame to the current one. This is done in two steps: augmentation and correction. Both steps exploit the continuous nature of edges i.e. most edgepoint are connected to adjacent ones.
Via Matching augmentation
In this step each matched edgepoint shares this association with its neighbours. This is done by “walking” the edge, which means recursively go over its neighbours and assign the same associated edgepoint until a valid association is found (Fig. 3c).
ViB Matching correction
For a given edgepoint with an association in a target frame, correction means finding the best matching edgepoint in terms of the epipolar constraint between the two frames obtained by tracking stage estimated pose. For this search the edge is “walked” again using the connectivity information. In practice not only the epipolar line is taken into account, also current edgepoint reprojected position help improve cases where the epipolar is closed to parallel with the edge or there is no motion.
This process is applied to each new frame after augmentation, and is also used in the optimization backend when building the Jacobians. Both augmentation and correction stages are illustrated in Figure 3.
Vii Variable and Keyframe Selection
In the presented system edgemaps are only defined in certain keyframes. Even though it would be possible to propagate associations for every PKF in both directions (past and future frames) and solve a joint problem as done in [11], this would lead to variables and measurements repetition.
Ideally, an edgepoint representing an individual edge fragment should be defined as a variable in only one PKF, however this is not easy to achieve due to the nonbiunivocal association of edgepoints. Nevertheless an approximation may be obtained by using the associations to disable redefined edgepoints.
An open question is in which particular PKF should an edgepoint be defined. Possible options are to anchor the edgepoint to the first or last PKF where it was observed. Even though the former has its advantages, the later is chosen as it assures that all currently visible edgepoints are anchored and optimized relative to the last frame. This particular choice keeps depth information ready for the tracking step and it’s key for defining the marginalization strategy. In fact, having associations only to the past, assures that keyframe marginalization involves only previous point measurements, therefore closer to the optimal.
When a new frame arrives it’s added as PKF to the backend. Associations are backpropagated to match every edgepoint against active PKFs and DKFs. When the amount of edgepoints matched in a particular keyframe falls bellow a fraction of all edgepoints (), the association process is stopped and a local covisivility window is defined for PKF . At the same time, the new frame it’s kept as PKF if the amount of newly observed edgepoints with respect to the last PKF exceeds a certain threshold, otherwise it’s added as DKF.
Every association from new PKFs to the previous one disables the matched variables in the later one, thus reducing variable repetition and computational cost. Ideally, current visible edgepoints would be active variables only in the new PKF, while the others contain only edgepoints that are not visible in the newer ones. In practice, variable repetition remains due to the nonbiunivocal matching of edges. However this proves to be beneficial to accuracy as it maximizes the number of measurements used.
When the amount of active keyframes inside the newest covisivility window exceeds a limit (1015), the DKF that posses the less translation with respect to its consecutive keyframes is discarded. PKFs are not discarded, as they define the map.
Viii Hierarchical Optimization
The presented system exploits primary and secondary sparseness [31] through a custom built gauss newton solver designed for the posed problem structure. In fact, local covisibility windows defined for each PKF suggests a natural split of the problem.
Overall the method goes as follows. First each individual subfunctional is linearised around the current optimization point and edgepoint variables are marginalized using the Schur complement. The resulting priors, involving poses only, are jointly optimized using PoseGraph Optimization (PGO) over absolute poses. Finally, edgepoints are updated by the linear increments obtained from the PGO.
Viiia Edgepoint Marginalization
Instead of posing the problem in absolute coordinates like [18] and [3], a mixed absolute  relative formulation is chosen, a reasoning for this is given in section VIIID. For a given subfunctional , residuals are expressed in terms of the relative transformation that takes edgepoints from the PKF to the corresponding keyframe :
(6) 
Where is a transformation belonging to SE3 Group defined by the composition function:
(7) 
Considering that projection function is invariant to scale, eq.6 can be rewritten as:
(8)  
(9) 
Where , and are defined as:
(10)  
(11) 
To take into account edgepoints position measurement in the respective PKF, a prior on is added:
(12) 
Throught all optimization, analytical expressions for the residual 9 with respect to , and transformations are used:
(13)  
(14)  
(15) 
Jacobians over SE3 elements are taken with respect to an infinitesimal increment in SE3 tangent space (refer to [32], [33]) taking into account equation 9, it can be shown that:
(16)  
(17) 
With these Jacobians and stacking residuals and variables into vectors, it is possible to obtain a second order approximation of the functional that takes the known form:
(18) 
Whose vector of variables is defined as:
Being the frame covisibility window length, and its increments are:
Functional 18 has a known form in bundle adjustment [31] that enables its components to be defined as:
(19)  
(20)  
(21) 
Its important to recall that, due to the fact that each residual depends only on one pose and edgepoint, the Hessian has a sparse structure were , and are diagonal, and is block diagonal.
In addition to this, variables and only appear in subfunctional . Thus, given an incremental update on the relative transformation , is possible to solve for the optimal values of and by taking the Schur Complement:
(22)  
(23) 
Which can be replaced into equation 18 to obtain a form of the functional where edgepoint variables has been marginalized:
(24) 
Where and stand for information vector and matrix respectively:
(25)  
(26) 
It these equations is a vector that stacks all transformation increments from PKF to all keyframes in its connectivity window :
ViiiB Joint Odometry Pose Graph Optimization
Given a set of subfunctionals approximated by equation 24,the overall optimization problem takes the form:
(27)  
(28) 
Where constant terms has been dropped and is the tangent space distance from the current relative pose to the prior linearization point .
In order to solve this system using GaussNewton[34], a linear approximation of equation 28 is used:
(29) 
It should be noted that, while absolute poses are iteratively relinearized in every optimization step, subfunctional energy priors are selectively relinearized only when needed [35], [36] (see sec. VIIIC). Therefore it may happen that .
The Jacobians are defined with respect to an infinitesimal increment on SE3 :
(30)  
(31) 
Both jacobians are similar, in fact noting that it can be shown that .
Using the SE3 adjoint operator, its possible to “shift” to the left of :
(32) 
In this equation is the error transform. Its twist vector is defined in terms of a translational () and rotational component ():
(33) 
Analytical expressions for the derivative in equation 32 are complex, but they can be found in [37], [33] and in the source code of this work.
It is interesting to note that, in the case were subfunctional has been marginalized in the current optimization step, equals identity and the jacobian becomes the adjoint:
(34)  
(35) 
Equation 29 can be plugged into 27 to obtain a linear system of equations on the increments in absolute poses. This is a sparce and mainly block diagonal system (except for the connections generated by the loop closure), with non zero entries only in poses that share a mutually observed landmarks.
Solutions are obtained using Sparce Cholesky Decomposition and, once increments are obtained, absolute pose linearization points are updated:
(36) 
ViiiC Edgepoints Update
As already mentioned, not all PKFs functionals are relinearized in every optimization step, provided that computing residuals, Jacobians and Schur complements in equations 9 to 24 involves calculations with every edgepoint. Given that in average a PKF defines 500012000 edgepoints, these operations comprise most of the optimization algorithm running time.
On top of this, once a PKF has been relinearized and optimized many times and falls outside the covisibility window, effective update in variables become negligible and equation 24 starts being a good approximation to the full functional [35]. This way, subfunctionals are relinearized only when they are within a certain temporal window and a significant pose change occurred during the last optimization step. This is measured by comparing the distance from the current relative transformations involving the subfunctional to its last linearization point, relative to the later:
(37) 
For the PKFs that has been relinearized, edgepoint’s variables and are updated using equations 22. Relative increments are first obtained with equation 29 and Jacobians 31 and 30, which are cached from the PGO.
ViiiD Relative vs. Absolute coordinates in local BA
In a full BA scheme every PKF subfunctional would be relinearized in each step. In this case relative formulation in serves only as an intermediate step, yielding an equivalent solution to formulating equation 6 in absolute coordinates (due to Jacobians chain rule). This is to be expected as linearization should be independent of the intermediate variables used to obtain it.
The case is substantially different when relinarization is avoided and 24 is used as an approximation to the true functional, in contrast to a one constructed using absolute coordinates.
Formulating relative coordinates priors has interesting advantages [38]. To begin with, BA classical structure where each factor involves a landmark and only one pose is maintained; thus simplifying Schur Complement and Hessian computation. But more importantly, the relative formulation is a better approximation of the real functional.
A relative parametrization naturally expresses the invariance to a change in absolute pose, and marginalized priors have a scale invariance form due to a rank deficiency (gauge freedom). In fact, the functional will not penalize increments in the direction of the translation components of the linearization point:
(38)  
(39) 
This implies that the functional will not penalize translation expansions or shrinks as long as they belong to the direction pointed by the linearization point, which after a few iterations should converge to the optimal. This interesting property allows modeling scale drift without the use of SIM3 [39] optimization. A working example of this idea can be seen in Figure9.
is invariant to an increment if and . Because marginalized functional is the result of a minimization over and the following holds:
(40)  
(41)  
(42) 
In order for 42 to be true for every it must be , implying invariance to a change .
ViiiE Pose Marginalization
Problem 27 involves poses only, therefore is much faster to solve than full BA. However, it grows linearly with time, becoming computationally expensive. To keep time constrained, marginalization is applied to old enough poses, thus not changing any more [3]. A wider marginalization window is defined, such that relative PKF priors falling outside this window are fused into a single absolute prior.
This procedure is simple, the system creates an absolute position prior to fix both position and scale to its initial condition. Absolute priors involve a vector of absolute correlated poses and take the following generic form:
(43) 
The absolute prior is jointly optimized with the rest of problem 27 until a PKF gets a link to a keyframe outside the marginalization window. In this case, its corresponding subfunctional is jointly optimized with the absolute position prior, updating the later. Finally, the absolute prior is constrained using Schur marginalization to have all its measurements inside the marginalization window . Needless to say, this is a sliding window, therefore old PKFs are incrementally merged in the absolute prior.
ViiiF Overall Algorithm
The optimization algorithm can be summarized in the following steps (performed for each iteration):

Marginalize old poses and rebuild absolute prior according to section VIIIE.

Solve a PGO problem on absolute pose variables and update them.

Update landmarks using equations 22, only for PKFs relinearized in this iteration .
Before running the optimization, loop closure check is performed if the current frame is a PKF. If a match is found, a slightly different optimization algorithm is run (see section IXB).
Ix Loop Closures
Loop closure is a key aspect of SLAM systems, especially monocular ones, that greatly helps to keep the overall error constrained. However, this is a challenging problem on its own and is out of the scope this paper which focuses on edgemaps association and optimization. Therefore, inspired in similar works [2, 40], well established feature based methods for loop candidate search and prealignemnt have been used, in order to show SESLAM capabilities apart from the loop closure problem itself. Leaving a full edgebased loop closing mechanism proposed for future work.
Ixa Loop Candidates Search
In order to find possible loop candidates, only for PKFs ORB features are extracted [41] and uses to compute a binary representation using a Bag of Words (BoW) dictionary implementation [42], thus creating a BoW database. Every time a new PKF is created, it is compared with all previous ones outside its covisibility window, obtaining a similarity score. The 2 PKF with higher score above the overall average are marked as potential loop candidates.
Candidate’s keypoints are matched with their descriptors and a transformation is computed by solving the PnP problem with RANSAC using OpenCV. To that end, candidate’s keypoints depth is estimated by averaging neighbouring edgepoints inverse depth on a window around each point. This transformation is further defined using the edge tracking method described in section V.
To check a candidate’s validity, its edgepoints are reprojected to the current frame. If sufficient edgepoints present low reprojection error and are inside the current frame’s FOV, the candidate is accepted.
IxB Optimization
Once a candidate has been aligned and associated with the current PKF, its data association is used to propagate matches from the candidate PKF to every keyframe in the current keyframe’s covisibility window. This is done simply by connecting successive matches.
Once this new connections are established, optimal transformation from candidate PKF to each of these keyframes is obtained by minimizing the reproyection error 6 with respect to the relative pose only.
Next, a joint relative odometry prior from the candidate to keyframes in both candidate and current covisivility windows, is built using the procedure described in section VIIIA. This prior is constructed using the optimized poses obtained in the previous step as linearization point, which gives a good hint for the true optimal point.
Once this prior is computed, all relative priors are reactivated and the pose graph is optimized completely (marginal absolute prior is discarded). Active covisibility window edgepoints are also included in the optimization. The pose graph structure in this case is illustrated in Figure 5.
Usually a few iterations () are required for convergence. Upon this, all relative priors falling outside the relative marginalization window are fused again into a single absolute marginal prior. The system then resumes the process described in previous sections.
It is paramount to interpret the prior as a compressed form of the reprojection error functions from the candidate PKF’s edgepoints to every keyframe it is connected to, both in the current and its own covisibility windows. As shown in Figure 5.
Given that such connections involve more that one mutual pose, not only position but also scale is connected. When PGO is done with all relative priors, together with the active edgepoints, scale adapts to the one given by the candidate PKF, that is connected through is covisibility window to older measurements, as is shown experimentally in Figure 9. In this setup, scale freedom is not given by SIM3 optimization [39] but by the rank deficiency of the joint odometry priors (section VIIID).
X Implementation Details
Xa Edge Detection
Up to this point, little has been said about the edge extraction procedure. This account for the fact that presented algorithm is intended to be detection method agnostic. However, as edges are the presented algorithm’s input, it is expected for different methods to have a significant impact in its performance.
The edge detection algorithm is based on maximal gradient on a smoothed image [43]. Despite being simple, it yields well localized edges, which is the most desirable feature for a SLAM algorithm, together with repeatability.
Non Maximal suppression and subpixel position is estimated for each edgepoint and third derivative threshold is also applied [44], as this helps discriminate continuous gradients.
Finally, the edgemap is optionally decimated by half to reduce the overall number of variables. This helps to increase performance in terms of running speed, by eliminating redundant variables, while keeping full resolution position and connectivity.
XB Optimization Parallelization
Most computation time of the algorithm is spent in the edgepoints marginalization step (section VIIIA). Marginalizations of each PKF prior are independent of each other, as their information is connected in the PGO. Therefore, these calculations are parallelized using a thread pool of 8 threads.
Finally, secondary sparseness is also exploted inside each prior calculation by separating edgepoints in groups according to how far they expand in the covisibility window. Schur complement can be applied separately in each group, adding the results in a later stage. These groups are also parallelized.
Xi Evaluation
The presented system was benchmarked with the EuRoC dataset[45], which is widely used for evaluation of SLAM methods, indoor sequences from the TUMVI dataset [46] and the ICLNUIM dataset [47]. Round Mean Squared Absolute Trajectory Error (ATE) is the chosen metric for evaluation, as it provides an overall benchmark of the position accuracy. Due to the monocular nature of the algorithm, SIM3 alignment of the trajectory with respect to the provided groundtruth and ATE error calculation were performed using a public algorithm [48], based on the method proposed in [49] and [50].
The algorithm was run on a desktop computer featuring an Intel(R) Core(TM) i99900K CPU @ 3.60GHz with 64GB of RAM, however the system’s memory footprint does not exceed 2GB. The system is initialized using [14], engaging full optimization once a few frames have been accumulated. Similar to other methods [18], the first and last parts of the datasets, were the camera is still or shook for IMU initialization, are stripped off as it compromises monocular initialization.
Two versions of the method are tested, namely SESLAM1 and SESLAM3 which perform one and three optimization iterations per frame respectively.
Xia Results on EuRoC Dataset
EuRoCMAV is a well established state of the art benchmark for SLAM systems and is the main dataset used for evaluating this algorithm. It provides two sets of realistic indoors sequences taken from a quadrotor MAV, rated by its difficulty. The first set features five sequences of a large machine hall with different degrees of dynamic motions and illumination changes. The second set was taken inside two rooms equipped with a VICON tracking system and while less space is covered, they are more challenging in terms high dynamic motions. This whole dataset is targeted for stereo visual inertial algorithms, hence running pure monocular systems introduces additional difficulties, specially in pure camera rotation cases, where the whole edgemap is renewed without motion.
Table I show ATE for the Machine Hall and Vicon Room sequences respectively, of SESLAM compared against two widely known algorithms: OBSLAM2 [2] representing the feature based SLAM systems and LDSO [40] for direct ones. Both methods are state of the art in terms of Monocular SLAM and were run using configurations provided by the authors.
SESLAM  ORBSLAM2  LDSO  

MH01  ()  ()  () 
MH02  ()  ()  () 
MH03  ()  ()  () 
MH04   ()  ()  () 
MH05   ()  ()  () 
VR1_1  ()  ()  () 
VR2_1  ()  ()  () 
VR1_2  ()  ()  () 
VR2_2   ()  ()  () 
It can be seen that SESLAM presents similar accuracy to that of both methods and in particular SESLAM3 only introduces a significant improvement in difficult sequences, where high dynamic movements introduce large amounts of new data per frame. The cause of failure in the most challenging datasets is loss of tracking due to strong intensity changes and large motions with image blur. The system does not recover from this cases as it lacks a relocalization mechanism, which is proposed as future work
It is important to highlight that both ORBSLAM2 and LDSO are the result of long incremental work in their respective fields, in contrast to edgebased SLAM. In light of SESLAM competitive results in these sequences, it is clear that this approach is very promising and has room for improvements that could be carried out in the future.
Top down views of resulting trajectories for sequences MH03 and V102 are shown in fig. 6(a,b).
XiB Results on TUMVI Dataset
TUMVI Dataset is a newer than the former, comprising many sets of indoor and outdoor sequences. It is a challenging dataset aimed to evaluate stereo visual inertial algorithms over long sequences with dynamic motions and illumination changes. The Room sequences were chosen for evaluation, as they are the only ones with ground truth data for whole sequences.
SESLAM1  OKVIS [3]  ROVIO[4]  VINS [6]  

Room 1  ()  0.06  0.16  0.07 
Room 2  ()  0.11  0.33  0.07 
Room 3   ()  0.07  0.15  0.11 
Room 4   ()  0.03  0.09  0.04 
Room 5  ()  0.07  0.12  0.20 
Room 6  ()  0.04  0.05  0.08 
Test results are shown in Table II, where SESLAM was compared with the results published together with the dataset [46], to ensure the other systems have configuration intended by the authors. It can be seen again that even though SESLAM is a monocular system, it produces competitive results even against the state of the art visualinertial SLAM systems.
The fully optimized trajectory for sequences Room1 is depicted in fig. 6(c).
XiC Results on ICLNUIM Dataset
Finally, SESLAM was also run in the Office4 sequences of ICLNUIM synthetic dataset, in order to compare it with the edgebased algorithm presented in [13]. Results are shown in Table III, where ATE was taken from [13], as they haven’t released their code.
It should be noted that the ICL is a dataset targeted for RGBD SLAM and reconstruction systems. The office trajectories employed by [13] are short sequences with slow motions. Given that roughly 20% of the sequence is spent on initialization, the authors consider these to be a suboptimal way to assess this type of SLAM systems. Nevertheless, our system outperforms [13], except for a case were a tracking failure occurs.
XiD Reconstruction Results
Reconstruction output is a key aspect of this method. Examples are shown in Figures 1, 7 and 8. The advantage in terms of structural information provided, with respect to sparse systems, is easy to see. Noteworthy is how the system keeps consistency in Room2 sequence over a 2.24 min span with several loops happening.
XiE Running times
Given that SESLAM is a semidense method optimizing every significant edgepoint in images ( 5K to 12K extracted per frame), the computational burden is higher compared to sparse methods. Also, running time heavily depends on each particular scene structure. Despite this, the system presents good performance in most datasets. This is due to its PKF selection policy, hierarchical optimization exploiting primary and secondary spareness and parallelization. Measured running times are reported in Table IV.
Seq  FPS  EP/F  Seq  FPS  EP/F 

MH01  26  11102  Room 1  21  9085 
MH02  28  10808  Room 2  22  8806 
MH03  26  9422  Room 3  14  947 
MH04  17*  7654  Room 4     
MH05  16*  7751  Room 5  19  8403 
V101  36  7628  Room 6  29  8322 
V201  41  6948  ICL0  26  11597 
V102  32  5651  ICL1  34  8832 
V202  17*  5974  ICL2  27  11841 
V103      ICL3  28  12186 
V203     
XiF Scale Drift and Loop Closure
Finally, Figure 9 shows the estimated speed error for both incremental and fully optimized trajectories on EuRoC MH01 sequence. In the incremental case the system underestimates speed as a consequence of increasing scale drift. When a loop closure occurs (), PGO performs a full trajectory optimization, converging to a consistent scale.
Xii Conclusion
In this paper SESLAM, a novel semidense structured edgebased monocular SLAM system, that achieves competitive results in challenging datasets, compared to other state of the art systems was presented. The edgebased nature of the system inherently builds a semidense reconstruction of the environment, providing relevant structure data for further navigation algorithms, leveraging from most relevant image information even in textureless environments. An edgebased system is not trivial to build, mainly because edges are not easy to associate, track and optimize over time, as they lack descriptors and do not present biunivocal correspondence, unlike point features. These issues were tackled by developing methods to adapt SLAM theory to the edges nature, including: an approach to match edges between frames and between relatively far keyframes in a consistent manner; a parametrization, residual and variable selection strategy that exploits edges’ nature to make the optimization problem treatable, avoiding its rapid size increase; and a nonlinear optimization that better adapts to the edge optimization problem.
Finally, SESLAM was benchmarked in the widely used EuroCMAV, the newly TUMVI and the ICLNUIM datasets. As can be seen in the results section, the system shows comparable precision to state of the art featurebased and dense/semidense systems, while achieving realtime operation in several cases, using edges as sole input in all the stages but loop closure.
Proposed future work includes extending the use of edges to the loop closure stage to make the system completely edgebased, trying edge sparsification and interpolation to reduce processing time and exploring different detectors and association strategies. Developing a relocalization stage is also proposed, so as to recover from tracking failures.
In light of SESLAM competitive results, it is clear that the edgebased approach to SLAM is very promising and has room for improvements to reach its full maturity. To encourage such developments, SESLAM source code will be soon released as open source.
References
 [1] R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “Orbslam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
 [2] R. MurArtal and J. D. Tardós, “Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 [3] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframebased visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
 [4] M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended kalman filter based visualinertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
 [5] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Onmanifold preintegration for realtime visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2017.
 [6] T. Qin, P. Li, and S. Shen, “Vinsmono: A robust and versatile monocular visualinertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
 [7] R. GomezOjeda, F. A. Moreno, D. Scaramuzza, and J. G. Jiménez, “PLSLAM: a stereo SLAM system through the combination of points and line segments,” CoRR, vol. abs/1705.09479, 2017.
 [8] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. MorenoNoguer, “Plslam: Realtime monocular visual slam with points and lines,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 4503–4508, IEEE, 2017.
 [9] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in realtime,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2320–2327, IEEE, 2011.
 [10] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semidirect monocular visual odometry,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 15–22, IEEE, 2014.
 [11] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2018.
 [12] P. Smith, I. D. Reid, and A. J. Davison, “Realtime monocular slam with straight lines,” 2006.
 [13] S. Maity, A. Saha, and B. Bhowmick, “Edge slam: Edge points based monocular visual slam,” in Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops, ICCVW’17, pp. 2408–2417, 2017.
 [14] J. Jose Tarrio and S. Pedre, “Realtime edgebased visual odometry for a monocular camera,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 702–710, 2015.
 [15] J. J. Tarrio and S. Pedre, “Realtime edge based visual inertial odometry for mav teleoperation in indoor environments,” Journal of Intelligent & Robotic Systems, vol. 90, no. 12, pp. 235–252, 2018.
 [16] L. von Stumberg, V. Usenko, J. Engel, J. Stückler, and D. Cremers, “From monocular slam to autonomous drone exploration,” in 2017 European Conference on Mobile Robots (ECMR), pp. 1–8, IEEE, 2017.
 [17] R. MurAtal and J. D. Tardos, “Probabilistic semidense mapping from highly accurate featurebased monocular slam,” in Robotics: Science and Systems, 2015.
 [18] J. Engel, T. Schöps, and D. Cremers, “Lsdslam: Largescale direct monocular slam,” in European Conference on Computer Vision, pp. 834–849, Springer, 2014.
 [19] G. Zhang and I. H. Suh, “Building a partial 3d linebased map using a monocular slam,” 2011 IEEE International Conference on Robotics and Automation, pp. 1497–1502, 2011.
 [20] H. Zhou, D. Zou, L. Pei, R. Ying, P. Liu, and W. Yu, “Structslam: Visual slam with building structure lines,” IEEE Transactions on Vehicular Technology, vol. 64, pp. 1364–1375, April 2015.
 [21] G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, “Building a 3d linebased map using stereo slam,” IEEE Transactions on Robotics, vol. 31, pp. 1364–1377, Dec 2015.
 [22] M. Tomono, “Linebased 3d mapping from edgepoints using a stereo camera,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3728–3734, May 2014.
 [23] S. Yang and S. Scherer, “Direct monocular odometry using points and lines,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3871–3877, May 2017.
 [24] R. GomezOjeda, J. Briales, and J. GonzalezJimenez, “Plsvo: Semidirect monocular visual odometry by combining points and line segments,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4211–4216, Oct 2016.
 [25] R. G. von Gioi, J. Jakubowicz, J.M. Morel, and G. Randall, “Lsd: A fast line segment detector with a false detection control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 722–732, 2010.
 [26] L. Zhang and R. Koch, “An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency,” J. Visual Communication and Image Representation, vol. 24, pp. 794–805, 2013.
 [27] E. Eade and T. Drummond, “Edge landmarks in monocular slam,” in Proceedings of the 2006 British Machine Vision Conference, BMVC’06, (Washington, DC, USA), IEEE Computer Society, 2006.
 [28] G. Klein and D. Murray, “Improving the agility of keyframebased slam,” in European Conference on Computer Vision, pp. 802–815, Springer, 2008.
 [29] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, ISMAR ’07, (Washington, DC, USA), pp. 1–10, IEEE Computer Society, 2007.
 [30] J. Civera, A. J. Davison, and J. M. Montiel, “Inverse depth parametrization for monocular slam,” IEEE transactions on robotics, vol. 24, no. 5, pp. 932–945, 2008.
 [31] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms, pp. 298–372, Springer, 1999.
 [32] J. L. Blanco, “A tutorial on se(3) transformation parameterizations and onmanifold optimization,” 09 2010.
 [33] T. D. Barfoot, State Estimation for Robotics. Cambridge University Press, 2017.
 [34] P.A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 [35] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental smoothing and mapping,” IEEE Transactions on Robotics, vol. 24, no. 6, pp. 1365–1378, 2008.
 [36] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.
 [37] E. Eade.
 [38] G. Sibley, “Relative bundle adjustment,” Department of Engineering Science, Oxford University, Tech. Rep, vol. 2307, no. 09, 2009.
 [39] H. Strasdat, J. Montiel, and A. J. Davison, “Scale driftaware large scale monocular slam,” Robotics: Science and Systems VI, vol. 2, no. 3, p. 7, 2010.
 [40] X. Gao, R. Wang, N. Demmel, and D. Cremers, “Ldso: Direct sparse odometry with loop closure,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2198–2204, IEEE, 2018.
 [41] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” 2011.
 [42] D. GálvezLópez and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
 [43] J. Canny, “A computational approach to edge detection,” in Readings in computer vision, pp. 184–203, Elsevier, 1987.
 [44] T. Lindeberg, “Principles for automatic scale selection,” 1999.
 [45] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
 [46] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stückler, and D. Cremers, “The tum vi benchmark for evaluating visualinertial odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1680–1687, IEEE, 2018.
 [47] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark for RGBD visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, (Hong Kong, China), May 2014.
 [48] R. MurArtal.
 [49] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, D. Cremers, R. Siegwart, and W. Burgard, “Towards a benchmark for rgbd slam evaluation,” in RGBD Workshop on Advanced Reasoning with Depth Cameras at Robotics: Science and Systems Conf.(RSS), 2011.
 [50] S. Umeyama, “Leastsquares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 4, pp. 376–380, 1991.