SE-SLAM: Semi-Dense Structured Edge-Based Monocular SLAM
Vision-based Simultaneous Localization And Mapping (VSLAM) is a mature problem in Robotics. Most VSLAM systems are feature based methods, which are robust and present high accuracy, but yield sparse maps with limited application for further navigation tasks. Most recently, direct methods which operate directly on image intensity have been introduced, capable of reconstructing richer maps at the cost of higher processing power. In this work, an edge-based monocular SLAM system (SE-SLAM) is proposed as a middle point: edges present good localization as point features, while enabling a structural semi-dense map reconstruction. However, edges are not easy to associate, track and optimize over time, as they lack descriptors and biunivocal correspondence, unlike point features. To tackle these issues, this paper presents a method to match edges between frames in a consistent manner; a feasible strategy to solve the optimization problem, since its size rapidly increases when working with edges; and the use of non-linear optimization techniques. The resulting system achieves comparable precision to state of the art feature-based and dense/semi-dense systems, while inherently building a structural semi-dense reconstruction of the environment, providing relevant structure data for further navigation algorithms. To achieve such accuracy, state of the art non-linear optimization is needed, over a continuous feed of edgepoints per frame, to optimize the full semi-dense output. Despite its heavy processing requirements, the system achieves near to real-time operation, thanks to a custom built solver and parallelization of its key stages. In order to encourage further development of edge-based SLAM systems, SE-SLAM source code will be released as open source.
Vision-based Simultaneous Localization And Mapping (VSLAM) is a well studied problem, and key for Computer Vision and Robotics. SLAM lies at the core of most systems that require awareness of self motion and scene geometry.
Many solutions have been proposed for the VSLAM problem. Feature based algorithms [1, 2, 3, 4, 5, 6] rely on recognition and tracking of point landmarks over time. These methods are robust and present high accuracy, however the reconstructed map is sparse, thus presents little practical use by itself. They also present problems in textureless environments, where point features are difficult to come by , .
Direct methods operate directly on image intensity, reconstructing depth at every image pixel [9, 10] or in a selected subset of pixels . These methods avoid the cost of feature extraction, but state of the art methods need to estimate and correct a photometric camera model to deal with illumination changes and occlusions, yielding a more complex optimization problem . Nevertheless, the reconstruction output of these systems usually yields richer maps, more suitable for navigation than feature-based methods, at the cost of higher processing power and GPU processing in full dense methods.
In this context, an edge-based system appears as an interesting middle point between the former. On one hand edge extraction is robust, highly parallelizable and widely studied in computer vision, and it is also known for taking part in biological vision systems. Moreover, edges are trackable features, meaning that the geometric re-projection error can be employed as done in classical feature-based systems. Furthermore, edges can be extracted in textureless scenes where classical feature extraction is difficult. Finally, an edgemap usually contains objects boundaries, hence its reconstruction provides rich information about structure of the observed scene, and may play a part in full dense reconstruction and semantic interpretation.
Despite its potential benefits, SLAM systems using edges are not common and most of them rely on straight lines as a complement to classical features, to increase robustness when base methods fail [12, 8, 7]. Only recently an edge-based system has been proposed in the literature . This is probably due to the fact that edges are not easy to associate, track and optimize over time, as they lack descriptors and do not present biunivocal correspondence, unlike point features.
In this work, a novel edge-based SLAM system is introduced (SE-SLAM), which relies on edge detection as sole input, except for the loop closure step. Tracking and data association is tackled by following up on previous developments in edge-based visual odometry for MAVs control [14, 15]. Full non-linear keyframe based optimization is performed on edges, adapting recent methods [3, 11] to the limitations and advantages presented by this type of feature. The resulting system achieves comparable accuracy to state of the art methods and provides a full edge reconstruction output, suitable for further navigation steps (see Figure 1) . Using this type of back-end to optimize a pure edge input is, to the authors’ knowledge, the first of its kind.
Ii Related Work
Feature based algorithms that rely on recognition and tracking of point landmarks over time are the most common solution to the Visual SLAM problem [1, 2, 3]. ORB-SLAM2  is probably the state of the art in terms of accuracy. OKVIS  is another relevant solution, targeted to stereo visual-inertial odometry, whose optimization and marginalization approach presented served as inspiration for this work. These methods present high accuracy, however the reconstructed map is sparse and a separate mapping algorithm is required to get denser ones .
Addressing this issue, Direct SLAM methods deal with the problem of minimizing an intensity based functional in all image pixels without a feature extraction phase. DTAM  was the first to show a dense 3D monocular reconstruction of this kind. It relies on heavy GPU aided processing to incrementally fuse and regularize every image pixel into a dense model and is constrained to small environments. As an effort to reduce the computational cost of these algorithms, semi-dense approaches were presented  that operate over image intensities but only on a meaningful set of pixels. These pixels are picked based on their traceability, using the high intensity gradient parts of the image. Hence, its output contains edges but no explicit treatment is made. DSO  applies non-linear optimization to an intensity based functional over a sparse selection of pixels. Moreover, they incorporate lighting parameters to the camera model to account for illumination and exposition time changes, making the optimization more complex. In addition, dealing with individual pixels makes association even harder compared to edges. In  this is tackled by redefining variables on each keyframe, which leads to variable and measurement repetition, consequently reducing performance. In the present work, illumination changes are handled by edge detection, and association is used to reduce the number of repeated variables, enabling whole edgemap reconstruction with non-linear optimization.
Furthermore, SLAM and visual odometry systems explicitly using edges have also been introduced. The ones using edges as sole input, actually use fit straight lines [19, 20, 21, 22], hence their application is restricted to structured scenarios where these features are present. On the other hand, line features have also been used to enhance existing systems, with the aim of boosting their performance in textureless environments, while yielding a structured line map as a secondary goal. A pioneer in this field is , where authors incorporate lines into a monocular Extended Kalman Filter system. More recently,  and  extended the semi-direct monocular visual odometry system SVO  using lines to improve its performance in textureless enviroments. In PL-SLAM  authors extend ORB-SLAM with lines, presenting a monocular SLAM system that shows improvement in less textured sequences at the cost of nearly duplicating the computational cost of plain ORB-SLAM. Almost at the same time, another PL-SLAM  that builds upon the ORB-SLAM pipeline was presented, although using a stereo approach. This work presents results showing better performance than ORB-SLAM on specially recorded real-world textureless scenes, without such a high computational burden. They also develop a method for enhanced loop closure using line descriptors, which could be an interesting add-on to the system in this paper. All of these approaches use the LSD line detector  and LBD line descriptor for matching purposes .
There is not much work in explicitely using general edges in SLAM systems, probably since edges are not easy to associate, track and optimize over time, as they lack descriptors and do not present biunivocal correspondence. Eade and Drummond  present what may be the first work in this line. They define an edge features called edgelet - a very short, locally straight segment of what may be a longer, possibly curved, line - and use it to enhance a particle filter SLAM system, to take advantage of the structural information provided by edges. Klein and Murray  also use this edgelet definition, but in this case to enhance robustness against rapid camera motions of their PTAM system . Closest to the present work is the Edge SLAM System presented recently in , although their optical flow for tracking approach to track edges is significantly different from SE-SLAM. To compare against this last system, section includes results for ICL-NUIM sequences reported by them, showing that SE-SLAM outperforms theirs in all but one sequence. Regrettably, there is no public available code, neither parametrization of their work to compare in more challenging datasets like EuRoC or TUM-VI.
This paper introduces SE-SLAM, a novel semi-dense structured edge-based monocular SLAM system. SE-SLAM shows comparable precision to state of the art feature-based and dense/semi-dense systems, but using edges as sole input in all the stages but loop closure, achieving real-time operation in several cases. The edge-based nature of the system not only helps in textureless environments, but also inherently builds a structural semi-dense reconstruction of the environment suitable for further navigation algorithms. This is accomplished thanks to the following contributions:
A method to associate edges between frames and relatively far keyframes, in a consistent manner, that enables the application of techniques traditionally reserved to point features.
A parametrization, residual and variable selection and marginalization strategy that exploits edges’ nature, to make the optimization problem treatable, since its size increases rapidly if applying existing methods to edges.
A non-linear optimization approach that better adapts to an edge based input.
The rest of this paper is organized as follows: section IV presents the overall problem, sections V, VI and VII describe the front-end of the system, namely how the initial conditions and data association are generated to build the optimization problem. Section VIII presents the back-end and which assumptions and approximations are made to solve the optimization efficiently. Section IX presents a loop closure detection and integration strategy. Finally, some relevant implementation details are explained in section X and results obtained by running the algorithm on public datasets are presented in section XI.
Iv Problem Formulation
The input of the algorithm is composed by edgemaps  defined as sets of edge belonging pixels (edgepoints), including its sub-pixel position , a measure of local edge gradient or normal direction and connections to its adjacent edgepoints. The later play a key role in the data association phase.
The purpose of the system is to recover 3D position of each edgepoint, while at the same time infer camera poses. What makes this problem challenging is that there is no obvious way of differentiating neighbour pixels. Therefore, edgepoints cannot be univocally matched over time, as a matter of fact, the same edge observed from different scales may have different amounts of edgepoints. For this reason, a keyframe-based approach similar to previous semi-dense methods is followed , edgemaps are defined for each keyframe and anchored to them, and the global map results from the union of all edgemaps.
However, redefining and optimizing an edgemap per keyframe inevitability leads to variable repetition and a waste of computational resources. To solve this, two types of keyframes will be differentiated thought all this work. Keyframes in which an edgemap is defined are called “Prior KeyFrames” (PKFs). This corresponds with the fact that for each one of them there will be a measurement prior, or factor added to the pose graph optimization step. On the other hand, and for the sake of clarity, keyframes that only provide edge position measurements will be called “Data KeyFrames” (DKFs).
Another advantage of this separation is that it allows to use different criteria for keyframe selection, while PKFs are selected based on covisibility (number of novel edgepoints), DKFs are selected using simple heuristics that maximize inter keyframe translation. As done in recent SLAM methods , every new frame is added as a keyframe and later removed from an optimization window according to selection criteria.
Notation convention: PKFs will be indexed with the letter , DKFs with and edgepoints with .
Iv-a Edge Parametrization
In order to perform bundle adjustment, a parametrization must be chosen for spatial landmarks, edgepoints in this case. To account for the fact that edgepoints are defined relative to PKFs, a relative inverse depth parametrization is adopted . However, given that edgepoints present no localization information along the edge direction, only 2 parameters are selected: inverse depth , and distance along the edge’s normal direction . This parametrization is sketched in Figure 2.
Defining as the image position of edgepoint in PKF , its normal direction vector and the back projection from image to euclidean space, the 3D position of the edgepoint in camera coordinates is defined as:
This parametrization constrains the edgepoint to only move along edge’s perpendicular direction during optimization. Hence dealing with uncertainty in localization of edgepoints along edge’s tangential direction.
Iv-B Error Model
The error model used by  is kept, which is a reprojection error function that takes into account edge’s matching uncertainty along its tangential direction. Hence, only the error along its normal direction is taken into account. For a generic edgepoint defined in PKF and associated in DKF , the measurement takes the form:
Where are the respective keyframe’s poses, and are the associated edgepoint’s position and normal direction in DKF , and the scalar product yields the projected error along this direction.
Iv-C Uncertainty and Robust Estimation
Given a model for edge position uncertainty provided by the detection method, equation 2 can be modified to account for it. Moreover, IRLS is used in every optimization step weighting the reprojection error to account for outliers. Overall, the residual function takes the form:
Employing a slightly modified version of Huber weighting.
To take into account edgepoint position in its defining PKF, an extra measurement must be added as prior so as to constrain , which is not robustly weighted:
Iv-D Total Energy Functional
Having defined the error model, the optimization problem can be posed as the minimization of the reprojection errors of edgepoints defined in each PKF with respect to matches in DKFs:
V Edge Tracking
In the context of this work, tracking is the problem of finding the best transformation between the current and next frame, without prior data association or initial condition. For this purpose the method presented in  is used, with some enhancements to make it more robust against fast motions.
In  an auxiliary image is created from the new frame which contains in each pixel the index of the closest edgepoint. This image is used to re-project the current frame into the new one and find an initial match for each edgepoint. Later, the re-projection error functional in equation 2 is minimized, only with respect to the relative translation between frames. This is done iteratively, rebuilding the matches in each iteration, and using the current frame’s estimated depth as an input datum.
In favor of rejecting some wrong association, a threshold is applied to the difference between gradient vectors of each edgepoint. Optionally, a threshold can also be applied to the intensity difference at both sides of the edge (one comparison is made on each side).
V-a Multi Scale Tracking
The input image is decimated at different scales and an edgemap is extracted for each of them. The tracking algorithm is applied in sequence from coarse to fine scales.
No depth estimation is made for the coarsest scale, instead depth from the main (finest) scale is recursively propagated to the coarser ones using a nearest neighbour approach. Edges that fail to find a close neighbour in the finer scale are rejected. Beside saving running time, this approach chooses coarse edges that have a correspondence in finer scales. This edgepoints usually belong to better localized object boundaries.
V-B Depth Smoothing for Tracking and Association
Only for the tracking stage, edgepoint’s depth is smoothed, where each edgepoint is averaged using it’s closest neighbours’ depth.
By smoothing depth, a uniform warping of close edges is achieved, thus improving association in image parts with close similar edges. After the tracking stage, smoothed depth is discarded.
Vi Data asociation
The main output of the tracking stage is the optimal transformation between the current and new frame, given the edgemap’s structure. However, associations are made from the current edgemap points to the new ones (Fig. 3a). Due to the nature of edges, these matches are not biunivocal, as two different edgepoints may be matched with the same edgepoint in the new frame and there is no guarantee that every new edgepoint will be matched (Fig. 3b).
Given how the optimization problem is posed in the back-end, it is important to maximize and refine associations from the new frame to the current one. This is done in two steps: augmentation and correction. Both steps exploit the continuous nature of edges i.e. most edgepoint are connected to adjacent ones.
Vi-a Matching augmentation
In this step each matched edgepoint shares this association with its neighbours. This is done by “walking” the edge, which means recursively go over its neighbours and assign the same associated edgepoint until a valid association is found (Fig. 3c).
Vi-B Matching correction
For a given edgepoint with an association in a target frame, correction means finding the best matching edgepoint in terms of the epipolar constraint between the two frames obtained by tracking stage estimated pose. For this search the edge is “walked” again using the connectivity information. In practice not only the epipolar line is taken into account, also current edgepoint reprojected position help improve cases where the epipolar is closed to parallel with the edge or there is no motion.
This process is applied to each new frame after augmentation, and is also used in the optimization back-end when building the Jacobians. Both augmentation and correction stages are illustrated in Figure 3.
Vii Variable and Keyframe Selection
In the presented system edgemaps are only defined in certain keyframes. Even though it would be possible to propagate associations for every PKF in both directions (past and future frames) and solve a joint problem as done in , this would lead to variables and measurements repetition.
Ideally, an edgepoint representing an individual edge fragment should be defined as a variable in only one PKF, however this is not easy to achieve due to the non-biunivocal association of edgepoints. Nevertheless an approximation may be obtained by using the associations to disable redefined edgepoints.
An open question is in which particular PKF should an edgepoint be defined. Possible options are to anchor the edgepoint to the first or last PKF where it was observed. Even though the former has its advantages, the later is chosen as it assures that all currently visible edgepoints are anchored and optimized relative to the last frame. This particular choice keeps depth information ready for the tracking step and it’s key for defining the marginalization strategy. In fact, having associations only to the past, assures that keyframe marginalization involves only previous point measurements, therefore closer to the optimal.
When a new frame arrives it’s added as PKF to the backend. Associations are back-propagated to match every edgepoint against active PKFs and DKFs. When the amount of edgepoints matched in a particular keyframe falls bellow a fraction of all edgepoints (), the association process is stopped and a local covisivility window is defined for PKF . At the same time, the new frame it’s kept as PKF if the amount of newly observed edgepoints with respect to the last PKF exceeds a certain threshold, otherwise it’s added as DKF.
Every association from new PKFs to the previous one disables the matched variables in the later one, thus reducing variable repetition and computational cost. Ideally, current visible edgepoints would be active variables only in the new PKF, while the others contain only edgepoints that are not visible in the newer ones. In practice, variable repetition remains due to the non-biunivocal matching of edges. However this proves to be beneficial to accuracy as it maximizes the number of measurements used.
When the amount of active keyframes inside the newest covisivility window exceeds a limit (10-15), the DKF that posses the less translation with respect to its consecutive keyframes is discarded. PKFs are not discarded, as they define the map.
Viii Hierarchical Optimization
The presented system exploits primary and secondary sparseness  through a custom built gauss newton solver designed for the posed problem structure. In fact, local co-visibility windows defined for each PKF suggests a natural split of the problem.
Overall the method goes as follows. First each individual sub-functional is linearised around the current optimization point and edgepoint variables are marginalized using the Schur complement. The resulting priors, involving poses only, are jointly optimized using Pose-Graph Optimization (PGO) over absolute poses. Finally, edgepoints are updated by the linear increments obtained from the PGO.
Viii-a Edgepoint Marginalization
Instead of posing the problem in absolute coordinates like  and , a mixed absolute - relative formulation is chosen, a reasoning for this is given in section VIII-D. For a given sub-functional , residuals are expressed in terms of the relative transformation that takes edgepoints from the PKF to the corresponding keyframe :
Where is a transformation belonging to SE3 Group defined by the composition function:
Considering that projection function is invariant to scale, eq.6 can be re-written as:
Where , and are defined as:
To take into account edgepoints position measurement in the respective PKF, a prior on is added:
Throught all optimization, analytical expressions for the residual 9 with respect to , and transformations are used:
With these Jacobians and stacking residuals and variables into vectors, it is possible to obtain a second order approximation of the functional that takes the known form:
Whose vector of variables is defined as:
Being the frame co-visibility window length, and its increments are:
Its important to recall that, due to the fact that each residual depends only on one pose and edgepoint, the Hessian has a sparse structure were , and are diagonal, and is block diagonal.
In addition to this, variables and only appear in sub-functional . Thus, given an incremental update on the relative transformation , is possible to solve for the optimal values of and by taking the Schur Complement:
Which can be replaced into equation 18 to obtain a form of the functional where edgepoint variables has been marginalized:
Where and stand for information vector and matrix respectively:
It these equations is a vector that stacks all transformation increments from PKF to all keyframes in its connectivity window :
Viii-B Joint Odometry Pose Graph Optimization
Given a set of sub-functionals approximated by equation 24,the overall optimization problem takes the form:
Where constant terms has been dropped and is the tangent space distance from the current relative pose to the prior linearization point .
It should be noted that, while absolute poses are iteratively re-linearized in every optimization step, sub-functional energy priors are selectively re-linearized only when needed ,  (see sec. VIII-C). Therefore it may happen that .
The Jacobians are defined with respect to an infinitesimal increment on SE3 :
Both jacobians are similar, in fact noting that it can be shown that .
Using the SE3 adjoint operator, its possible to “shift” to the left of :
In this equation is the error transform. Its twist vector is defined in terms of a translational () and rotational component ():
It is interesting to note that, in the case were sub-functional has been marginalized in the current optimization step, equals identity and the jacobian becomes the adjoint:
Equation 29 can be plugged into 27 to obtain a linear system of equations on the increments in absolute poses. This is a sparce and mainly block diagonal system (except for the connections generated by the loop closure), with non zero entries only in poses that share a mutually observed landmarks.
Solutions are obtained using Sparce Cholesky Decomposition and, once increments are obtained, absolute pose linearization points are updated:
Viii-C Edgepoints Update
As already mentioned, not all PKFs functionals are re-linearized in every optimization step, provided that computing residuals, Jacobians and Schur complements in equations 9 to 24 involves calculations with every edgepoint. Given that in average a PKF defines 5000-12000 edgepoints, these operations comprise most of the optimization algorithm running time.
On top of this, once a PKF has been re-linearized and optimized many times and falls outside the co-visibility window, effective update in variables become negligible and equation 24 starts being a good approximation to the full functional . This way, sub-functionals are re-linearized only when they are within a certain temporal window and a significant pose change occurred during the last optimization step. This is measured by comparing the distance from the current relative transformations involving the sub-functional to its last linearization point, relative to the later:
For the PKFs that has been re-linearized, edgepoint’s variables and are updated using equations 22. Relative increments are first obtained with equation 29 and Jacobians 31 and 30, which are cached from the PGO.
Viii-D Relative vs. Absolute coordinates in local BA
In a full BA scheme every PKF sub-functional would be re-linearized in each step. In this case relative formulation in serves only as an intermediate step, yielding an equivalent solution to formulating equation 6 in absolute coordinates (due to Jacobians chain rule). This is to be expected as linearization should be independent of the intermediate variables used to obtain it.
The case is substantially different when re-linarization is avoided and 24 is used as an approximation to the true functional, in contrast to a one constructed using absolute coordinates.
Formulating relative coordinates priors has interesting advantages . To begin with, BA classical structure where each factor involves a landmark and only one pose is maintained; thus simplifying Schur Complement and Hessian computation. But more importantly, the relative formulation is a better approximation of the real functional.
A relative parametrization naturally expresses the invariance to a change in absolute pose, and marginalized priors have a scale invariance form due to a rank deficiency (gauge freedom). In fact, the functional will not penalize increments in the direction of the translation components of the linearization point:
This implies that the functional will not penalize translation expansions or shrinks as long as they belong to the direction pointed by the linearization point, which after a few iterations should converge to the optimal. This interesting property allows modeling scale drift without the use of SIM3  optimization. A working example of this idea can be seen in Figure9.
is invariant to an increment if and . Because marginalized functional is the result of a minimization over and the following holds:
In order for 42 to be true for every it must be , implying invariance to a change .
Viii-E Pose Marginalization
Problem 27 involves poses only, therefore is much faster to solve than full BA. However, it grows linearly with time, becoming computationally expensive. To keep time constrained, marginalization is applied to old enough poses, thus not changing any more . A wider marginalization window is defined, such that relative PKF priors falling outside this window are fused into a single absolute prior.
This procedure is simple, the system creates an absolute position prior to fix both position and scale to its initial condition. Absolute priors involve a vector of absolute correlated poses and take the following generic form:
The absolute prior is jointly optimized with the rest of problem 27 until a PKF gets a link to a keyframe outside the marginalization window. In this case, its corresponding sub-functional is jointly optimized with the absolute position prior, updating the later. Finally, the absolute prior is constrained using Schur marginalization to have all its measurements inside the marginalization window . Needless to say, this is a sliding window, therefore old PKFs are incrementally merged in the absolute prior.
Viii-F Overall Algorithm
The optimization algorithm can be summarized in the following steps (performed for each iteration):
Marginalize old poses and re-build absolute prior according to section VIII-E.
Solve a PGO problem on absolute pose variables and update them.
Update landmarks using equations 22, only for PKFs re-linearized in this iteration .
Before running the optimization, loop closure check is performed if the current frame is a PKF. If a match is found, a slightly different optimization algorithm is run (see section IX-B).
Ix Loop Closures
Loop closure is a key aspect of SLAM systems, especially monocular ones, that greatly helps to keep the overall error constrained. However, this is a challenging problem on its own and is out of the scope this paper which focuses on edgemaps association and optimization. Therefore, inspired in similar works [2, 40], well established feature based methods for loop candidate search and pre-alignemnt have been used, in order to show SE-SLAM capabilities apart from the loop closure problem itself. Leaving a full edge-based loop closing mechanism proposed for future work.
Ix-a Loop Candidates Search
In order to find possible loop candidates, only for PKFs ORB features are extracted  and uses to compute a binary representation using a Bag of Words (BoW) dictionary implementation , thus creating a BoW database. Every time a new PKF is created, it is compared with all previous ones outside its covisibility window, obtaining a similarity score. The 2 PKF with higher score above the overall average are marked as potential loop candidates.
Candidate’s keypoints are matched with their descriptors and a transformation is computed by solving the PnP problem with RANSAC using OpenCV. To that end, candidate’s keypoints depth is estimated by averaging neighbouring edgepoints inverse depth on a window around each point. This transformation is further defined using the edge tracking method described in section V.
To check a candidate’s validity, its edgepoints are reprojected to the current frame. If sufficient edgepoints present low reprojection error and are inside the current frame’s FOV, the candidate is accepted.
Once a candidate has been aligned and associated with the current PKF, its data association is used to propagate matches from the candidate PKF to every keyframe in the current keyframe’s covisibility window. This is done simply by connecting successive matches.
Once this new connections are established, optimal transformation from candidate PKF to each of these keyframes is obtained by minimizing the reproyection error 6 with respect to the relative pose only.
Next, a joint relative odometry prior from the candidate to keyframes in both candidate and current covisivility windows, is built using the procedure described in section VIII-A. This prior is constructed using the optimized poses obtained in the previous step as linearization point, which gives a good hint for the true optimal point.
Once this prior is computed, all relative priors are re-activated and the pose graph is optimized completely (marginal absolute prior is discarded). Active covisibility window edgepoints are also included in the optimization. The pose graph structure in this case is illustrated in Figure 5.
Usually a few iterations () are required for convergence. Upon this, all relative priors falling outside the relative marginalization window are fused again into a single absolute marginal prior. The system then resumes the process described in previous sections.
It is paramount to interpret the prior as a compressed form of the reprojection error functions from the candidate PKF’s edgepoints to every keyframe it is connected to, both in the current and its own covisibility windows. As shown in Figure 5.
Given that such connections involve more that one mutual pose, not only position but also scale is connected. When PGO is done with all relative priors, together with the active edgepoints, scale adapts to the one given by the candidate PKF, that is connected through is covisibility window to older measurements, as is shown experimentally in Figure 9. In this setup, scale freedom is not given by SIM3 optimization  but by the rank deficiency of the joint odometry priors (section VIII-D).
X Implementation Details
X-a Edge Detection
Up to this point, little has been said about the edge extraction procedure. This account for the fact that presented algorithm is intended to be detection method agnostic. However, as edges are the presented algorithm’s input, it is expected for different methods to have a significant impact in its performance.
The edge detection algorithm is based on maximal gradient on a smoothed image . Despite being simple, it yields well localized edges, which is the most desirable feature for a SLAM algorithm, together with repeatability.
Non Maximal suppression and sub-pixel position is estimated for each edgepoint and third derivative threshold is also applied , as this helps discriminate continuous gradients.
Finally, the edgemap is optionally decimated by half to reduce the overall number of variables. This helps to increase performance in terms of running speed, by eliminating redundant variables, while keeping full resolution position and connectivity.
X-B Optimization Parallelization
Most computation time of the algorithm is spent in the edgepoints marginalization step (section VIII-A). Marginalizations of each PKF prior are independent of each other, as their information is connected in the PGO. Therefore, these calculations are parallelized using a thread pool of 8 threads.
Finally, secondary sparseness is also exploted inside each prior calculation by separating edgepoints in groups according to how far they expand in the covisibility window. Schur complement can be applied separately in each group, adding the results in a later stage. These groups are also parallelized.
The presented system was benchmarked with the EuRoC dataset, which is widely used for evaluation of SLAM methods, indoor sequences from the TUMVI dataset  and the ICL-NUIM dataset . Round Mean Squared Absolute Trajectory Error (ATE) is the chosen metric for evaluation, as it provides an overall benchmark of the position accuracy. Due to the monocular nature of the algorithm, SIM3 alignment of the trajectory with respect to the provided ground-truth and ATE error calculation were performed using a public algorithm , based on the method proposed in  and .
The algorithm was run on a desktop computer featuring an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz with 64GB of RAM, however the system’s memory footprint does not exceed 2GB. The system is initialized using , engaging full optimization once a few frames have been accumulated. Similar to other methods , the first and last parts of the datasets, were the camera is still or shook for IMU initialization, are stripped off as it compromises monocular initialization.
Two versions of the method are tested, namely SESLAM-1 and SESLAM-3 which perform one and three optimization iterations per frame respectively.
Xi-a Results on EuRoC Dataset
EuRoC-MAV is a well established state of the art benchmark for SLAM systems and is the main dataset used for evaluating this algorithm. It provides two sets of realistic indoors sequences taken from a quadrotor MAV, rated by its difficulty. The first set features five sequences of a large machine hall with different degrees of dynamic motions and illumination changes. The second set was taken inside two rooms equipped with a VICON tracking system and while less space is covered, they are more challenging in terms high dynamic motions. This whole dataset is targeted for stereo visual inertial algorithms, hence running pure monocular systems introduces additional difficulties, specially in pure camera rotation cases, where the whole edgemap is renewed without motion.
Table I show ATE for the Machine Hall and Vicon Room sequences respectively, of SE-SLAM compared against two widely known algorithms: OBSLAM2  representing the feature based SLAM systems and L-DSO  for direct ones. Both methods are state of the art in terms of Monocular SLAM and were run using configurations provided by the authors.
It can be seen that SE-SLAM presents similar accuracy to that of both methods and in particular SESLAM-3 only introduces a significant improvement in difficult sequences, where high dynamic movements introduce large amounts of new data per frame. The cause of failure in the most challenging datasets is loss of tracking due to strong intensity changes and large motions with image blur. The system does not recover from this cases as it lacks a re-localization mechanism, which is proposed as future work
It is important to highlight that both ORB-SLAM2 and LDSO are the result of long incremental work in their respective fields, in contrast to edge-based SLAM. In light of SE-SLAM competitive results in these sequences, it is clear that this approach is very promising and has room for improvements that could be carried out in the future.
Top down views of resulting trajectories for sequences MH03 and V102 are shown in fig. 6(a,b).
Xi-B Results on TUMVI Dataset
TUMVI Dataset is a newer than the former, comprising many sets of indoor and outdoor sequences. It is a challenging dataset aimed to evaluate stereo visual inertial algorithms over long sequences with dynamic motions and illumination changes. The Room sequences were chosen for evaluation, as they are the only ones with ground truth data for whole sequences.
|SESLAM-1||OKVIS ||ROVIO||VINS |
|Room 3||- ()||0.07||0.15||0.11|
|Room 4||- ()||0.03||0.09||0.04|
Test results are shown in Table II, where SE-SLAM was compared with the results published together with the dataset , to ensure the other systems have configuration intended by the authors. It can be seen again that even though SE-SLAM is a monocular system, it produces competitive results even against the state of the art visual-inertial SLAM systems.
The fully optimized trajectory for sequences Room1 is depicted in fig. 6(c).
Xi-C Results on ICL-NUIM Dataset
Finally, SE-SLAM was also run in the Office4 sequences of ICL-NUIM synthetic dataset, in order to compare it with the edge-based algorithm presented in . Results are shown in Table III, where ATE was taken from , as they haven’t released their code.
It should be noted that the ICL is a dataset targeted for RGBD SLAM and reconstruction systems. The office trajectories employed by  are short sequences with slow motions. Given that roughly 20% of the sequence is spent on initialization, the authors consider these to be a sub-optimal way to assess this type of SLAM systems. Nevertheless, our system outperforms , except for a case were a tracking failure occurs.
Xi-D Reconstruction Results
Reconstruction output is a key aspect of this method. Examples are shown in Figures 1, 7 and 8. The advantage in terms of structural information provided, with respect to sparse systems, is easy to see. Noteworthy is how the system keeps consistency in Room2 sequence over a 2.24 min span with several loops happening.
Xi-E Running times
Given that SE-SLAM is a semi-dense method optimizing every significant edgepoint in images ( 5K to 12K extracted per frame), the computational burden is higher compared to sparse methods. Also, running time heavily depends on each particular scene structure. Despite this, the system presents good performance in most datasets. This is due to its PKF selection policy, hierarchical optimization exploiting primary and secondary spareness and parallelization. Measured running times are reported in Table IV.
Xi-F Scale Drift and Loop Closure
Finally, Figure 9 shows the estimated speed error for both incremental and fully optimized trajectories on EuRoC MH01 sequence. In the incremental case the system underestimates speed as a consequence of increasing scale drift. When a loop closure occurs (), PGO performs a full trajectory optimization, converging to a consistent scale.
In this paper SE-SLAM, a novel semi-dense structured edge-based monocular SLAM system, that achieves competitive results in challenging datasets, compared to other state of the art systems was presented. The edge-based nature of the system inherently builds a semi-dense reconstruction of the environment, providing relevant structure data for further navigation algorithms, leveraging from most relevant image information even in textureless environments. An edge-based system is not trivial to build, mainly because edges are not easy to associate, track and optimize over time, as they lack descriptors and do not present biunivocal correspondence, unlike point features. These issues were tackled by developing methods to adapt SLAM theory to the edges nature, including: an approach to match edges between frames and between relatively far keyframes in a consistent manner; a parametrization, residual and variable selection strategy that exploits edges’ nature to make the optimization problem treatable, avoiding its rapid size increase; and a non-linear optimization that better adapts to the edge optimization problem.
Finally, SE-SLAM was benchmarked in the widely used EuroCMAV, the newly TUMVI and the ICL-NUIM datasets. As can be seen in the results section, the system shows comparable precision to state of the art feature-based and dense/semi-dense systems, while achieving real-time operation in several cases, using edges as sole input in all the stages but loop closure.
Proposed future work includes extending the use of edges to the loop closure stage to make the system completely edge-based, trying edge sparsification and interpolation to reduce processing time and exploring different detectors and association strategies. Developing a re-localization stage is also proposed, so as to recover from tracking failures.
In light of SE-SLAM competitive results, it is clear that the edge-based approach to SLAM is very promising and has room for improvements to reach its full maturity. To encourage such developments, SE-SLAM source code will be soon released as open source.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
-  R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
-  S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
-  M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
-  C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2017.
-  T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
-  R. Gomez-Ojeda, F. A. Moreno, D. Scaramuzza, and J. G. Jiménez, “PL-SLAM: a stereo SLAM system through the combination of points and line segments,” CoRR, vol. abs/1705.09479, 2017.
-  A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Pl-slam: Real-time monocular visual slam with points and lines,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 4503–4508, IEEE, 2017.
-  R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2320–2327, IEEE, 2011.
-  C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 15–22, IEEE, 2014.
-  J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2018.
-  P. Smith, I. D. Reid, and A. J. Davison, “Real-time monocular slam with straight lines,” 2006.
-  S. Maity, A. Saha, and B. Bhowmick, “Edge slam: Edge points based monocular visual slam,” in Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops, ICCVW’17, pp. 2408–2417, 2017.
-  J. Jose Tarrio and S. Pedre, “Realtime edge-based visual odometry for a monocular camera,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 702–710, 2015.
-  J. J. Tarrio and S. Pedre, “Realtime edge based visual inertial odometry for mav teleoperation in indoor environments,” Journal of Intelligent & Robotic Systems, vol. 90, no. 1-2, pp. 235–252, 2018.
-  L. von Stumberg, V. Usenko, J. Engel, J. Stückler, and D. Cremers, “From monocular slam to autonomous drone exploration,” in 2017 European Conference on Mobile Robots (ECMR), pp. 1–8, IEEE, 2017.
-  R. Mur-Atal and J. D. Tardos, “Probabilistic semi-dense mapping from highly accurate feature-based monocular slam,” in Robotics: Science and Systems, 2015.
-  J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conference on Computer Vision, pp. 834–849, Springer, 2014.
-  G. Zhang and I. H. Suh, “Building a partial 3d line-based map using a monocular slam,” 2011 IEEE International Conference on Robotics and Automation, pp. 1497–1502, 2011.
-  H. Zhou, D. Zou, L. Pei, R. Ying, P. Liu, and W. Yu, “Structslam: Visual slam with building structure lines,” IEEE Transactions on Vehicular Technology, vol. 64, pp. 1364–1375, April 2015.
-  G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, “Building a 3-d line-based map using stereo slam,” IEEE Transactions on Robotics, vol. 31, pp. 1364–1377, Dec 2015.
-  M. Tomono, “Line-based 3d mapping from edge-points using a stereo camera,” in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3728–3734, May 2014.
-  S. Yang and S. Scherer, “Direct monocular odometry using points and lines,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3871–3877, May 2017.
-  R. Gomez-Ojeda, J. Briales, and J. Gonzalez-Jimenez, “Pl-svo: Semi-direct monocular visual odometry by combining points and line segments,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4211–4216, Oct 2016.
-  R. G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: A fast line segment detector with a false detection control,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 722–732, 2010.
-  L. Zhang and R. Koch, “An efficient and robust line segment matching approach based on lbd descriptor and pairwise geometric consistency,” J. Visual Communication and Image Representation, vol. 24, pp. 794–805, 2013.
-  E. Eade and T. Drummond, “Edge landmarks in monocular slam,” in Proceedings of the 2006 British Machine Vision Conference, BMVC’06, (Washington, DC, USA), IEEE Computer Society, 2006.
-  G. Klein and D. Murray, “Improving the agility of keyframe-based slam,” in European Conference on Computer Vision, pp. 802–815, Springer, 2008.
-  G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, ISMAR ’07, (Washington, DC, USA), pp. 1–10, IEEE Computer Society, 2007.
-  J. Civera, A. J. Davison, and J. M. Montiel, “Inverse depth parametrization for monocular slam,” IEEE transactions on robotics, vol. 24, no. 5, pp. 932–945, 2008.
-  B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms, pp. 298–372, Springer, 1999.
-  J. L. Blanco, “A tutorial on se(3) transformation parameterizations and on-manifold optimization,” 09 2010.
-  T. D. Barfoot, State Estimation for Robotics. Cambridge University Press, 2017.
-  P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
-  M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental smoothing and mapping,” IEEE Transactions on Robotics, vol. 24, no. 6, pp. 1365–1378, 2008.
-  M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.
-  E. Eade.
-  G. Sibley, “Relative bundle adjustment,” Department of Engineering Science, Oxford University, Tech. Rep, vol. 2307, no. 09, 2009.
-  H. Strasdat, J. Montiel, and A. J. Davison, “Scale drift-aware large scale monocular slam,” Robotics: Science and Systems VI, vol. 2, no. 3, p. 7, 2010.
-  X. Gao, R. Wang, N. Demmel, and D. Cremers, “Ldso: Direct sparse odometry with loop closure,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2198–2204, IEEE, 2018.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” 2011.
-  D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
-  J. Canny, “A computational approach to edge detection,” in Readings in computer vision, pp. 184–203, Elsevier, 1987.
-  T. Lindeberg, “Principles for automatic scale selection,” 1999.
-  M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
-  D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stückler, and D. Cremers, “The tum vi benchmark for evaluating visual-inertial odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1680–1687, IEEE, 2018.
-  A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, (Hong Kong, China), May 2014.
-  R. Mur-Artal.
-  J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, D. Cremers, R. Siegwart, and W. Burgard, “Towards a benchmark for rgb-d slam evaluation,” in RGB-D Workshop on Advanced Reasoning with Depth Cameras at Robotics: Science and Systems Conf.(RSS), 2011.
-  S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 4, pp. 376–380, 1991.