Direct Sparse Odometry
Abstract
We propose a novel direct sparse visual odometry formulation. It combines a fully direct probabilistic model (minimizing a photometric error) with consistent, joint optimization of all model parameters, including geometry – represented as inverse depth in a reference frame – and camera motion. This is achieved in real time by omitting the smoothness prior used in other direct methods and instead sampling pixels evenly throughout the images. Since our method does not depend on keypoint detectors or descriptors, it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on mostly white walls. The proposed model integrates a full photometric calibration, accounting for exposure time, lens vignetting, and nonlinear response functions. We thoroughly evaluate our method on three different datasets comprising several hours of video. The experiments show that the presented approach significantly outperforms stateoftheart direct and indirect methods in a variety of realworld settings, both in terms of tracking accuracy and robustness.
1 Introduction
Simultaneous localization and mapping (SLAM) and visual odometry (VO) are fundamental building blocks for many emerging technologies – from autonomous cars and UAVs to virtual and augmented reality. Realtime methods for SLAM and VO have made significant progress in recent years. While for a long time the field was dominated by featurebased (indirect) methods, in recent years a number of different approaches have gained in popularity, namely direct and dense formulations.
Direct vs. Indirect.
Underlying all formulations is a probabilistic model that takes noisy measurements as input and computes an estimator for the unknown, hidden model parameters (3D world model and camera motion). Typically a Maximum Likelihood approach is used, which finds the model parameters that maximize the probability of obtaining the actual measurements, i.e., .
Indirect methods then proceed in two steps. First, the raw sensor measurements are preprocessed to generate an intermediate representation, solving part of the overall problem, such as establishing correspondences. Second, the computed intermediate values are interpreted as noisy measurements in a probabilistic model to estimate geometry and camera motion. Note that the first step is typically approached by extracting and matching a sparse set of keypoints – however other options exist, like establishing correspondences in the form of dense, regularized optical flow. It can also include methods that extract and match parametric representations of other geometric primitives, such as line or curvesegments.
Direct methods skip the preprocessing step and directly use the actual sensor values – light received from a certain direction over a certain time period – as measurements in a probabilistic model.
In the case of passive vision, the direct approach thus optimizes a photometric error, since the sensor provides photometric measurements. Indirect methods on the other hand optimize a geometric error, since the precomputed values – pointpositions or flowvecors – are geometric quantities. Note that for other sensor modalities like depth cameras or laser scanners (which directly measure geometric quantities) direct formulations may also optimize a geometric error.
Dense vs. Sparse.
Sparse methods use and reconstruct only a selected set of independent points (traditionally corners), whereas dense methods attempt to use and reconstruct all pixels in the 2D image domain. Intermediate approaches (semidense) refrain from reconstructing the complete surface, but still aim at using and reconstructing a (largely connected and wellconstrained) subset.
Apart from the extent of the used image region however, a more fundamental – and
consequential – difference lies in the addition of a geometry prior.
In the sparse formulation, there is no notion of neighborhood, and
geometry parameters (keypoint positions) are conditionally independent given
the camera poses & intrinsics^{1}^{1}1Note that even though early filteringbased methods [12, 4] kept track
of pointpointcorrelations, these originated from marginalized camera poses, not from the model itself..
Dense (or semidense) approaches on the other hand exploit the connectedness of the used image region to
formulate a geometry prior, typically favouring smoothness. In fact, such a prior is necessarily required to make
a dense world model observable from passive vision alone.
In general, this prior is formulated directly in the form of an additional loglikelihood energy term [26, 21, 22].
Note that the distinction between dense and sparse is not synonymous to direct and indirect – in fact, all four combinations exist:

Sparse + Direct: This is the formulation proposed in this paper. It optimizes a photometric error defined directly on the images, without incorporating a geometric prior. While we are not aware of any recent work using this formulation, a sparse and direct formulation was already proposed by Jin et al. in 2003 [13]. In contrast to their work however, which is based on an extended Kalman filter, our method uses a nonlinear optimization framework. The motivation for exploring the combination of sparse and direct is laid out in the following section.
1.1 Motivation
The direct and sparse formulation for monocular visual odometry proposed in this paper is motivated by the following considerations.
(1) Direct: One of the main benefits of keypoints is their ability to provide robustness to photometric and geometric distortions present in images taken with offtheshelf commodity cameras. Examples are automatic exposure changes, nonlinear response functions (gamma correction / whitebalancing), lens attenuation (vignetting), debayering artefacts, or even strong geometric distortions caused by a rolling shutter.
At the same time, for all usecases mentioned in the introduction, millions of devices will be (and already are) equipped with cameras solely meant to provide data for computer vision algorithms, instead of capturing images for human consumption. These cameras should and will be designed to provide a complete sensor model, and to capture data in a way that best serves the processing algorithms: Autoexposure and gamma correction for instance are not unknown noise sources, but features that provide better image data – and that can be incorporated into the model, making the obtained data more informative. Since the direct approach models the full image formation process down to pixel intensities, it greatly benefits from a more precise sensor model.
One of the main benefits of a direct formulation is that it does not require a point to be recognizable by itself, thereby allowing for a more finely grained geometry representation (pixelwise inverse depth). Furthermore, we can sample from across all available data – including edges and weak intensity variations – generating a more complete model and lending more robustness in sparsely textured environments.
(2) Sparse: The main drawback of adding a geometry prior is the introduction of correlations between geometry parameters, which render a statistically consistent, joint optimization in real time infeasible (see Figure 2). This is why existing dense or semidense approaches (a) neglect or coarsely approximate correlations between geometry parameters (orange), and / or between geometry parameters and camera poses (green), and (b) employ different optimization methods for the dense geometry part, such as a primaldual formulation [26, 21, 22].
In addition, the expressive complexity of today’s priors is limited:
While they make the 3D reconstruction denser, locally more accurate and more visually appealing,
we found that priors can introduce a bias, and thereby reduce rather than increase longterm, largescale accuracy.
Note that in time this may well change with the introduction of more realistic, unbiased priors learnt from realworld data.
1.2 Contribution and Outline
In this paper we propose a sparse and direct approach to monocular visual odometry. To our knowledge, it is the only fully direct method that jointly optimizes the full likelihood for all involved model parameters, including camera poses, camera intrinsics, and geometry parameters (inverse depth values). This is in contrast to hybrid approaches such as SVO [9], which revert to an indirect formulation for joint model optimization.
Optimization is performed in a sliding window, where old camera poses as well as points that leave the field of view of the camera are marginalized, in a manner inspired by [17]. In contrast to existing approaches, our method further takes full advantage of photometric camera calibration, including lens attenuation, gamma correction, and known exposure times. This integrated photometric calibration further increases accuracy and robustness.
Our CPUbased implementation runs in real time on a laptop computer.
We show in extensive evaluations on three different datasets comprising several hours of video that it
outperforms other stateoftheart approaches (direct and indirect), both in terms of robustness and accuracy.
With reduced settings (less points and active keyframes), it even runs at realtime
speed while still outperforming stateoftheart indirect methods. On high, nonrealtime settings in turn (more
points and active keyframes), it creates semidense models similar in density to those of LSDSLAM, but much more accurate.
The paper is organized as follows: The proposed direct, sparse model as well as the windowed optimization method are described in Section 2. Specifically, this comprises the geometric and photometric camera calibration in Section 2.1, the model formulation in Section 2.2, and the windowed optimization in Section 2.3. Section 3 describes the frontend: the part of the algorithm that performs dataselection and provides sufficiently accurate initializations for the highly nonconvex optimization backend. We provide a thorough experimental comparison to other methods in Section 4.1. We also evaluate the effect of important parameters and new concepts like the use of photometric calibration in Section 4.2. In Section 4.3, we analyse the effect of added photometric and geometric noise to the data. Finally, we provide a summary in Section 5.
2 Direct Sparse Model
Our direct sparse odometry is based on continuous optimization
of the photometric error over a window of recent frames, taking into account a photometrically calibrated
model for image formation. In contrast to existing direct methods,
we jointly optimize for all involved parameters (camera intrinsics, camera extrinsics, and inverse depth values),
effectively performing the photometric equivalent of windowed sparse bundle adjustment.
We keep the geometry representation employed by other direct approaches, i.e.,
3D points are represented as inverse depth in a reference frame (and thus have one degree of freedom).
Notation.
Throughout the paper, bold lowercase letters () represent vectors and bold uppercase letters () represent matrices. Scalars will be represented by light lowercase letters (), functions (including images) by light uppercase letters (). Camera poses are represented as transformation matrices , transforming a point from the world frame into the camera frame. Linearized poseincrements will be expressed as Liealgebra elements , which – with a slight abuse of notation – we directly write as vectors . We further define the commonly used operator using a leftmultiplicative formulation, i.e.,
(1) 
2.1 Calibration
The direct approach comprehensively models the image formation process. In addition to a geometric camera model – which comprises the function that projects a 3D point onto the 2D image – it is hence beneficial to also consider a photometric camera model, which comprises the function that maps realworld energy received by a pixel on the sensor (irradiance) to the respective intensity value. Note that for indirect methods this is of little benefit and hence widely ignored, as common feature extractors and descriptors are invariant (or highly robust) to photometric variations.
2.1.1 Geometric Camera Calibration
For simplicity, we formulate our method for the wellknown pinhole camera model – radial distortion is removed in a preprocessing step. While for wideangle cameras this does reduce the field of view, it allows comparison across methods that only implement a limited choice of camera models. Throughout this paper, we will denote projection by and backprojection with , where denotes the intrinsic camera parameters (for the pinhole model these are the focal length and the principal point). Note that analogously to [2], our approach can be extended to other (invertible) camera models, although this does increase computational demands.
2.1.2 Photometric Camera Calibration
We use the image formation model used in [8], which accounts for a nonlinear response function , as well as lens attenuation (vignetting) . Figure 3 shows an example calibration from the TUM monoVO dataset. The combined model is then given by
(2) 
where and are the irradiance and the observed pixel intensity in frame , and is the exposure time. The model is applied by photometrically correcting each video frame as very first step, by computing
(3) 
In the remainder of this paper, will always refer to the photometrically corrected image , except where otherwise stated.
2.2 Model Formulation
We define the photometric error of a point in reference frame , observed in a target frame , as the weighted SSD over a small neighborhood of pixels. Our experiments have shown that 8 pixels, arranged in a slightly spread pattern (see Figure 4) give a good tradeoff between computations required for evaluation, robustness to motion blur, and providing sufficient information. Note that in terms of the contained information, evaluating the SSD over such a small neighborhood of pixels is similar to adding first and secondorder irradiance derivative constancy terms (in addition to irradiance constancy) for the central pixel. Let
(4) 
where is the set of pixels included in the SSD; the exposure times of the images ; and the Huber norm. Further, stands for the projected point position of with inverse depth , given by
(5) 
with
(6) 
In order to allow our method to operate on sequences without known exposure times, we include an additional affine brightness transfer function given by . Note that in contrast to most previous formulations [13, 6], the scalar factor is parametrized logarithmically. This both prevents it from becoming negative, and avoids numerical issues arising from multiplicative (i.e., exponentially increasing) drift.
In addition to using robust Huber penalties, we apply a gradientdependent weighting given by
(7) 
which downweights pixels with high gradient. This weighting function can be probabilistically interpreted as adding small, independent geometric noise on the projected point position , and immediately marginalizing it – approximating small geometric error. To summarize, the error depends on the following variables: (1) the point’s inverse depth , (2) the camera intrinsics , (3) the poses of the involved frames , and (4) their brightness transfer function parameters .
The full photometric error over all frames and points is given by
(8) 
where runs over all frames , over all points in frame , and over all frames in which the point is visible. Figure 5 shows the resulting factor graph: The only difference to the classical reprojection error is the additional dependency of each residual on the pose of the host frame, i.e., each term depends on two frames instead of only one. While this adds offdiagonal entries to the posepose block of the Hessian, it does not affect the sparsity pattern after application of the Schur complement to marginalize point parameters. The resulting system can thus be solved analogously to the indirect formulation. Note that the Jacobians with respect to the two frames’ poses are linearly related by the adjoint of their relative pose. In practice, this factor can then be pulled out of the sum when computing the Hessian or its Schur complement, greatly reducing the additional computations caused by more variable dependencies.
If exposure times are known, we further add a prior pulling the affine brightness transfer function to zero:
(9) 
If no photometric calibration is available, we set and , as in this case they need to model the (unknown) changing exposure time of the camera. As a sidenote it should be mentioned that the ML estimator for a multiplicative factor is biased if both and contain noisy measurements (see [7]); causing to drift in the unconstrained case . While this generally has little effect on the estimated poses, it may introduce a bias if the scene contains only few, weak intensity variations.
Point Dimensionality.
In the proposed direct model, a point is parametrized by only one parameter (the inverse depth in the reference frame), in contrast to three unknowns as in the indirect model. To understand the reason for this difference, we first note that in both cases a 3D point is in fact an arbitrarily located discrete sample on a continuous, realworld 3D surface. The difference then lies in the way this 2D location on the surface is defined. In the indirect approach, it is implicitly defined as the point, which (projected into an image) generates a maximum in the used corner response function. This entails that both the surface, as well as the point’s location on the surface are unknowns, and need to be estimated. In our direct formulation, a point is simply defined as the point, where the source pixel’s ray hits the surface, thus only one unknown remains. In addition to a reduced number of parameters, this naturally enables an inverse depth parametrization, which – in a Gaussian framework – is better suited to represent uncertainty from stereobased depth estimation, in particular for faraway points [3].
Consistency.
Strictly speaking, the proposed direct sparse model does allow to use some observations (pixel values) multiple times, while others are not used at all. This is because – even though our point selection strategy attempts to avoid this by equally distributing points in space (see Section 3.2) – we allow point observations to overlap, and thus depend on the same pixel value(s). This particularly happens in scenes with little texture, where all points have to be chosen from a small subset of textured image regions. We however argue that this has negligible effect in practice, and – if desired – can be avoided by removing (or downweighting) observations that use the same pixel value.
2.3 Windowed Optimization
We follow the approach by Leutenegger et al. [17] and optimize the total error (8) in a sliding window using the GaussNewton algorithm, which gives a good tradeoff between speed and flexibility.
For ease of notation, we extend the operator as defined in (1) to all optimized parameters – for parameters other than SE(3) poses it denotes conventional addition. We will use to denote all optimized variables, including camera poses, affine brightness parameters, inverse depth values, and camera intrinsics. As in [17], marginalizing a residual that depends on a parameter in will fix the tangent space in which any future information (deltaupdates) on that parameter is accumulated. We will denote the evaluation point for this tangent space with , and the accumulated deltaupdates by . The current state estimate is hence given by . Figure 6 visualizes the relation between the different variables.
GaussNewton Optimization.
We compute the GaussNewton system as
(10) 
where is the diagonal matrix containing the weights, is the stacked residual vector, and is the Jacobian of .
Note that each point contributes residuals to the energy. For notational simplicity, we will in the following consider only a single residual , and the associated row of the Jacobian . During optimization – as well as when marginalizing – residuals are always evaluated at the current state estimate, i.e.,
(11)  
where are the current state variables the residual depends on. The Jacobian is evaluated with respect to an additive increment to , i.e.,
(12) 
It can be decomposed as
(13) 
where denotes the “geometric” parameters , and denotes the “photometric” parameters . We employ two approximations, described below.
First, both and are evaluated at . This technique is called “First Estimate Jacobians” [17, 11], and is required to maintain consistency of the system and prevent the accumulation of spurious information. In particular, in the presence of nonlinear nullspaces in the energy (in our formulation absolute pose and scale), adding linearizations around different evaluation points eliminates these and thus slowly corrupts the system. In practice, this approximation is very good, since , are smooth compared to the size of the increment . In contrast, is much less smooth, but does not affect the nullspaces. Thus, it is evaluated at the current value for , i.e., at the same point as the residual . We use centred differences to compute the image derivatives at integer positions, which are then bilinearly interpolated.
Second, is assumed to be the same for all residuals belonging to the same point, and evaluated only for the center pixel. Again, this approximation is very good in practice. While it significantly reduces the required computations, we have not observed a notable effect on accuracy for any of the used datasets.
From the resulting linear system, an increment is computed as and added to the current state:
(14) 
Note that due to the First Estimate Jacobian approximation, a multiplicative formulation (replacing with in (12)) results in the exact same Jacobian, thus a multiplicative update step is equally valid.
After each update step, we update for all variables that are not part of the marginalization term, using and . In practice, this includes all depth values, as well as the pose of the newest keyframe. Each time a new keyframe is added, we perform up to 6 GaussNewton iterations, breaking early if is sufficiently small. We found that – since we never start faraway from the minimum – a LevenbergMarquad dampening (which slows down convergence) is not required.
Marginalization.
When the active set of variables becomes too large, old variables are removed by marginalization using the Schur complement. Similar to [17], we drop any residual terms that would affect the sparsity pattern of : When marginalizing frame , we first marginalize all points in , as well as points that have not been observed in the last two keyframes. Remaining observations of active points in frame are dropped from the system.
Marginalization proceeds as follows: Let denote the part of the energy containing all residuals that depend on state variables to be marginalized. We first compute a GaussNewton approximation of around the current state estimate . This gives
(15)  
where denotes the current value (evaluation point for ) of . The constants can be dropped, and are defined as in (1013). This is a quadratic function on , and we can apply the Schur complement to marginalize a subset of variables. Written as a linear system, it becomes
(16) 
where denotes the block of variables we would like to marginalize, and the block of variables we would like to keep. Applying the Schur complement yields , with
(17)  
(18) 
The residual energy on can hence be written as
(19) 
This is a quadratic function on and can be trivially added to the full photometric error during all subsequent optimization and marginalization operations, replacing the corresponding nonlinear terms. Note that this requires the tangent space for to remain the same for all variables that appear in during all subsequent optimization and marginalization steps.
3 Visual Odometry FrontEnd
The front end is the part of the algorithm that

determines the sets , and that make up the error terms of . It decides which points and frames are used, and in which frames a point is visible – in particular, this includes outlier removal and occlusion detection.

provides initializations for new parameters, required for optimizing the highly nonconvex energy function . As a rule of thumb, a linearization of the image is only valid in a 12 pixel radius; hence all parameters involved in computing should be initialized sufficiently accurately for to be off by no more than 12 pixels.

decides when a point / frame should be marginalized.
As such, the frontend needs to replace many operations that in the indirect setting are accomplished by keyframe detectors (determining visibility, point selection) and initialization procedures such as RANSAC. Note that many procedures described here are specific to the monocular case. For instance, using a stereo camera makes obtaining initial depth values more straightforward, while integration of an IMU can significantly robustify – or even directly provide – a pose initialization for new frames.
3.1 Frame Management
Our method always keeps a window of up to active keyframes (we use ). Every new frame is initially tracked with respect to these reference frames (Step 1). It is then either discarded or used to create a new keyframe (Step 2). Once a new keyframe – and respective new points – are created, the total photometric error (8) is optimized. Afterwards, we marginalize one or more frames (Step 3).
Step 1: Initial Frame Tracking.
When a new keyframe is created, all active points are projected into it and slightly dilated, creating a semidense depth map. New frames are tracked with respect to only this frame using conventional twoframe direct image alignment, a multiscale image pyramid and a constant motion model to initialize. Figure 7 shows some examples – we found that further increasing the density has little to no benefit in terms of accuracy or robustness, while significantly increasing runtime. Note that when downscaling the images, a pixel is assigned a depth value if at least one of the source pixels has a depth value as in [24], significantly increasing the density on coarser resolutions.
If the final RMSE for a frame is more than twice that of the frame before, we assume that direct image alignment failed and attempt to recover by initializing with up to 27 different small rotations in different directions. This recoverytracking is done on the coarsest pyramid level only, and takes approximately 0.5ms per try. Note that this RANSAClike procedure is only rarely invoked, such as when the camera moves very quickly shakily. Tightly integrating an IMU would likely render this unnecessary.
Step 2: Keyframe Creation.
Similar to ORBSLAM, our strategy is to initially take many keyframes (around 510 keyframes per second), and sparsify them afterwards by early marginalizing redundant keyframes. We combine three criteria to determine if a new keyframe is required:

New keyframes need to be created as the field of view changes. We measure this by the mean square optical flow (from the last keyframe to the latest frame) during initial coarse tracking.

Camera translation causes occlusions and disocclusions, which requires more keyframes to be taken (even though may be small). This is measured by the mean flow without rotation, i.e., , where is the warped point position with .

If the camera exposure time changes significantly, a new keyframe should be taken. This is measured by the relative brightness factor between two frames .
These three quantities can be obtained easily as a byproduct of initial alignment. Finally, a new keyframe is taken if , where provide a relative weighting of these three indicators, and by default.
Step 3: Keyframe Marginalization.
Our marginalization strategy is as follows (let be the set of active keyframes, with being the newest and being the oldest):

We always keep the latest two keyframes ( and ).

Frames with less than 5% of their points visible in are marginalized.

If more than frames are active, we marginalize the one (excluding and ) which maximizes a “distance score” , computed as
(20) where is the Euclidean distance between keyframes and , and a small constant. This scoring function is heuristically designed to keep active keyframes welldistributed in 3D space, with more keyframes close to the most recent one.
A keyframe is marginalized by first marginalizing all points represented in it, and then the frame itself, using the marginalization procedure from Section 2.3. To preserve the sparsity structure of the Hessian, all observations of still existing points in the frame are dropped from the system. While this is clearly suboptimal (in practice about half of all residuals are dropped for this reason), it allows to efficiently optimize the energy function. Figure 8 shows an example of a scene, highlighting the active set of points and frames.
3.2 Point Management
Most existing direct methods focus on utilizing as much image data as possible. To achieve this in real time, they accumulate early, suboptimal estimates (linearizations / depth triangulations), and ignore – or approximate – correlations between different parameters. In this work, we follow a different approach, and instead heavily subsample data to allow processing it in real time in a joint optimization framework. In fact, our experiments show that image data is highly redundant, and the benefit of simply using more data points quickly flattens off. Note that in contrast to indirect methods, our direct framework still allows to sample from across all available data, including weakly textured or repetitive regions and edges, which does provide a real benefit (see Section 4).
We aim at always keeping a fixed number of active points (we use ), equally distributed across space and active frames, in the optimization. In a first step, we identify candidate points in each new keyframe (Step 1). Candidate points are not immediately added into the optimization, but instead are tracked individually in subsequent frames, generating a coarse depth value which will serve as initialization (Step 2). When new points need to be added to the optimization, we choose a number of candidate points (from across all frames in the optimization window) to be activated, i.e., added into the optimization (Step 3). Note that we choose candidates in each frame, however only keep active points across all active frames combined. This assures that we always have sufficient candidates to activate, even though some may become invalid as they leave the field of view or are identified as outliers.
Step 1: Candidate Point Selection.
Our point selection strategy aims at selecting points that are (1) welldistributed in the image and (2) have sufficiently high image gradient magnitude with respect to their immediate surroundings. We obtain a regionadaptive gradient threshold by splitting the image into blocks. For each block, we then compute the threshold as , where is the median absolute gradient over all pixels in that block, and a global constant (we use ).
To obtain an equal distribution of points throughout the image, we split it into blocks, and from each block select the pixel with largest gradient if it surpasses the regionadaptive threshold. Otherwise, we do not select a pixel from that block. We found that it is often beneficial to also include some points with weaker gradient from regions where no highgradient points are present, capturing information from weak intensity variations originating for example from smoothly changing illumination across white walls. To achieve this, we repeat this procedure twice more, with decreased gradient threshold and blocksize and , respectively. The blocksize is continuously adapted such that this procedure generates the desired amount of points (if too many points were created it is increased for the next frame, otherwise it is decreased). Figure 9 shows the selected point candidates for some example scenes. Note that for for candidate point selection, we use the raw images prior to photometric correction.
Step 2: Candidate Point Tracking.
Point candidates are tracked in subsequent frames using a discrete search along the epipolar line, minimizing the photometric error (4). From the best match we compute a depth and associated variance, which is used to constrain the search interval for the subsequent frame. This tracking strategy is inspired by LSDSLAM. Note that the computed depth only serves as initialization once the point is activated.
Step 3: Candidate Point Activation.
After a set of old points is marginalized, new point candidates are activated to replace them. Again, we aim at maintaining a uniform spatial distribution across the image. To this end, we first project all active points onto the most recent keyframe. We then activate candidate points which – also projected into this keyframe – maximize the distance to any existing point (requiring larger distance for candidates created during the second or third blockrun). Figure 7 shows the resulting distribution of points in a number of scenes.
Outlier and Occlusion Detection.
Since the available image data generally contains much more information than can be used in real time, we attempt to identify and remove potential outliers as early as possible. First, when searching along the epipolar line during candidate tracking, points for which the minimum is not sufficiently distinct are permanently discarded, greatly reducing the number of false matches in repetitive areas. Second, point observations for which the photometric error (4) surpasses a threshold are removed. The threshold is continuously adapted with respect to the median residual in the respective frame. For “bad” frames (e.g., frames that contain a lot of motion blur), the threshold will be higher, such that not all observations are removed. For good frames, in turn, the threshold will be lower, as we can afford to be more strict.
4 Results
In this section we will extensively evaluate our Direct Sparse monoVO algorithm (DSO). We both compare it to other monocular SLAM / VO methods, as well as evaluate the effect of important design and parameter choices. We use three datasets for evaluation:
(1) The TUM monoVO dataset [8], which provides 50 photometrically calibrated sequences, comprising 105 minutes of video recorded in dozens of different environments, indoors and outdoors (see Figure 11). Since the dataset only provides loopclosuregroundtruth (allowing to evaluate tracking accuracy via the accumulated drift after a large loop), we evaluate using the alignment error () as defined in the respective publication.
(2) The EuRoC MAV dataset [1], which contains 11 stereoinertial sequences comprising 19 minutes of video, recorded in 3 different indoor environments. For this dataset, no photometric calibration or exposure times are available, hence we omit photometric image correction and set (). We evaluate in terms of the absolute trajectory error (), which is the translational RMSE after alignment. For this dataset we crop the beginning of each sequence since they contain very shaky motion meant to initialize the IMU biases – we only use the parts of the sequence where the MAV is in the air.
(3) The ICLNUIM dataset [10], which contains 8 raytraced sequences comprising 4.5 minutes of video, from two indoor environments. For this dataset, photometric image correction is not required, and all exposure times can be set to . Again, we evaluate in terms of the absolute trajectory error ().
Methodology.
We aim at an evaluation as comprehensive as possible given the available data, and thus run all sequences both forwards and backwards, 5 times each (to account for nondeterministic behaviour). On default settings, we run each method 10 times each. For the EuRoC MAV dataset we further run both the left and the right video separately. In total, this gives 500 runs for the TUMmonoVO dataset, 220 runs for the EuRoC MAV dataset, and 80 runs for the ICLNUIM dataset, which we run on 20 dedicated workstations. We remove the dependency on the host machine’s CPU speed by not enforcing realtime execution, except where stated otherwise: for ORBSLAM we play the video at 20% speed, whereas DSO is run in a sequentialized, singlethreaded implementation that runs approximately four times slower than real time. Note that even though we do not enforce realtime execution for most of the experiments, we use the exact same parameter settings as for the realtime comparisons.
The results are summarized in the form of cumulative error plots (see, e.g., Figure 10), which visualize for how many tracked sequences the respective error value ( / ) was below a certain threshold; thereby showing both accuracy on sequences where a method works well, as well as robustness, i.e., on how many sequences the method does not fail. The raw tracking results for all runs – as well as scripts to compute the figures – are provided in the supplementary material^{2}^{2}2http://vision.in.tum.de/dso. Additional interesting analysis using the TUMmonoVO dataset – e.g. the influence of the camera’s field of view, the image resolution or the camera’s motion direction – can be found in [8].
Evaluated Methods and Parameter Settings.
We compare our method to the opensource implementation of (monocular) ORBSLAM [20]. We also attempted to evaluate against the opensource implementations of LSDSLAM [5] and SVO [9], however both methods consistently fail on most of the sequences. A major reason for this is that they assume brightness constancy (ignoring exposure changes), while both realworld datasets used contain heavy exposure variations.
To facilitate a fair comparison and allow application of the loopclosure metric from the TUMmonoVO dataset, we disable explicit loopclosure detection and relocalization for ORBSLAM. Note that everything else (including local and global BA) remains unchanged, still allowing ORBSLAM to detect incremental loopclosures that can be found via the covisibility representation alone. All parameters are set to the same value across all sequences and datasets. The only exception is the ICLNUIM dataset: For this dataset we set for DSO, and lower the FAST threshold for ORBSLAM to 2, which we found to give best results.
4.1 Quantitative Comparison
Figure 10 shows the absolute trajectory RMSE on the EuRoC MAV dataset and the ICLNUIM dataset for both methods (if an algorithm gets lost within a sequence, we set ). Figure 12 shows the alignment error , as well as the rotationdrift and scaledrift for the TUMmonoVO dataset.
In addition to the nonrealtime evaluation (bold lines), we evaluate both algorithms in a hardenforced realtime setting on an Intel i74910MQ CPU (dashed lines). The direct, sparse approach clearly outperforms ORBSLAM in accuracy and robustness both on the TUMmonoVO dataset, as well as the synthetic ICL_NUIM dataset. On the EuRoC MAV dataset, ORBSLAM achieves a better accuracy (but lower robustness). This is due to two major reasons: (1) there is no photometric calibration available, and (2) the sequences contain many small loops or segments where the quadrocopter “backtracks” the way it came, allowing ORBSLAM’s local mapping component to implicitly close many small and some large loops, whereas our visual odometry formulation permanently marginalizes all points and frames that leave the field of view. We can validate this by prohibiting ORBSLAM from matching against any keypoints that have not been observed for more than s (lines with circle markers in Figure 10): In this case, ORBSLAM performs similar to DSO in terms of accuracy, but is less robust. The slight difference in robustness for DSO comes from the fact that for realtime execution, tracking new frames and keyframecreation are parallelized, thus new frames are tracked on the secondlatest keyframe, instead of the latest. In some rare cases – in particular during strong exposure changes – this causes initial image alignment to fail.
To show the flexibility of DSO, we include results when running at 5 times realtime speed^{3}^{3}3All images are loaded, decoded, and pinholerectified beforehand., with reduced settings (=800 points, =6 active frames, image resolution, GaussNewton iterations after a keyframe is created): Even with such extreme settings, DSO achieves very good accuracy and robustness on all three datasets.
Note that DSO is designed as a pure visual adometry while ORBSLAM constitutes a full SLAM system, including loopclosure detection & correction and relocalization – all these additional abilities are neglected or switched off in this comparison.
4.2 Parameter Studies
This section aims at evaluating a number of different parameter and algorithm design choices, using the TUMmonoVO dataset.
Photometric Calibration.
We analyze the influence of photometric calibration, verifying that it in fact increases accuracy and robustness: to this end, we incrementally disable the different components:

exposure (blue): set and .

vignette (green): set (and 1.).

response (yellow): set (and 1 – 2.).

brightness constancy (black): set , i.e., disable affine brightness correction (and 1 – 3.).
Figure 15 shows the result. While known exposure times seem to have little effect on the accuracy, removing vignette and response calibration does slightly decrease the overall accuracy and robustness. Interestingly, only removing vignette calibration performs slightly worse than removing vignette and response calibration. A naïve brightness constancy assumption (as used in many other direct approaches like LSDSLAM or SVO) clearly performs worst, since it does not account for automatic exposure changes at all.
Amount of Data.
We analyze the effect of changing the amount of data used, by varying the number of active points , as well as the number of frames in the active window . Note that increasing allows to keep more observations per point: For any point we only ever keep observations in active frames; thus the number of observations when marginalizing a point is limited to (see Section 2.3). Figure 16 summarizes the result. We can observe that the benefit of simply using more data quickly flattens off after points. At the same time, the number of active frames has little influence after , while increasing the runtime quadratically. We further evaluate a fixedlag marginalization strategy (i.e., always marginalize the oldest keyframe, instead of using the proposed distancescore) as in [17]: this performs significantly worse.
Selection of Data.
In addition to evaluating the effect of the number of residuals used, it is interesting to look at which data is used – in particular since one of the main benefits of a direct approach is the ability to sample from all points, instead of only using corners. To this end, we vary the gradient threshold for point selection, ; the result is summarized in Figure 17. While there seems to be a sweet spot around (if is too large, for some scenes not enough welldistributed points are available to sample from – if it is too low, too much weight will be given to data with a low signaltonoise ratio), the overall impact is relatively low.
More interestingly, we analyse the effect of only using corners, by restricting point candidates to FAST corners only. We can clearly see that only using corners significantly decreases performance. Note that for lower FAST thresholds, many false “corners” will be detected along edges, which our method can still use, in contrast to indirect methods for which such points will be outliers. In fact, ORBSLAM achieves best performance using the default threshold of 20.
Number of Keyframes.
We analyze the number of keyframes taken by varying (see Section 3.1). For each value of we give the resulting average number of keyframes per second; the default setting results in 8 keyframes per second, which is easily achieved in real time. The result is summarized in Figure 18. Taking too few keyframes (less than 4 per second) reduces the robustness, mainly in situations with strong occlusions / disocclusions, e.g., when walking through doors. Taking too many keyframes, on the other hand (more than 15 per second), decreases accuracy. This is because taking more keyframes causes them to be marginalized earlier (since is fixed), thereby accumulating linearizations around earlier (and less accurate) linearization points.
Residual Pattern.
We test different residual patterns for , covering smaller or larger areas. The result is shown in Figure 19.
4.3 Geometric vs. Photometric Noise Study
The fundamental difference between the proposed direct model and the indirect model is the noise assumption. The direct approach models photometric noise, i.e., additive noise on pixel intensities. In contrast, the indirect approaches models geometric noise, i.e., additive noise on the position of a point in the image plane, assuming that keypoint descriptors are robust to photometric noise. It therefore comes at no surprise that the indirect approach is significantly more robust to geometric noise in the data. In turn, the direct approach performs better in the presence of strong photometric noise – which keypointdescriptors (operating on a purely local level) fail to filter out. We verify this by analyzing tracking accuracy on the TUMmonoVO dataset, when artificially adding (a) geometric noise, and (b) photometric noise to the images.
Geometric Noise.
For each frame, we separately generate a lowfrequency random flowmap by upsampling a grid filled with uniformly distributed random values from (using bicubic interpolation). We then perturb the original image by shifting each pixel by :
(21) 
This procedure simulates noise originating from (unmodeled) rolling shutter or inaccurate geometric camera calibration. Figure 20 visualizes an example of the resulting noise pattern, as well as the accuracy of ORBSLAM and DSO for different values of . As expected, we can clearly observe how DSO’s performance quickly deteriorates with added geometric noise, whereas ORBSLAM is much less affected. This is because the first step in the indirect pipeline – keypoint detection and extraction – is not affected by lowfrequency geometric noise, as it operates on a purely local level. The second step then optimizes a geometric noise model – which not surprisingly deals well with geometric noise. In the direct approach, in turn, geometric noise is not modeled, and thus has a much more severe effect – in fact, for there likely exists no state for which all residuals are within the validity radius of the linearization of ; thus optimization fails entirely (which can be alleviated by using a coarser pyramid level). Note that this result also suggests that the proposed direct model is more susceptible to inaccurate intrinsic camera calibration than the indirect approach – in turn, it may benefit more from accurate, nonparametric intrinsic calibration.
Photometric Noise.
For each frame, we separately generate a highfrequency random blurmap by upsampling a grid filled with uniformly distributed random values in . We then perturb the original image by adding anisotropic blur with standard deviation to pixel :
(22) 
where denotes a 2D Gaussian kernel with standard deviation .
Figure 21 shows the result. We can observe that DSO is slightly more robust to photometric noise
than ORBSLAM – this is because (purely local) keypoint matching fails for high photometric noise, whereas a joint
optimization of the photometric error better overcomes the introduced distortions.
To summarize: While the direct approach outperforms the indirect approach on wellcalibrated data, it is illsuited in the presence of strong geometric noise, e.g., originating from a rolling shutter or inaccurate intrinsic calibration. In practice, this makes the indirect model superior for smartphones or offtheshelf webcams, since these were designed to capture videos for human consumption – prioritizing resolution and lightsensitivity over geometric precision. In turn, the direct approach offers superior performance on data captured with dedicated cameras for machinevision, since these put more importance on geometric precision, rather than capturing appealing images for human consumption. Note that this can be resolved by tightly integrating the rolling shutter into the model, as done, e.g., in [19, 18, 15].
4.4 Qualitative Results
In addition to accurate camera tracking, DSO computes 3D points on all gradientrich areas, including edges – resulting in pointcloud reconstructions similar to the semidense reconstructions of LSDSLAM. The density then directly corresponds to how many points we keep in the active window . Figure 22 shows some examples.
Figure 23 shows three more scenes (one from each dataset), together with some corresponding depth maps. Note that our approach is able to track through scenes with very little texture, whereas indirect approaches fail. All reconstructions shown are simply accumulated from the odometry, without integrating loopclosures. See the supplementary video for more qualitative results.
5 Conclusion
We have presented a novel direct and sparse formulation for Structure from Motion. It combines the benefits of direct methods (seamless ability to use & reconstruct all points instead of only corners) with the flexibility of sparse approaches (efficient, joint optimization of all model parameters). This is possible in real time by omitting the geometric prior used by other direct methods, and instead evaluating the photometric error for each point over a small neighborhood of pixels, to wellconstrain the overall problem. Furthermore, we incorporate full photometric calibration, completing the intrinsic camera model that traditionally only reflects the geometric component of the image formation process.
We have implemented our direct & sparse model in the form of a monocular visual odometry algorithm (DSO), incrementally marginalizing / eliminating old states to maintain realtime performance. To this end we have developed a frontend that performs dataselection and provides accurate initialization for optimizing the highly nonconvex energy function. Our comprehensive evaluation on several hours of video shows the superiority of the presented formulation relative to stateoftheart indirect methods. We furthermore present an exhaustive parameter study, indicating that (1) simply using more data does not increase tracking accuracy (although it makes the 3D models denser), (2) using all points instead of only corners does provide a real gain in accuracy and robustness, and (3) incorporating photometric calibration does increase performance, in particular compared to the basic “brightness constancy” assumption.
We have also shown experimentally that the indirect approach – modeling a geometric error – is much more robust to geometric noise, e.g., originating from a poor intrinsic camera calibration or a rolling shutter. The direct approach is in turn more robust to photometric noise, and achieves superior accuracy on wellcalibrated data. We believe this to be one of the main explanations for the recent revival of direct formulations after a dominance of indirect approaches for more than a decade: For a long time, the predominant source of digital image data were cameras, which originally were designed to capture images for human viewing (such as offtheshelf webcams or integrated smartphone cameras). In this setting, the strong geometric distortions caused by rolling shutters and imprecise lenses favored the indirect approach. In turn, with 3D computer vision becoming an integral part of massmarket products (including autonomous cars and drones, as well as mobile devices for VR and AR), cameras are being developed specifically for this purpose, featuring global shutters, precise lenses, and high framerates – which allows direct formulations to realize their full potential.
Since the structure of the proposed direct sparse energy formulation is the same as that of indirect methods, it can be integrated with other optimization frameworks like (doublewindowed) bundle adjustment [25] or incremental smoothing and mapping [14]. The main challenge here is the greatly increased degree of nonconvexity compared to the indirect model, which originates from the inclusion of the image in the error function – this is likely to restrict the use of our model to video processing.
References
 [1] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. Achtelik, and R. Siegwart. The EuRoC micro aerial vehicle datasets. International Journal of Robotics Research, 2016.
 [2] D. Caruso, J. Engel, and D. Cremers. Largescale direct SLAM for omnidirectional cameras. In International Conference on Intelligent Robot Systems (IROS), 2015.
 [3] J. Civera, A. Davison, and J. Montiel. Inverse depth parametrization for monocular SLAM. Transactions on Robotics, 24(5):932–945, 2008.
 [4] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Realtime single camera SLAM. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 29, 2007.
 [5] J. Engel, T. Schöps, and D. Cremers. LSDSLAM: Largescale direct monocular SLAM. In European Conference on Computer Vision (ECCV), 2014.
 [6] J. Engel, J. Stueckler, and D. Cremers. Largescale direct slam with stereo cameras. In International Conference on Intelligent Robot Systems (IROS), 2015.
 [7] J. Engel, J. Sturm, and D. Cremers. Scaleaware navigation of a lowcost quadrocopter with a monocular camera. Robotics and Autonomous Systems (RAS), 62(11):1646––1656, 2014.
 [8] J. Engel, V. Usenko, and D. Cremers. A photometrically calibrated benchmark for monocular visual odometry. In arXiv preprint arXiv, 2016.
 [9] C. Forster, M. Pizzoli, and D. Scaramuzza. SVO: Fast semidirect monocular visual odometry. In International Conference on Robotics and Automation (ICRA), 2014.
 [10] A. Handa, T. Whelan, J. McDonald, and A. Davison. A benchmark for RGBD visual odometry, 3D reconstruction and SLAM. In International Conference on Robotics and Automation (ICRA), 2014.
 [11] G. P. Huang, A. I. Mourikis, and S. I. Roumeliotis. A firstestimates Jacobian EKF for improving SLAM consistency. In International Symposium on Experimental Robotics, 2008.
 [12] H. Jin, P. Favaro, and S. Soatto. Realtime 3d motion and structure of point features: Frontend system for visionbased control and interaction. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2000.
 [13] H. Jin, P. Favaro, and S. Soatto. A semidirect approach to structure from motion. The Visual Computer, 19(6):377–394, 2003.
 [14] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F. Dellaert. iSAM2: Incremental smoothing and mapping using the Bayes tree. International Journal of Robotics Research, 31(2):217–236, Feb 2012.
 [15] C. Kerl, J. Stueckler, and D. Cremers. Dense continuoustime tracking and mapping with rolling shutter RGBD cameras. In International Conference on Computer Vision (ICCV), 2015.
 [16] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In International Symposium on Mixed and Augmented Reality (ISMAR), 2007.
 [17] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale. Keyframebased visual–inertial odometry using nonlinear optimization. International Journal of Robotics Research, 34(3):314–334, 2015.
 [18] M. Li, B. Kim, and A. Mourikis. Realtime motion estimation on a cellphone using inertial sensing and a rollingshutter camera. In International Conference on Robotics and Automation (ICRA), 2013.
 [19] S. Lovegrove, A. PatronPerez, and G. Sibley. Spline fusion: A continuoustime representation for visualinertial fusion with application to rolling shutter cameras. In British Machine Vision Converence (BMVC), 2013.
 [20] R. MurArtal, J. Montiel, and J. Tardos. ORBSLAM: a versatile and accurate monocular SLAM system. Transactions on Robotics, 31(5):1147–1163, 2015.
 [21] R. Newcombe, S. Lovegrove, and A. Davison. DTAM: Dense tracking and mapping in realtime. In International Conference on Computer Vision (ICCV), 2011.
 [22] M. Pizzoli, C. Forster, and D. Scaramuzza. REMODE: Probabilistic, monocular dense reconstruction in real time. In International Conference on Robotics and Automation (ICRA), 2014.
 [23] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monocular depth estimation in complex dynamic scenes. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [24] T. Schöps, J. Engel, and D. Cremers. Semidense visual odometry for AR on a smartphone. In International Symposium on Mixed and Augmented Reality (ISMAR), 2014.
 [25] H. Strasdat, A. J. Davison, J. M. M. Montiel, and K. Konolige. Double window optimisation for constant time visual SLAM. In International Conference on Computer Vision (ICCV), 2011.
 [26] J. Stühmer, S. Gumhold, and D. Cremers. Realtime dense geometry from a handheld camera. In Pattern Recognition (DAGM), 2010.
 [27] L. Valgaerts, A. Bruhn, M. Mainberger, and J. Weickert. Dense versus sparse approaches for estimating the fundamental matrix. International Journal of Computer Vision (IJCV), 96(2):212–234, 2012.