# Tightly-coupled Monocular Visual-odometric SLAM using Wheels and a MEMS Gyroscope

###### Abstract

In this paper, we present a novel tightly-coupled probabilistic monocular visual-odometric Simultaneous Localization and Mapping algorithm using wheels and a MEMS gyroscope, which can provide accurate, robust and long-term localization for the ground robot moving on a plane. Firstly, we present an odometer preintegration theory that integrates the wheel encoder measurements and gyroscope measurements to a local frame. The preintegration theory properly addresses the manifold structure of the rotation group SO(3) and carefully deals with uncertainty propagation and bias correction. Then the novel odometer error term is formulated using the odometer preintegration model and it is tightly integrated into the visual optimization framework. Furthermore, we introduce a complete tracking framework to provide different strategies for motion tracking when (1) both measurements are available, (2) visual measurements are not available, and (3) wheel encoder experiences slippage, which leads the system to be accurate and robust. Finally, the proposed algorithm is evaluated by performing extensive experiments, the experimental results demonstrate the superiority of the proposed system.

## I Introduction

Simultaneous localization and mapping(SLAM) from on-board sensors is a fundamental and key technology for autonomous mobile robot to safely interact within its workspace. SLAM is a technique that builds a globally consistent representation of the environment(i.e. the map) and estimates the state of the robot in the map simultaneously. Because SLAM can be used in many practical applications, such as autonomous driving, virtual or augmented reality and indoor service robots, it has received considerable attention from Robotics and Computer Vision communities.

In this paper, we propose a novel tightly-coupled probabilistic optimization-based monocular visual-odometric SLAM(VOSLAM) system. By combining a monocular camera with wheels and a MEMS gyroscope, the method provides accurate and robust motion tracking for domestic service robots moving on a plane, e.g. cleaning robot, nursing robot and restaurant robot waiter. A single camera provides rich information about the environment, which allows for building 3D map, tracking camera pose and recognizing places already visited. However, the scale of the environment can not be determined using monocular camera, and visual tracking system is sensitive to motion blur, occlusions and illumination changes. Most ground robots are equipped with wheel encoders that provide precise and stable translational measurements of each wheel at most of the time, the measurements contain the absolute scale information. Whereas, the wheel encoder cannot provide accurate self rotational estimates and occasionally provides faulty measurements. In addition, the MEMS gyroscope is a low cost and commercially widely used sensor, and provides accurate and robust inter-frame rotational estimate. However, the estimated rotation is noisy and diverges even in few seconds. Based on the analysis of each sensor, we can know that wheel encoder and gyroscope are complementary to the monocular camera sensor. Therefore, tightly fusing the measurements from wheel encoder and gyroscope to the monocular visual SLAM can not only dramatically improve the accuracy and robustness of the system, but also recover the scale of the environment. In the following, we will call the wheel encoder and MEMS gyroscope the odometer.

In order to tightly fuse the odometer measurements to the visual SLAM system in the framework of graph-based optimization, it is important to provide the integrated odometer measurements between the selected keyframes. Therefore, motivated by the inertial measurement unit(IMU) preintegration theory proposed in [1], we present a novel odometer preintegration theory and corresponding uncertainty propagation and bias correction theory on manifold. The preintegration theory integrates the measurements from the wheel encoder and gyroscope to a single relative motion constraint that is independent of the change of the linearization point, therefore the repeated computation is eliminated. Then, based on the proposed odometer preintegration model, we formulate the new preintegrated odometer factor and seamlessly integrate it in a visual-odometric pipeline under the optimization framework.

Furthermore, both visual and odometer measurements are not always available. Therefore, we present a complete visual-odometric tracking framework to ensure the accurate and robust motion tracking in different situations. For the situation where both measurements are available, we maximally exploit the both sensing cues to provide accurate motion tracking. For the situation where visual information is not available, we use odometer measurements to improve the robustness of the system and offer some strategies to render the visual information available as quick as possible. In addition, for the critical drawback of the wheel sensor, we provide a strategy to detect and compensate for the slippage of the wheel encoder. In this way, we can track the motion of the ground robot accurately and robustly.

The final contribution of the paper is the extensive evaluation of our system. Extensive experiments are performed to demonstrate the accuracy and robustness of the proposed algorithm. The presented algorithm is shown in Fig. 1, and the details are presented in later sections.

## Ii Related work

There are extensive scholarly works on monocular visual SLAM, these works rely on either filtering methods or nonlinear optimization methods. Filtering based approaches require fewer computational resources due to the continuous marginalization of past state. The first real-time monocular visual SLAM - MonoSLAM [2] is an extended kalman filter(EKF) based method. The standard way of computing Jacobian in the filtering leads the system to have incorrect observability, therefore the system is inconsistent and gets slightly lower accuracy. To solve this problem, the first-estimates Jacobian approach was proposed in [3], which computes Jacobian with the first-ever available estimate instead of different linearization points to ensure the correct observability of the system and thereby improve the consistency and accuracy of the system. In addition, the observability-constrained EKF [4] was proposed to explicitly enforce the unobservable directions of the system, hence improving the consistency and accuracy of the system.

On the other hand, nonlinear optimization based approaches can achieve better accuracy due to it’s capability to iteratively re-linearize measurement models at each iteration to better deal with their nonlinearity, however it incurs a high computational cost. The first real-time optimization based monocular visual SLAM system is PTAM [5] proposed by Klein and Murray. The method achieves real-time performance by dividing the SLAM system into two parallel threads. In one thread, the system performs bundle adjustment over selected keyframes and constructed map points to obtain accurate map of the environment. In the other parallel thread, the camera pose is tracked by minimizing the reprojection error of the features that match the reconstructed map. Based on the work of PTAM, a versatile monocular SLAM system ORB-SLAM [6] was presented. The system introduced the third loop closing thread to eliminate the accumulated error when revisiting an already reconstructed area, it is achieved by taking advantage of bag-of-words [7] and a 7 degree-of-freedom(dof) pose graph optimization [8].

In addition, according to the definition of visual residual models, monocular SLAM can also be categorized into feature based approaches and direct approaches. The above mentioned methods are all feature based approaches, which is quite mature and able to provide accurate estimate. However, the approaches fail to track in poorly textured environments and need to consume extra computational resources to extract and match features. In contrary, direct methods work on pixel intensity and can exploit all the information in the image even in some places where the gradient is small. Therefore, direct methods can outperform feature based methods in low texture environment and in the case of camera defocus and motion blur. DTAM[9], SVO[10] and LSD-SLAM[11] are direct monocular SLAM systems, which builds the dense or semi-dense map from monocular images in real-time, however its accuracy is still lower than the feature based semi-dense mapping technique [12].

The monocular visual SLAM is scale ambiguous and sensitive to motion blur, occlusions and illumination changes. Therefore, based on the framework of monocular SLAM, it is often combined with other odometric sensors, especially IMU sensor, to achieve accurate and robust tracking system. Tightly-coupled visual-odometric SLAM can also be categorized into filtering based methods and optimization based methods, where visual and odometric measurements are fused from the raw measurement level. Papers [13, 14, 15, 16, 17] are filtering based monocular visual-inertial SLAM, the approaches use the inertial measurements to accurately predict the motion movement between two consecutive frames. An elegant example for filtering based visual-inertial odometry(VIO) is the MSCKF[16], which exploits all the available geometric information provided by the visual measurements with the computational complexity only linear in the number of features, it is achieved by excluding point features from the state vector.

OKVIS[18] is an optimization based monocular visual-inertial SLAM, which tightly integrates the inertial measurements in the keyframe-based visual-inertial pipeline under the framework of graph optimization. However, in this system, the IMU integration is computed repeatedly when the linearization point changes. Therefore in order to eliminate this repeated computation, Forster et al. presented an IMU preintegration theory, and tightly integrated the preintegrated IMU factor and visual factor in a fully probabilistic manner in paper [1]. Later, a real-time tightly-coupled monocular visual-inertial SLAM system - ORB-VISLAM [19] was presented. The system can close loop and reuse the previously estimated 3D map, therefore achieve high accuracy and robustness. Recently, another tightly-coupled monocular visual-inertial odometry was proposed in [20][21], it provides accurate and robust motion tracking by performing local bundle adjustment(BA) for each frame and its capability to close loop.

There are also several works on the visual-odometric SLAM that fuses visual measurements and wheel encoder measurements. In [22], wheel encoder measurements are combined to the system of visual motion tracking for accurate motion prediction, thereby the true scale of the system is recovered. In addition, paper [23] proved that VINS has additional unobservable directions when a ground robot moves along straight lines or circular arcs. Therefore a system fusing the wheel encoder measurements to the VINS estimator in a tightly-coupled manner was proposed to render the scale of the system observable.

## Iii Preliminaries

We begin by briefly defining the notations used throughout the paper. We employ to denote the world reference frame, , and to denote the wheel odometer frame, camera frame and inertial frame for the image. In following, we employ to represent rotation from frame to and to describe the 3D position of frame with respect to the frame .

The rotation and translation between the rigidly mounted wheel encoder and camera sensor are and respectively, and denotes the rotation from the inertial frame to the wheel encoder frame, these parameters are obtained from calibration. In addition, the pose of the image is the rigid-body transformation , and the 3D position of the map point in the global frame and the camera frame are denoted as and respectively.

In order to provide a minimal representation for the rigid-body transformation during the optimization, we use a vector computed from the Lie algebra of to represent the over-parameterized rotation matrix . The Lie algebra of is denoted as , which is the tangent space of the manifold and coincides with the space of skew symmetric matrices. The logarithm map associates a rotation matrix to a skew symmetric matrix:

(1) |

where operator maps a 3-dimensional vector to a skew symmetric matrix, thus the vector can be computed using inverse operator:

(2) |

Inversely, the exponential map associates the Lie algebra to the rotation matrix :

(3) |

The input of our estimation problem is a stream of measurements from the monocular camera and the odometer. The visual measurement is a set of point features extracted from the captured intensity image at time-step . Such measurement is obtained by camera projection model , which projects the map point expressed in the current camera frame onto the image coordinate :

(4) |

where is the corresponding feature measurement, and is the measurement noise with covariance . The projection function is determined by the intrinsic parameters of the camera, which is known from calibration.

In addition, the gyroscope of the odometer measures the angular velocity at time-step k, the measurement is assumed to be affected by a slowly varying sensor bias with covariance and a discrete-time zero-mean Gaussian white noise with covariance :

(5) |

The wheel encoder of the odometer measures the traveled distance and of the both wheels from time-step to , which is assumed to be affected by a discrete-time zero-mean Gaussian white noise with variance :

(6) |

Therefore, the measured 3D position of frame with respect to frame from wheel encoder is:

(7) |

where and constitute the pose of frame , and and constitute the pose of frame .

In many cases, the ground robot is moving on a plane. The motion on a plane has 3 dof in contrast to 6 dof of 3D motion, i.e. the roll, pitch angle and translation on z-axis of frame in the frame of physical plane should be close to zero. Since the additional information can improve the accuracy of the system, we also provide planar measurement with covariance for each frame, where the first two elements correspond to the planar rotational measurement and the third element corresponds to the planar translational measurement.

## Iv Tightly-coupled visual-odometric nonlinear optimization on Manifold

We use to denote the set of successive keyframes from to , and to denote all the landmarks visible from the keyframes in . Then the variables to be estimated in the window of keyframes from to is:

(8) |

where is the state of the keyframe .

We denote the visual measurements of at the keyframe as . In addition, we denote the odometer measurements obtained between two consecutive keyframes and as . Therefore, the set of measurements collected for optimizing the state is:

(9) |

### Iv-a Maximum a Posteriori Estimation

The optimum value of state is estimated by solving the following maximum a posteriori (MAP) problem:

(10) |

which means that given the available measurements , we want to find the best estimate for state . Assuming measurements are independent, then using Bayes’ rule, we can rewrite as:

(11) |

The equation can be interpreted as a factor graph. The variables in are corresponding to nodes in the factor graph. The terms , , and are called factors, which encodes probabilistic constraints between nodes. A factor graph representing the problem is shown in Fig. 2.

The MAP estimate is equal to the minimum of the negative log-posterior. Under the assumption of zero-mean Gaussian noise, the MAP estimate in (10) can be written as the minimization of sum of the squared residual errors:

(12) |

where , , and are the prior error, odometer error, reprojection error and plane error respectively, as well as , , and are the corresponding covariance matrices, and is the Huber robust cost function. In the following subsections, we provide expressions for these residual errors and introduce the Gauss-Newton optimization method on manifold.

### Iv-B Preintegrated Odometer Measurement

In this section, we derive the odometer preintegration between two consecutive keyframes and by assuming the gyro bias of keyframe is known. We firstly define the rotation increment and position increment in the wheel odometer frame as:

(13) |

Then, using the first-order approximation and dropping higher-order noise terms, we split each increment in (LABEL:increment) to preintegrated measurement and its noise. For rotation, we have:

(14) |

where . Therefore, we obtain the preintegrated rotation measurement:

(15) |

For position, we have:

(16) |

Therefore, we obtain the preintegrated position measurement:

(17) |

### Iv-C Noise Propagation

We start with rotation noise. From (14), we obtain:

(18) |

The rotation noise term is zero-mean and Gaussian, since it is a linear combination of zero-mean white Gaussian noise .

Furthermore, from (16), we obtain the position noise:

(19) |

The position noise is also zero-mean Gaussian noise, because it is a linear combination of the noise and the rotation noise .

We write (18) and (19) in iterative form, then the noise propagation can be written in matrix form as:

(20) |

or more simply:

(21) |

Given the linear model (21) and the covariance of the odometer measurements noise , it is possible to compute the covariance of the odometer preintegration noise iteratively:

(22) |

with initial condition .

Therefore, we can fully characterize the preintegrated odometer measurements noise as:

(23) |

### Iv-D Bias update

In the previous section, we assumed that the gyro bias is fixed. Given the bias change , we can update the preintegrated measurements using a first-order approximation. For preintegrated rotation measurement:

(24) |

where . For preintegrated position measurement:

(25) |

where .

### Iv-E Preintegrated Odometer Measurement Model

From the geometric relations between two consecutive keyframes and , we get our preintegrated measurement model as:

(26) |

Therefore, preintegrated odometer residual is:

(27) |

### Iv-F Gyro Bias Model

Gyro bias is slowly time-varying, so the relation of gyro bias between two consecutive keyframes and is:

(28) |

where is the discrete-time zero-mean Gaussian noise with covariance . Therefore, we can express the gyro bias residual as:

(29) |

### Iv-G Odometer Factor

### Iv-H Visual Factor

Through the measurement model in (LABEL:visualmm), the map point expressed in the world reference frame can be projected onto the image plane of the keyframe as:

(31) |

Therefore, the reprojection error for the map point seen by the keyframe is:

(32) |

### Iv-I Plane Factor

The x-y plane of the first wheel encoder frame coincides with the physical plane, so the planar measurement in section III corresponds to that the roll, pitch angle and translation on z-axis between frame and should be close to zero. Therefore, we express the plane factor as:

(33) |

### Iv-J On-Manifold Optimization

The MAP estimate in (12) can be written in general form on manifold as:

(34) |

We use the retraction approach to solve the optimization problem on manifold. The retraction is a bijective map between the tangent space and the manifold. Therefore, we can re-parameterize our problem as follows:

(35) |

where is an element of the tangent space and the minimum dimension error representation. The objective function is defined on the Euclidean space, so it is easy to compute Jacobian.

For the rigid-body transformation , the retractraction at is:

(36) |

where , .

However, since the gyro bias and position of map points are already in a vector space, the corresponding retraction at and are:

(37) |

where and .

We adopt the Gauss-Newton algorithm to solve (34) since a good initial guess can be obtained. Firstly, we linearize each error function in (34) with respect to by its first order Taylor expansion around the current initial guess :

(38) |

where, is the jacobian of with respect to , which is computed in , and . Substituting (38) to the each error term of (34), we obtain:

(39) |

where , , are formed by stacking , , respectively. Then we take the derivative of with respect to and set the derivative to zero, which leads to the following linear system:

(40) |

Finally, the state is updated by adding the increment to the initial guess :

(41) |

## V monocular visual-odometric SLAM

Our monocular visual-odometric SLAM system is inspired by the ORB-SLAM [6] and visual-inertial ORB-SLAM [19] methods. Fig. 1 shows an overview of our system. In this section, we detail the main changes of our visual-odometric SLAM system with respect to the referenced system.

### V-a Map Initialization

The map initialization is in charge of constructing an initial set of map points by using the visual and odometer data. Firstly, we extract ORB features in the current frame and search for feature correspondences with reference frame . If there are sufficient feature matches, we perform the next step, else we set the current frame as reference frame. The second step is to check the parallax of each correspondence and pick out a set of feature matches that have sufficient parallax. When the size of is greater than a threshold, we use the odometer measurements to compute the relative transformation between two frames, and triangulate the matched features . Finally, if the size of the successfully created map points is greater than a threashold, a global BA that minimizes all reprojection error, odometer error and plane error in the initial map is applied to refine the initial map.

### V-B Tracking when Previous Visual Tracking is Successful

Once the initial pose of current frame is predicted using the odometer measurements, the map points in the local map are projected into the current frame and matched with the keypoints extracted from the current frame. Then the pose of current frame is optimized by minimizing the corresponding energy function. Depending on whether the map in back-end is updated, the pose prediction and optimization methods are different, which will be described in detail below. In addition, we provide a detection strategy and solution for wheel slippage. The tracking mechanism is illustrated in Fig. 3.

#### V-B1 Tracking when Map Updated

When tracking is performed just after a map update in the back-end, we firstly compute the preintegrated odometer measurement between current frame and last keyframe . Then the computed relative transformation is combined with the optimized pose of last keyframe to predict the initial pose of the current frame. The reason for this state prediction is that the pose estimate of the last keyframe is accurate enough after performing a local or global BA in the back-end. Finally, the state of the current frame is optimized by minimizing the following energy function:

(42) |

After the optimization, the resulting estimation and Hessian matrix are served as a prior for next optimization.

#### V-B2 Tracking when no Map Updated

When the map is not changed in the back-end, we compute the preintegrated odometer measurement between current frame and last frame , and predict the initial pose of current frame by integrating the relative transformation to the pose of last frame. Then, we optimize the pose of current frame by performing the nonlinear optimization that minimizing the following objective function:

(43) |

where the residual is a prior error term of last frame:

(44) |

where , , and are the estimated states and resulting covariance matrix from previous pose optimization. The optimized result is also served as a prior for next optimization.

#### V-B3 Detecting and Solving Wheel Slippage

Wheel encoder is an ambivalent sensor, it provides a precise and stable relative transformation at most of the time, but it can also deliver very faulty data when the robot experiences slippage. If we perform visual-odometric joint optimization using this kind of faulty data, in order to simultaneously satisfy the constraints of both odometer measurements with slippage and visual measurements, the optimization will lead to a false estimate. Therefore, we provide a strategy to detect and solve this case. We think the current frame experienced a slippage if the above optimization makes more than half of the original matched features become outliers. Once the wheel slippage is detected, we set slippage flag to current frame and reset the initial pose of current frame as the pose of last frame . Then we re-project the map points in the local map and re-match with features on the current frame. Finally, the state of current frame is optimized by only using those matched features:

(45) |

After the optimization, the resulting estimate and Hessian matrix of current frame are served as a prior for next optimization.

### V-C Tracking when Previous Visual Tracking is Lost

If visual information is not available in current frame, only odometer measurements can be used to compute the pose of the frame. So in order to obtain more accurate pose estimate, we should make the visual information available as early as possible.

Supposing the previous visual tracking is lost, then one of the three cases will happen for the current frame: (1) the robot revisits to an already reconstructed area; (2) the robot visits to a new environment where sufficient map points are newly constructed; (3) the visual features are still unavailable wherever the robot is. For these different situations, we perform different strategies to estimate the pose of the current frame. For case 1, a global relocalization method as done in [6], i.e. using DBOW[7] and PnP algorithm[24], is performed to compute the pose of the current frame and render the visual information available. For case 2, we firstly use the odometer measurements to predict the initial pose of current frame, then project map points seen by last keyframe to the current frame and optimize the pose of current frame using those matched features. For case 3, we use the odometer measurements to compute the pose of the current frame.

When enough features are extracted from the current frame after visual tracking is lost, we firstly think the robot may returned to an already reconstructed environment, therefore perform the global relocaliation method(solution for case 1). However, if the relocalization continuously fails until the second keyframe with enough features is selected to enable the reconstruction of the new map, we think the robot entered into a new environment, thereby the localization in newly constructed map is performed as solution for case 2. We deem the visual information becomes available for motion tracking of current frame when the camera pose is supported by enough matched features. So if the computed pose in case 1 and case 2 is not supported by enough matched features or fewer features are extracted from the current frame, we think the visual information is still unavailable for motion tracking of the current frame and set the pose of current frame according to the solution for case 3.

### V-D Keyframe Decision

If the visual tracking is successful, we have two criteria for keyframe selection: (1) current frame tracks less than 50 features than last keyframe; (2) Local BA is finished in the back-end. These criteria insert keyframes as many as possible to make visual tracking and mapping to work all the time, thereby ensure a good performance of the system.

In addition, if visual tracking is lost, we insert a keyframe to the back-end when one of the following conditions is satisfied: (1) The traveled distance from the last keyframe is large than a threshold; (2) The relative rotation angle from the last keyframe is beyond a threshold; (3) Local BA is finished in the local mapping thread. These conditions ensure that when the previous map is not available and the robot enters into a new environment where there are enough features, the system can still build new map that are consistent with the previous map.

### V-E Back-End

The back-end includes the local mapping thread and the loop closing thread. The local mapping thread aims to construct the new map points of the environment and optimize the local map. When new keyframe is inserted to local mapping thread, we make small changes in convisibility graph update and local BA with respect to paper [19]. If the visual tracking of new keyframe is lost, we update the covisibility graph by adding a new node for keyframe and an edge connected with the last keyframe to ensure the ability to build new map. In addition, visual-odometric local BA is performed to optimize the last N keyframes(local window) and all points seen by those N keyframes, which is achieved by minimizing the cost function (12) in the window. One thing to note is that the odometer constraint linking to the previous keyframe is only constructed for those keyframes without the slippage flag. The loop closing thread is in charge of eliminating the accumulated drift when returning to an already reconstructed area, it is implemented in the same way as paper [19].

## Vi Experiments

In the following, we perform a number of experiments to evaluate the proposed approach both qualitatively and quantitatively. Firstly, we perform qualitative and quantitative analysis of our algorithm to show the accuracy of our system in Section VI-A. Then the validity of the proposed strategy for detecting and solving the wheel slippage is demonstrated in Section VI-B. Finally in Section VI-C, we test the tracking performance of the algorithm when the previous visual tracking is lost. The experiments are performed on a laptop with Intel Core i5 2.2GHz CPU and an 8GB RAM, and the corresponding videos are available at: https://youtu.be/EaDTC92hQpc. In addition, our system is able to work robustly in raspberry pi platform that has a Quad Core 1.2GHz Broadcom BCM2837 64bit CPU and 1GB RAM, at the processing frequency of 5Hz.

### Vi-a Algorithm Evaluation

We evaluate the accuracy of the proposed algorithm in a dataset provided by the author of [23]. The dataset is recorded by a Pioneer 3 DX robot with a Project Tango, and provides 640 480 grayscale images at 30 Hz, the inertial measurements at 100Hz and wheel-encoder measurements at 10 Hz. In addition, the dataset also provides the ground truth which is computed from the batch least squares offline using all available visual, inertial and wheel measurements.

We process images at the frequency of 10 Hz, qualitative comparison of the estimated trajectory and the ground truth is shown in Fig. 4. The estimated trajectory and the ground truth are aligned in closed form using the method of Horn[25]. We can qualitatively compare our estimated trajectory with the result provided by the approach of Wu. et. al [23] in their figure 6. It is clear that our algorithm produces more accurate trajectory estimate, which is achieved by executing the complete visual-odometric tracking strategies, performing the local BA to optimize the local map and closing loop to eliminate the accumulated error when returning to an already mapped area. Quantitatively, the sequence is 1080m long, and the positioning Root Mean Square Error(RMSE) of our algorithm is 0.606m, it is the 0.056% of the total traveled distance with comparison to 0.25% of the approach [23].

### Vi-B Demonstration of Robustness to Wheel Slippage

In the following experiments, we use data recorded from a DIY robot with a OV7251 camera mounted on it to look upward for visual sensing. The sensor suite provides the 640 480 grayscale images at the frequency of 30Hz, the wheel-odometer and gyroscope measurements at 50 Hz. Since there is no ground truth available, we only do qualitatively analysis.

The wheel slippage experiment is performed in two situations. In the first experiment, we firstly let the ground robot to walk normally, then hold the robot to make it static but the wheel is spinning, and finally let it to normally walk once again. The estimated results in some critical moments are shown in Fig. 5 and Fig. 6. Fig. (a)a is the captured image at the first critical moment when the platform start to experience wheel slippage, and the trajectories estimated by our method and the odometer from the beginning to this moment are shown in Fig. (a)a. We can see that both methods can accurately estimate the position of the sensor suite under normal motion. The image and the estimated trajectory obtained at second moment when wheel slippage is over are given in Fig. (b)b and Fig. (b)b. As evident, the images at first and second moments are almost the same, our method gives the very close pose for these two moments with comparison to the odometer who provides far away positions for these two moments due to the wheel slippage. Thus the validity of the proposed strategy for detecting and solving wheel slippage can be proved. The reconstructed 3D map for the sequence are shown in Fig. (c)c, the map is globally consistent, which is achieved by effectively solving the problem of wheel slippage. The situation is also tested in artificial lighting and relatively low texture environment. In Fig. 7 and 8, its intermediate and final results are given, which also demonstrates the robustness of our system to the slippage of wheel encoder.

The second experiment is performed as follows. The sensor suite walks normally at first, then the wheel turns normally, however the platform is moved to another location artificially, and finally it normally walks once again. The test results for the second situation are shown in Fig. 9 and Fig. 10. Before the sensor is moved away, the estimated trajectories from both our method and the odometer are close to each other as shown in Fig. (a)a. Fig. (a)a and Fig. (b)b are the captured images at first moment when platform starts to move and at second moment when the platform has been moved to another location. Comparing to the estimated motion by the odometer, the proposed method gives precise tracking for the movement as shown in Fig. (b)b. Thereby, the performance of the proposed strategy for wheel slippage is demonstrated again.

### Vi-C Demonstration of Tracking Performance when Previous Visual Tracking is Lost

The tracking performance of our system when previous visual tracking is lost is tested in two sequences, sequence 1 includes the case 1 and case 3 in Section V-C and the sequence 2 includes the case 2 and case 3 in Section V-C. Firstly, we use sequence 1 to test the proposed solution for the case 1 and case 3, the estimated results in some critical moments are shown in Fig. 11 and Fig. 12. The robot firstly moves on areas where enough visual information is available to build a map of the environment shown in Fig. (a)a. Then we turn out the lights to make the visual information unavailable. The motion of robot is continuously computed in the period of visual loss as shown in Fig. (b)b, which is achieved by using the odometer measurements as solution to case 3. Finally, we turn on the lights to make the robot to revisit an already reconstructed area, at which moment, the global relocalization is triggered. The reconstructed map at the end of sequence is shown in Fig. (c)c, it is globally consistent without closing the loop. Therefore, we can demonstrate the validity of the proposed solution for case 1 and case 3. Furthermore, we can conclude that our system is robust to visual loss, thanks to the stable measurements from the odometer.

Secondly, we perform the case 2 experiment in sequence 2, the test results for the experiment are shown in Fig. 13. The ground robot firstly moves on areas where enough visual information is available to build the map of the environment shown in Fig. (a)a. Then the robot goes to the low texture environment, and later enters new environment where enough features are available. From Fig. (b)b, we can know that new map is created when there are enough feature points in new environment, which however is not consistent with the previously reconstructed map. Finally, the robot returns to an already mapped area, this leads the system to trigger loop closure for eliminating the accumulated error, thereby globally consistent map is constructed as shown in Fig. (c)c.

## Vii Conclusion and future work

In this paper, we have proposed a tightly-coupled monocular visual-odometric SLAM system. It tightly integrates the proposed odometer factor and visual factor in the optimization framework to optimally exploit the both sensor cues, which ensures the accuracy of the system. In addition, the system uses the odometer measurements to compute the motion of frame when visual information is not available, and is able to detect and reject false information from wheel encoder, thereby ensuring the robustness of the system. The experiments have domenstrated that our system can provide accurate, robust and long-term localization for the wheeled robots mostly moving on a plane.

In future work, we aim to exploit line features to improve the performance of our algorithm in environments where only fewer point features are available. In addition, the camera-to-wheel encoder calibration parameters are only known with finite precision, which can pose a bad effect on results, so we intend to estimate the extrinsic calibration parameters online and optimize this parameters by BA. Finally, we will add full IMU measurements to our system for accurate motion tracking when both visual information and wheel odometric information cannot provide valid information for localization.

## Acknowledgment

## References

- [1] Christian Forster, Luca Carlone, Frank Dellaert, and Davide Scaramuzza. Imu preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation. Georgia Institute of Technology, 2015.
- [2] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, June 2007.
- [3] Guoquan P. Huang, Anastasios I. Mourikis, and Stergios I. Roumeliotis. A first-estimates jacobian ekf for improving slam consistency. Experimental Robotics. Springer, Berlin, Heidelberg, pages 373–382, 2009.
- [4] J. A Hesch and S. I Roumeliotis. Consistency analysis and improvement for single-camera localization. In Computer Vision and Pattern Recognition Workshops, pages 15–22, 2013.
- [5] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 225–234, Nov 2007.
- [6] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardï¿½ï¿½s. Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, Oct 2015.
- [7] Dorian Galvez-Lopez and Juan D. Tardos. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 28(5):1188–1197, 2012.
- [8] J. M. M. Montiel H. Strasdat and A. J. Davison. Scale drift-aware large scale monocular slam. In Robotics: Science and Systems(RSS), June 2010.
- [9] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In 2011 International Conference on Computer Vision, pages 2320–2327, Nov 2011.
- [10] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 15–22, May 2014.
- [11] Jakob Engel, Thomas Schops, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. In in European Conf. on Computer Vision, pages 834–849, 2014.
- [12] R. Mur-Artal and J. D. Tards. Probabilistic semi-dense mapping from highly accurate feature-based monocular slam. In Robotics: Science and Systems(RSS), July 2015.
- [13] Pedro Pinies, Todd Lupton, Salah Sukkarieh, and Juan D. Tardos. Inertial aiding of inverse depth slam using a monocular camera. pages 2797–2802, 2007.
- [14] Markus Kleinert and Sebastian Schleith. Inertial aided monocular slam for gps-denied navigation. In 2010 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pages 20–25, Sep 2010.
- [15] Eagle S Jones and Stefano Soatto. Visual-inertial navigation, mapping and localization: A scalable real-time causal approach. International Journal of Robotics Research, 30(4):407–430, 2011.
- [16] A. I. Mourikis and S. I. Roumeliotis. A multi-state constraint kalman filter for vision-aided inertial navigation. In Proceedings IEEE International Conference on Robotics and Automation, pages 3565–3572, April 2007.
- [17] Mingyang Li and A. I. Mourikis. High-precision, consistent ekf-based visual-inertial odometry. 32(6):690–711, 2013.
- [18] Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. Keyframe-based visual-inertial odometry using nonlinear optimization. International Journal of Robotics Research, 34(3):314–334, 2015.
- [19] R. Mur-Artal and J. D. Tardï¿½ï¿½s. Visual-inertial monocular slam with map reuse. IEEE Robotics and Automation Letters, 2(2):796–803, 2017.
- [20] Peiliang Li, Tong Qin, Botao Hu, Fengyuan Zhu, and Shaojie Shen. Monocular visual-inertial state estimation for mobile augmented reality. In 2017 IEEE International Symposium on Mixed and Augmented Reality(ISMAR), pages 11–21, Oct 2017.
- [21] Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. arXiv, abs/1708.03852, 2017.
- [22] J. Civera, O. G. Grasa, A. J. Davison, and J. M. M. Montiel. 1-point ransac for ekf filtering. application to real-time structure from motion and visual odometry. In J. Field Rob, volume 27, pages 609–631, 2010.
- [23] Georgious Georgiou Kejian J. Wu, Chao X. Guo and Stergios I. Roumeliotis. Vins on wheels. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5155–5162, May 2017.
- [24] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. International Journal of Computer Vision, 81(2):155–166, 2009.
- [25] Berthold K. P. Horn. Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A, 4(4):629–642, 1987.