Spatiotemporal Decoupling Based LiDARâCamera Calibration under Arbitrary Configurations
Abstract
LiDARcamera calibration is a precondition for many heterogeneous systems that fuse data from LiDAR and camera. However, the constraint from the common field of view and the requirement for strict time synchronization make the calibration a challenging problem. In this paper, we propose a hybrid LiDARcamera calibration method aiming to solve these two difficulties. The configuration between LiDAR and camera is free from their common field of view as we move the camera to cover the scenario observed by LiDAR. 3D visual reconstruction of the environment can be achieved from the sequential visual images obtained by the moving camera, which later can be aligned with the single 3D laser scan captured when both the scene and the equipment are stationary. Under this design, our method can further get rid of the influence from time synchronization between LiDAR and camera. Moreover, the extended field of view obtained by the moving camera can improve the calibration accuracy. We derive the conditions of minimal observability for our method and discuss the influence on calibration accuracy from different placements of chessboards, which can be utilized as a guideline for designing highaccuracy calibration procedures. We validate our method on both simulation platform and realworld datasets. Experiments show that our method can achieve higher accuracy than other comparable methods.
The camera coordinate system.
The laser coordinate system.
A visual 3D point on the chessboard in .
A visual 3D point on the chessboard in .
A visual 3D point of a landmark in .
A laser 3D point in .
The normal vector of a laser point.
The pose of the camera in .
The extrinsic parameters of w.r.t. .
I Introduction
A perception system that employs only one sensor will not be robust. For example, LiDARbased odometry [1] will fail when working in a long corridor, and the camerabased algorithm [2] , [3], [4] cannot be applied to a textureless scene [5]. Fusing the visual and laser information can eliminate the outliers from the algorithm, and solve various limitations for the algorithms imposed by the single sensor. For example, the fusion of the range sensor and the camera can improve the accuracy of object detection [6]. What’s more, heterogeneous localization methods, such as visual localization on a laser map [7], can enable lowcost and longterm localization. The precondition of all the above algorithms is the calibration of different sensors, and to that end, we focus on extrinsic calibration of the LiDAR and camera in this work.
Numerous efforts have been carried out to perform LiDARâcamera extrinsic calibration [8], [9], [10]. The current calibration approaches can be classified into two groups [11]: one is appearancebased and the other is motionbased. The appearancebased methods can obtain the extrinsic parameters by directly matching 2D images with 3D points on the laser point cloud. In the motionbased methods, the motion of the camera is estimated from images, while the motion of the LiDAR is estimated from the laser points, and then calibration is performed by aligning the two trajectories.
First, we will consider the appearancebased methods. Methods such as [12], [13] use targets that can be detected on both 2D images and 3D laser point clouds. However, these algorithms always require the sensors sharing a common field of view, which some application scenarios cannot satisfy. Even in the application scenario where the condition is met, the requirement of the common field of view constrains the scale of the scene and limits the number of targets that can be detected, thus affecting the accuracy of the calibration, which prevents the utilization of pinhole cameras from the LiDARcamera system. In some methods, panoramic or wideangle cameras are used to solve this problem [14]. In comparison, the motionbased methods [15], [16] perform calibration by aligning the estimated motion trajectories. Yet it is not easy to perform time synchronization of the data obtained by the two sensors. In some scenarios where time synchronization is not completed, an additional variable (i.e. time offset) should be introduced such as in [15], which increases the number of variables to be optimized. Ultimately, the motionbased method is a loosely coupled calibration method that cannot lead to high calibration accuracy.
In this work, we propose a hybrid calibration method, which combines the advantages of appearancebased calibration and motionbased calibration. The demonstration of proposed method is shown in Fig. 1. In our method, a number of chessboards in various poses are placed around the sensors, and one frame laser scan of the chessboards is obtained under stationary. Then the sensors are moved around to obtain images of each chessboard to reconstruct the visual 3D point cloud. Note that this differs from previous approaches [8], [9] which require multiple images and the LiDAR data of a single chessboard presented at different poses as inputs; the hidden limitation of these methods is that a common field of view between sensors is needed. Our method expands the camera’s field of view by moving the sensor, so even though there is no common view at the starting position, the LiDAR and the âexpanded cameraâ can also have overlap in their measurement ranges, which removes the configuration limitation for a common field of view. Moreover, the extended field of view obtained can remove the constraints of the observed scale of the scene and increase the number of chessboards that can be detected, which can lead to an increase in accuracy. Additionally, since the LiDAR and camera remain stationary during acquisition of the laser data, there is no time synchronization problem in our system, and no need to introduce a time offset. As part of our contribution, we also examine the observability properties of our system and present the minimal necessary conditions for concurrently estimating the LiDARcamera extrinsic parameters. Further, we derive the influence of the angle and distance between chessboards on the calibration accuracy, which proves that sharing a larger field of view between sensors is beneficial for better calibration accuracy. The relevant theory provides a guideline for designing highaccuracy calibration procedures. The contributions of this paper are shown below:

We remove the limitation of sharing a common field of view by expanding the field of view of the camera, and use spatiotemporal decoupling to make our method suitable for situations without time synchronization, and to improve the accuracy of the calibration.

We theoretically prove the observability of the calibration system and present the minimum necessary conditions for calibration, which provides a practical guideline.

We derive the influence of the angle and distance between chessboards on the calibration accuracy, which proves that the extended field of view improves the accuracy and convergence of calibration.
This work is structured as follows. The next section starts with a discussion of the related work. Section III gives a detailed description of the proposed method. Then we prove the theory in Section IV and Section V and evaluate it in Section VI. We present our conclusions in Section VII.
Ii Related Work
Our work is related to appearancebased calibration and motionbased calibration, and the camera has been calibrated by default [17], [18].
Iia Appearancebased calibration
Appearancebased LiDARcamera calibration methods refer to matching the 2D feature points detected on the image with the 3D points in the scene. There have been many methods based on point correspondences, such as [19]. When the point correspondences are given, the extrinsic parameters can be calculated by methods solving for perspectivenpoints (PnP) [20]. However, the point correspondences are not easily obtained because manually identifying the corresponding 3D feature points in the point cloud is very difficult. Later on, [21] presented a method to automatically calibrate the extrinsic parameters with one shot of multiple chessboards, which recovered the 3D structure from the detected image corners. After that, the approach used the constraint that the chessboard planes should coincide with the detected LiDAR planes to perform calibration. The method was applied in the KITTI dataset [22] to calibrate the extrinsic parameters between the cameras and the LiDAR sensor. However, when recovering the 3D structure of the detected corners of multiple chessboards, the method requires the sensors to share a sufficiently common field of view with the chessboards. Additionally, in order to obtain enough matching information for calibration, the camera must be placed far away from the chessboards, resulting in fewer laser lines hitting the chessboard, which limits this method in achieving high calibration accuracy on sparse LiDAR. Some methods lead to the tedious focus process in order to expand the field of view such as [10].
Unlike the approaches above, [23] utilizes the reflectance intensity to estimate the corners of the chessboard from the 3D laser point cloud. If the corners of the 3D laser point cloud are identified, the extrinsic calibration is converted to a 3D2D matching problem. However, the noise and sparsity of the point cloud data present a challenge in this approach. Moreover, like the other appearancebased methods, this method has a limitation where there must be a shared common field of view. Even in the application scenario where the condition is met, the requirement of the common field of view constrains the scale of the scene and limits the number of targets that can be detected, thus affecting the accuracy of the calibration.
IiB Motionbased Calibration
The motionbased extrinsic calibration method does not require the assumption that there is a shared common field of view. In the motionbased LiDARcamera calibration method [24], the motions of the sensors are estimated respectively and calibration is performed by aligning the estimated motion trajectories. Each sensor motion is calculated respectively and the extrinsic parameters are derived numerically from each motion. Early motionbased calibration methods were based on handeye calibration [25]. In [11], the initial extrinsic parameters are obtained from scalefree camera motion and LiDAR motion. Next, the camera motion is recalculated using the initial extrinsic parameters and the point cloud from the LiDAR, and then the extrinsic parameters are calculated again using the motion, and this is repeated until the estimate converges. However, the motionbased method is a loosely coupled calibration method that cannot lead to high calibration accuracy. In addition, the motionbased calibration method needs to complete time synchronization before performing calibration, which is not easy in some cases. In scenarios where time synchronization is not completed, an additional variable(i.e. time offset) should be introduced. In [15], they propose a method to obtain the motion of a sensor in 2D3D calibration and estimate the extrinsic parameters and the time offset between the sensors. Obviously, introducing new variables will reduce the calibration accuracy.
Iii Spatiotemporal Decoupling Calibration
Our optimization method is tightly coupled, and we use spatiotemporal decoupling to perform calibration, which also means that our method is applicable to situations where there is no time synchronization and no additional variables introduced. Additionally, in order to remove the configuration limitation, our method reconstructs visual point clouds from continuous images, which also expands the camera’s field of view and improves calibration accuracy.
An overview of our method is shown in Fig. 2. This method can be roughly divided into two steps. In the first step, we use the region growing method to segment the obtained laser data into several point cloud planes associated with the chessboard planes, and we use the corner extraction method to extract the chessboard corner points from the camera images and then reconstruct the visual 3D point clouds. Then we construct pointtoplane optimization equations from the visual points to the LiDAR plane, and estimate the extrinsic parameters using the GaussâNewton method [26]. In the following Section IIIA, we will describe how both sets of points can be generated, and be aligned to get the extrinsic parameters in Section IIIB.
Iiia LiDAR And Visual Chessboard Plane Extraction
Our method does not limit the calibration target, as long as it is a flat textured plate; for example, the AprilTag [27] calibration plate can be used as a calibration target. The reason we choose the chessboards as the calibration cues is that they are easy to obtain and cheap, which is same as [17], [28]. Compared to existing methods for capturing multiple sets of images and LiDAR data for a single chessboard, we disperse multiple chessboards in space arranged in various poses.
LiDAR Plane Extraction
This step describes how to extract the chessboard plane from one frame scan of the LiDAR. We perform the segmentation by growing and clustering the points in the laser data to several point sets which potentially correspond to the chessboard planes. First, the normal vector is computed for each laser point . Second, several seed points are randomly selected from the laser points. Then we grow each random seed point greedily into each region represented by a corresponding point set . The growing principle is that if a point is a neighbor of the seed and its normal vector is similar to the seed’s normal, the point will be added to the corresponding set of the seed.
After using the region growing method, we get several hypotheses of point sets extracted from the laser data . Next, we filter out a subset where each point set represents a chessboard plane, which means the planes which are either insufficiently planar or significantly smaller than a chessboard will be removed. After this part, any mentioned is a laser point on the chessboard.
Visual Plane Extraction
To extract the 3D points of the chessboard from continuous camera images, we use the following method to extract the chessboard pixels and reconstruct them into 3D points. First, we run ORBSLAM [29] using the acquired camera image to obtain the pose of the camera at each moment and the position of the 3D points in the map. In order to distinguish the 3D points of the chessboard, we extract the corners of the chessboard while processing each image. After detecting the corners of the chessboard, we mark the corresponding 3D points in the map. In the case of monocular camera, we obtain scale by measuring the size of chessboard square, while in the case of stereo camera, the scale is known. Because of the existence of pixel point observation error, we set a reconstruction score for each marked pixel point indicating the reconstruction quality:
(1) 
where is the number of keyframes that observe the point, and is the uncertainty of the depth of the point, and is the a scale factor chosen to keep the expected value of and approximately equal.
The pixel points marked with low scores will be discarded and are not part of the later process. After obtaining the filtered chessboard reconstruction corner points, we perform bundle adjustment to ensure that the resulting 3D points of the chessboard are relatively accurate in space. The visual 3D points obtained by this method can lead to a larger common field of view between the two sensors, which removes the configuration limitation and is shown to improve the calibration accuracy compared with the appearancebased calibration method in Section V.
Our calibration system can be considered to be a visual localization on the laser map (i.e. the first frame of the laser). That is to say, the camera’s pose for the first frame is , which is equal to , so the global frame is the laser frame.
IiiB Optimization For Calibration
The motionbased method perform calibration by aligning two trajectories which has a time synchronization problem. In order to explore the effects of time synchronization, we assume that the pose of LiDAR in at time can be denoted by , and . And the pose of the camera in at time is , and . In the ideal case, there would be an equation like this:
(2) 
Because of the existence of time synchronization error, the equation becomes:
(3) 
where is the pose of LiDAR in at time , and is a small time offset. However, in our system, the two sensors remain stationary during laser data acquirement as shown in Fig. 3, so we can obtain:
(4) 
In this way, our calibration method is applicable for cases without time synchronization and will not introduce additional variables compared to the motionbased calibration method.
So far, 3D LiDAR chessboard points and 3D visual chessboard points have been obtained from the LiDAR data and continuous images, respectively. Next, we need to optimize the extrinsic parameters through the correspondences between these point clouds. First of all, the data association needs to be performed. These point clouds obtained in Section IIIA are not onetoone relevant. In fact, there is no need for pointtopoint correspondence in our method; what we need is just chessboardtochessboard correspondence. Because the point clouds of the chessboard are very sparse, it is simple to get chessboardtochessboard correspondence. To obtain the data association, rough extrinsic parameters are obtained by mechanical parameters as the initial value of . Then we use a Kdimension tree structure (KDtree) to search for the nearest 3 laser points (i.e. , , , and ) for each and save the above data association.
After the data association we need to filter the point pairs before putting them into the optimization process. We use the laser information to delete the visual points that should not be involved in optimization. Our score function is based on the distance from the visual point to the associated laser plane.
(5) 
where , , and are the normal vectors corresponding to , , and . We remove with scores less than a certain threshold.
Then we optimize the remaining points. As shown in Fig. 4, the state variables of the system are the camera state (i.e. the position of the keyframe) , and landmark (the point belonging to the chessboard is represented as ). and represent the point sets of and , respectively, and their relations are . The cost function used to optimize the state variables is derived as
(6) 
where represents the feature reprojection error for the th camera pose and the th feature point.
(7) 
where is the Huber robust cost function, and is the projection function that projects onto the image under pose , denotes the corresponding image feature point. is the information matrix of the reprojection error. stands for the pointtoplane error term for th feature point and th laser point.
(8) 
where is the information matrix of the pointtoplane error.
We solve this optimization problem with the GaussâNewton algorithm implemented in g2o [30]. After the optimization we get the resulting calibrated extrinsic parameters.
Iv Observability Analysis
In this work, we propose a LiDARâcamera calibration method based on 3D SLAM and discuss the observability of the system. Unlike the general SLAM system, our system adds new observations that will change the system’s observability. In contrast to the observability analysis of LiDARLadybug calibration [14], our method uses a camera to obtain continuous images to reconstruct the 3D point cloud and analyzes the observability from the perspective of the dynamic system. Our idea can be divided into two steps. In the first step, we analyze the observability matrix and nullspace of the standard SLAM system. In the second step, we substitute the nullspace of the SLAM system into our calibration system, and examine the dimensions that remain unobservable.
Iva Observability Of Standard SLAM
We follow the observability analysis of [31] about 3D EKFbased VIO and their derivation about rotationerror propagation equation. However, our positionerror propagation equation is different from [31], and there is no inertial measurement unit (IMU) in our system. In our analysis, the laser frame is set to global frame of SLAM system, and the pose of camera is expressed as the translation and the rotation in quaternion. Thus, the state vector at time is given by
(9) 
where is the unit quaternion representing the rotation from the laser frame to the camera frame at time , and are the camera position and landmark position in the laser frame. As for the error state of position, we exploit the standard additive error definition (e.g. ). And we define the rotation error based on the quaternion :
(10) 
where denotes quaternion multiplication, is a small rotation used to transform the estimated laser frame to match with the true one. As proposed in [31], we rewrite to obtain a minimal 3dimensional representation for this rotation:
(11) 
where is a vector describing the rotation errors about the three axes. With the above error definition, the errorstate is defined as:
(12) 
We now turn attention to computing the errorstate transition matrix. We note that the rotationerror definition satisfies:
(13) 
where is the rotation between camera and laser frame. And in the time interval , rotationerror satisfies:
(14) 
From (13), we can obtain:
(15) 
where is a error vector. Therefore, we can obtain the linearized expression for the rotationerror propagation:
(16) 
where is the shorthand notation of . Additionally, the position propagation equation is:
(17) 
In order to get the linearized positionerror propagation, we use (13) to linearize (17):
(18) 
where denotes the error in . Since the landmark is static, its state estimate is invariant with time. Similarly, the landmarkerror propagation is derived:
(19) 
According to (16), (18) and (19), we can obtain:
(20) 
where is the number of landmarks, can be represented as , is the errorstate transition matrix, is the noise process. Different from [31], our errorstate transition matrix has only three variables. Additionally, the measurement Jacobian matrix at time for feature is given by:
(21) 
where the Jacobian of reprojection measurement with respect to the feature position and camera pose are given by:
(22) 
(23) 
where is the estimate of the feature position with respect to the camera. Then we analyze the observability matrix, the observability matrix [32] for the time interval between timestep and is defined as:
(24) 
For the block row of the observability matrix corresponding to the observation of feature at time between timestep and :
(25) 
(26) 
In the above equations, we have used the symbol to denote the matrix of general SLAM system. At this point, if we define the nullspace as follows:
(27) 
It is easy to verify that . Since this holds for any and any (i.e. for all block rows of the observability matrix), we conclude that . That is to say, the general 3D SLAM system is unobservable, with unobservable 6DoF corresponding to the global pose of the state vector.
IvB Observability Of Our Calibration System
Comparing with the standard system, our system just adds a new pointtoplane error observation. So the errorstate transition matrix is the same as the standard one. Turning to the measurement Jacobian matrix , besides the feature reprojection error, there is pointtoplane error in our system. Thus, the measurement Jacobian matrix at time for feature is given by:
(28) 
where refers to the Jacobian matrix of the pointtoplane measurement with respect to the camera pose and feature position. Then the observability matrix of feature at time becomes:
(29) 
What follows is the analysis about the minimal necessary conditions of the chessboards setting to solve the accurate 6DoF extrinsic calibration problem.
Observation of one plane: Suppose there are enough points on each chessboard plane, the observability matrix of chessboard’s feature and feature at time becomes:
(30) 
Then, we can obtain:
(31) 
where , ,. Next, we perform an elementary linear transformation on the above equation. That is, both sides of (31) are multiplied by matrix .
(32) 
(33) 
As shown, there are three columns of that become all zeros. So the second, third and sixth columns of are the nullspace of , we denote it as and can get . Since this holds for any , and any , we conclude that . Note that the first three columns of correspond to global translations of the state vector, while the last three columns to global rotations. Therefore, when observing only one plane, any translation parallel to the plane’s normal and any rotation around the plane’s normal vector is unobservable.
Observation of two planes: the observability matrix of feature and feature at time from two chessboards, described by ,and :
(34) 
Then, we can obtain:
(35) 
where ,.
Next, we perform an elementary linear transformation on the above equation. That is, both sides of (35) are multiplied by matrix .
(36) 
(37) 
(38) 
As shown, the third column of the is the nullspace of , we denote it as and can get . Therefore, when observing two planes, one degree of freedom of the translation is unobservable.
Observation of three planes: Similar to the previous derivation process, we conclude that when three planes with noncollinear normal vectors are observed, we can determine all the unknowns. That is to say, our calibration system is observable. The above observability analysis proves that at least three chessboards are needed to calibrate the 6DoF extrinsic parameters. In other words, a larger common field of view between the two sensors is needed, which is difficult for the appearancebased calibration method.
V Placement Of Chessboards
The visual information in this method is represented by a point cloud in space of a set of sparse visual 3D chessboards. The previous analysis concluded that in order to calibrate the 6DoF extrinsic parameters, 3 chessboards were needed at least. How to place these 3 chessboards in space to get the most comprehensive visual information and better calibration accuracy is what we will discuss next. In the appearancebased method, the increase in the number of chessboards has little effect on the calibration accuracy, so we investigate the effect of the placement of the chessboards on the calibration accuracy. The following theory can provide a guideline for designing highaccuracy calibration procedures.
To simplify the problem we are analyzing, we consider that the visual 3D points reconstructed by our method are one frame of visual data. We derive the problem in 2D and consider each two chessboards which can still provide insights into realworld applications. Analyzing the calibration accuracy refers to analyzing the uncertainty of the pointtoplane error function, which can be represented as a determination of the corresponding Hessian matrix. The larger the determinant of the Hessian matrix, the smaller the uncertainty. More specifically, we are going to explore the influence of the angle and distance between two chessboards respectively.
Va Angle Between Chessboards
As shown in Fig. 5, the angle of chessboard relative to chessboard is , the angle of relative to is , and is reduced to 3DoF represented by
(39) 
Under the simplified condition, the pointtoplane error item can be rewritten as:
(40) 
where is the visual 2D point in the camera coordinate system, and are the 2D extrinsic parameters and is the laser 2D point in the laser coordinate system, is the 2D normal vector of .
Since the GaussâNewton method [26] is one of the simplest and most versatile methods for optimization algorithms, our proof uses this method to solve the gradient descent direction. The GaussâNewton method uses as an approximation of the secondorder Hessian matrix. For our system, the Jacobian matrix of the error function relative to the extrinsic parameters , is derived as follows:
(41) 
where denotes the derivatives of with respect to . Further, the Hessian matrix is as follows:
(42) 
In order to explore the placement of the calibration plate, we have further simplified the situation. Assuming that the normal vector of the calibration plate is
(43) 
Then the normal vector of the calibration plate is
(44) 
Thereafter, we take two points and from the calibration plate , and one point from the calibration plate . Then, for these three pairs of points, the Hessian matrix of the error function that needs to be jointly optimized is:
(45) 
Finding the determinant of the Hessian matrix and simplifying it we can obtain:
(46) 
where is the second row of , and is the second row of .
It can be seen that when , the determinant of the matrix takes the maximum value, that is, when the two plates are at , the uncertainty is the smallest. Obviously, it is difficult to observe the obverse sides of the above two chessboards at the same time for the common field of view provided by the appearancebased method.
VB Distance Between Chessboards
The conclusion of Section VA is that the calibration error uncertainty is the smallest when the two chessboards are placed orthogonal to each other. In this section, when the angles of the two chessboards are fixed, we will discuss the effect of the distance between two chessboards.
However, the distance between the two chessboards is not directly involved in the optimization; in fact it is the points selected on the chessboard that are pertinent to the optimization. Therefore, the problem is equal to how the distance between the points influences the calibration accuracy, based on the premise that the existence of measurement errors is not considered in our derivation.
Again, we simplify the problem to 2D situation. One should notice that the reference frame of the normal vectors is the laser frame. Assume that the angle of calibration plate relative to the axis of is , and the angle of calibration plate with respect to the axis of is . Combining the conclusions of Section VA, and to simplify the derivation, we set and . Therefore, the normal vector of calibration plate is:
(47) 
The normal vector of calibration plate is:
(48) 
As before, we take two points and from calibration plate and one point from calibration plate . The Jacobian matrix of the error function relative to the extrinsic parameters , is as follows:
(49) 
Then, the Hessian matrix is:
(50) 
Solving the determinant of the Hessian matrix and simplifying it, we can get
(51) 
where , , and is the value projected into the axis of the laser coordinate system. is also a value projected into the axis of the laser coordinate system. The illustration of (51) can be seen in Fig. 6.
Therefore, we can find that the farther the projection distance on the axis is, the smaller the calibration error uncertainty is. Furthermore, since the size of the board cannot be infinitely large, multiple calibration plates are necessary to improve the calibration accuracy. When there are multiple calibration plates in the space, the angle between each two calibration plates should be as close as possible to for better calibration accuracy in 2D. Therefore, the pose between each two calibration plates should be as different as possible to reduce the calibration error.
Comments

Combining the analysis of observability and the minimal necessary conditions for calibration we conclude that at least three chessboards are required and more chessboards can lead to better calibration accuracy.

With the same number of chessboards, a scattered placement is better than a centralized one, which is expected to be true in 3D.

The extended camera field of view obtained by our method meets the requirement of observing multiple chessboards, which is difficult in those methods that keep the sensors stationary.

Observing multiple chessboards arranged in various poses in the expanded the field of view also gives our method an advantage compared to the appearancebased method.
Vi Experimental Results
In order to evaluate our method, we performed simulation verification and realworld experiments separately. In the simulation experiments, we showed that the placement theory derived in Section VI is reasonable by performing our calibration method with chessboards placed in scattered and centralized arrangements, respectively. Then compared with the other methods, the results show the advantage of the proposed method on the calibration accuracy. With regard to the evaluation criteria, the simulation environment provides the ground truth for the LiDAR and the camera’s extrinsic parameters, so we could directly evaluate the calibration accuracy. We built a simulation environment in VREP, using a stereo vision sensor and a Velodyne VLP16 LiDAR to obtain data as shown in the top of Fig. 7
In the realworld experiments, the comparison with other methods is also performed demonstrating the practicality of the proposed method. With regard to the evaluation criteria, the ground truth of the extrinsic parameters is not available. In order to evaluate the result of the estimated extrinsic parameters, we divided the collected data into two groups, one for training the pointtoplane error with different extrinsic results obtained by different methods, and the other for testing the accuracy. As shown in the bottom of Fig. 7, we fixed two Pointgrey cameras with a Velodyne VLP16 LiDAR on the robot to perform the realworld experiments and the cameras were calibrated by default.
Via VREP Simulation
Theoretical Verification
In the simulation environment, we first verify the theoretically derived conclusions. Due to the existence of observation errors, the angle between the chessboards and the distance between the chessboards are highly coupled, and it is impossible to perform strict control variables to verify the influence of angle and distance separately. We only verify the final conclusions derived from the theory. As shown in Fig. 8 (a) and (b) we placed four chessboards around the sensor, and used our method to obtain data for calibration. In order to simulate the situation in 2D as much as possible, four chessboards are placed vertical to the ground, and one set is centralized in front of the field of view, the other is scattered around the sensor.
The final calibration is expressed as rotation , and translation . , were compared against ground truth , , which were obtained from VREP. Following [21], for translation error, we computed in meters. For rotation error, we first computed the relative rotation and represented it in degrees.
The results, depicted in Fig. 9, indicate that the calibration accuracy is better when the angle between the two chessboards is as shown in Fig. 8 (b), which is consistent with the conclusion of Section VA. This placement method does not provide sufficient dimensional constraints in 3D, when we use four chessboards vertical to the ground to calibrate. Since our chessboards are not completely vertical to the ground during the experiment, the existing placement error makes it possible to calibrate the rough extrinsic parameters. As shown in Fig. 8 (a) and (b), in order to obtain a larger common field of view between the two sensors to observe four chessboards, the camera’s extended field of view obtained by our method is needed.
Accuracy Comparison
Next, we compared our method with the KITTI single shot method [21] on calibration accuracy. We placed seven chessboards in various poses as shown in Fig. 8 (c), and obtained data for the KITTI single shot calibration method to calibrate. The KITTI single shot calibration method can automatically give the extrinsic calibration results in one acquisition, which is convenient to use. The method requires placing multiple chessboards in front of the field of view and obtaining the LiDAR and camera data in one shot, respectively. However, if one wants to obtain as much data as possible from the chessboards, the sensor should be placed much farther away from the chessboards which can be easily seen from Fig. 8 (c). Once the sensor is too far from the chessboards, it is often difficult to extract the corner points from the obtained camera image, and the laser lines hitting the chessboards are also reduced. Another main limiting assumption of the KITTI approach is the common field of view between the camera and the LiDAR.
As shown in Fig. 8 (d) top and bottom, we then obtained two sets of data for seven chessboards concentrated and scattered in a certain field of view and evaluated our calibration method. Unlike the KITTI single shot method, when we captured the camera images, we moved the robot around nearly to obtain images of each chessboard, and then reconstructed the global visual 3D points in the space. In the theoretical derivation part, we explored the influence of the placement of the chessboards on the calibration accuracy, and concluded that the scattered placement of the chessboards is beneficial for improving the calibration accuracy. The results, depicted in Fig. 10, indicate that the calibration results of the scattered layout of seven chessboards are better than the centralized arrangement, which is consistent with the conclusion of Section V. Our method achieves better calibration results than the KITTI single shot method, because the camera’s field of view limits the number of laser lines hitting the chessboards and the number of observed chessboards in the single shot method.
ViB RealWorld Experiment
We conducted realworld experiments and sufficiently compared the three different calibration methods: the KITTI single shot calibration method [21]; MO methods: multiple images and LiDAR data of a single calibration target presented in different directions as input which is fussy due to the need to move the chessboard, similar to [8], [9]; and the motionbased calibration method: based on trajectory alignment.
We obtained one set of data for the KITTI single shot, two sets of data in which the chessboards were placed scattered and centralized for our method, two sets of data for the motionbased calibration, and one set of data for the MO calibration method, including 67 corresponding laser scans and camera image data under different poses. The 67 pairs of laser scans and camera images were divided into two parts. The first 30 data pairs were used for the MO method to calibrate the extrinsic parameters, and the last 37 remaining data pairs were used to test the accuracy of all methods. As mentioned before, we evaluated the calibration results estimated by different methods with the pointtoplane error using the same data. The calibration error is shown in Fig. 11.
We made two experiments for the motionbased method, one is called the Motionbased nonsync method, for which we did not complete the hardware synchronization between the two sensors. The other is called the Motionbased with sync method, for which we completed a rough hardware synchronization by finding the nearest neighbor on the timestamps of the two sensors. For the motionbased calibration method, time synchronization is not easy; even if a rough time synchronization is completed, the calibration result is not as good as with the appearancebased method.
The KITTI single shot calibration method does not give a very accurate result in the case of the Velodyne VLP16 LiDAR. The calibration results of our scattered placement calibration method are better than the calibration results of the centralized placement, which is consistent with the conclusion of Section V. The traditional appearancebased method is limited by the narrow common field of view, and the calibration accuracy is not as good as our method. This result proves once again that a larger common field of view between the two sensors obtained by our method can lead to a better calibration accuracy.
To discuss the calibration results of the MO calibration method, we did another set of experiments. The extrinsic calibration was performed using the data of 10 pairs (MO10), 20 pairs (MO20), and 30 pairs (MO30), respectively. It can be seen that with more data used for calibration, the accuracy of the MO calibration method also increases and the MO calibration method achieves results that are only worse than our calibration method when using 30 angles. The MO method is time consuming to use and converges more slowly than our method. Since the MO method needs to have a common field of view between the sensors, this limits its use to some extent.
In summary, our method has the best calibration accuracy, and we only used the information of five chessboards. First, the extended field of view obtained by our method can improve the calibration accuracy, and our method can be applied to a case with an arbitrary configuration, which is beneficial and necessary in practical use. For example, for a robot used for laser localization and navigation, it is necessary to fix a camera aimed diagonally downward to solve the LiDARâs blind spot. This sensor fixing method results in no common field of view. Second, our method is decoupled in space and time, so it is applicable to the cases lacking time synchronization and will not introduce additional variables. That is to say, our error term does not include laser motion estimation error and time offset error. Third, as shown by our experiments, the calibration accuracy is higher in the case where the calibration plates are placed in a scattered manner, which is consistent with our theoretical derivation.
Finally, we show the results of our method of the laser data reprojected into the image, which allows us to see the accuracy of the calibration more intuitively, as shown in Fig. 12.
Vii Conclusion
We proposed a LiDARcamera extrinsic calibration method based on spatiotemporal decoupling and eliminated the limitation of sharing a common field of view. Furthermore, we analyzed the observability of the calibration system and derived how the chessboards can be placed better to improve the accuracy of the extrinsic calibration. Then we made a full comparison with other methods through both simulation and realworld experiments, which showed that our method can give a higher precision calibration result. In order to simplify the calibration, our next work is to study a calibration method that does not require a calibration plate and still maintains calibration accuracy.
Bo Fu received his BS from the Department of Control Science and Engineering, Shandong University, Jinan, P.R. China in 2017.

Yue Wang received his PhD the from Department of Control Science and Engineering, Zhejiang University, Hangzhou, P.R. China in 2016.

Yanmei Jiao received her BS from the Department of Computer Science and Control Engineering, Nankai University, Tianjin, P.R. China in 2017.

Xiaqing Ding received her BS from the Department of Control Science and Engineering, Zhejiang University, Hangzhou, P.R. China in 2016.

Li Tang received his BS from the Department of Control Science and Engineering, Zhejiang University, Hangzhou, P.R. China in 2015.

Rong Xiong received her PhD from the Department of Control Science and Engineering, Zhejiang University, Hangzhou, P.R. China in 2009.

Footnotes
 Throughout this paper, is used to denote the estimate of random variable , while is the error estimate. denotes the skew symmetric matrix corresponding to vector .
References
 Y. Zhuang, N. Jiang, H. Hu, and F. Yan, “3dlaserbased scene measurement and place recognition for mobile robots in dynamic indoor environments,” IEEE Transactions on Instrumentation and Measurement, vol. 62, no. 2, pp. 438–450, 2013.
 S. Hussmann and T. Liepert, “Threedimensional tof robot vision system,” IEEE Transactions on Instrumentation and Measurement, vol. 58, no. 1, pp. 141–146, 2009.
 S. Zhu and Y. Gao, “Noncontact 3d coordinate measurement of crosscutting feature points on the surface of a largescale workpiece based on the machine vision method,” IEEE Transactions on Instrumentation and Measurement, vol. 59, no. 7, pp. 1874–1887, 2010.
 Y. Li, Y. F. Li, Q. L. Wang, D. Xu, and M. Tan, “Measurement and defect detection of the weld bead based on online vision inspection,” IEEE Transactions on Instrumentation and Measurement, vol. 59, no. 7, pp. 1841–1849, 2010.
 L. Tang, Y. Wang, X. Ding, H. Yin, R. Xiong, and S. Huang, “Topological localmetric framework for mobile robots navigation: a long term perspective,” Autonomous Robots, pp. 1–15, 2018.
 Q. Luo, H. Ma, Y. Wang, L. Tang, and R. Xiong, “3dssd: Learning hierarchical features from rgbd images for amodal 3d object detection,” arXiv preprint arXiv:1711.00238, 2017.
 X. Ding, Y. Wang, D. Li, L. Tang, H. Yin, and R. Xiong, “Laser map aided visual inertial localization in changing environment,” arXiv preprint arXiv:1803.01104, 2018.
 B. Zheng, X. Huang, R. Ishikawa, T. Oishi, and K. Ikeuchi, “A new flying range sensor: Aerial scan in omnidirections,” in 3D Vision (3DV), 2015 International Conference on, pp. 623–631, IEEE, 2015.
 J. Zhang, M. Kaess, and S. Singh, “A realtime method for depth enhanced visual odometry,” Autonomous Robots, vol. 41, no. 1, pp. 31–43, 2017.
 J. L. L. Galilea, J.M. Lavest, C. A. L. Vazquez, A. G. Vicente, and I. B. Munoz, “Calibration of a highaccuracy 3d coordinate measurement sensor based on laser beam and cmos camera,” IEEE Transactions on Instrumentation and Measurement, vol. 58, no. 9, pp. 3341–3346, 2009.
 R. Ishikawa, T. Oishi, and K. Ikeuchi, “Lidar and camera calibration using motions estimated by sensor fusion odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7342–7349, IEEE, 2018.
 Q. Zhang and R. Pless, “Extrinsic calibration of a camera and laser range finder (improves camera calibration).,” in IROS, vol. 3, pp. 2301–2306, 2004.
 V. Fremont, P. Bonnifait, et al., “Extrinsic calibration between a multilayer lidar and a camera,” in Multisensor Fusion and Integration for Intelligent Systems, 2008. MFI 2008. IEEE International Conference on, pp. 214–219, IEEE, 2008.
 F. M. Mirzaei, D. G. Kottas, and S. I. Roumeliotis, “3d lidar–camera intrinsic and extrinsic calibration: Identifiability and analytical leastsquaresbased initialization,” The International Journal of Robotics Research, vol. 31, no. 4, pp. 452–467, 2012.
 Z. Taylor and J. Nieto, “Motionbased calibration of multimodal sensor extrinsics and timing offset estimation,” IEEE Transactions on Robotics, vol. 32, no. 5, pp. 1215–1229, 2016.
 Y. C. Shiu and S. Ahmad, “Calibration of wristmounted robotic sensors by solving homogeneous transform equations of the form ax= xb,” IEEE Transactions on robotics and automation, vol. 5, no. 1, pp. 16–29, 1989.
 Z. Song and R. Chung, “Use of lcd panel for calibrating structuredlightbased range sensing system,” IEEE Transactions on Instrumentation and Measurement, vol. 57, no. 11, pp. 2623–2630, 2008.
 H. Bacakoglu and M. S. Kamel, “A threestep camera calibration method,” IEEE Transactions on Instrumentation and Measurement, vol. 46, no. 5, pp. 1165–1172, 1997.
 D. Scaramuzza, A. Harati, and R. Siegwart, “Extrinsic self calibration of a camera and a 3d laser range finder from natural scenes,” in Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pp. 4164–4169, IEEE, 2007.
 V. Lepetit, F. MorenoNoguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” International journal of computer vision, vol. 81, no. 2, p. 155, 2009.
 A. Geiger, F. Moosmann, Ö. Car, and B. Schuster, “Automatic camera and range sensor calibration using a single shot,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 3936–3943, IEEE, 2012.
 A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
 W. Wang, K. Sakurada, and N. Kawaguchi, “Reflectance intensity assisted automatic and accurate extrinsic calibration of 3d lidar and panoramic camera using a printed chessboard,” Remote Sensing, vol. 9, no. 8, p. 851, 2017.
 I. Fassi and G. Legnani, “Hand to sensor calibration: A geometrical interpretation of the matrix equation ax= xb,” Journal of Robotic Systems, vol. 22, no. 9, pp. 497–506, 2005.
 J. D. Hol, T. B. Schön, and F. Gustafsson, “Modeling and calibration of inertial and vision sensors,” The international journal of robotics research, vol. 29, no. 23, pp. 231–244, 2010.
 A. Bjorck, Numerical methods for least squares problems, vol. 51. Siam, 1996.
 E. Olson, “Apriltag: A robust and flexible visual fiducial system,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3400–3407, IEEE, 2011.
 J. Liu, Y. Li, and S. Chen, “Robust camera calibration by optimal localization of spatial control points,” IEEE Transactions on Instrumentation and Measurement, vol. 63, no. 12, pp. 3076–3087, 2014.
 R. MurArtal and J. D. Tardós, “Orbslam2: An opensource slam system for monocular, stereo, and rgbd cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
 R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g 2 o: A general framework for graph optimization,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3607–3613, IEEE, 2011.
 M. Li and A. I. Mourikis, “Highprecision, consistent ekfbased visualinertial odometry,” The International Journal of Robotics Research, vol. 32, no. 6, pp. 690–711, 2013.
 Z. Chen, K. Jiang, and J. C. Hung, “Local observability matrix and its application to observability analyses,” in Industrial Electronics Society, 1990. IECON’90., 16th Annual Conference of IEEE, pp. 100–103, IEEE, 1990.