Insights into the robustness of control point configurations for homography and planar pose estimation
Abstract
In this paper, we investigate the influence of the spatial configuration of a number of control points on the accuracy and robustness of space resection methods, e.g. used by a fiducial marker for pose estimation. We find robust configurations of control points by minimizing the first order perturbed solution of the DLT algorithm which is equivalent to minimizing the condition number of the data matrix. An empirical statistical evaluation is presented verifying that these optimized control point configurations not only increase the performance of the DLT homography estimation but also improve the performance of planar pose estimation methods like IPPE and EPnP, including the iterative minimization of the reprojection error which is the most accurate algorithm. We provide the characteristics of stable control point configurations for real world noisy camera data that are practically independent on the camera pose and form certain symmetric patterns dependent on the number of points. Finally, we present a comparison of optimized configuration versus number of control points.
I Introduction
The PerspectivenPoint (PnP) problem and the special case of planar pose estimation via homography estimation are some of the most researched topics in the fields of computer vision and photogrammetry. Even though the research in these areas has been wide, there is a surprising lack of information regarding the effect of 3D control point configurations on the accuracy and robustness of the estimation methods.
As will be shown in Sec. II, it is clear from the literature, that the control point configurations are relevant and they do influence the accuracy and robustness of pose estimates. However, the findings are rather general, since they are based on handson experience and thus far only lead to some rules of thumb. Most obvious and widely accepted is, that increasing the number of control points increases the accuracy of the results in presence of noise. Further on, in several studies when simulations are performed to compare methods, great care is given to possible singular point configurations, such as noncentered data or nearplanar cases which are singularities or degenerate cases for certain estimation methods.
A more thorough evaluation is given for the normalized DLT algorithm, whereas the normalization has already shown to improve the estimation because it is related to the condition number of the set of DLT equations [1]. The only error analysis for homography estimation found so far by the authors in the literature presents a statistical analysis and simulations of the errors in the homography coefficients [2].
However, none of the above give an answer to the question: Is there an optimal perspectivenpoint configuration, which can increase the accuracy and robustness of space resection methods?
If there is an optimal configuration, this question includes several followup questions: Is this optimal configuration dependent on the pose, or is there only one configuration that is optimal for all poses? What are the specifics of this/these configuration(s) in relation to absolute coordinates and relative distances between coordinates? Are there similarities between configurations that differ in the number of control points? When does an increase in number of points that are arbitrarily configured outperform the optimal configuration of a small number point set?
In this paper, we give an answer to all of these questions for planar control points by proposing an optimization objective to find optimized planar control point configurations. Figure 1 sketches the main idea of optimizing the proposed objective via a gradient descent approach and the stepwise improvement of the accuracy of the pose estimate starting from some initial control point configuration. Each descent step leads to a change in control point configuration and thus defines a stable dynamics for the control points that are placed on a planar visual fiducial marker (object plane) converging to stable control point configurations.
The paper is structured as follows: In Sec. II, we classify pose estimation methods and summarize known findings about control point configurations. In Sec. III, we derive the optimization objective based on golden standard algorithms for pose estimation. In Sec. V, we describe the simulation results, and finally in Sec. VI, we discuss the results and give conclusions.
Ii state of the art
Iia Brief history of space resection and PnP methods
Camera or space resection is a term used in the field of photogrammetry in which the spatial position and orientation of a photo is obtained by using image measurements of control points present on the photo. A similar concept comes from the computer vision community, where it is known as the PerspectivenPoint (PnP) problem. PnP can be considered an overconstrained (only for ) and generic solution to the pose estimation problem from point correspondences. PnP methods can be classified into those which solve for a small and predefined number of points, and those which can handle the general case. The minimal number of points to solve the PnP problem is three. Several solutions have been presented in the literature [3], which in general provide four solutions for noncollinear points. Thus, prior knowledge has to be included to choose the correct solution.
Since it has been proven that pose accuracy usually increases with the number of points [3], other PnP approaches that use more points () are usually preferred. The general PnP methods can be broadly divided into whether they are iterative or noniterative. Iterative approaches formulate the problem as a nonlinear leastsquares problem. They differ in the choice of the cost function to minimize, which is usually associated to an algebraic or geometric error. Some of the most important iterative methods in chronological order are: the POSIT algorithm [4], the LHM [5], the Procrustes PnP method or PPnP [6] and the global optimization method SDP [7].
Most iterative methods have the disadvantage that they return only a single pose solution, which might not be the true one. Most of them can only guarantee a local minimum, and the ones that find a global minimum remain computational intensive. The major limitation of iterative methods is that they are rather slow, neither convergence nor optimality can be guaranteed and a good initial guess is usually needed to converge to the right solution.
Noniterative methods try to reformulate the problem so it may be solved by a potentially large equation system. However, early noniterative solvers were also computational demanding and worse for a larger number of points. The first efficient and noniterative solution was EPnP [8], which was later improved by using an iterative method to increase accuracy.
More recent approaches are based on polynomial solvers trying to achieve linear performance without the problems of EPnP and with higher accuracy. The first successful method is the DLS [9], which uses a Cayley parametrization of the rotation, unfortunately with some degenerate cases.
Robust PnP or RPnP [10] is the first method that is more accurate when a low number of points is used , with similar accuracy to LHM but faster.
The OPnP (Optimal PnP) [11] is similar to DLS but a nonunit quaternion representation of the rotation is used.
UPnP [12] is a linear noniterative method that generalizes the solution into the NPnP (Non perspective npoint) problem. In contrast to OPnP the authors used normalized unit quaternions to represent rotations.More recently, in optDLS [13] a return to the Cayley rotation parametrization used on DLS is proposed, mentioning a simple trick to avoid the singularities, the method is three times faster than OPnP.
A special case of PnP is planar pose estimation, or PPE, which is a space resection problem that involves the process of recovering the relative pose of a plane with respect to a camera’s coordinate frame from a single image measurement. A PPE problem can be solved by calculating the objectplane to imageplane homography transformation and then extracting the pose from the homography matrix. This is known as homography decomposition [14, 15], or by using a set of points in the plane as the measurement with a special case of the PnP methods (planar PnP). Some of the most important planar PnP methods are the iterative RPPSP [16] and the more recent direct method IPPE [17]. In general planar PnP methods outperform the best homography decomposition methods when noise is present. Additionally, homography decomposition methods only provide a single solution in contrast to modern planarPnP methods.
IiB Homography estimation
The homography estimation is a key part of the homography decomposition methods and the IPPE algorithm. The standard linear algorithm for homography estimation is the Direct Linear Transform (DLT) [18], which was improved later in [19] using an orthogonalization step. For both methods the normalization of the measurements is a key step to improve the quality of the estimated homography [18]. However, the normalization has some disadvantages [20]: First, the normalization matrices are calculated from noisy measurements and are sensitive to outliers, and second, for a given measurement the noise affecting each point is independent of the others. However, in normalized measurements this independence is removed [2]. A method is proposed in [20] to overcome this problem by avoiding the normalization and using a Taubin estimator instead, obtaining similar results as the normalized one.
IiC Control points configurations
It is pointed out in [8, 10] that 3D point configurations have an influence on the local minima of the PnP problem. In the RPnP paper [10] a broad classification of the control points configurations into three groups is presented. The classification is based on the rank of the matrix , where , is the 3D coordinate of control point and is the amount of control points. The defined groups are: 1) Ordinary 3D case, when the and the smallest singular value of is different to zero. 2) Planar case, when the and 3) Quasisingular case, when the and the ratio of the smallest eigenvalue to the largest one is very small ().
In EPnP [8] it is shown that if the control points are taken from the uncentered data or the region where the image projections of the control points cover only a small part of the image, the stability of the compared methods greatly degrades. In RPnP it is elaborated that based on the previous classification this uncentered data is a configuration that lays between the ordinary 3D case and the planar case.
Some assumptions about the influence of the control points configurations are also present in the IPPE paper [17]. Through statistical evaluations the authors found out that the accuracy for the 4point case decreases if the points are uniformly sampled from a given region. They circumvent this problem by selecting the corners of the region as the positions for the control points and then refer the reader to the Chen and Suter paper [2], were the analysis of the stability of the homography estimation to 1st order perturbations is presented. In this analysis it is clear that the error in homography estimate is dependent on the singular values of the matrix in the DLT algorithm (see also next section).
Iii Basics of golden standard algorithms for pose estimation
Before we explain the optimization method for optimizing point configurations, we shortly summarize the golden standard optimization methods for pose estimation from general and planar point configurations which are the minimization of the reprojection (geometric) error (MRE) for iterative methods and the minimization of the algebraic error for noniterative methods via the DLT algorithm, respectively.
Iiia General point configuration for pose estimation
Given a 3D2D point correspondence of th 3D control point with world coordinates and its corresponding projection onto a planar calibrated camera^{1}^{1}1Assuming the calibration matrix to be known, the homogeneous image coordinates in pixel can be transformed to homogeneous normalized image coordinates in metric units . with normalized image coordinates the relation between these points is given by the relative pose^{2}^{2}2The rotation matrix is given by: (Euclidean transformation) between world and camera frame followed by a projection with .
This leads to the relation:
(1) 
Including additive noise on the errorfree image coordinates we get noisy measurements of the image coordinates . Thus, we can solve for the reprojection error of each point which is a squared 2norm. Minimizing the squared 2norm of all points for the optimal pose leads to the following leastsquares estimator
(2) 
Iterative gradient descent optimization of (2) leads to the most accurate pose estimation results in the literature so far, also for planar point configurations.
IiiB Planar points configuration for pose estimation
If the control points are all on a plane , we can define a 2D subspace in the 3D world with coordinates^{3}^{3}3Corresponding homogeneous coordinates are . . Plugging the planar constraint in equation (1), extending to homogeneous coordinates and rearranging the equation, leads to an homography mapping
(3) 
Eliminating , we get , where each point correspondence produces two linearly independent equations
(4) 
with and .
Again, assuming noisy measurements of the image coordinates , we get noisy matrices
(5)  
From we can solve for the algebraic error of each point, because holds. Minimizing the squared 2norm of all points for the optimal homography leads to the following leastsquares estimator
(6) 
Since contains 9 entries, but is defined only up to scale the total number of degrees of freedom is 8. Thus, the additional constraint is included to solve the optimization.
Now, stacking all and as and respectively, we arrive at solving the noisy homogeneous linear equation system
(7) 
The solution of (7) is equivalent to the solution of (6) and is given by the DLT algorithm applying a singular value decomposition (SVD) of , whereas with being the right singular vector of , associated with the least singular value . Usually, an additional normalization step of the coordinates of the control points and its projections is performed leading to the normalized DLT algorithm which is the golden standard for noniterative pose estimation, because it is very easy to handle and serves as a basis for other noniterative as well as iterative pose estimation methods.
Iv Optimizing points configuration for pose estimation
In order to find an optimal control points configuration in the field of view of a camera for estimating the pose of the same camera, we need a proper optimization criterion. In the following, we propose an optimization criterion that is optimal for planar pose estimation using the (normalized) DLT algorithm. We start with availing ourselves of perturbation theory applied to singular value decomposition of noisy matrices [23] and have a look at the first order perturbation expansion for the perturbed solution of the DLT algorithm, given in [2], which is
(8) 
Equation (8) clearly shows that the optimal solution for the homography that equals the right singular vector of the unperturbed matrix , associated with the least singular value^{4}^{4}4The singular values are arranged in descending order: . , is perturbed by the second term in (8). The second term is a weighted sum of the first eight optimal right singular vectors , whereas the weights are the influence of the measurement errors on the unperturbed solution along the different dimensions of the model space. The presence of very small in the denominator can give us very large weights for the corresponding model space basis vector and dominate the error. Hence, small singular values cause the estimation to be extremely sensitive to small amounts of noise in the data and correlates with the singular value spectrum^{5}^{5}5Here, the singular value spectrum between the first and secondlast singular value is relevant, because holds. as follows: The smaller the singular value spectrum, the less perturbed the estimation is. It is also well known, that the condition number of a matrix with respect to the 2norm is given by the ratio between the largest and, in our case, secondsmallest singular value [24]
(9) 
which is minimal if the singular value spectrum is minimal. The normalization of the control points and its projections which leads to the normalized DLT algorithm has already shown to improve the condition of matrix [1]. Thus, we simply try to minimize the condition number of matrix with respect to all control points like follows:
(10) 
Optimization of (10) is realized with a gradient descent minimization, whereas for calculation of the gradient vector we use automatic differentiation^{6}^{6}6For implementation, we used autograd [25]. [26]. This leads to the final discrete control points dynamics
(11) 
for each iteration and stepsize , which is adapted using SuperSAB [27]. The control points dynamics can now be used to find optimal control point configurations for pose estimation from planar markers.
Given perturbations on the matrix the relative error on the estimation of the homography parameters is defined as and from perturbation theory the following inequality defines an upper bound for the relative error:
(12) 
To find a lower bound, we can use the error of using a perturbed matrix with the true homography defined as and the error of using the optimal homography estimation with the same perturbed matrix defined as to build the following inequality:
(13) 
which then divided by leads to a lower bound of the relative error:
(14) 
The upper bound implies that there are only two ways to improve the maximum values of the relative error, either by reducing the perturbation of the matrix, which usually can’t be controlled, or by improving the condition number of the matrix, which can be done by a normalization step in the DLT transform and by selecting optimized control point configurations.
V Simulation and real experiment results
Our simulation setup is based on a perspective camera model and a circular planar visual marker of radius meters on the plane centered in the origin of world coordinates. A set of control points are defined inside the limits of this circular plane, which are then projected onto the camera image^{7}^{7}7Camera parameters: size , intrinsic parameters .. A uniform distribution of 400 camera poses is defined around the circular plane as displayed in Fig. 2.
Va Evaluations
To evaluate the improvement of the gradient descent optimization, we consider the optimization objective, which is the condition number (9) at each iteration , given by in the DLT algorithm. To evaluate the effect of the optimization (10) on the underlying homography estimate using a given set of control points , we rely on the reprojection error induced by the estimated homography given by
(15) 
for a fixed set of validation control points that are evenly distributed on the object plane covering an area larger than the limits of the circle. Thus, it is possible to measure how good is able to represent the true homography beyond the space of the control points.
Each simulation for a given camera pose is then performed in the following way: 1) An initial random point set is defined inside the circular plane 2) For each iteration step an improved set of control points is obtained by (11) and projected to image coordinates using the true camera pose and calibration matrix . Then, the correspondences are used to calculate and . 3) For each a statistically meaningful measure of the homography estimation robustness against noise is desired. Thus, 1000 runs of the homography estimation using the DLT algorithm were performed^{8}^{8}8The homography estimation method presented in [19] and the gradient based one of OpenCV were also tested. The results almost do not differ for low point configurations to the DLT, so it was the chosen one for the experiments.. In each of these runs Gaussian noise with standard deviation was added to the image coordinates for the simulation of real camera measurements . Finally, the error is calculated in each run and the average and standard deviation of this error for all runs is computed.
As illustration of the gradient minimization process an example case of a simulation in a frontoparallel camera pose for a 4point configuration is presented. A Gaussian noise of pixel is added to image coordinates for the homography estimation runs. In Fig. 3 the initial object and image point configurations are shown. The movement of points present two main characteristics: first, they are driven to separate from each other, and second, they tend to equalize the distance to each other in object space. This results in squarelike shaped configurations.
The evolution of as well as and is presented in Fig. 4. The condition number decreases drastically in the first iterations of the gradient descent, and by doing so the mean and standard deviation of is also reduced. With more iterations both metrics slowly and smoothly converge to a stable minimum value.
This first result in itself is highly representative as it proves that some point configurations increase the accuracy of homography estimation methods as well as the robustness to noise and it is also possible to obtain optimal point configurations.
Motivated by the homography results, it was of interest to test if the optimization of control point configurations could improve as well the accuracy of pose estimation algorithms. Thus, three different pose estimation algorithms^{9}^{9}9For the EPnP and LM methods, the OpenCV implementations were used, and for IPPE the Python implementation provided in the author’s github repository. were run at each iteration of the optimization process, namely: 1) a noniterative PnP method EPnP [8], 2) a planar pose estimation method IPPE [17], and 3) an iterative one based on the LevenbergMarquardt optimization denoted as LM.
As in similar works [8, 17], we denote as the estimated rotation and translation for a given camera pose at iteration and by the true rotation and translation. The error metrics for pose estimation are defined as follows:

RE is the rotational error (in degrees) defined as the minimal rotation needed to align to . It is obtained from the axisangle representation of .

TE is the relative error in translation.
Similar to the homography simulation, for each iteration , 1000 runs of the pose estimation with noisy correspondences for each of the PnP methods were performed. Then, the mean and standard deviation of RE and TE for the 1000 runs were calculated for each iteration. The PnP simulation results for the frontoparallel case are presented in Fig. 5 comparing the performance of all methods together and in Fig. 6 details about the standard deviation of each method are shown.
A real experiment was also implemented in order to test if the simulation assumptions (Gaussian image noise and perfect intrinsics) may affect the results in practical applications. A computer screen was used as planar fiducial marker to dynamically display the points during gradient descent. A set of 4 circles was displayed for each iteration of the optimization. These circles were then captured by a camera and detected using a Hough transform based circle detector. We performed 100 detections for each gradient descent iteration. An Optitrack system was used to measure the ground truth pose of the camera relative to the marker screen. The calibration of the Optitrack markers to the relevant coordinate frames was performed using the ROS package easy_handeye and the calibration of the PointGrey Blackfly camera intrinsics^{10}^{10}10Camera parameters: size , intrinsic parameters . was done using the camera_calibration ROS package. The results of running the optimization process for a set of 4 random initial points are shown in figures 7, 8 and 9.
Next the relationship between badly configured points and optimized points was studied. For each camera pose in the distribution of Fig. 2, 100 different initial random point configurations with were simulated and the optimization process was performed. In this case only the initial and final values of the point configuration metrics are stored. Thus, it is possible to compare the methods based on illconditioned (random initial points) and wellconditioned (after optimization) point configurations. In Fig. 10 the results for the homography estimation are presented and in Fig. 11 the results of the pose estimation. Finally, in figures 12 and 13 the final point configurations for all the camera poses are shown as a 2D histogram and some example configurations are shown for the 4point and 5point case.
VB Discussion
From the results it is possible to observe that the control point configurations indeed have a strong effect on the accuracy of homography and planar PnP methods. The amount of improvement when having optimized points in EPnP and IPPE is more pronounced than for LM, which is in itself an interesting result, since IPPE for example is a method that takes considerable less computation time than LM. By having well configured points, the methods converge to the similar error values (see Fig. 5, 6 and Fig. 8). Additionally, as can be seen in Fig. 6, not only the mean is reduced but also the variance, with a more pronounced effect on translation for the three methods. It is remarkable that the point configuration also affects the performance of LM although our optimization objective is not directly related to the minimization of the reprojection error. The results of the real experiment closely match the simulations. An improvement on the condition number is associated with improvements both in homography and pose estimation.
Optimized point configurations are on average better than random configurations independent on the camera pose. Further on, the smaller the number of control points the more is the relative improvement on the estimates for all of the evaluated methods. For example, the homography estimate using 4 optimal control points is always better than arbitrary point configurations with more points as can be seen in Fig. 10. Choosing more than 4 optimal control points slightly improves the homography estimates and converges for a number of 8 optimal control points. Thus, configuration of the control points has more effect on the accuracy than the number of control points. But increasing number of optimized control points to using LM method even outperforms a large number of random control points.
As can be seen on Fig. 12 the most common minima for the 4point case is a squarelike shape, and as shown on Fig 13 even for the 5point case the corners of a squarelike shape are common. In Figures 4, 6, 10 and 11 it can be seen that the square is always the lower boundary both for the condition number and the estimation error. On average a perfect square is the most robust shape for homography and pose estimation from four points. This makes sense, since orthogonality and point separation improve the robustness of the SVD.
Vi Conclusions and future work
A method for obtaining optimal control points for homography estimation using an optimization approach is presented. The lower the number of control points the more the configuration of the points on the object plane has an influence on the accuracy of homography and PnP estimation methods. It was proven that a square is a very stable and robust configuration for all camera poses. For more point configurations the configurations meet the general rule of being maximally separated in object space with equal distances between each other including the optimized 4point configuration as a subset. Finally, we found that there is almost no difference in accuracy between IPPE and LM for optimized point configurations but IPPE is much faster because it is a direct and not an iterative method. In future work, we will try to generalize the results to nonplanar point configurations.
References
 [1] R. Hartley, “In defence of the 8point algorithm,” in Proc. of IEEE Int. Conf. on Computer Vision, 1997.
 [2] P. Chen and D. Suter, “Error analysis in homography estimation by first order approximation tools: A general technique,” Journal of Mathematical Imaging and Vision, 2009.
 [3] E. Marchand, H. Uchiyama, and F. Spindler, “Pose Estimation for Augmented Reality: A HandsOn Survey,” IEEE Trans. on Vis. and Comput. Graphics, 2016.
 [4] D. Oberkampf, D. F. DeMenthon, and L. S. Davis, “Iterative Pose Estimation Using Coplanar Feature Points,” Computer Vision and Image Understanding, 1996.
 [5] C. P. Lu, G. D. Hager, and E. Mjolsness, “Fast and globally convergent pose estimation from video images,” IEEE Trans. Pattern Anal. Mach. Intell., 2000.
 [6] V. Garro, F. Crosilla, and A. Fusiello, “Solving the PnP problem with anisotropic orthogonal procrustes analysis,” 3DIMPVT, 2012.
 [7] G. Schweighofer and A. Pinz, “Globally Optimal O(n) Solution to the PnP Problem for General Camera Models,” BMVC, 2008.
 [8] V. Lepetit, F. MorenoNoguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnP problem,” International Journal of Computer Vision, 2008.
 [9] J. A. Hesch and S. I. Roumeliotis, “A direct leastsquares (DLS) method for PnP,” in IEEE Int. Conf. on Computer Vision, 2011.
 [10] S. Li, C. Xu, and M. Xie, “A robust O(n) solution to the perspectivenpoint problem,” IEEE Trans. Pattern Anal. Mach. Intell., 2012.
 [11] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi, “Revisiting the PnP problem: A fast, general and optimal solution,” in Proc. of the IEEE Int. Conf. on Computer Vision, 2013.
 [12] L. Kneip, H. Li, and Y. Seo, “UPnP: An Optimal O(n) Solution to the Absolute Pose Problem with Universal Applicability,” in European Conference on Computer Vision, 2014.
 [13] G. Nakano, “Globally Optimal DLS Method for PnP Problem with Cayley parameterization,” in BMVC, 2015.
 [14] P. Sturm, “Algorithms for PlaneBased Pose Estimation,” IEEE Conf. on Computer Vision and Pattern Recognition, 2000.
 [15] Z. Zhang, “A Flexible New Technique for Camera Calibration,” IEEE Trans. Pattern Anal. Mach. Intell., 2000.
 [16] G. Schweighofer and A. Pinz, “Robust pose estimation from a planar target,” IEEE Trans. Pattern Anal. Mach. Intell., 2006.
 [17] T. Collins and A. Bartoli, “Infinitesimal planebased pose estimation,” International Journal of Computer Vision, 2014.
 [18] A. Z. Richard Hartley, Multiple View Geometry, 2nd ed. Cambridge University Press, 2004.
 [19] M. J. Harker and P. L. O’Leary, “Computation of Homographies,” in BMVC, 2005.
 [20] P. Rangarajan and P. Papamichalis, “Estimating homographies without normalization,” in Proc. of Int. Conf. on Image Processing, 2009.
 [21] V. Willert, “Optical indoor positioning using a camera phone,” in Proc. of the 2010 int. conf. on indoor positioning and indoor navigation, 2010.
 [22] D. H. S. Chung, M. L. Parry, P. A. Legg, I. W. Griffiths, R. S. Laramee, and M. Chen, “Visualizing multiple errorsensitivity fields for single camera positioning,” Computing and Visualization in Science, 2014.
 [23] G. W. Stewart, “Perturbation theory for the singular value decomposition,” Tech. Rep., 1998.
 [24] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed., 2013.
 [25] D. Maclaurin, “Modeling, inference and optimization with composable differentiable procedures,” Tech. Rep., 2016.
 [26] L. B. Rall, “Automatic differentiation: Techniques and applications,” 1981.
 [27] T. Tollenaere, “Supersab: fast adaptive back propagation with good scaling properties,” pp. Neural Networks 3(5), 561–573, 1990.