Resolving Marker Pose Ambiguity by Robust Rotation Averaging
with Clique Constraints*
Planar markers are useful in robotics and computer vision for mapping and localisation. Given a detected marker in an image, a frequent task is to estimate the 6DOF pose of the marker relative to the camera, which is an instance of planar pose estimation (PPE). Although there are mature techniques, PPE suffers from a fundamental ambiguity problem, in that there can be more than one plausible pose solutions for a PPE instance. Especially when localisation of the marker corners is noisy, it is often difficult to disambiguate the pose solutions based on reprojection error alone. Previous methods choose between the possible solutions using a heuristic criteria, or simply ignore ambiguous markers.
We propose to resolve the ambiguities by examining the consistencies of a set of markers across multiple views. Our specific contributions include a novel rotation averaging formulation that incorporates long-range dependencies between possible marker orientation solutions that arise from PPE ambiguities. We analyse the combinatorial complexity of the problem, and develop a novel lifted algorithm to effectively resolve marker pose ambiguities, without discarding any marker observations. Results on real and synthetic data show that our method is able to handle highly ambiguous inputs, and provides more accurate and/or complete marker-based mapping and localisation.
In many robotic vision pipelines, fiducial markers are often employed to simplify feature extraction. In particular, planar markers [wang2016apriltag, fiala2004artag, fiala2005artag, garrido2014automatic, romero2018speeded, hu2019deep], which are designed to be easily detected and associated across images, find extensive use in laboratory and commercial settings (factories, warehouses, mines, etc.). In applications that perform planar marker-based SfM or SLAM [shaya2012self, munoz2018mapping, degol2018improved, munoz2019spm], there is a basic need to estimate the 6DOF pose of an observed marker relative to the camera coordinate frame. This is often solved as a special case of planar pose estimation (PPE), which functions by determining the relative pose between a plane of known dimensions and its projection onto the image [oberkampf1993iterative, schweighofer2006robust, collins2014infinitesimal].
While in theory 6DOF pose can be determined uniquely from four non-colinear but co-planar points, the situation is less clear in non-ideal conditions where perspective effects are not apparent, e.g., when the imaged marker is small or the marker is at a distance which is significantly larger than the focal length. In such conditions there is a two-fold rotational ambiguity that corresponds to an unknown reflection of the plane about the z-axis of the camera [oberkampf1993iterative, schweighofer2006robust, collins2014infinitesimal]. For one observed planar marker (specifically its four corners), state-of-the-art PPE methods [schweighofer2006robust, collins2014infinitesimal] may return two physically plausible pose solutions, with one of them being the correct one (i.e., the one closer to the ground truth pose).
Fig. 1 shows an example from the dataset of [degol2018improved]. Note that the two solutions returned by PPE can be very different, thus it is unwise to arbitrarily choose one of the two poses, or take the midpoint of the two solutions as the pose estimate.
A common way to disambiguate the two returned poses and is to compute the reprojection error of each pose
where and are the reference 3D position and 2D observation of the 4 corners of the detected marker, is the camera intrinsic parameter and projects onto the image with camera pose . The PPE pose with the lower reprojection error is then selected.
However, comparing reprojection errors is not foolproof [wu2012stable, munoz2019spm], for if the corner localisation is noisy, and can be very close. In fact, the correct solution can have the higher reprojection error; see Fig. 1.
In practice, marker pose ambiguity occurs regularly [munoz2018mapping]. Fig. 2(a) is the histogram of the reprojection error ratio
of the PPE-derived poses for all the markers detected in sequence Hotel2(H2) from [dai2017scannet]. About 25% of the PPE solutions are considered ambiguous (ratio value [munoz2018mapping]).
While current theory and algorithms for PPE [schweighofer2006robust, collins2014infinitesimal] have characterised the ambiguity issue and are able to compute all physically plausible solutions stably, using the PPE outputs under ambiguity, particularly in marker-based SfM or SLAM pipelines, remains a fundamental challenge. In the following, we further survey efforts to deal with marker pose ambiguity, before outlining the proposed solution.
I-a Related work
Tanaka et al. [tanaka2014solution, tanaka2017solving] modified the conventional planar marker design to directly incorporate orientation information. They attach two one-dimensional moire patterns onto the marker to obtain appearance variation for pose disambiguation, as well as lenticular lenses that introduce 3D deviations to the marker surface. Though this largely alleviates the ambiguity problem, the marker fabrication is non-trivial.
For planar target camera tracking, a filtering method with a well-tuned camera motion model [wu2012stable, uematsu2007improvement] can be exploited to disambiguate the marker poses. However, this assumes temporal continuity in the images, which may not be valid in SfM with wide baseline images; moreover, there are no mature filtering methods for marker SLAM. Jin et al. [jin2017sensor] showed improved marker pose estimation accuracy by fusing depth information. However, this requires an RGBD camera.
Marker-based SfM/SLAM is an active research area [shaya2012self, neunert2016open, munoz2018mapping, degol2018improved, munoz2019spm]. Marker ambiguity is not dealt with explicitly in [neunert2016open, shaya2012self, degol2018improved], though [degol2018improved] combined feature-based SfM with marker-based SfM. Munoz-Salinas et al. applied the ratio test of [collins2014infinitesimal] in their marker-based SfM [munoz2018mapping] and SLAM pipeline [munoz2019spm]. Basically, if the ratio (2) is below a threshold (default is 0.6 [munoz2018mapping]), the PPE solution with the lower reprojection error is used in subsequent SfM/SLAM processing; else, the marker detection is discarded. A weakness of this approach is the sensitivity to the threshold. If it is too low, many marker detections will be excluded, leading to data wastage or even SfM/SLAM failure. On the other hand, a high threshold risks using bad marker poses (recall that the pose with the lower reprojection error may not be the correct one) for SfM/SLAM. Sec. VI will demonstrate this shortcoming.
I-B Our contributions
Unlike previous works that have used a per-marker approach to resolve marker ambiguity, we exploit multi-view constraints for disambiguation. From the input marker detections, we first construct a multigraph of relative rotation measurements, which incorporates all PPE pose ambiguities. Then, we formulate a novel rotation averaging problem with clique constraints that respects consistency (details later) between subsets of relative pose measurements. We examine the combinatorial complexity of the new problem, and develop a lifted optimisation method to efficiently solve it. Then, a series of small maximal weighted clique problems are solved to make the final pose selections. Our method allows all valid PPE pose combinations to be examined, and leads to more accurate and/or complete marker-based SfM.
Ii Problem formulation
Consider input images that observed a set of markers of known sizes in a static scene. We assume calibrated cameras. A standard marker detection and id algorithm [opencv_library] is applied to each image. Denote by
as the set of markers detected in . Using a PPE technique [schweighofer2006robust, collins2014infinitesimal] on the corners of detected in , the marker-to-camera (M2C) relative pose of to is computed, which can potentially yield two solutions
Without loss of generality, we assume that each marker observation has exactly two relative pose solutions. Note that the pose ambiguity is due to orientation ambiguity, thus the translation component is the same, i.e.,
Given the set of all M2C relative pose measurements
our overall aim is SfM, i.e., find the absolute poses of the markers and cameras . To do so, pose ambiguity must be resolved, i.e., for each such that , choose either or for SfM computations.
Previous pipelines [munoz2018mapping, munoz2019spm] make the choice using per-marker heuristics, or discard the marker observation. This “preprocessing” yields the reduced measurement set
where each is either or , and . The reduced measurement set is then subjected to the rest of the SfM/SLAM pipeline. Our new method exploits multi-view consistency to disambiguate the PPE marker poses in a way that avoids premature decisions; details as follows.
Iii Multigraph with rotational ambiguity
Since the ambiguity lies in the orientations, it is natural to model the ambiguity using only the M2C relative rotations
To this end, we construct a multigraph , where the vertices is the set of markers , and the edges indicate covisibility between the markers. More specifically, if and are detected in , four edges
connect vertices and in ; assuming , the edges correspond to the marker-to-marker (M2M) relative rotations
where is a bit string composed of two binary indicators . The edges in are undirected; if , the edge has the associated M2M relative rotation
Thus, in our notation
The set of all edges (without repetitions) is thus
Similarly, the set of unique M2M relative rotations is
The existence of four M2M relative rotations per pair is a direct consequence of ambiguity in marker pose estimation, and the bit string selects a particular combination of M2C relative rotations to derive the M2M relative rotation.
Note that our multigraph construction method is a significant extension of that in [munoz2018mapping], in that our multigraph incorporates all ambiguous marker poses, whereas [munoz2018mapping] generates from the preprocessed data (7) with no ambiguities.
Iii-a Consistent cliques
We assume that the multigraph is connected, i.e., there is a path that connects every pair of vertices (markers) in .
(Consistent clique) Given multigraph as defined above, a consistent clique for image is a fully connected subgraph such that
Every two vertices are connected by exactly one edge , where is one of .
For every two vertices that are connected to vertex , the associated edges and satisfy the condition .
Fig. 3 provides examples. Intuitively, a consistent clique for image corresponds to a set of M2M relative rotations that are composed using a constant selection of one of the two M2C relative poses for each marker detected in .
Since there are multiple valid combinations of constant M2C relative pose selections, there are multiple consistent cliques for an image. Assuming that markers are detected in each image, there are number of consistent cliques per image. For images, there are thus unique combinations of consistent cliques across the images.
Iv Disambiguation with rotation averaging
Based on the multigraph, our technique resolves the ambiguities by first solving a novel rotation averaging formulation, then - based on the averaging results - building and solving a maximum weighted clique problem. The key outcome of this step is marker pose disambiguation; Sec. V will incorporate this step into a marker-based SfM pipeline.
Iv-a Rotation averaging with clique constraints
While standard rotation averaging is defined over a graph of relative rotations [hartley2013rotation, chatterjee2013efficient], extending the formulation to a multigraph of relative rotations is straightforward, and existing algorithms (we used [chatterjee2013efficient]) can be applied with minor adjustments. Let be the absolute rotations of the markers. A rotation averaging problem over multigraph is
where is a robust norm. The motivation behind (16) is to attempt to identify the incorrect poses from PPE as the contributors to outlying measurements in the averaging task.
However, our tests (Sec. VI) suggest that this approach is ineffective for disambiguation, most probably because (16) does not enforce clique consistency (Def. 1). Thus, error terms that are regarded as inliers could correspond to choosing both PPE poses for the same marker detection.
To enforce clique consistency into rotation averaging, we introduce a set of binary indicator variables
where the setting implies selecting M2C relative rotation the detection of in , while implies selecting . We then formulate the clique-constrained rotation averaging problem
Intuitively, selects the M2C relative rotations to compose the M2M relative rotations in a consistent way. Searching over thus allows different consistent cliques in all images to be examined. Finally, since are shared across images, multi-view consistency is exploited to choose the best combinations of the PPE relative rotations.
Iv-B Efficient algorithm using lifting approach
A naive method to solve (18) is to enumerate , and for each instantiation, collect the non-zero terms in (18) and solve the resulting rotation averaging problem. Then, return the with the lowest optimised error as the disambiguation decision. Since there are possible instantiations of (assuming markers seen per image), this is infeasible.
which yields the “smoothed” version of (18)
Intuitively, the contribution of an error term in (20) is now weighted according to correctness of the corresponding M2C relative poses that define the error term.
Problem (20) can be solved using an iterative non-linear optimiser (e.g., fmincon in MATLAB). We initialise via a minimum spanning tree on , choosing the M2M relative rotations with the lower combined reprojection errors for chaining, and is set to reflect these choices. As we will show in Sec. VI, our method is not biased by such an initialisation, since it is capable of providing more accurate disambiguation than comparing reprojection errors alone.
Iv-C Selecting the marker poses
for all . Similar to (2), the ratio (21) indicates how “disambiguable” the PPE poses are for each marker detection (smaller ratios are better), but now based on the value of . Although is not discrete, the percentage of marker poses that are still ambiguous is now significantly reduced.
To conclusively select one PPE pose per detected marker, a simple solution would be to threshold each with ; however, we would like to avoid such a per-marker decision. To this end, for each image we construct the multigraph , where , and
Note that is a submultigraph of , and there exist consistent cliques in (see Sec. III-A). Further, each edge in has the weight
Given , define edge indicator variables
and the maximum weighted clique (MWC) problem
Basically, the aim of is to find a consistent clique in with the largest edge weights. Though MWC is intractable in general [tomita2003efficient], each instance is small, since the number of detected markers in is small (usually ).
We use the efficient clique solver of [eppstein2011listing] on each . The optimised provides a consistent selection of the PPE poses for all markers detected in . Specifically, for each detected in , find a that is nonzero, and set if , or otherwise.
Algorithm 1 summarises the proposed method for marker pose disambiguation.
V Marker-based SfM pipeline
To carry out marker-based SfM using our marker pose disambiguation method, we largely follow the pipeline of the state-of-the-art MarkerMapper [munoz2018mapping]. Briefly, a robust pose graph optimisation is first invoked on the resolved M2C relative poses (7) from Algorithm 1 to yield absolute marker poses - in our case, the absolute rotation component is initialised using the output from solving (20). Then, each camera pose is initialised using single pose averaging from the M2C poses, before all marker and camera poses are refined simultaneously by bundle adjustment on the observed corners of all detected markers. We refer to [munoz2018mapping] for details of the SfM pipeline.
To assess the efficacy of the proposed marker pose disambiguation technique, we compared the following methods:
Reprojection error (M1): For each marker detection, select the PPE solution with the lower reprojection error.
Default ratio test (M3): The threshold of is applied on the reprojection error ratio (the default setting in [munoz2018mapping]).
Proposed method (Ours): As described in Sec. IV.
When applying the above disambiguation methods to perform marker-based SfM, we simply used them to preprocess the input marker detections, then execute the rest of the pipeline of MarkerMapper [munoz2018mapping] (see Sec. V). All the experiments were conducted on a 3.5GHz CPU and 8GB of RAM.
Vi-a Experiments on hybrid data
Vi-A1 Data generation
We used the ScanNet Dataset [dai2017scannet] that contained a number of sequences with ground truth 6DOF camera poses and depth. A test sequence was created from an original sequence by warping a number of ArUco markers [garrido2014automatic, romero2018speeded] based on known/ground truth M2C relative poses onto parts of the images that correspond to planar surfaces; see supplementary video 111https://www.youtube.com/watch?v=LtwavEeCkQ4&t= for a sample sequence. Using the ground truth camera absolute pose , the ground truth marker absolute pose is .
|Seq||Precision(%)||# markers mapped||# cameras localised|
|Seq||Average marker pose error (, cm)||Average camera pose error (, cm)|
Vi-A2 Marker detection
Using the steps above, we generated five testing sequences from Bedroom(B), Hotel1(H1), Hotel2(H2), Office1(O1) and Office2(O2). We used [garrido2014automatic] to detect, identify and localise the corners of each marker in each frame; see Table I for the number of frames and unique detected markers in each sequence. Though the markers were synthetically warped into the images, our analysis suggests that corner localisation suffered from errors of 1–7 pixels.
|Dataset||Mean err. (m)||Median err. (m)|
|ece floor4 wall||5.28||2.72||20.95||2.56||5.35||2.03||18.09||2.12|
|ece floor5 stairs||1.58||3.18||4.07||1.14||0.96||2.64||3.72||0.82|
|cee night cw||30.21||34.79||75.57||19.06||19.25||24.21||76.42||10.12|
Vi-A3 Ground truth M2C pose selection
On the noisy corner localisations, PPE [collins2014infinitesimal] is invoked, which yields two M2C relative poses for each detected marker. To decide the ground truth selection, we compute the angular difference between and as
The ground truth selection of the PPE poses is taken as the one with the lower angular difference .
For the hybrid data experiment, we evaluated all the approaches on two main aspects; see supplementary video for demonstration of our pose disambiguation method.
Precision in pose disambiguation
For each testing sequence, precision in pose disambiguation is defined as
Table I shows that Ours generally has higher precision than the others. The fact that M4 (the control method) is much poorer than Ours proves that enforcing the proposed clique-consistency is crucial for disambiguating the PPE poses. Amongst the per-marker disambiguation methods (M1–M3), M1 has the lowest precision, validating observations in previous works that comparing reprojection errors alone is not foolproof. Adding a ratio test to avoid decisions on cases that are too ambiguous helps to improve precision in M2 and M3. In particular, the precision of M2 is on par with Ours. However, as we show next, this gain by M2 comes at a cost.
Completeness and accuracy of SfM
To assess the effects of marker pose disambiguation on SfM, we evaluate
the number of markers mapped and cameras localised; and
the error (in deg and cm) of the marker and camera poses
estimated by marker-based SfM from the disambiguated PPE poses in Table I,II respectively. Although M2 is precise, it yields a much sparser map than the others; moreover, as it has pruned away many useful detections, there are insufficient data to allow accurate SfM. Using our pose disambiguation technique leads to more complete and accurate maps.
Vi-B Real world dataset experiment
Testing was performed on sequences from [degol2018improved]. We selected 3 indoor scenes with different difficulty levels: ece floor 4 wall, ece floor5 stairs and cee night cw. There are unique markers placed the scene in each sequence. To enable comparisons, we invoked [degol2018improved] (denoted as FM) which conducts both feature- and marker-based SfM on the sequences. Since SfM with M2 failed in all 3 sequences due to insufficient data for optimisation, comparison is not made.
Qualitative results in Table III show that Ours is more accurate than M1 and M3 in marker-based SfM - of course, Ours is visibly not as complete as FM, but the latter uses features on top of markers, which entails heavier computations. Using the estimated camera positions by FM as reference, we obtain the position errors (in m) computed by the marker-based SfM methods - normalised and plotted as a cumulative density in Fig. 4. It is apparent that Ours is much more accurate in camera localisation, especially in the most challenging sequence cee night cw. Table IV lists the mean and median position error, relative to FM.