Resolving Marker Pose Ambiguity by Robust Rotation Averaging
with Clique Constraints*
Abstract
Planar markers are useful in robotics and computer vision for mapping and localisation. Given a detected marker in an image, a frequent task is to estimate the 6DOF pose of the marker relative to the camera, which is an instance of planar pose estimation (PPE). Although there are mature techniques, PPE suffers from a fundamental ambiguity problem, in that there can be more than one plausible pose solutions for a PPE instance. Especially when localisation of the marker corners is noisy, it is often difficult to disambiguate the pose solutions based on reprojection error alone. Previous methods choose between the possible solutions using a heuristic criteria, or simply ignore ambiguous markers.
We propose to resolve the ambiguities by examining the consistencies of a set of markers across multiple views. Our specific contributions include a novel rotation averaging formulation that incorporates longrange dependencies between possible marker orientation solutions that arise from PPE ambiguities. We analyse the combinatorial complexity of the problem, and develop a novel lifted algorithm to effectively resolve marker pose ambiguities, without discarding any marker observations. Results on real and synthetic data show that our method is able to handle highly ambiguous inputs, and provides more accurate and/or complete markerbased mapping and localisation.
IEEEexample:BSTControl
I Introduction
In many robotic vision pipelines, fiducial markers are often employed to simplify feature extraction. In particular, planar markers [wang2016apriltag, fiala2004artag, fiala2005artag, garrido2014automatic, romero2018speeded, hu2019deep], which are designed to be easily detected and associated across images, find extensive use in laboratory and commercial settings (factories, warehouses, mines, etc.). In applications that perform planar markerbased SfM or SLAM [shaya2012self, munoz2018mapping, degol2018improved, munoz2019spm], there is a basic need to estimate the 6DOF pose of an observed marker relative to the camera coordinate frame. This is often solved as a special case of planar pose estimation (PPE), which functions by determining the relative pose between a plane of known dimensions and its projection onto the image [oberkampf1993iterative, schweighofer2006robust, collins2014infinitesimal].
While in theory 6DOF pose can be determined uniquely from four noncolinear but coplanar points, the situation is less clear in nonideal conditions where perspective effects are not apparent, e.g., when the imaged marker is small or the marker is at a distance which is significantly larger than the focal length. In such conditions there is a twofold rotational ambiguity that corresponds to an unknown reflection of the plane about the zaxis of the camera [oberkampf1993iterative, schweighofer2006robust, collins2014infinitesimal]. For one observed planar marker (specifically its four corners), stateoftheart PPE methods [schweighofer2006robust, collins2014infinitesimal] may return two physically plausible pose solutions, with one of them being the correct one (i.e., the one closer to the ground truth pose).
Fig. 1 shows an example from the dataset of [degol2018improved]. Note that the two solutions returned by PPE can be very different, thus it is unwise to arbitrarily choose one of the two poses, or take the midpoint of the two solutions as the pose estimate.
A common way to disambiguate the two returned poses and is to compute the reprojection error of each pose
(1) 
where and are the reference 3D position and 2D observation of the 4 corners of the detected marker, is the camera intrinsic parameter and projects onto the image with camera pose . The PPE pose with the lower reprojection error is then selected.
However, comparing reprojection errors is not foolproof [wu2012stable, munoz2019spm], for if the corner localisation is noisy, and can be very close. In fact, the correct solution can have the higher reprojection error; see Fig. 1.
In practice, marker pose ambiguity occurs regularly [munoz2018mapping]. Fig. 2(a) is the histogram of the reprojection error ratio
(2) 
of the PPEderived poses for all the markers detected in sequence Hotel2(H2) from [dai2017scannet]. About 25% of the PPE solutions are considered ambiguous (ratio value [munoz2018mapping]).
While current theory and algorithms for PPE [schweighofer2006robust, collins2014infinitesimal] have characterised the ambiguity issue and are able to compute all physically plausible solutions stably, using the PPE outputs under ambiguity, particularly in markerbased SfM or SLAM pipelines, remains a fundamental challenge. In the following, we further survey efforts to deal with marker pose ambiguity, before outlining the proposed solution.
Ia Related work
Tanaka et al. [tanaka2014solution, tanaka2017solving] modified the conventional planar marker design to directly incorporate orientation information. They attach two onedimensional moire patterns onto the marker to obtain appearance variation for pose disambiguation, as well as lenticular lenses that introduce 3D deviations to the marker surface. Though this largely alleviates the ambiguity problem, the marker fabrication is nontrivial.
For planar target camera tracking, a filtering method with a welltuned camera motion model [wu2012stable, uematsu2007improvement] can be exploited to disambiguate the marker poses. However, this assumes temporal continuity in the images, which may not be valid in SfM with wide baseline images; moreover, there are no mature filtering methods for marker SLAM. Jin et al. [jin2017sensor] showed improved marker pose estimation accuracy by fusing depth information. However, this requires an RGBD camera.
Markerbased SfM/SLAM is an active research area [shaya2012self, neunert2016open, munoz2018mapping, degol2018improved, munoz2019spm]. Marker ambiguity is not dealt with explicitly in [neunert2016open, shaya2012self, degol2018improved], though [degol2018improved] combined featurebased SfM with markerbased SfM. MunozSalinas et al. applied the ratio test of [collins2014infinitesimal] in their markerbased SfM [munoz2018mapping] and SLAM pipeline [munoz2019spm]. Basically, if the ratio (2) is below a threshold (default is 0.6 [munoz2018mapping]), the PPE solution with the lower reprojection error is used in subsequent SfM/SLAM processing; else, the marker detection is discarded. A weakness of this approach is the sensitivity to the threshold. If it is too low, many marker detections will be excluded, leading to data wastage or even SfM/SLAM failure. On the other hand, a high threshold risks using bad marker poses (recall that the pose with the lower reprojection error may not be the correct one) for SfM/SLAM. Sec. VI will demonstrate this shortcoming.
IB Our contributions
Unlike previous works that have used a permarker approach to resolve marker ambiguity, we exploit multiview constraints for disambiguation. From the input marker detections, we first construct a multigraph of relative rotation measurements, which incorporates all PPE pose ambiguities. Then, we formulate a novel rotation averaging problem with clique constraints that respects consistency (details later) between subsets of relative pose measurements. We examine the combinatorial complexity of the new problem, and develop a lifted optimisation method to efficiently solve it. Then, a series of small maximal weighted clique problems are solved to make the final pose selections. Our method allows all valid PPE pose combinations to be examined, and leads to more accurate and/or complete markerbased SfM.
Ii Problem formulation
Consider input images that observed a set of markers of known sizes in a static scene. We assume calibrated cameras. A standard marker detection and id algorithm [opencv_library] is applied to each image. Denote by
(3) 
as the set of markers detected in . Using a PPE technique [schweighofer2006robust, collins2014infinitesimal] on the corners of detected in , the markertocamera (M2C) relative pose of to is computed, which can potentially yield two solutions
(4) 
Without loss of generality, we assume that each marker observation has exactly two relative pose solutions. Note that the pose ambiguity is due to orientation ambiguity, thus the translation component is the same, i.e.,
(5) 
Given the set of all M2C relative pose measurements
(6) 
our overall aim is SfM, i.e., find the absolute poses of the markers and cameras . To do so, pose ambiguity must be resolved, i.e., for each such that , choose either or for SfM computations.
Previous pipelines [munoz2018mapping, munoz2019spm] make the choice using permarker heuristics, or discard the marker observation. This “preprocessing” yields the reduced measurement set
(7) 
where each is either or , and . The reduced measurement set is then subjected to the rest of the SfM/SLAM pipeline. Our new method exploits multiview consistency to disambiguate the PPE marker poses in a way that avoids premature decisions; details as follows.
Iii Multigraph with rotational ambiguity
Since the ambiguity lies in the orientations, it is natural to model the ambiguity using only the M2C relative rotations
(8) 
To this end, we construct a multigraph , where the vertices is the set of markers , and the edges indicate covisibility between the markers. More specifically, if and are detected in , four edges
(9) 
connect vertices and in ; assuming , the edges correspond to the markertomarker (M2M) relative rotations
(10)  
Fig. 3 shows an example. Since multiple edges connect two vertices, is a multigraph. We summarise (9) and (10) as
(11) 
where is a bit string composed of two binary indicators . The edges in are undirected; if , the edge has the associated M2M relative rotation
(12) 
Thus, in our notation
(13) 
The set of all edges (without repetitions) is thus
(14) 
Similarly, the set of unique M2M relative rotations is
(15) 
The existence of four M2M relative rotations per pair is a direct consequence of ambiguity in marker pose estimation, and the bit string selects a particular combination of M2C relative rotations to derive the M2M relative rotation.
Note that our multigraph construction method is a significant extension of that in [munoz2018mapping], in that our multigraph incorporates all ambiguous marker poses, whereas [munoz2018mapping] generates from the preprocessed data (7) with no ambiguities.
Iiia Consistent cliques
We assume that the multigraph is connected, i.e., there is a path that connects every pair of vertices (markers) in .
Definition 1
(Consistent clique) Given multigraph as defined above, a consistent clique for image is a fully connected subgraph such that

;

Every two vertices are connected by exactly one edge , where is one of .

For every two vertices that are connected to vertex , the associated edges and satisfy the condition .
Fig. 3 provides examples. Intuitively, a consistent clique for image corresponds to a set of M2M relative rotations that are composed using a constant selection of one of the two M2C relative poses for each marker detected in .
Since there are multiple valid combinations of constant M2C relative pose selections, there are multiple consistent cliques for an image. Assuming that markers are detected in each image, there are number of consistent cliques per image. For images, there are thus unique combinations of consistent cliques across the images.
Iv Disambiguation with rotation averaging
Based on the multigraph, our technique resolves the ambiguities by first solving a novel rotation averaging formulation, then  based on the averaging results  building and solving a maximum weighted clique problem. The key outcome of this step is marker pose disambiguation; Sec. V will incorporate this step into a markerbased SfM pipeline.
Iva Rotation averaging with clique constraints
While standard rotation averaging is defined over a graph of relative rotations [hartley2013rotation, chatterjee2013efficient], extending the formulation to a multigraph of relative rotations is straightforward, and existing algorithms (we used [chatterjee2013efficient]) can be applied with minor adjustments. Let be the absolute rotations of the markers. A rotation averaging problem over multigraph is
(16) 
where is a robust norm. The motivation behind (16) is to attempt to identify the incorrect poses from PPE as the contributors to outlying measurements in the averaging task.
However, our tests (Sec. VI) suggest that this approach is ineffective for disambiguation, most probably because (16) does not enforce clique consistency (Def. 1). Thus, error terms that are regarded as inliers could correspond to choosing both PPE poses for the same marker detection.
To enforce clique consistency into rotation averaging, we introduce a set of binary indicator variables
(17) 
where the setting implies selecting M2C relative rotation the detection of in , while implies selecting . We then formulate the cliqueconstrained rotation averaging problem
(18)  
Intuitively, selects the M2C relative rotations to compose the M2M relative rotations in a consistent way. Searching over thus allows different consistent cliques in all images to be examined. Finally, since are shared across images, multiview consistency is exploited to choose the best combinations of the PPE relative rotations.
IvB Efficient algorithm using lifting approach
A naive method to solve (18) is to enumerate , and for each instantiation, collect the nonzero terms in (18) and solve the resulting rotation averaging problem. Then, return the with the lowest optimised error as the disambiguation decision. Since there are possible instantiations of (assuming markers seen per image), this is infeasible.
To enable an efficient algorithm for (18), we apply the lifting approach [sunderhauf2012towards]. First, we relax the indicator variables and replace them in (18) with a sigmoid function
(19) 
which yields the “smoothed” version of (18)
(20)  
Intuitively, the contribution of an error term in (20) is now weighted according to correctness of the corresponding M2C relative poses that define the error term.
Problem (20) can be solved using an iterative nonlinear optimiser (e.g., fmincon in MATLAB). We initialise via a minimum spanning tree on , choosing the M2M relative rotations with the lower combined reprojection errors for chaining, and is set to reflect these choices. As we will show in Sec. VI, our method is not biased by such an initialisation, since it is capable of providing more accurate disambiguation than comparing reprojection errors alone.
IvC Selecting the marker poses
Let by the optimised relaxed indicator variables from solving (20). For the same sequence used in Fig. 2(a), we plot in Fig. 2(b) the histogram of the ratios
(21) 
for all . Similar to (2), the ratio (21) indicates how “disambiguable” the PPE poses are for each marker detection (smaller ratios are better), but now based on the value of . Although is not discrete, the percentage of marker poses that are still ambiguous is now significantly reduced.
To conclusively select one PPE pose per detected marker, a simple solution would be to threshold each with ; however, we would like to avoid such a permarker decision. To this end, for each image we construct the multigraph , where , and
(22) 
Note that is a submultigraph of , and there exist consistent cliques in (see Sec. IIIA). Further, each edge in has the weight
(23) 
Given , define edge indicator variables
and the maximum weighted clique (MWC) problem
() 
Basically, the aim of is to find a consistent clique in with the largest edge weights. Though MWC is intractable in general [tomita2003efficient], each instance is small, since the number of detected markers in is small (usually ).
We use the efficient clique solver of [eppstein2011listing] on each . The optimised provides a consistent selection of the PPE poses for all markers detected in . Specifically, for each detected in , find a that is nonzero, and set if , or otherwise.
Algorithm 1 summarises the proposed method for marker pose disambiguation.
V Markerbased SfM pipeline
To carry out markerbased SfM using our marker pose disambiguation method, we largely follow the pipeline of the stateoftheart MarkerMapper [munoz2018mapping]. Briefly, a robust pose graph optimisation is first invoked on the resolved M2C relative poses (7) from Algorithm 1 to yield absolute marker poses  in our case, the absolute rotation component is initialised using the output from solving (20). Then, each camera pose is initialised using single pose averaging from the M2C poses, before all marker and camera poses are refined simultaneously by bundle adjustment on the observed corners of all detected markers. We refer to [munoz2018mapping] for details of the SfM pipeline.
Vi Results
To assess the efficacy of the proposed marker pose disambiguation technique, we compared the following methods:

[leftmargin=1em]

Reprojection error (M1): For each marker detection, select the PPE solution with the lower reprojection error.

Default ratio test (M3): The threshold of is applied on the reprojection error ratio (the default setting in [munoz2018mapping]).

Proposed method (Ours): As described in Sec. IV.
When applying the above disambiguation methods to perform markerbased SfM, we simply used them to preprocess the input marker detections, then execute the rest of the pipeline of MarkerMapper [munoz2018mapping] (see Sec. V). All the experiments were conducted on a 3.5GHz CPU and 8GB of RAM.
Via Experiments on hybrid data
ViA1 Data generation
We used the ScanNet Dataset [dai2017scannet] that contained a number of sequences with ground truth 6DOF camera poses and depth. A test sequence was created from an original sequence by warping a number of ArUco markers [garrido2014automatic, romero2018speeded] based on known/ground truth M2C relative poses onto parts of the images that correspond to planar surfaces; see supplementary video ^{1}^{1}1https://www.youtube.com/watch?v=LtwavEeCkQ4&t= for a sample sequence. Using the ground truth camera absolute pose , the ground truth marker absolute pose is .
Seq  Precision(%)  # markers mapped  # cameras localised  
M1  M2  M3  M4  Ours  M1  M2  M3  M4  Ours  M1  M2  M3  M4  Ours  
B  3  31  94.32  100  92.31  31.82  100  3  0  3  3  3  31  0  31  31  31 
H1  5  41  80.68  100  82.61  22.16  100  5  0  5  5  5  41  0  40  41  41 
O1  7  51  77.08  96.97  78.8  14.58  96.52  7  7  7  7  7  51  41  51  51  51 
O2  6  91  92.64  100  98.95  37.94  99.41  6  4  6  6  6  91  46  91  91  91 
H2  14  151  93.42  98.94  97.89  48.16  100  14  13  14  14  14  151  101  151  151  151 
Seq  Average marker pose error (, cm)  Average camera pose error (, cm)  

M1  M2  M3  M4  Ours  M1  M2  M3  M4  Ours  
B  5.4  11.7      6.3  15.0  19.0  37.5  2.3  2.2  7.0  15.9      11.9  19.5  32.0  10.0  0.8  2.0 
H1  11.7  13.0      12.5  15.0  39.1  26.3  3.3  8.6  14.8  27.5      17.6  41.6  37.9  28.8  5.0  3.2 
O1  26.2  30.3  15.2  8.0  25.4  29.0  55.3  120.9  3.5  4.3  17.3  69.8  7.6  16.0  19.2  69.4  85.8  49.7  5.7  13.7 
O2  8.7  6.6  4.4  4.2  4.1  2.6  28.0  63.2  4.2  2.4  6.2  10.5  0.8  2.4  17.4  4.0  41.6  40.1  1.3  3.4 
H2  4.3  5.1  7.7  3.1  5.4  5.5  20.3  14.2  3.6  4.9  4.3  3.8  2.2  2.3  3.3  3.1  32.0  10.0  3.4  2.4 
ViA2 Marker detection
Using the steps above, we generated five testing sequences from Bedroom(B), Hotel1(H1), Hotel2(H2), Office1(O1) and Office2(O2). We used [garrido2014automatic] to detect, identify and localise the corners of each marker in each frame; see Table I for the number of frames and unique detected markers in each sequence. Though the markers were synthetically warped into the images, our analysis suggests that corner localisation suffered from errors of 1–7 pixels.
M1  M3  M4  Ours  FM 

Dataset  Mean err. (m)  Median err. (m)  

M1  M3  M4  Ours  M1  M3  M4  Ours  
ece floor4 wall  5.28  2.72  20.95  2.56  5.35  2.03  18.09  2.12 
ece floor5 stairs  1.58  3.18  4.07  1.14  0.96  2.64  3.72  0.82 
cee night cw  30.21  34.79  75.57  19.06  19.25  24.21  76.42  10.12 
ViA3 Ground truth M2C pose selection
On the noisy corner localisations, PPE [collins2014infinitesimal] is invoked, which yields two M2C relative poses for each detected marker. To decide the ground truth selection, we compute the angular difference between and as
(24) 
The ground truth selection of the PPE poses is taken as the one with the lower angular difference .
ViA4 Results
For the hybrid data experiment, we evaluated all the approaches on two main aspects; see supplementary video for demonstration of our pose disambiguation method.
Precision in pose disambiguation
For each testing sequence, precision in pose disambiguation is defined as
(25) 
Table I shows that Ours generally has higher precision than the others. The fact that M4 (the control method) is much poorer than Ours proves that enforcing the proposed cliqueconsistency is crucial for disambiguating the PPE poses. Amongst the permarker disambiguation methods (M1–M3), M1 has the lowest precision, validating observations in previous works that comparing reprojection errors alone is not foolproof. Adding a ratio test to avoid decisions on cases that are too ambiguous helps to improve precision in M2 and M3. In particular, the precision of M2 is on par with Ours. However, as we show next, this gain by M2 comes at a cost.
Completeness and accuracy of SfM
To assess the effects of marker pose disambiguation on SfM, we evaluate

[leftmargin=1em]

the number of markers mapped and cameras localised; and

the error (in deg and cm) of the marker and camera poses
estimated by markerbased SfM from the disambiguated PPE poses in Table I,II respectively. Although M2 is precise, it yields a much sparser map than the others; moreover, as it has pruned away many useful detections, there are insufficient data to allow accurate SfM. Using our pose disambiguation technique leads to more complete and accurate maps.
ViB Real world dataset experiment
Testing was performed on sequences from [degol2018improved]. We selected 3 indoor scenes with different difficulty levels: ece floor 4 wall, ece floor5 stairs and cee night cw. There are unique markers placed the scene in each sequence. To enable comparisons, we invoked [degol2018improved] (denoted as FM) which conducts both feature and markerbased SfM on the sequences. Since SfM with M2 failed in all 3 sequences due to insufficient data for optimisation, comparison is not made.
Qualitative results in Table III show that Ours is more accurate than M1 and M3 in markerbased SfM  of course, Ours is visibly not as complete as FM, but the latter uses features on top of markers, which entails heavier computations. Using the estimated camera positions by FM as reference, we obtain the position errors (in m) computed by the markerbased SfM methods  normalised and plotted as a cumulative density in Fig. 4. It is apparent that Ours is much more accurate in camera localisation, especially in the most challenging sequence cee night cw. Table IV lists the mean and median position error, relative to FM.