Resolving Marker Pose Ambiguity by Robust Rotation Averagingwith Clique Constraints*

Resolving Marker Pose Ambiguity by Robust Rotation Averaging
with Clique Constraints*

Shin-Fang Ch’ng, Naoya Sogi, Pulak Purkait, Tat-Jun Chin and Kazuhiro Fukui *This work was supported by the ARC Centre of Excellence on Robotic Vision CE140100016 and the Mawson Lakes Fellowship Program.School of Computer Science, The University of Adelaide, Australia.Department of Computer Science, University of Tsukuba, Japan.

Planar markers are useful in robotics and computer vision for mapping and localisation. Given a detected marker in an image, a frequent task is to estimate the 6DOF pose of the marker relative to the camera, which is an instance of planar pose estimation (PPE). Although there are mature techniques, PPE suffers from a fundamental ambiguity problem, in that there can be more than one plausible pose solutions for a PPE instance. Especially when localisation of the marker corners is noisy, it is often difficult to disambiguate the pose solutions based on reprojection error alone. Previous methods choose between the possible solutions using a heuristic criteria, or simply ignore ambiguous markers.

We propose to resolve the ambiguities by examining the consistencies of a set of markers across multiple views. Our specific contributions include a novel rotation averaging formulation that incorporates long-range dependencies between possible marker orientation solutions that arise from PPE ambiguities. We analyse the combinatorial complexity of the problem, and develop a novel lifted algorithm to effectively resolve marker pose ambiguities, without discarding any marker observations. Results on real and synthetic data show that our method is able to handle highly ambiguous inputs, and provides more accurate and/or complete marker-based mapping and localisation.



I Introduction

In many robotic vision pipelines, fiducial markers are often employed to simplify feature extraction. In particular, planar markers [wang2016apriltag, fiala2004artag, fiala2005artag, garrido2014automatic, romero2018speeded, hu2019deep], which are designed to be easily detected and associated across images, find extensive use in laboratory and commercial settings (factories, warehouses, mines, etc.). In applications that perform planar marker-based SfM or SLAM [shaya2012self, munoz2018mapping, degol2018improved, munoz2019spm], there is a basic need to estimate the 6DOF pose of an observed marker relative to the camera coordinate frame. This is often solved as a special case of planar pose estimation (PPE), which functions by determining the relative pose between a plane of known dimensions and its projection onto the image [oberkampf1993iterative, schweighofer2006robust, collins2014infinitesimal].

While in theory 6DOF pose can be determined uniquely from four non-colinear but co-planar points, the situation is less clear in non-ideal conditions where perspective effects are not apparent, e.g., when the imaged marker is small or the marker is at a distance which is significantly larger than the focal length. In such conditions there is a two-fold rotational ambiguity that corresponds to an unknown reflection of the plane about the z-axis of the camera [oberkampf1993iterative, schweighofer2006robust, collins2014infinitesimal]. For one observed planar marker (specifically its four corners), state-of-the-art PPE methods [schweighofer2006robust, collins2014infinitesimal] may return two physically plausible pose solutions, with one of them being the correct one (i.e., the one closer to the ground truth pose).

Fig. 1 shows an example from the dataset of [degol2018improved]. Note that the two solutions returned by PPE can be very different, thus it is unwise to arbitrarily choose one of the two poses, or take the midpoint of the two solutions as the pose estimate.

A common way to disambiguate the two returned poses and is to compute the reprojection error of each pose


where and are the reference 3D position and 2D observation of the 4 corners of the detected marker, is the camera intrinsic parameter and projects onto the image with camera pose . The PPE pose with the lower reprojection error is then selected.

However, comparing reprojection errors is not foolproof [wu2012stable, munoz2019spm], for if the corner localisation is noisy, and can be very close. In fact, the correct solution can have the higher reprojection error; see Fig. 1.

In practice, marker pose ambiguity occurs regularly [munoz2018mapping]. Fig. 2(a) is the histogram of the reprojection error ratio


of the PPE-derived poses for all the markers detected in sequence Hotel2(H2) from [dai2017scannet]. About 25% of the PPE solutions are considered ambiguous (ratio value  [munoz2018mapping]).

Fig. 1: (a) A detected marker with bounding box from a frame in the dataset of [degol2018improved]. (b) The two poses (yellow) and (blue) returned by PPE [collins2014infinitesimal] have reprojection errors and resp. Though has the lower error, it is an incorrect pose, cf. the ground truth pose (green).

While current theory and algorithms for PPE [schweighofer2006robust, collins2014infinitesimal] have characterised the ambiguity issue and are able to compute all physically plausible solutions stably, using the PPE outputs under ambiguity, particularly in marker-based SfM or SLAM pipelines, remains a fundamental challenge. In the following, we further survey efforts to deal with marker pose ambiguity, before outlining the proposed solution.

(a) Histogram of error ratio (2).
(b) Histogram of weight ratio (21).
Fig. 2: Histogram of reprojection error ratio (2) and weight ratio (21) from proposed method (Sec. IV-C) for all markers detected in Hotel2 [dai2017scannet].

I-a Related work

Tanaka et al. [tanaka2014solution, tanaka2017solving] modified the conventional planar marker design to directly incorporate orientation information. They attach two one-dimensional moire patterns onto the marker to obtain appearance variation for pose disambiguation, as well as lenticular lenses that introduce 3D deviations to the marker surface. Though this largely alleviates the ambiguity problem, the marker fabrication is non-trivial.

For planar target camera tracking, a filtering method with a well-tuned camera motion model [wu2012stable, uematsu2007improvement] can be exploited to disambiguate the marker poses. However, this assumes temporal continuity in the images, which may not be valid in SfM with wide baseline images; moreover, there are no mature filtering methods for marker SLAM. Jin et al. [jin2017sensor] showed improved marker pose estimation accuracy by fusing depth information. However, this requires an RGBD camera.

Marker-based SfM/SLAM is an active research area [shaya2012self, neunert2016open, munoz2018mapping, degol2018improved, munoz2019spm]. Marker ambiguity is not dealt with explicitly in [neunert2016open, shaya2012self, degol2018improved], though [degol2018improved] combined feature-based SfM with marker-based SfM. Munoz-Salinas et al. applied the ratio test of [collins2014infinitesimal] in their marker-based SfM [munoz2018mapping] and SLAM pipeline [munoz2019spm]. Basically, if the ratio (2) is below a threshold (default is 0.6 [munoz2018mapping]), the PPE solution with the lower reprojection error is used in subsequent SfM/SLAM processing; else, the marker detection is discarded. A weakness of this approach is the sensitivity to the threshold. If it is too low, many marker detections will be excluded, leading to data wastage or even SfM/SLAM failure. On the other hand, a high threshold risks using bad marker poses (recall that the pose with the lower reprojection error may not be the correct one) for SfM/SLAM. Sec. VI will demonstrate this shortcoming.

I-B Our contributions

Unlike previous works that have used a per-marker approach to resolve marker ambiguity, we exploit multi-view constraints for disambiguation. From the input marker detections, we first construct a multigraph of relative rotation measurements, which incorporates all PPE pose ambiguities. Then, we formulate a novel rotation averaging problem with clique constraints that respects consistency (details later) between subsets of relative pose measurements. We examine the combinatorial complexity of the new problem, and develop a lifted optimisation method to efficiently solve it. Then, a series of small maximal weighted clique problems are solved to make the final pose selections. Our method allows all valid PPE pose combinations to be examined, and leads to more accurate and/or complete marker-based SfM.

Ii Problem formulation

Consider input images that observed a set of markers of known sizes in a static scene. We assume calibrated cameras. A standard marker detection and id algorithm [opencv_library] is applied to each image. Denote by


as the set of markers detected in . Using a PPE technique [schweighofer2006robust, collins2014infinitesimal] on the corners of detected in , the marker-to-camera (M2C) relative pose of to is computed, which can potentially yield two solutions


Without loss of generality, we assume that each marker observation has exactly two relative pose solutions. Note that the pose ambiguity is due to orientation ambiguity, thus the translation component is the same, i.e.,


Given the set of all M2C relative pose measurements


our overall aim is SfM, i.e., find the absolute poses of the markers and cameras . To do so, pose ambiguity must be resolved, i.e., for each such that , choose either or for SfM computations.

Previous pipelines [munoz2018mapping, munoz2019spm] make the choice using per-marker heuristics, or discard the marker observation. This “preprocessing” yields the reduced measurement set


where each is either or , and . The reduced measurement set is then subjected to the rest of the SfM/SLAM pipeline. Our new method exploits multi-view consistency to disambiguate the PPE marker poses in a way that avoids premature decisions; details as follows.

Iii Multigraph with rotational ambiguity

Since the ambiguity lies in the orientations, it is natural to model the ambiguity using only the M2C relative rotations


To this end, we construct a multigraph , where the vertices is the set of markers , and the edges indicate covisibility between the markers. More specifically, if and are detected in , four edges


connect vertices and in ; assuming , the edges correspond to the marker-to-marker (M2M) relative rotations


Fig. 3 shows an example. Since multiple edges connect two vertices, is a multigraph. We summarise (9) and (10) as


where is a bit string composed of two binary indicators . The edges in are undirected; if , the edge has the associated M2M relative rotation


Thus, in our notation


The set of all edges (without repetitions) is thus


Similarly, the set of unique M2M relative rotations is


The existence of four M2M relative rotations per pair is a direct consequence of ambiguity in marker pose estimation, and the bit string selects a particular combination of M2C relative rotations to derive the M2M relative rotation.

Fig. 3: Multigraph and consistent cliques. (a) The scene has 4 markers captured in 3 images . All markers were detected in , while only a subset was detected in and .  (b) Multigraph with the edges labelled following (9). Since and were covisible in and , there are 8 edges connecting vertices and (similarly, and in and ). (c) Two consistent cliques (red and blue) for image .

Note that our multigraph construction method is a significant extension of that in [munoz2018mapping], in that our multigraph incorporates all ambiguous marker poses, whereas [munoz2018mapping] generates from the preprocessed data (7) with no ambiguities.

Iii-a Consistent cliques

We assume that the multigraph is connected, i.e., there is a path that connects every pair of vertices (markers) in .

Definition 1

(Consistent clique) Given multigraph as defined above, a consistent clique for image is a fully connected subgraph such that

  • ;

  • Every two vertices are connected by exactly one edge , where is one of .

  • For every two vertices that are connected to vertex , the associated edges and satisfy the condition .

Fig. 3 provides examples. Intuitively, a consistent clique for image corresponds to a set of M2M relative rotations that are composed using a constant selection of one of the two M2C relative poses for each marker detected in .

Since there are multiple valid combinations of constant M2C relative pose selections, there are multiple consistent cliques for an image. Assuming that markers are detected in each image, there are number of consistent cliques per image. For images, there are thus unique combinations of consistent cliques across the images.

Iv Disambiguation with rotation averaging

Based on the multigraph, our technique resolves the ambiguities by first solving a novel rotation averaging formulation, then - based on the averaging results - building and solving a maximum weighted clique problem. The key outcome of this step is marker pose disambiguation; Sec. V will incorporate this step into a marker-based SfM pipeline.

Iv-a Rotation averaging with clique constraints

While standard rotation averaging is defined over a graph of relative rotations [hartley2013rotation, chatterjee2013efficient], extending the formulation to a multigraph of relative rotations is straightforward, and existing algorithms (we used [chatterjee2013efficient]) can be applied with minor adjustments. Let be the absolute rotations of the markers. A rotation averaging problem over multigraph is


where is a robust norm. The motivation behind (16) is to attempt to identify the incorrect poses from PPE as the contributors to outlying measurements in the averaging task.

However, our tests (Sec. VI) suggest that this approach is ineffective for disambiguation, most probably because (16) does not enforce clique consistency (Def. 1). Thus, error terms that are regarded as inliers could correspond to choosing both PPE poses for the same marker detection.

To enforce clique consistency into rotation averaging, we introduce a set of binary indicator variables


where the setting implies selecting M2C relative rotation  the detection of in , while implies selecting . We then formulate the clique-constrained rotation averaging problem


Intuitively, selects the M2C relative rotations to compose the M2M relative rotations in a consistent way. Searching over thus allows different consistent cliques in all images to be examined. Finally, since are shared across images, multi-view consistency is exploited to choose the best combinations of the PPE relative rotations.

Iv-B Efficient algorithm using lifting approach

A naive method to solve (18) is to enumerate , and for each instantiation, collect the non-zero terms in (18) and solve the resulting rotation averaging problem. Then, return the with the lowest optimised error as the disambiguation decision. Since there are possible instantiations of (assuming markers seen per image), this is infeasible.

To enable an efficient algorithm for (18), we apply the lifting approach [sunderhauf2012towards]. First, we relax the indicator variables and replace them in (18) with a sigmoid function


which yields the “smoothed” version of (18)


Intuitively, the contribution of an error term in (20) is now weighted according to correctness of the corresponding M2C relative poses that define the error term.

Problem (20) can be solved using an iterative non-linear optimiser (e.g., fmincon in MATLAB). We initialise via a minimum spanning tree on , choosing the M2M relative rotations with the lower combined reprojection errors for chaining, and is set to reflect these choices. As we will show in Sec. VI, our method is not biased by such an initialisation, since it is capable of providing more accurate disambiguation than comparing reprojection errors alone.

Iv-C Selecting the marker poses

Let by the optimised relaxed indicator variables from solving (20). For the same sequence used in Fig. 2(a), we plot in Fig. 2(b) the histogram of the ratios


for all . Similar to (2), the ratio (21) indicates how “disambiguable” the PPE poses are for each marker detection (smaller ratios are better), but now based on the value of . Although is not discrete, the percentage of marker poses that are still ambiguous is now significantly reduced.

To conclusively select one PPE pose per detected marker, a simple solution would be to threshold each with ; however, we would like to avoid such a per-marker decision. To this end, for each image we construct the multigraph , where , and


Note that is a submultigraph of , and there exist consistent cliques in (see Sec. III-A). Further, each edge in has the weight


Given , define edge indicator variables

and the maximum weighted clique (MWC) problem


Basically, the aim of is to find a consistent clique in with the largest edge weights. Though MWC is intractable in general [tomita2003efficient], each instance is small, since the number of detected markers in is small (usually ).

We use the efficient clique solver of [eppstein2011listing] on each . The optimised provides a consistent selection of the PPE poses for all markers detected in . Specifically, for each detected in , find a that is nonzero, and set if , or otherwise.

Algorithm 1 summarises the proposed method for marker pose disambiguation.

V Marker-based SfM pipeline

To carry out marker-based SfM using our marker pose disambiguation method, we largely follow the pipeline of the state-of-the-art MarkerMapper [munoz2018mapping]. Briefly, a robust pose graph optimisation is first invoked on the resolved M2C relative poses (7) from Algorithm 1 to yield absolute marker poses - in our case, the absolute rotation component is initialised using the output from solving (20). Then, each camera pose is initialised using single pose averaging from the M2C poses, before all marker and camera poses are refined simultaneously by bundle adjustment on the observed corners of all detected markers. We refer to [munoz2018mapping] for details of the SfM pipeline.

1:M2C relative poses (6) with PPE ambiguity.
2:Construct a multigraph from the input (Sec. III).
3: Solve (20) based on (Sec. IV-B).
4:for  do
5:      Solve from (Sec. IV-C).
6:      Based on , select one of two M2C poses for all markers in (Sec. IV-C).
7:One M2C relative pose per detected marker.
Algorithm 1 Method for marker pose disambiguation

Vi Results

To assess the efficacy of the proposed marker pose disambiguation technique, we compared the following methods:

  • [leftmargin=1em]

  • Reprojection error (M1): For each marker detection, select the PPE solution with the lower reprojection error.

  • Strict ratio test (M2): The threshold of is applied on the reprojection error ratio (2) (see Sec. I-A for details).

  • Default ratio test (M3): The threshold of is applied on the reprojection error ratio (the default setting in [munoz2018mapping]).

  • Robust rotation averaging and post hoc clique consistency enforcement (M4): Solve (16) by IRLS [chatterjee2013efficient], then use the IRLS-optimised weights for the error terms as inputs to our M2C pose selection method in Sec. IV-C.

  • Proposed method (Ours): As described in Sec. IV.

When applying the above disambiguation methods to perform marker-based SfM, we simply used them to preprocess the input marker detections, then execute the rest of the pipeline of MarkerMapper [munoz2018mapping] (see Sec. V). All the experiments were conducted on a 3.5GHz CPU and 8GB of RAM.

Vi-a Experiments on hybrid data

Vi-A1 Data generation

We used the ScanNet Dataset [dai2017scannet] that contained a number of sequences with ground truth 6DOF camera poses and depth. A test sequence was created from an original sequence by warping a number of ArUco markers [garrido2014automatic, romero2018speeded] based on known/ground truth M2C relative poses onto parts of the images that correspond to planar surfaces; see supplementary video 111 for a sample sequence. Using the ground truth camera absolute pose , the ground truth marker absolute pose is .

Seq Precision(%) # markers mapped # cameras localised
M1 M2 M3 M4 Ours M1 M2 M3 M4 Ours M1 M2 M3 M4 Ours
B 3 31 94.32 100 92.31 31.82 100 3 0 3 3 3 31 0 31 31 31
H1 5 41 80.68 100 82.61 22.16 100 5 0 5 5 5 41 0 40 41 41
O1 7 51 77.08 96.97 78.8 14.58 96.52 7 7 7 7 7 51 41 51 51 51
O2 6 91 92.64 100 98.95 37.94 99.41 6 4 6 6 6 91 46 91 91 91
H2 14 151 93.42 98.94 97.89 48.16 100 14 13 14 14 14 151 101 151 151 151
TABLE I: Precision in pose disambiguation on Hybrid Data.
Seq Average marker pose error (, cm) Average camera pose error (, cm)
M1 M2 M3 M4 Ours M1 M2 M3 M4 Ours
B 5.4 11.7 - - 6.3 15.0 19.0 37.5 2.3 2.2 7.0 15.9 - - 11.9 19.5 32.0 10.0 0.8 2.0
H1 11.7 13.0 - - 12.5 15.0 39.1 26.3 3.3 8.6 14.8 27.5 - - 17.6 41.6 37.9 28.8 5.0 3.2
O1 26.2 30.3 15.2 8.0 25.4 29.0 55.3 120.9 3.5 4.3 17.3 69.8 7.6 16.0 19.2 69.4 85.8 49.7 5.7 13.7
O2 8.7 6.6 4.4 4.2 4.1 2.6 28.0 63.2 4.2 2.4 6.2 10.5 0.8 2.4 17.4 4.0 41.6 40.1 1.3 3.4
H2 4.3 5.1 7.7 3.1 5.4 5.5 20.3 14.2 3.6 4.9 4.3 3.8 2.2 2.3 3.3 3.1 32.0 10.0 3.4 2.4
TABLE II: SFM Accuracy for different pose disambiguation methods on hybrid data. ‘-’ denotes failed reconstruction.

Vi-A2 Marker detection

Using the steps above, we generated five testing sequences from Bedroom(B), Hotel1(H1), Hotel2(H2), Office1(O1) and Office2(O2). We used [garrido2014automatic] to detect, identify and localise the corners of each marker in each frame; see Table I for the number of frames and unique detected markers in each sequence. Though the markers were synthetically warped into the images, our analysis suggests that corner localisation suffered from errors of 1–7 pixels.

M1 M3 M4 Ours FM
TABLE III: Qualitative result: Reconstruction results for marker-based SfM methods M1,M3, M4, and Ours, as well as feature- and marker-based SfM method FM [degol2018improved]. Row 1: ece floor4 wall, Row 2: ece floor5 stairs, Row 3: cee night cw. For the marker-based methods, red = reconstructed reference marker, blue: reconstructed markers, green: estimated camera positions.
Fig. 4: Comparison of camera position error (relative to FM) of M1, M3, M4 and Ours.
Dataset Mean err. (m) Median err. (m)
M1 M3 M4 Ours M1 M3 M4 Ours
ece floor4 wall 5.28 2.72 20.95 2.56 5.35 2.03 18.09 2.12
ece floor5 stairs 1.58 3.18 4.07 1.14 0.96 2.64 3.72 0.82
cee night cw 30.21 34.79 75.57 19.06 19.25 24.21 76.42 10.12
TABLE IV: Mean and median camera position error, relative to FM.

Vi-A3 Ground truth M2C pose selection

On the noisy corner localisations, PPE [collins2014infinitesimal] is invoked, which yields two M2C relative poses for each detected marker. To decide the ground truth selection, we compute the angular difference between and as


The ground truth selection of the PPE poses is taken as the one with the lower angular difference .

Vi-A4 Results

For the hybrid data experiment, we evaluated all the approaches on two main aspects; see supplementary video  for demonstration of our pose disambiguation method.

Precision in pose disambiguation

For each testing sequence, precision in pose disambiguation is defined as


Table I shows that Ours generally has higher precision than the others. The fact that M4 (the control method) is much poorer than Ours proves that enforcing the proposed clique-consistency is crucial for disambiguating the PPE poses. Amongst the per-marker disambiguation methods (M1M3), M1 has the lowest precision, validating observations in previous works that comparing reprojection errors alone is not foolproof. Adding a ratio test to avoid decisions on cases that are too ambiguous helps to improve precision in M2 and M3. In particular, the precision of M2 is on par with Ours. However, as we show next, this gain by M2 comes at a cost.

Completeness and accuracy of SfM

To assess the effects of marker pose disambiguation on SfM, we evaluate

  • [leftmargin=1em]

  • the number of markers mapped and cameras localised; and

  • the error (in deg and cm) of the marker and camera poses

estimated by marker-based SfM from the disambiguated PPE poses in Table I,II respectively. Although M2 is precise, it yields a much sparser map than the others; moreover, as it has pruned away many useful detections, there are insufficient data to allow accurate SfM. Using our pose disambiguation technique leads to more complete and accurate maps.

Vi-B Real world dataset experiment

Testing was performed on sequences from [degol2018improved]. We selected 3 indoor scenes with different difficulty levels: ece floor 4 wall, ece floor5 stairs and cee night cw. There are unique markers placed the scene in each sequence. To enable comparisons, we invoked [degol2018improved] (denoted as FM) which conducts both feature- and marker-based SfM on the sequences. Since SfM with M2 failed in all 3 sequences due to insufficient data for optimisation, comparison is not made.

Qualitative results in Table III show that Ours is more accurate than M1 and M3 in marker-based SfM - of course, Ours is visibly not as complete as FM, but the latter uses features on top of markers, which entails heavier computations. Using the estimated camera positions by FM as reference, we obtain the position errors (in m) computed by the marker-based SfM methods - normalised and plotted as a cumulative density in Fig. 4. It is apparent that Ours is much more accurate in camera localisation, especially in the most challenging sequence cee night cw. Table IV lists the mean and median position error, relative to FM.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description