Robot-to-Robot Relative Pose Estimation using Humans as Markers

Robot-to-Robot Relative Pose Estimation using Humans as Markers

Md Jahidul Islam, Jiawei Mo and Junaed Sattar * This paper is in review at the Journal of Field Robotics (JFR).The authors are with the Interactive Robotics and Vision Laboratory, Department of Computer Science and Engineering, University of Minnesota- Twin Cities, US.
E-mail:islam034, moxxx066,

In this paper, we propose a method to determine the 3D relative pose of pairs of communicating robots by using human pose-based key-points as correspondences. We adopt a ‘leader-follower’ framework, where at first, the leader robot visually detects and triangulates the key-points using the state-of-the-art pose detector named OpenPose. Afterward, the follower robots match the corresponding 2D projections on their respective calibrated cameras and find their relative poses by solving the perspective-n-point (PnP) problem. In the proposed method, we design an efficient person re-identification technique for associating the mutually visible humans in the scene. Additionally, we present an iterative optimization algorithm to refine the associated key-points based on their local structural properties in the image space. We demonstrate that these refinement processes are essential to establish accurate key-point correspondences across viewpoints. Furthermore, we evaluate the performance of the proposed relative pose estimation method through several experiments conducted in terrestrial and underwater environments. Finally, we discuss the relevant operational challenges of this approach and analyze its feasibility for multi-robot cooperative systems in human-dominated social settings and feature-deprived environments such as underwater.

I Introduction

Accurate computation of relative pose is essential in multi-robot estimation problems such as cooperative tracking, localization [27], planning, mapping [30], and more. Unless global positioning information (e.g., GPS) is available, the robots need to estimate their positions and orientations relative to each other based on their exteroceptive sensory measurements and noisy odometry [40]. This process is necessary for registering their measurements to a common frame of reference in order to maintain coordination. Therefore, robust estimation of robot-to-robot relative pose is crucial for deploying a team of robots in GPS-denied environments.

In a cooperative setting, robots with visual sensing capabilities solve the relative pose estimation problem by triangulating mutually visible local features and landmarks. A lack of salient features significantly affects the accuracy of this estimation [34], which eventually hampers the overall success of the operation. Such difficulties often arise in poor visibility conditions underwater due to a lower number of salient features and natural landmarks [6, 29]. Nevertheless, close proximity of human divers to robots is a fairly common occurrence in underwater applications [11]. In addition, humans are frequently present and clearly visible in many social scenarios [12, 15] where natural landmarks are not reliably identifiable due to repeated textures, noisy visual conditions, etc. Hence, the problem of having limited natural landmarks can be alleviated by using mutually visible humans as markers (i.e., features correspondences), particularly in human-robot collaborative applications. Despite the potential, the feasibility of using human presence or body-pose for robot-to-robot relative pose estimation has not been explored in the literature.

Fig. 1: A simplified illustration of 3D relative pose estimation between robot 1 and robot 2 (3). The robots know the transformations between their intrinsically calibrated cameras and respective global frames, i.e., {1}, {2}, and {3}. Robot 1 is considered as the leader (equipped with a stereo camera) and its pose in global coordinates (, ) is known. Robot 2 (3) finds its unknown global pose by cooperatively localizing itself relative to robot 1 using the human pose-based key-points as common landmarks.

In this paper, we propose a method for computing six degrees-of-freedom (-DOF) robot-to-robot transformation between pairs of communicating robots by using mutually detected humans’ pose-based key-points as correspondences. As illustrated in Figure 1, we adopt a leader-follower framework where one of the robots (equipped with a stereo camera) is assigned as a leader. First, the leader robot detects and triangulates 3D positions of the key-points in its own frame of reference. Then the follower robot matches the corresponding 2D projections on its intrinsically calibrated camera and localizes itself by solving the perspective-n-point (PnP) problem [39]. It is to be noted that this entire process of extrinsic calibration is automatic and does not require prior knowledge about the robots’ initial positions. Additionally, it is straightforward to extend the leader-follower framework for multi-robot teams from the pairwise solutions. Furthermore, if the leader robot has global positioning information, i.e., has a GPS or an ultra-short baseline (USBL) receiver, the follower robots can use that information to localize themselves in the global frame as well.

In addition to presenting the conceptual model and implementation details, we provide efficient solutions to the practicalities involved in the proposed robot-to-robot pose estimation method. As mentioned, we use OpenPose [5] for detecting human body-poses in the image space. Although it provides reliable detection performance, the extracted 2D key-points across different views do not necessarily associate as a correspondence. We propose a twofold solution to this; first, we design an efficient person re-identification technique by evaluating the hierarchical similarities of the key-point regions in the image space. We also demonstrate that the state-of-the-art appearance-based person re-identification models [1, 17] fail to provide acceptable performance under real-time constraints. Subsequently, we formulate an iterative optimization algorithm that refines the preliminarily associated noisy key-points by further exploiting their local structural properties in respective images. This two-stage process facilitates efficient and robust key-point correspondences across viewpoints for accurate robot-to-robot relative pose estimation. Furthermore, we evaluate the proposed estimation method over a number of terrestrial and underwater experiments; the results validate its effectiveness in real-world applications. Lastly, we analyze its feasibility in various multi-robot cooperative systems and discuss relevant operational considerations.

Ii Related Work

The following sections present a brief discussion on the existing literature that is relevant to our problem of interest.

Ii-a Robot-to-robot Relative Pose Estimation

The problem of robot-to-robot relative pose estimation has been thoroughly studied for 2D planar robots, particularly using range and bearing sensors. Analytic solutions for determining -DOF robot-to-robot transformation using mutual distance and/or bearing measurements involve solving an over-determined system of nonlinear equations [40, 32]. Similar solutions for 3D case, i.e., for determining 6-DOF transformation using inter-robot distance and/or bearing measurements, has been proposed as well [41, 33]. In practice, these analytic solutions are used as an initial estimate for the relative pose, and then iteratively refined using optimization techniques (e.g., nonlinear weighted least-squares) in order to account for noisy observation and uncertainty in the robots’ motion.

Robots that rely on visual perception (i.e., use cameras as exteroceptive sensors) solve the relative pose estimation problem by triangulating mutually visible features and landmarks [35]. Therefore, it reduces to solving the PnP problem by using sets of 2D-3D correspondences between geometric features and their projections on respective image planes [39]. Although high-level geometric features (e.g., lines, conics, etc.) have been proposed, point features are typically used in practice for relative pose estimation [13]. Moreover, the PnP problem is solved either using iterative approaches by formulating the over-constrained system ( ) as a nonlinear least-squares problem, or by using sets of three non-collinear points ( ) in combination with Random Sample Consensus (RANSAC) to remove outliers [8]. Besides, vision-based approaches often use temporal-filtering methods, the extended Kalman-filter (EKF) in particular, to reduce the effect of noisy measurements in order to provide near-optimal pose estimates [35, 13]. On the other hand, it is also common to simplify the relative pose estimation by attaching specially designed calibration-patterns on each robot [28]. However, this requires that the robots operate at a sufficiently close range, and remain mutually visible.

Ii-B Human Body-Pose Detection

Visual detection of 2D human pose has made significant progress over the last decade. The state-of-the-art methodologies can be categorized into the top-down and bottom-up approaches. The top-down approaches [9, 25] detect the humans in the image space first, and then perform localization and association of their body-parts. One major limitation of these approaches is that their running times are proportional to the number of persons in the image. Additionally, the robustness of the pose estimation largely depends on the accuracy of their person detectors. In contrast, the bottom-up approaches [5, 24] do not suffer from these two issues. However, they require solving a more computationally challenging inference problem of learning global contextual cues for simultaneous detection and association of the body-parts.

The classical approaches [7, 3] use pictorial structures to model the appearance of human body-parts. A set of densely sampled shape descriptors are used for localizing the body-parts and then classifiers such as AdaBoost, SVMs, etc., are used for detection. Associating the detected body-parts is rather challenging; a mixture of tree-based models are typically used to learn separate pairwise relationships for different body-part configurations [14]. Graph-based connectivity models are then used to formulate the inference (association) as a graph-cut problem. These pairwise connectivity models can be further generalized [23] to capture the anatomical relationships among multiple body-parts. Recently proposed approaches use Deep Neural Networks (DNNs) to learn the human pose detection from large training datasets in order to perform fast and accurate global inference. DeepPose [31], for instance, formulates the problem as a regression problem and uses a cascade of DNNs to learn the inference in a holistic fashion. On the other hand, OpenPose [5] jointly learns to detect and associate using pose machines [26]. In contrast to DNNs, each module of a pose machine is trained locally; the sequential predictions of these modules are then refined to perform a hierarchical joint inference. Such hierarchical structures facilitate fast inference for multi-person pose estimation in addition to achieving state-of-the-art detection performance. Due to these compelling reasons, we use OpenPose in this work.

Ii-C Human-aware Robot Control

Human-awareness is essential for autonomous mobile robots operating in social settings and human-robot collaborative applications. A large body of literature and systems exist [11, 19, 12] which focus on the areas of understanding human motion, instructions, behaviors, etc. Additionally, tracking human pose relative to the robot is particularly common in applications such as person tracking [20], following [12], collaborative manipulation [18], behavior imitation [16], etc. However, the feasibility of using humans’ presence or their body-poses as markers for robot-to-robot relative pose estimation has not been explored in the literature, which we attempt to address in this work.

Iii Methodology

The proposed robot-to-robot relative pose estimation method incorporates a number of computational components: detection of human body-poses in images captured from different views (by leader and follower robots), pair-wise association of the detected humans across multiple images, geometric refinement of the key-point correspondences, and 3D pose estimation of the follower robot (relative to the leader) using 2D-3D correspondences. We now elaborate their methodologies in the following sections.

Iii-a Human Body-Pose Detection

OpenPose [5] is an open-source library for real-time multi-human 2D pose detection in images, originally developed using Caffe and OpenCV We use a Tensorflow based on the MobileNet model that provides faster inference compared to the original model (also known as the CMU model). Specifically, it processes a image in on the embedded computing board named Jetson TX2 [21], whereas the original model takes multiple seconds.

OpenPose generates annotated 2D key-points pertaining to the nose, neck, shoulders, elbows, wrists, hips, knees, ankles, eyes, and ears of a human body. As shown in Figure 2, a subset of these key-points and their pair-wise anatomical relationships are generated for each human. We represent the key-points generated by OpenPose in an image by a array where is the number of detected humans in the image; that is, each row contains ordered key-points for a particular person. Specifically,

If a particular key-point is not detected, then the values are left as (, ). Additionally, we configure in a way that the first row belongs to the left-most person, the second row belongs to the next left-most person, and gradually the last row belongs to the right-most person in the image. It is achieved by sorting the rows based on the average of their non-negative -coordinates. This way of formatting the key-points helps to speed up the process of associating correspondences between and .

In order to establish key-point correspondences between mutually visible humans in the scene, the rows of and need to be associated. That is, the follower robot needs to make sure that it is pairing the key-points of the same individuals. This is important because in practice they might be looking at different individuals, or the same individuals in a different spatial order. Associating multiple persons across different images is a well-studied problem known as person re-identification.

Fig. 2: Multi-human 2D body-pose detection using OpenPose in various human-robot collaborative settings.
Fig. 3: An illustration of how the hierarchical body-parts are extracted for person re-identification based on their structural similarities; once the persons are associated, the pair-wise key-points are refined and used as correspondences.

Iii-B Person Re-identification using Hierarchical Similarities

Although a number of existing deep visual models provide good solutions for person re-identification [1, 17], we design a simple and efficient model to meet the real-time on-board computational constraints. The idea is to avoid using a computationally demanding feature extractor by making use of the hierarchical anatomical structures that are already embedded in the key-points. First, we bundle the subsets of key-points in several spatial bounding-boxes (Bbox) as follows:

  • Face Bbox: nose, eyes, and ears

  • Upper-body Bbox: neck, shoulders, and hips

  • Lower-body Bbox: hips, knees, and ankles

  • Left-arm Bbox: left shoulder, elbow, and wrist

  • Right-arm Bbox: right shoulder, elbow, and wrist

  • Full-body Bbox: encloses all the key-points

Figure 3 illustrates the spatial hierarchy of these bounding boxes and their corresponding key-points. They are extracted by spanning the corresponding key-points’ coordinate values in both the and dimensions. We use an offset (of additional length) in each dimension in order to capture more spatial information around the key-points. A Bbox is discarded if its area is below an empirically chosen threshold; this happens when the corresponding body-part is not detected or its resolution is too small to be informative.

Once these bounding boxes are extracted, we use the structural properties of their respective image-patches as features for person re-identification; specifically, we compare the structural similarities [36] between image patches pertaining to the face, upper-body, lower-body, left-arm, right-arm, and the full body of a person. Based on their aggregated similarities, we evaluate the pair-wise association between each person as seen by the leader (in ) and by the follower (in ).

The structural similarity [36] for a particular pair of single-channel rectangular image-patches (, ) is evaluated based on three properties: luminance (), contrast (), and structure (). The standard way of computing these are:

Here, () denotes the mean of image patch (), () denotes the variance of (), and denotes the cross-correlation between and . The structural similarity metric (SSIM) is then defined as:

In order to ensure numeric stability, two constants and are added as:


We use , , and an sliding window in our implementation. Additionally, we resize the patches extracted from so that their corresponding pairs in have the same dimensions. Then, we apply Equation 1 on every channel () and use their average value as the similarity metric on a scale of [, ]. Specifically, we use this metric for person re-identification as follows:

  • We only consider the mutually visible body-parts for evaluating pair-wise SSIM values and take their average. This choice is important to enforce meaningful comparisons; otherwise, it is equivalent to using only the full-body Bbox, which we found to be highly inaccurate.

  • Each person in is associated with the most similar person (i.e., corresponding to the maximum SSIM value) in . However, the association is discarded if the maximum SSIM value is less than a predefined threshold (); we use in our implementation. This reduces the risk of inaccurate associations, particularly when there are mutually exclusive people in the scene.

Iii-C Key-point Refinement

Once the specific persons are identified, i.e., the rows of and are associated, the mutually visible key-points are paired together to form correspondences. Although the key-points are ordered and OpenPose localizes them reasonably well, they cannot be readily used as geometric correspondences due to perspective distortions and noise. We attempt to solve this problem by designing an iterative optimization algorithm that refines the noisy correspondences based on their structural properties in a neighborhood. By denoting as the image-patch centered at in image , we define a loss function for each correspondence as:


Then, we refine each initial key-point correspondence by minimizing the following function:


As Equation 3 suggests, we fix and refine to maximize . We use a refinement region of in our implementation, whereas the resolution of is . In addition, we use a gradient-based refinement; specifically, we perform the following iterative update:


Similar formulation using SSIM-based loss function for optimization is fairly standard in literature [22, 4]. We follow the procedures suggested in [4] for computing the gradient of SSIM. In practice, we stack all the key-points and their gradients (separately) to perform the optimization simultaneously. Additionally, a fixed learning rate of and a maximum iteration of in our implementation.

(a) A group of people seen from multiple views and their 2D body-poses (detected by OpenPose).
(b) Person association and pose-based key-point correspondences for a particular image pair; a unique identifier is assigned to each association, matched key-points are shown in green lines for the right-most person.
(c) The reconstructed 3D key-points of the humans’ structure and the estimated camera poses (up-to scale).
Fig. 4: Results of estimating structure from motion using only human pose-based key-points as features.
(a) Three humans seen from two different views.
(b) Key-point correspondences and epipolar lines.
(c) The reconstructed 3D key-points and the estimated camera poses (up-to scale).
(d) Reduction of the average re-projection error by the iterative key-point refinement process.
(e) Re-projected points (red crosses) on the left image are shown; the blue circles represent true locations.
Fig. 5: Structure from motion for a two-view case using only human pose-based key-points as features.

Iii-D Robot-to-robot Pose Estimation

Once the mutually visible key-points are associated and refined, the follower robot uses the corresponding 3D positions (provided by the leader) to estimate its relative pose by solving a PnP problem. Thus, we require that the leader robot is equipped with a stereo camera (or an RGBD camera) so that it can triangulate the refined key-points using epipolar constraints (or use the depth sensor) to represent the key-points in 3D. Let denote the 3D locations of the key-points in the leader’s coordinate frame, and denote their corresponding 2D projections on the follower’s camera. Then, assuming the cameras are synchronized, the PnP problem is formulated as follows:


Here, is the intrinsic matrix of the follower’s camera and is its 6-DOF transformation relative to the leader. In our implementation, we follow the standard iterative solution for PnP using RANSAC [39].

Iv Experimental Analysis

We conduct several experiments with 2-DOF and 3-DOF robots to evaluate the applicability and performance of the relative pose estimation method. We present the experiments, analyze the results, and discuss various operational considerations in the following sections.

Iv-a Structure from Motion using Human Pose

We perform experiments to validate that the human pose-based key-points can be used as reliable correspondences for relative pose estimation. As illustrated in Figure 3(a), we emulate an experimental set-up for structure from motion with humans; we use an intrinsically calibrated monocular camera to capture a group of nine (static) people from multiple views. Here, the goal is to estimate the camera poses and reconstruct the 3D structures of the humans using only their body-poses as features.

First, we use OpenPose to detect the human pose-based 2D key-points in the images (Figure 3(a)). Then, we apply the proposed person re-identification and key-point refinement processes to obtain the feature correspondences across multiple views (Figure 3(b)). Subsequently, we follow the standard procedures for structure from motion [10]: fundamental matrix computation using 8-point algorithm with RANSAC, essential matrix computation, camera pose estimation by enforcing the Cheirality constraint, and linear triangulation. Finally, the triangulated 3D points and camera poses are refined using bundle adjustment. As demonstrated in Figure 3(c), the spatial structure of the reconstructed points on the human bodies and the camera poses are consistent with our setup. Results of another experiment for a two-view case are demonstrated in Figure 5.

(a) Quantitative performance compared to using SIFT-based features.
(b) Inaccurate 3D reconstruction using raw key-point correspondences (without refinement).
Fig. 6: Necessity of the proposed key-point refinement process (described in Section III-C); results correspond to the experiment illustrated in Figure 4.
Person ReId Market-1501 Dataset CUHK-03 Dataset FPS
models Rank-1 Acc. () MAP () Rank-1 Acc. () MAP () (Jetson TX2)
Aligned ReId
Deep person ReId
Tripled-loss ReId
Proposed person ReId 7.45
TABLE II: Effectiveness of the proposed person ReId method on real-world data; each set contains images of multiple humans in ground and underwater scenarios.
Person ReId Set A (1-2 humans per image) Set B (3-5 humans per image)
models Rank-1 Acc. () FPS Rank-1 Acc. () FPS
Aligned ReId
Deep person ReId
Tripled-loss ReId
Proposed person ReID 76.55 6.81 71.56 5.45
TABLE I: A quantitative performance comparison for various person ReId models on standard datasets; a set test images are used for comparison from each dataset.

Iv-B Impact of the Person Re-Identification and Key-point Refinement Processes

It is easy to notice that person re-identification (ReId) is essential for associating mutually visible persons across different views. As mentioned in Section III-B, we focus on achieving fast association by making use of the local structural properties around the key-points in the image space instead of using an additional feature extractor. The state-of-the-art person ReId approaches adopt deep visual feature extractors that are computationally demanding. In Table II, we quantitatively evaluate the state-of-the-art models named Aligned ReID [37], Deep person ReId [17], and Tripled-loss ReId [38] based on accuracy and mean averaged precision (mAP) on two standard datasets. Specifically, a test-set containing instances from the Market-1501 and CUHK-03 datasets are used for the evaluation; also, the respective run-times on a NVIDIA™Jetson TX2 are shown for comparison. The results indicate that although these models (once trained on similar data) perform well on standard datasets, they are computationally too expensive for a real-time embedded platform. Moreover, as demonstrated in Table II, these off-the-shelf models do not perform that well on high-resolution real-world images. Although their performance can be improved by training on more comprehensive real-world data, the computational complexity remains a barrier. To this end, the proposed person ReId method provides a significantly faster run-time and better portability (i.e., does not require rigorous training).

On the other hand, Figure 5(a) demonstrates the necessity and effectiveness of the proposed key-point refinement process (see Section III-C). It shows the average re-projection errors before and after the refinement compared to the ground truth (i.e., SIFT feature-based 3D reconstruction). Specifically, it achieves an average re-projection error of pixels, which is acceptable considering the fact that there are ten times less anatomical key-points than SIFT feature-based key-points. More importantly, as shown in Figure 5(b), the 3D reconstruction and camera pose estimation without the refinement process is inaccurate; this indicates that the raw key-point correspondences are invalid in a perspective geometric sense. We provide an illustration pertaining to the iterative refinement of noisy geometric correspondences in Figure 5.

(a) 3-DOF ground experiment: one leader and one follower robot; the follower robot’s trajectory is shown by red arrows.
(b) 6-DOF underwater experiment: one leader and two follower robots (aerial view).
Fig. 7: The experimental setups for using multiple robots and cameras to capture the humans’ body-poses from different perspectives.

Iv-C Field Experiments

We also perform several field experiments for 2D and 3D robot-to-robot relative pose estimation. Specifically, we use 3-DOF planar robots for ground and 6-DOF robots for underwater experiments; typical setups for these experiments are illustrated in Figure 7. First, we capture multiple human body-poses from a leader and follower robots’ perspective. Then we use the proposed estimation method to find the follower-to-leader relative pose and analyze the results.

Iv-C1 3-DOF Pose Estimation

In the particular scenario shown in Figure 6(a), we use two planar robots (one leader and one follower) and two mutually visible humans in the scene. The robot with an AR-tag on its back is used as the follower robot while the other robot is used as the leader. The AR-tag is used to obtain the follower’s ground truth relative pose for comparison. On the other hand, the leader robot is equipped with a RGBD camera; it communicates with the follower and shares the 3D locations of the mutually visible key-points. Specifically, it detects the human pose-based 2D key-points and associates the corresponding depth information to represent them in 3D. Subsequently, the follower robot uses this information to localize itself relative to the leader by following the proposed estimation method.

As demonstrated in Figure 6(a), we move the follower robot in a rectangular pattern and evaluate the 3-DOF pose estimates relative to the static leader robot. We present the qualitative results in Figure 8; it shows that the follower robot’s pose estimates are very close to their respective ground truth. Overall, we observe an average error of in translation (cm) and a average error in rotation, which are reasonably accurate. We obtain similar qualitative and quantitative performances with a dynamic leader as well. However, there are a few practicalities involved which can affect the performances, e.g., the number of common key-points, synchronized communication, etc.; we elaborately discuss these issues in Section IV-D.

Iv-C2 6-DOF Pose Estimation

Figure 6(b) shows the setup of a particular underwater experiment where we capture human body-poses from different perspectives to estimate the 6-DOF transformations of multiple follower robots relative to a leader robot. In our experiments, the leader robot is equipped with a stereo camera; hence, the 3D information of the human pose-based key-points is obtained by using stereo triangulation technique. Then we find their corresponding 2D projections on the follower robots’ cameras using the proposed person re-identification and key-point refinement processes. Finally, we estimate the follower-to-leader relative poses from respective PnP solutions. We present a particular scenario in Figure 8(a); it illustrates the leader and follower robots’ perspectives and the associated human pose-based key-points. Additionally, Figure 8(b) shows the reconstructed 3D key-points which are consistent with the mutually visible humans’ body-poses. Finally, the estimated 6-DOF poses of the follower robots relative to the leader robot are shown in Figure 8(c).

(a) The leader robot detects the pose-based key-points and shares the 3D locations.
(b) Estimated poses of the follower relative to the leader (the green cones represent the respective ground truth).
Fig. 8: An experiment to evaluate the accuracy of 2D relative pose estimation with two planar robots and two mutually visible humans.
(a) A group of people seen from multiple perspectives; the detected key-points and their associations are annotated in respective images.
(b) Stereo triangulation of the human pose-based key-points (seen by the leader robot).
(c) Estimated relative poses of the followers robots.
Fig. 9: An underwater experiment for 3D relative pose estimation using one leader and two follower robots.

Iv-D Discussion: Operational Challenges and Practicalities

There are a few operational considerations and challenges involved for practical implementations of the proposed pose estimation method. We now discuss these aspects and their possible solutions based on our experimental findings.

  • Synchronized cooperation: A major operational requirement of multi-robot cooperative systems is the ability to register synchronized measurements in a common frame of reference. However, it is quite challenging in practice. For problems such as ours, an effective solution is to maintain a buffer of time-stamped measurements and register them as a batch using a temporal sliding window. The follower robots can use such techniques to find their relative poses at regular time intervals. However, the challenge remains in finding the instantaneous relative poses, especially when both robots are in motion.

  • Number of humans and relative viewing angle: We have observed a couple of other practical issues during the experiments. First, the presence of multiple humans in the scene is needed to ensure reliable pose estimation performances. We have found that two or more mutually visible humans are ideal for establishing a large set of reliable correspondences. In addition, we have found that the pose estimation performance is affected by the relative viewing angle; specifically, it often fails to find correct associations when the leader-human-follower is larger than (approximately) . This results in a situation where the robots are exclusively looking at opposite sides of the person without enough common key-points.

  • Trade-off between robustness and efficiency: It is quite challenging to ensure a fast yet robust performance for visual feature-based body-pose estimation and person ReId on limited on-board computational resources available in an embedded platform. As demonstrated in Section IV-B, this trade-off between robustness and efficiency led us to design the proposed person ReId and key-point refinement methodologies. These efficient modules enable us to achieve an average end-to-end run-time of - milliseconds for relative pose estimation (on NVIDIA™Jetson TX2). Nevertheless, there is significant room for improvement in order to achieve better performance margins at a faster rate.

V Conclusions and Future Work

In this paper, we explore the feasibility of using human body-poses as markers to establish reliable multi-view geometric correspondences and to eventually solve the robot-to-robot relative pose estimation problem. First, we use OpenPose for extracting the pose-based 2D key-points pertaining to the humans in the scene. Then we associate the humans seen from multiple views using an efficient person re-identification model. Subsequently, we refine the key-point correspondences using an iterative optimization algorithm based on their local structural similarities in the image space. Finally, we use the 3D locations of the key-points (triangulated by the leader robot) and their corresponding 2D projections (seen by the follower robot) to formulate a PnP problem and solve for the unknown pose of the follower robot relative to the leader. We perform extensive experiments in terrestrial and underwater environments to investigate the applicability of the proposed relative pose estimation method; the experimental results validate its effectiveness both for 2D and 3D robots. We also discuss the relevant operational challenges and propose efficient solutions to deal with them. In the future, we seek to improve the end-to-end run-time of the proposed system and plan to use it in practical applications such as multi-robot convoying, cooperative source-to-destination planning, etc. Additionally, we aim to investigate the applicability of DensePose [2] in our work, which can potentially provide significantly more key-point correspondences per-person compared to OpenPose.


We would like to thank Hyun Soo Park (Assistant Professor, University of Minnesota) for his valuable insights which immensely enriched this paper. We gratefully acknowledge the support of the MnDrive initiative and thank NVIDIA Corporation for donating two Titan-class GPUs for this research. In addition, we are grateful to the Bellairs Research Institute of Barbados for providing us with the facilities for field experiments; we also acknowledge our colleagues at the IRVLab and the participants of the Marine Robotics Sea Trials for their assistance in collecting data and conducting the experiments.


  • [1] E. Ahmed, M. Jones, and T. K. Marks (2015) An Improved Deep Learning Architecture for Person Re-identification. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3908–3916. Cited by: §I, §III-B.
  • [2] R. Alp Güler, N. Neverova, and I. Kokkinos (2018) DensePose: Dense Human Pose Estimation in the Wild. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7297–7306. Cited by: §V.
  • [3] M. Andriluka, S. Roth, and B. Schiele (2009) Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1014–1021. Cited by: §II-B.
  • [4] A. N. Avanaki (2009) Exact Global Histogram Specification Optimized for Structural Similarity. Optical Review 16 (6), pp. 613–621. Cited by: §III-C.
  • [5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime Multi-person 2d Pose Estimation using Part Affinity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7291–7299. Cited by: §I, §II-B, §II-B, §III-A.
  • [6] H. Damron, A. Q. Li, and I. Rekleitis (2018) Underwater Surveying via Bearing only Cooperative Localization. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3957–3963. Cited by: §I.
  • [7] V. Ferrari, M. Marin-Jimenez, and A. Zisserman (2008) Progressive Search Space Reduction for Human Pose Estimation. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. Cited by: §II-B.
  • [8] M. A. Fischler and R. C. Bolles (1981) Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II-A.
  • [9] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik (2014) Using K-poselets for Detecting People and Localizing their Keypoints. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3582–3589. Cited by: §II-B.
  • [10] R. Hartley and A. Zisserman (2003) Multiple View Geometry in Computer Vision. Cambridge University Press. Cited by: §IV-A.
  • [11] M. J. Islam, M. Ho, and J. Sattar (2018) Understanding Human Motion and Gestures for Underwater Human-Robot Collaboration. Journal of Field Robotics (JFR), pp. 1–23. External Links: Document Cited by: §I, §II-C.
  • [12] M. J. Islam, J. Hong, and J. Sattar (2018) Person Following by Autonomous Robots: A Categorical Overview. arXiv preprint arXiv:1803.08202. Cited by: §I, §II-C.
  • [13] F. Janabi-Sharifi and M. Marey (2010) A Kalman-filter-based Method for Pose estimation in Visual Servoing. IEEE transactions on Robotics (TRO) 26 (5), pp. 939–947. Cited by: §II-A.
  • [14] S. Johnson and M. Everingham (2011) Learning Effective Human Pose Estimation from Inaccurate Annotation. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1465–1472. Cited by: §II-B.
  • [15] R. Kümmerle, M. Ruhnke, B. Steder, C. Stachniss, and W. Burgard (2013) A Navigation System for Robots Operating in Crowded Urban Environments. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3225–3232. Cited by: §I.
  • [16] J. Lei, M. Song, Z. Li, and C. Chen (2015) Whole-body Humanoid Robot Imitation with Pose Similarity Evaluation. Signal Processing 108, pp. 136–146. Cited by: §II-C.
  • [17] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: Deep Filter Pairing Neural Network for Person Re-identification. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 152–159. Cited by: §I, §III-B, §IV-B.
  • [18] J. Mainprice and D. Berenson (2013) Human-robot Collaborative Manipulation Planning using Early Prediction of Human Motion. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 299–306. Cited by: §II-C.
  • [19] R. Mead and M. J. Matarić (2017) Autonomous Human–robot Proxemics: Socially aware Navigation based on Interaction Potential. Autonomous Robots 41 (5), pp. 1189–1201. Cited by: §II-C.
  • [20] M. Montemerlo, S. Thrun, and W. Whittaker (2002) Conditional Particle Filters for Simultaneous Mobile Robot Localization and People-tracking. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), Vol. 1, pp. 695–701. Cited by: §II-C.
  • [21] NVIDIA™ (2014) Embedded Computing Boards. Note: 8-2-2019 Cited by: §III-A.
  • [22] D. Otero and E. R. Vrscay (2014) Solving Optimization Problems that Employ Structural Similarity as the Fidelity Measure. In Proc. of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), pp. 1. Cited by: §III-C.
  • [23] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele (2013) Poselet conditioned pictorial structures. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 588–595. Cited by: §II-B.
  • [24] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele (2016) DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937. Cited by: §II-B.
  • [25] L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele (2012) Articulated People Detection and Pose Estimation: Reshaping the Future. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3178–3185. Cited by: §II-B.
  • [26] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh (2014) Pose Machines: Articulated Pose Estimation via Inference Machines. In Proc. of the European Conference on Computer Vision (ECCV), pp. 33–47. Cited by: §II-B.
  • [27] I. M. Rekleitis, G. Dudek, and E. E. Milios (2002) Multi-robot Cooperative Localization: A Study of Trade-offs Between Efficiency and Accuracy. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. 3, pp. 2690–2695. Cited by: §I.
  • [28] I. Rekleitis, D. Meger, and G. Dudek (2006) Simultaneous Planning, Localization, and Mapping in a Camera Sensor Network. Robotics and Autonomous Systems 54 (11), pp. 921–932. Cited by: §II-A.
  • [29] J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguere, A. Mills, N. Plamondon, C. Prahacs, Y. Girdhar, M. Nahon, et al. (2008) Enabling Autonomous Capabilities in Underwater Robotics. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3628–3634. Cited by: §I.
  • [30] S. Se, D. G. Lowe, and J. J. Little (2005) Vision-based Global Localization and Mapping for Mobile Robots. IEEE Transactions on Robotics (TRO) 21 (3), pp. 364–375. Cited by: §I.
  • [31] A. Toshev and C. Szegedy (2014) DeepPose: Human Pose Estimation via Deep Neural Networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1653–1660. Cited by: §II-B.
  • [32] N. Trawny and S. I. Roumeliotis (2010) On the Global Optimum of Planar, Range-based Robot-to-robot Relative Pose Estimation. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3200–3206. Cited by: §II-A.
  • [33] N. Trawny, X. S. Zhou, K. Zhou, and S. I. Roumeliotis (2010) Inter-robot transformations in 3D. IEEE Transactions on Robotics (TRO) 26 (2), pp. 226–243. Cited by: §II-A.
  • [34] C. Valgren and A. J. Lilienthal (2010) SIFT, SURF & Seasons: Appearance-based Long-term Localization in Outdoor Environments. Robotics and Autonomous Systems 58 (2), pp. 149–156. Cited by: §I.
  • [35] J. Wang and W. J. Wilson (1992) 3D Relative Position and Orientation Estimation using Kalman Filter for Robot Control. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2638–2645. Cited by: §II-A.
  • [36] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image Quality Assessment: from Error Visibility to Structural Similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §III-B, §III-B.
  • [37] L. Zhao, X. Li, Y. Zhuang, and J. Wang (2017) Deeply-learned Part-aligned Representations for Person Re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3219–3228. Cited by: §IV-B.
  • [38] W. Zheng, S. Gong, and T. Xiang (2011) Person Re-identification by Probabilistic Relative Distance Comparison. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 649–656. Cited by: §IV-B.
  • [39] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi (2013) Revisiting the PnP Problem: A Fast, General and Optimal Solution. In Proc. of the IEEE International Conference on Computer Vision (ICCV), pp. 2344–2351. Cited by: §I, §II-A, §III-D.
  • [40] X. S. Zhou and S. I. Roumeliotis (2008) Robot-to-robot Relative Pose Estimation from Range Measurements. IEEE Transactions on Robotics (TRO) 24 (6), pp. 1379–1393. Cited by: §I, §II-A.
  • [41] X. S. Zhou and S. I. Roumeliotis (2011) Determining the Robot-to-robot 3D Relative Pose using Combinations of Range and Bearing Measurements (Part II). In IEEE International Conference on Robotics and Automation (ICRA), pp. 4736–4743. Cited by: §II-A.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description