Enhanced Human-Machine Interaction by Combining Proximity Sensing with Global Perception

Enhanced Human-Machine Interaction by Combining Proximity Sensing with Global Perception

Christoph Heindl, Markus Ikeda, Gernot Stübl, Andreas Pichler and Josef Scharinger Visual Computing and Robotics, PROFACTOR GmbH, Austria christoph.heindl@profactor.atInstitute of Computational Perception, Johannes Kepler University, Austria josef.scharinger@jku.at

The raise of collaborative robotics has led to wide range of sensor technologies to detect human-machine interactions: at short distances, proximity sensors detect nontactile gestures virtually occlusion-free, while at medium distances, active depth sensors are frequently used to infer human intentions. We describe an optical system for large workspaces to capture human pose based on a single panoramic color camera. Despite the two-dimensional input, our system is able to predict metric 3D pose information over larger field of views than would be possible with active depth measurement cameras. We merge posture context with proximity perception to reduce occlusions and improve accuracy at long distances. We demonstrate the capabilities of our system in two use cases involving multiple humans and robots.


2nd Workshop on Proximity Perception
2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Macau, China
Published under the Creative Commons license (CC BY-NC-ND) at KITopen

I Introduction

Proximity perception is an active research field, aiming to equip robotics with nontactile near-field sensors to advance robot autonomy and human-machine interoperability. While proximity sensing is a compelling concept on close-up range, it fails to recognize spatio-temporal events on moderate distances from the robot [hughes2018robotic]. These events, however, provide important information to create a more robust and intuitive set of interaction patterns.

In previous works, optical systems were increasingly used to close this information gap. Many approaches integrate active depth cameras to detect human key points and measure distances between objects in the environment. The downsides of active depth sensors are the short operating range, the sensitivity to extraneous light and the increased occlusion caused by the camera-projector arrangement.

Fig. 1: Our system complements proximity sensing (left) with global perception from a bird’s-eye perspective in real-time (right). Based on a color input image, we estimate the following features shown superimposed in the right image: human pose (dots connected by lines), worker orientation (white triangle) and the recent movement trajectory (red path on floor). The majority of these measurements are available in a metric 3D space and combined with proximity perception to coordinate human-machine interactions.

We describe an optical system that complements proximity perception by observing human and machines from a bird’s-eye perspective (see Figure 1). Even though our system works solely with color images from one single panoramic camera, it delivers pose information for humans and robots in a metric space. We combine the provided pose information with near-field measurements from proximity sensors to robustly detect human-machine interaction patterns. We demonstrate the benefits of merging complementary input modalities in two use cases involving multiple robots and humans 111Demonstration video https://youtu.be/1X0xF3m36BA.

I-a Related Work

Vision based human-machine interaction has been studied before. Guanglong et al. [du2018online] combine RGB-D cameras and inertial measurement units to detect human gestures for robot learning. Zimmermann et al. [zimmermann20183d] use depth-based devices to record human instructions for teaching a service robot.

Systems based on depth cameras are also used to study human-robot safety aspects. Fabrizio et al. [fabrizio2016real] describe a depth sensing enabled device to compute distances to dynamic obstacles for collision avoidance. Švarnỳ et al. [petr2018] propose using 2D keypoint detection merged with RGB-D data to measure distances between a robot and a single human operator.

I-B Contributions

The proposed system has a number of benefits. Our system works with a readily available single wide-angle color camera, eliminating the range limitations of active depth devices. The inference step builds on recent advances in pose estimation, enabling our system to answer complex image related queries robustly. In addition to instantaneous pose, we track individual humans to provide pose trajectories over time. Although the detection step takes place in image space, the majority of predictions are elevated to a metric 3D space, enabling natural fusion with near-field sensors to create a richer set of human-machine interactions.

Ii Method

Figure 3 illustrates the core components of our system. These are detailed in the following sub-sections.

Ii-a View Synthesis

In order to process panoramic color images, we first synthesize one or more synthetic rectilinear views (see Figure 2). View synthesis assumes a virtual pinhole camera that is potentially rotated with respect to the physical source frame . Next, every virtual camera pixel is mapped to a corresponding source pixel using the method described in [heindl2018]. The computed pixel position is bilinearly interpolated to determine the final color value associated with .

Fig. 2: View synthesis from panoramic color images. A virtual rectilinear camera image (right) is created from a highly distorted spherical image region (left).
Fig. 3: System Overview. Our approach considers panoramic images as input; these are synthesized to form one or more rectilinear views. Deeply learned neural networks then predict 2D human and robot pose keypoints. These detections are subsequently lifted to a 3D metric space using homographies of planes. Next, our system fuses near-field measurements with human/robot pose context to create location-aware events. These events in turn lead to environmental reactions defined by the application scenario.

Ii-B Pose Estimation

To compute the 2D pose of humans and robots, we first predict 2D keypoints from synthesized color views (see Figure 7 b,d). We use neural network architectures, depicted in Figure 4 based on works of Cao et al. [cao2017realtime] and Heindl et al. [cheind2019disp] to perform keypoint localization. Each network, composed of a series of Deep Convolutional Neural Networks (DCNNs), regresses keypoint coordinates by transforming the input color images into keypoint belief maps . Here are image height and width and is the number of output keypoints (6 for robots, 25 for humans). Each belief map encodes the likelihood of observing a particular keypoint in a specific image region. We train these networks with both real [cao2017realtime] and artificially generated images [cheind2019disp].

In a subsequent step, 2D poses are transformed into a metric space via a set of homographies. In particular, we propose the use of an image-to-ground homography , to map image positions to metric ground coordinates as follows


where is the dehomogenization operator . Because such mappings are accurate only for body parts sufficiently close to the ground (such as foot positions), we use a statistical body model to gain the ability to map extra keypoints such as hips and shoulders. These additional points serve to predict body orientation and to stabilize body positions in case of partial occlusions. Regarding robots, we map only the base joints to determine their location.

Ii-C Multiple Object Tracking

To distinguish individuals and smooth poses over time, we perform object tracking (see Figure 1, 7d). We filter human trajectories using a Kalman filter, assuming a linear dynamic motion model. All operations are performed in ground plane coordinates to avoid perspective effects. To assign newly detected poses to existing ones, we setup a linear assignment problem on the following bipartite graph: Given a set of pose detections and a set of forward predicted poses at time , we add an edge for every possible combination of detected and tracked pose to the graph. The cost associated with edge is computed as the Euclidean distance between the ground positions of and . We use the method of Kuhn et al. [kuhn1955hungarian] to solve for the optimal assignments and update the Kalman filters correspondingly.

Fig. 4: Joint localization architecture [cheind2019disp]. First, base features are extracted from input image . An initial Deep Convolutional Neural Network (DCNN) performs initial joint belief prediction . A series of DCNNs then refines belief prediction to generate the final prediction .

Ii-D Merging Pose Information with Proximity Measurements

Pose information alone has already proven useful in our experiments with human-robot interaction. However, line-of-sight occlusions and large object distances limit the system’s ability to measure small movements accurately. Therefore, we merge global pose context with gestures detected by a proximity device222https://www.leapmotion.com/.

In particular, we treat data fusion as a probabilistic classification problem and consider the presence of gestures intended for robot to be binary latent random variables . At each time-step we observe the following noisy features: a) the proximity device’s confidence level for an active gesture (note that we consider the device to be held by the operator and this feature is thus shared among robots), b) the relative position of the operator closest to the robot in ground plane coordinates and c) the orientation of the operator with respect to the robot, expressed as the cosine of the angle .

Given observed features , we compute the posterior probability of the presence of a command using Bayes theorem

For computational reasons, we introduce the following independence assumptions

That is, all observed features are independent from each other given the state of command . Thus, the posterior probability simplifies to

where is the partition function given by

Our model, shown graphically in Figure 5, follows the structure of a naïve Bayes classifier. We assume the following underlying probability distributions

Here, refers to a two-dimensional distribution with bin probabilities . We have chosen to model the conditional probability using a normal distribution for practical reasons. Strictly speaking, the use of a normal distribution is inappropriate, as the support of the normal distribution is and not . However, probabilities beyond quickly drop to zero and therefore do not pose a problem for our use case.

The parameters of our model are estimated in a supervised fashion. That is, we consider along with to be observed in each training sample. We estimate the parameters of each distribution by maximizing the likelihood. For multiple robots, we learn a separate classifier for each of them. A single gesture may hence trigger actions on multiple robots.




Fig. 5: Probabilistic data fusion. The presence of a command intended for robot is modelled as a binary latent random variable . Our model jointly considers a) gesture device confidence , b) position of human and c) orientation of operator to predict the posterior probability of .

Iii Evaluation

Iii-a Experimental Setup

We perform all experiments in an environment that covers the volume of using a single color camera with resolution , mounted above ground. We calibrate the camera intrinsically and estimate the ground plane homography using a chessboard object. Our setup consists of two robots: an UR10 and a KUKA iiwa, both of which are placed in close proximity to each other (see Figure 7a). We use a standard computing unit equipped with a single NVIDIA GeForce GTX 1080 Ti for pose estimation at . The gesture sensing device is connected to a portable Laptop via USB. Both robots are connected to separate controlling units. ZeroMQ [hintjens2013zeromq] is used to exchange messages between all computing entities.

Iii-B Pose Estimation Accuracy

In this work we consider metric accuracies, as image accuracies have been reported previously [heindl2018, cheind2019disp]. Assuming an error-free intrinsic camera matrix, the uncertainty in detecting point correspondences for homography estimation is measured to be . The parameter uncertainties of are estimated using the method of Criminsi et al. [criminisi1999plane]. The uncertainty of 2D keypoint detection is . We transform input uncertainties to measurement uncertainties by propagating errors through Equation 1 to the first order. Table I lists measurement uncertainties as a function of object-to-camera distance.

Distance Uncertainty
TABLE I: Measurement uncertainties as a function of object distance.

Iii-C Data Fusion Accuracy

To evaluate the data fusion approach, we conduct several experiments using the robot orchestration use case (see Section IV). For training the classifier we record 900, 2700 and 4500 events of each feature in . This corresponds to , and of potential human-robot interaction. In a downstream step, we label the latent variable for each event based on visual inspection. We train separate classifiers for each sample set and evaluate on 5000 test events. Figure 6 shows the results in terms of precision-recall curves. Even for the smallest training size the average precision is around 0.87 and increases moderately as we train on more data.

Fig. 6: Classification accuracy based on fusing proximity sensing with global vision. Each curve represents the precision/recall of a probabilistic model trained on a specific number of training samples at various classification thresholds. Besides the number of the training samples, the average precision (AP) is shown.

Iv Demonstrations

We demonstrate the interaction potential of our system in two use cases (UCs)333Video link is provided in footnote 1.:


We orchestrate multiple robots by placing a near-field gesture-sensitive device in the hands of an operator. The operator’s position and orientation with respect to the robots is used to predict the intended receiver of gesture commands (see Figure 7 a,b).


We show an adaptive robot speed control to simplify human-machine cooperation. As humans approach, we automatically decelerate the robot. Likewise, we accelerate the robot to full operating speed as humans leave (see Figure 7 c,d).

Fig. 7: Use Cases (UCs); (a,b) UC1 Robot orchestration - operator’s position and orientation selects the intended robot to receive near-field gesture commands; (c,d) UC2 Robot Interaction - adjusting robot speed in human presence for near-field interaction.

V Conclusion and Discussion

We have developed a single monocular camera-based system to provide real-time pose detection for human-machine interaction at large scales. Albeit our detection is performed in 2D, we develop a homography based solution to lift predictions to a metric 3D space. We show that probabilistic fusing proximity perception with global vision is beneficial in creating robust human-robot interaction detection.

In its present state we identify the following shortcomings of our system: a) The use of homographies is limited to keypoints with statistically predefined heights above ground. In the future, we plan to predict 3D keypoints directly. b) Proximity perception is integrated based on detection confidence levels provided by the gesture device. We expect to gain additional robustness, by directly fusing raw data from both, the camera and the proximity sensor, in a common model.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description