Enhanced Human-Machine Interaction by Combining Proximity Sensing with Global Perception
The raise of collaborative robotics has led to wide range of sensor technologies to detect human-machine interactions: at short distances, proximity sensors detect nontactile gestures virtually occlusion-free, while at medium distances, active depth sensors are frequently used to infer human intentions. We describe an optical system for large workspaces to capture human pose based on a single panoramic color camera. Despite the two-dimensional input, our system is able to predict metric 3D pose information over larger field of views than would be possible with active depth measurement cameras. We merge posture context with proximity perception to reduce occlusions and improve accuracy at long distances. We demonstrate the capabilities of our system in two use cases involving multiple humans and robots.
2nd Workshop on Proximity Perception
2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Macau, China
Published under the Creative Commons license (CC BY-NC-ND) at KITopen
Proximity perception is an active research field, aiming to equip robotics with nontactile near-field sensors to advance robot autonomy and human-machine interoperability. While proximity sensing is a compelling concept on close-up range, it fails to recognize spatio-temporal events on moderate distances from the robot [hughes2018robotic]. These events, however, provide important information to create a more robust and intuitive set of interaction patterns.
In previous works, optical systems were increasingly used to close this information gap. Many approaches integrate active depth cameras to detect human key points and measure distances between objects in the environment. The downsides of active depth sensors are the short operating range, the sensitivity to extraneous light and the increased occlusion caused by the camera-projector arrangement.
We describe an optical system that complements proximity perception by observing human and machines from a bird’s-eye perspective (see Figure 1). Even though our system works solely with color images from one single panoramic camera, it delivers pose information for humans and robots in a metric space. We combine the provided pose information with near-field measurements from proximity sensors to robustly detect human-machine interaction patterns. We demonstrate the benefits of merging complementary input modalities in two use cases involving multiple robots and humans 111Demonstration video https://youtu.be/1X0xF3m36BA.
I-a Related Work
Vision based human-machine interaction has been studied before. Guanglong et al. [du2018online] combine RGB-D cameras and inertial measurement units to detect human gestures for robot learning. Zimmermann et al. [zimmermann20183d] use depth-based devices to record human instructions for teaching a service robot.
Systems based on depth cameras are also used to study human-robot safety aspects. Fabrizio et al. [fabrizio2016real] describe a depth sensing enabled device to compute distances to dynamic obstacles for collision avoidance. Švarnỳ et al. [petr2018] propose using 2D keypoint detection merged with RGB-D data to measure distances between a robot and a single human operator.
The proposed system has a number of benefits. Our system works with a readily available single wide-angle color camera, eliminating the range limitations of active depth devices. The inference step builds on recent advances in pose estimation, enabling our system to answer complex image related queries robustly. In addition to instantaneous pose, we track individual humans to provide pose trajectories over time. Although the detection step takes place in image space, the majority of predictions are elevated to a metric 3D space, enabling natural fusion with near-field sensors to create a richer set of human-machine interactions.
Figure 3 illustrates the core components of our system. These are detailed in the following sub-sections.
Ii-a View Synthesis
In order to process panoramic color images, we first synthesize one or more synthetic rectilinear views (see Figure 2). View synthesis assumes a virtual pinhole camera that is potentially rotated with respect to the physical source frame . Next, every virtual camera pixel is mapped to a corresponding source pixel using the method described in [heindl2018]. The computed pixel position is bilinearly interpolated to determine the final color value associated with .
Ii-B Pose Estimation
To compute the 2D pose of humans and robots, we first predict 2D keypoints from synthesized color views (see Figure 7 b,d). We use neural network architectures, depicted in Figure 4 based on works of Cao et al. [cao2017realtime] and Heindl et al. [cheind2019disp] to perform keypoint localization. Each network, composed of a series of Deep Convolutional Neural Networks (DCNNs), regresses keypoint coordinates by transforming the input color images into keypoint belief maps . Here are image height and width and is the number of output keypoints (6 for robots, 25 for humans). Each belief map encodes the likelihood of observing a particular keypoint in a specific image region. We train these networks with both real [cao2017realtime] and artificially generated images [cheind2019disp].
In a subsequent step, 2D poses are transformed into a metric space via a set of homographies. In particular, we propose the use of an image-to-ground homography , to map image positions to metric ground coordinates as follows
where is the dehomogenization operator . Because such mappings are accurate only for body parts sufficiently close to the ground (such as foot positions), we use a statistical body model to gain the ability to map extra keypoints such as hips and shoulders. These additional points serve to predict body orientation and to stabilize body positions in case of partial occlusions. Regarding robots, we map only the base joints to determine their location.
Ii-C Multiple Object Tracking
To distinguish individuals and smooth poses over time, we perform object tracking (see Figure 1, 7d). We filter human trajectories using a Kalman filter, assuming a linear dynamic motion model. All operations are performed in ground plane coordinates to avoid perspective effects. To assign newly detected poses to existing ones, we setup a linear assignment problem on the following bipartite graph: Given a set of pose detections and a set of forward predicted poses at time , we add an edge for every possible combination of detected and tracked pose to the graph. The cost associated with edge is computed as the Euclidean distance between the ground positions of and . We use the method of Kuhn et al. [kuhn1955hungarian] to solve for the optimal assignments and update the Kalman filters correspondingly.
Ii-D Merging Pose Information with Proximity Measurements
Pose information alone has already proven useful in our experiments with human-robot interaction. However, line-of-sight occlusions and large object distances limit the system’s ability to measure small movements accurately. Therefore, we merge global pose context with gestures detected by a proximity device222https://www.leapmotion.com/.
In particular, we treat data fusion as a probabilistic classification problem and consider the presence of gestures intended for robot to be binary latent random variables . At each time-step we observe the following noisy features: a) the proximity device’s confidence level for an active gesture (note that we consider the device to be held by the operator and this feature is thus shared among robots), b) the relative position of the operator closest to the robot in ground plane coordinates and c) the orientation of the operator with respect to the robot, expressed as the cosine of the angle .
Given observed features , we compute the posterior probability of the presence of a command using Bayes theorem
For computational reasons, we introduce the following independence assumptions
That is, all observed features are independent from each other given the state of command . Thus, the posterior probability simplifies to
where is the partition function given by
Our model, shown graphically in Figure 5, follows the structure of a naïve Bayes classifier. We assume the following underlying probability distributions
Here, refers to a two-dimensional distribution with bin probabilities . We have chosen to model the conditional probability using a normal distribution for practical reasons. Strictly speaking, the use of a normal distribution is inappropriate, as the support of the normal distribution is and not . However, probabilities beyond quickly drop to zero and therefore do not pose a problem for our use case.
The parameters of our model are estimated in a supervised fashion. That is, we consider along with to be observed in each training sample. We estimate the parameters of each distribution by maximizing the likelihood. For multiple robots, we learn a separate classifier for each of them. A single gesture may hence trigger actions on multiple robots.
Iii-a Experimental Setup
We perform all experiments in an environment that covers the volume of using a single color camera with resolution , mounted above ground. We calibrate the camera intrinsically and estimate the ground plane homography using a chessboard object. Our setup consists of two robots: an UR10 and a KUKA iiwa, both of which are placed in close proximity to each other (see Figure 7a). We use a standard computing unit equipped with a single NVIDIA GeForce GTX 1080 Ti for pose estimation at . The gesture sensing device is connected to a portable Laptop via USB. Both robots are connected to separate controlling units. ZeroMQ [hintjens2013zeromq] is used to exchange messages between all computing entities.
Iii-B Pose Estimation Accuracy
In this work we consider metric accuracies, as image accuracies have been reported previously [heindl2018, cheind2019disp]. Assuming an error-free intrinsic camera matrix, the uncertainty in detecting point correspondences for homography estimation is measured to be . The parameter uncertainties of are estimated using the method of Criminsi et al. [criminisi1999plane]. The uncertainty of 2D keypoint detection is . We transform input uncertainties to measurement uncertainties by propagating errors through Equation 1 to the first order. Table I lists measurement uncertainties as a function of object-to-camera distance.
Iii-C Data Fusion Accuracy
To evaluate the data fusion approach, we conduct several experiments using the robot orchestration use case (see Section IV). For training the classifier we record 900, 2700 and 4500 events of each feature in . This corresponds to , and of potential human-robot interaction. In a downstream step, we label the latent variable for each event based on visual inspection. We train separate classifiers for each sample set and evaluate on 5000 test events. Figure 6 shows the results in terms of precision-recall curves. Even for the smallest training size the average precision is around 0.87 and increases moderately as we train on more data.
We demonstrate the interaction potential of our system in two use cases (UCs)333Video link is provided in footnote 1.:
We orchestrate multiple robots by placing a near-field gesture-sensitive device in the hands of an operator. The operator’s position and orientation with respect to the robots is used to predict the intended receiver of gesture commands (see Figure 7 a,b).
We show an adaptive robot speed control to simplify human-machine cooperation. As humans approach, we automatically decelerate the robot. Likewise, we accelerate the robot to full operating speed as humans leave (see Figure 7 c,d).
V Conclusion and Discussion
We have developed a single monocular camera-based system to provide real-time pose detection for human-machine interaction at large scales. Albeit our detection is performed in 2D, we develop a homography based solution to lift predictions to a metric 3D space. We show that probabilistic fusing proximity perception with global vision is beneficial in creating robust human-robot interaction detection.
In its present state we identify the following shortcomings of our system: a) The use of homographies is limited to keypoints with statistically predefined heights above ground. In the future, we plan to predict 3D keypoints directly. b) Proximity perception is integrated based on detection confidence levels provided by the gesture device. We expect to gain additional robustness, by directly fusing raw data from both, the camera and the proximity sensor, in a common model.