Early Estimation of User’s Intention of Tele-Operation
Object Affordance and Hand Motion in a Dual First-Person Vision
This paper describes a method of estimating the intention of a user’s motion in a robot tele-operation scenario. One of the issues in tele-operation is latency, which occurs due to various reasons such as a slow robot motion and a narrow communication channel. An effective way of reducing the latency is to estimate human intention of motions and to move the robot proactively. To enable a reliable early intention estimation, we use both hand motion and object affordances in a dual first-person vision (robot and user) with an HMD. Experimental results in an object pickup scenario show the effectiveness of the method.
Lifestyle support is one of the promising application domains of robotic technologies. A mobile manipulator is a useful tool for the user with a disability to perform various home tasks such as bringing a user-specified object from a remote room. Tele-operation through a robot’s view is a common approach to easily controlling such a robot. Human body motion-based control (e.g., ) is promising because it can provide an intuitive way of tele-operation. There are, however, two issues in body motion-based control. One is the body motion observation, and the other is the latency due to motion estimation or robot control.
Concerning body motion estimation, existing devices such as MoCap, RGB-D cameras, and wearable sensors are costly and/or not very reliable. Concerning the latency, even if the communication channel is fast enough, the latency is sometimes unavoidable due to, for example, the difficulty in controlling a robot precisely. A hybrid approach will then be a solution, which combines the user’s command and robot’s autonomous movement based on object recognition and motion planning capabilities, but keeping the intuitiveness of operation remains as a problem.
To address these issues, we propose an approach based on early intention estimation of human actions with a dual first-person view of the robot and the user. We suppose the situation where a user tries to pick up one of the objects on a remote table. Fig. 1 shows an intended scenario and an overview of the proposed method. The user obtains the robot’s current view through a head-mounted display (HMD). The user reaches out a hand to touch a target object in the view. Of course, the hand never touches the object at a remote site. The system uses a stream of hand images for intention estimation. For an early but reliable estimation of intention, we use a strong relationship between human hand shape (i.e., preshaping ) and object affordances. Unlike the previous work on action recognition in a first-person vision, we combine data from different views.
2 Related Work
2.1 Action recognition by first-person vision
Many methods have been proposed for End-to-End action recognition by first-person vision [11, 7]. These methods analyzes the relationships of objects and hand motions which both appear in first-person image sequences. In our tele-operation scenario, these methods cannot be directly adopted because hands and objects are at different locations and therefore observed in separate images.
2.2 Hand motion intention estimation using preshaping and affordances
Characteristics of an object which naturally induces a human action is called affordance . Arbib et al.  show the relationship between the shape and the motion of a hand and the visible characteristics of the target object. This implies that preshaping provides useful information for identifying the target object . In our problem setting, the user cannot get a visual feedback of her/his own hand position and motion. Some work show that even in such a case, as long as the location of the target of interaction is known, the reaching motion can be executed similarly [12, 10]; This supports the use of separately observed object locations/types and a hand motion stream.
3 Object Affordance Extraction
Do et al.  developed AffordanceNet. Affordance here is a part of an object which is closely related to some human action. AffordanceNet outputs object names and bounding boxes, and affordance-wise segmentation.
We use cups as target objects. We modified the original AffordanceNet such that the class output is only “cup” and the affordance output has the following four classes: contain, wrap-grasp, handle-grasp, and background. We also constructed a new dataset accordingly by ourselves. Fig. 2 shows examples in the new dataset and an affordance extraction result.
4 Intention Estimation Network
4.1 Network structure
Early intention estimation or action prediction usually necessitates integration and analysis of time series data. We design a network, called Intention Estimation Network (IEN), which is to calculate probable target location(s) of the current hand motion based on an integration of affordance information and hand information (see Fig. 3). For the former, we use the output of AffordanceNet trained with our dataset, which is composed of a set of four affordance masks and a bounding box mask. For the latter, we use the output of the ConvLSTM  with a sequence of hand images as an input. We concatenate both outputs and then supply it to a variant of U-Net , one of the popular networks for semantic segmentation. The final heatmap is normalized so that the values of all pixels sum to one.
4.2 Dataset construction for intention estimation
The input to IEN includes the extracted object and affordance infomration as well as a hand image sequence. The latter is the heatmap of the target location. Since the output of AffordanceNet for a real setting is not free from errors, we first generated a variety of simulated scenes, which are with ground truth information of affordances and object bounding boxes, and then provided them to the user for taking a hand-reaching motion.
Our system uses a stereoscopic head-mounted display (HMD) (Occulus Rift CV1) for presenting images of the remote site to the user. We use an RGB-D camera (Intel RealSense D435) for acquiring hand motion sequences, which are recorded both in RGB and depth images. Fig. 1 shows a user wearing the system on the bottom left.
The steps for collecting a hand motion sequence for an affordance/stereo pair are as follows (see Fig. 4):
Put two or three cups on the table. Textures, orientations, and locations of cups are set randomly.
A stereo image pair is shown on the HMD.
The user moves the right hand to try to pick up the designated object in the scene, and the hand image sequence is recorded with the RGB-D camera.
The recorded sequence is saved with the output of AffordanceNet.
5 Performance Evaluation
5.1 Training and testing the network
We collected 156 pairs of affordance and hand motion information for training. One hand motion sequence consists of 60 frames (30fps, 2sec) and a set of 10-frame sequences are extracted with a sliding-window fashion. As a result, we have 50 sequences for each original sequence and 7,800 sequences in total. The ground truth of the output heatmap is generated by putting a Gaussian distribution at the location of the designated object. We used the KL-divergence as the loss function to evaluate the distance between the true and the predicted heatmap.
We generated an additional test dataset for evaluation by 50 trials. The input to the intention estimation network (IEN) in testing is not limited to a fixed-length (10 frames) sequence but a set of sequences with variable lengths from 1 to 15 to see how early the intention can be estimated.
We examined the following three points: (1) effectiveness of affordance information, (2) comparison of RGB and depth images, and (3) effectiveness of hand region extraction in RGB images.
5.2 Estimation examples
Fig. 5 shows the inputs and the corresponding results for depth images. There are two cups, and the user is asked to pick up the right one. With affordance information, the probability of the right cup gradually increased, while without affordance information, the probabilities for both objects remain almost the same.
Fig. 6 examines the usefulness of hand extraction for RGB images. There are two cups, and the user is asked to pick up the right one. With hand region information, the probability of the right cup increased at the 6th frame, while both cups are equally probable without that information. Since we learned the use of hand extraction is always useful, we include hand extraction for RGB images in the subsequent quantitative comparison.
5.3 Quantitative evaluation
The purpose of the intention estimation is to determine the target object as early as possible. Since a heatmap can be interpreted as a probability distribution of the location of the target object, we use the sum of probabilities with the bounding box of an object as the confidence value. Such confidence values for all objects in the scene are normalized to be used as probabilities of being the target. The object whose probability exceeds a threhold is judged as the intended target object. We use the averaged f-value as a measure of estimation quality.
Table 1 compares actual f-values for four cases in order to summarize the above comparisons. Using depth images is more effective in the intention estimation because the hand region and shape information is more clearly obtained compared to RGB images. Using affordance information is useful for both depth and RGB images, and using RGB images with hand extraction and affordances is comparable to using depth images without affordances.
*-AO: with affordance, *-O: without affordance
This paper proposes a method for early estimation of the user’s intention of tele-operation using a dual first-person images and object affordances. We deal with a tele-operated object pickup scenario where the user sees a view of a remote site through an HMD and moves his/her hand to try to pick up a target object. AffordanceNet extracts object and affordance information and ConvLSTM extracts hand motion and shape information. The U-Net combines these two kinds of information to generate a heatmap of the target location. We generated two new datasets, one is for training AffordanceNet and the other is for training the whole network. We compared several cases and show that affordance information is useful in early intention estimation both for depth and RGB images. Testing the developed method on a real robot and applying it to other objects are future work.
This work is in part supported by JSPS KAKENHI Grant Numbers 17H01799. The authors would like to thank Mr. Hiroaki Masuzawa for fruitful discussion.
-  M. Arbib, T. Iberall, and D. Lyons. Coordinated control programs for movements of the hand. In Hand Function and the Neocortex, pages 135–170. Springer Verlag, 1985.
-  C. Bard, J. Troccaz, and G. Vercelli. Shape analysis and hand preshaping for grasping. In Proc. IROS-91, pages 64–69, 1991.
-  T.-T. Do, A. Nguyen, and I. Reid. AffordanceNet: An end-to-end deep learning approach for object affordance detection. In Proc. ICRA-2018, 2018.
-  J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, 1979.
-  C. Ha, S. Park, J. Her, I. Jang, Y. Lee, G. Cho, H. Son, and D. Lee. Whole-body multi-modal semi-autonomous teleoperation of mobile manipulator systems. In Proc. ICRA-2015, pages 164–170, 2015.
-  N. Nakamaura, T. Miyasato, and F. Kishino. Preshaping analysis and target prediction in grasping hand motion. IEICE Trans. on Fundamentals, J80-A(6):1003–1013, 1997. (in Japanese).
-  N. Rhinehart and K. Kitani. First-person activity forecasting from video with online inverse reinforcement learning. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. arXiv preprint arXiv:1505.04597, 2015.
-  X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems, volume 25, pages 802–810, 2015.
-  S. Shimawaki and N. Sakai. Effect of viewing conditions on preshaping movement in hand grasp motion. The Japanese J. of Ergonomics, 47(1):31–35, 2011.
-  S. Singh, C. Arora, and C. Jawahar. First person action recognition using deep learned descriptors. In Proc. CVPR-2016, 2016.
-  S. Winges, D. Weber, and M. Santello. The role of vision on hand preshaping during reach to grasp. Experimental Brain Research, 152:489–498, 2003.