Learning Articulated Motions
From Visual Demonstration
Many functional elements of human homes and workplaces consist of rigid components which are connected through one or more sliding or rotating linkages. Examples include doors and drawers of cabinets and appliances; laptops; and swivel office chairs. A robotic mobile manipulator would benefit from the ability to acquire kinematic models of such objects from observation. This paper describes a method by which a robot can acquire an object model by capturing depth imagery of the object as a human moves it through its range of motion. We envision that in future, a machine newly introduced to an environment could be shown by its human user the articulated objects particular to that environment, inferring from these “visual demonstrations” enough information to actuate each object independently of the user.
Our method employs sparse (markerless) feature tracking, motion segmentation, component pose estimation, and articulation learning; it does not require prior object models. Using the method, a robot can observe an object being exercised, infer a kinematic model incorporating rigid, prismatic and revolute joints, then use the model to predict the object’s motion from a novel vantage point. We evaluate the method’s performance, and compare it to that of a previously published technique, for a variety of household objects.
A long-standing challenge in robotics is to endow robots with the ability to interact effectively with the diversity of objects common in human-made environments. Existing approaches to manipulation often assume that objects are simple and drawn from a small set. The models are then either pre-defined or learned from training, for example requiring fiducial markers on object parts, or prior assumptions about object structure. Such requirements may not scale well as the number and variety of objects increases. This paper describes a method with which robots can learn kinematic models for articulated objects in situ, simply by observing a user manipulate the object. Our method learns open kinematic chains that involve rigid linkages, and prismatic and revolute motions, between parts.
There are three primary contributions of our approach that make it effective for articulation learning. First, we propose a feature tracking algorithm designed to perceive articulated motions in unstructured environments, avoiding the need to embed fiducial markers in the scene. Second, we describe a motion segmentation algorithm that uses kernel-based clustering to group feature trajectories arising from each object part. A subsequent optimization step recovers the 6-DOF pose of each object part. Third, the method enables use of the learned articulation model to predict the object’s motion when it is observed from a novel vantage point. Figure 1 illustrates a scenario where our method learns kinematic models for a refrigerator and microwave from separate user-provided demonstrations, then predicts the motion of each object in a subsequent encounter. We present experimental results that demonstrate the use of our method to learn kinematic models for a variety of everyday objects, and compare our method’s performance to that of the current state of the art.
Ii Related Work
Providing robots with the ability to learn models of articulated objects requires a range of perceptual skills such as object tracking, motion segmentation, pose estimation, and model learning. It is desirable for robots to learn these models from demonstrations provided by ordinary users. This necessitates the ability to deal with unstructured environments and estimate object motion without requiring tracking markers. Traditional tracking algorithms such as KLT , or those based on SIFT  depend on sufficient object texture and may be susceptible to drift when employed over an object’s full range of motion. Alternatives such as large-displacement optical flow  or particle video methods  tend to be more accurate but require substantially more computation.
Articulated motion understanding generally requires a combination of motion tracking and segmentation. Existing motion segmentation algorithms use feature based trackers to construct spatio-temporal trajectories from sensor data, and cluster these trajectories based on rigid-body motion constraints. Recent work by Brox and Malik  in segmenting feature trajectories has shown promise in analyzing and labeling motion profiles of objects in video sequences in an unsupervised manner. Recent work by Elhamifar and Vidal  has proven effective at labeling object points based purely on motion visible in a sequence of standard camera images. Our framework employs similar techniques, and introduce a segmentation approach for features extracted from RGB-D data.
Researchers have studied the problem of learning models from visual demonstration. Yan and Pollefeys  and Huang et al.  employ structure from motion techniques to segment the articulated parts of an object, then estimate the prismatic and rotational degrees of freedom between these parts. These methods are sensitive to outliers in the feature matching step, resulting in significant errors in pose and model estimates. Closely related to our work, Katz et al.  consider the problem of extracting segmentation and kinematic models from interactive manipulation of an articulated object. They take a deterministic approach, first assuming that each object linkage is prismatic and proceed to fit a rotational degree-of-freedom only if the residual is above a specified threshold. Katz et al. learn from observations made in clean, clutter-free environments and primarily consider objects in close proximity to the RGB-D sensor. Recently, Katz et al.  propose an improved learning method that has equally good performance with reduced algorithmic complexity. However, the method does not explicitly reason over the complexity of the inferred kinematic models, and tends to over-fit to observed motion. In contrast, our algorithm targets in situ learning in unstructured environments with probabilistic techniques that provide robustness to noise. Our method adopts the work of Sturm et al. , which used a probabilistic approach to reason over the likelihood of the observations while simultaneously penalizing complexity in the kinematic model. Their work differs from ours in two main respects: they required that fiducial markers be placed on each object part in order to provide nearly noise-free observations; and they assume that the number of unique object parts is known a priori.
Iii Articulation Learning From Visual Demonstration
This section introduces the algorithmic components of our method. Figure 2 illustrates the steps involved.
Our approach consists of a training phase and a prediction phase. The training phase proceeds as follows: (i) Given RGB-D data, a feature tracker constructs long-range feature trajectories in 3-D. (ii) Using a relative motion similarity metric, clusters of rigidly moving feature trajectories are identified. (iii) The 6-DOF motion of each cluster is then estimated using 3-D pose optimization. (iv) Given a pose estimate for each identified cluster, the most likely kinematic structure and model parameters for the articulated object are determined. Figure 3 illustrates the steps involved in the training phase with inputs and outputs for each component.
Once the kinematic model of an articulated object is learned, our system can predict the motion trajectory of the object during future encounters. In the prediction phase: (i) Given RGB-D data, the description of the objects in the scene, , is extracted using SURF  descriptors. (ii) Given a set of descriptors , the best-matching object and its kinematic model, are retrieved; and (iii) From these correspondences and the kinematic model parameters of the matching object, the object’s articulated motion is predicted. Figure 4 illustrates the steps involved in the prediction phase.
Iii-a Spatio-Temporal Feature Tracking
The first step in articulation learning from visual demonstration involves visually observing and tracking features on the object while it is being manipulated. We focus on unstructured environments without fiducial markers. Our algorithm combines interest-point detectors and feature descriptors with traditional optical flow methods to construct long-range feature trajectories. We employ Good Features To Track (GFTT)  to initialize up to 1500 salient features with a quality level of 0.04 or greater, across multiple image scales. Once the features are detected, we populate a mask image that captures regions where interest points are detected at each pyramid scale. We use techniques from previous work on dense optical flow  to predict each feature at the next timestep. Our implementation also employs median filtering as suggested by Wang et al.  to reduce false positives.
We bootstrap the detection and tracking steps with a feature description step that extracts and learns the description of the feature trajectory. At each image scale, we compute the SURF descriptor  over features that were predicted from the previous step, denoted as , and compare them with the description of the detected features at time , denoted as . Subsequently, detected features that are sufficiently close to predicted features and that successfully meet a desired match score are added to the feature trajectory, while the rest are pruned. To combat drift, we use the detection mask as a guide to reinforce feature predictions with feature detections. Additionally, we incorporate flow failure detection techniques  to reduce drift in feature trajectories.
Like other feature-based methods  our method requires visual texture. In typical video sequences, some features are continuously tracked, while other features are lost due to occlusion or lack of image saliency. To provide rich trajectory information, we continuously add features to the scene as needed. We maintain a constant number of feature trajectories tracked, by adding newly detected features in regions that are not yet occupied. From RGB-D depth information, image-space feature trajectories can be easily extended to 3-D. As a result, each feature key-point is represented by its normalized image coordinates , position and surface normal , represented as . We denote as the resulting set of feature trajectories constructed, where . To combat noise inherent in our consumer-grade RGB-D sensor, we post-process the point cloud with a fast bilateral filter  with parameters px, cm.
Iii-B Motion Segmentation
To identify the kinematic relationships among parts in an articulated object, we first distinguish the trajectory taken by each part. In particular, we analyze the motions of the object parts with respect to each other over time, and infer whether or not pairs of object parts are rigidly attached. To reason over candidate segmentations, we formulate a clustering problem to identify the different motion subspaces in which the object parts lie. After clustering, similar labels imply rigid attachment, while dissimilar labels indicate non-rigid relative motion between parts.
If two features in belong to the same rigid part, the relative displacement and angle between the features will be consistent over the common span of their trajectories. The distribution over the relative change in displacement vectors and angle subtended is modeled as a zero-mean Gaussian, , where is the expected noise covariance for rigidly-connected feature pairs. The similarity of two feature trajectories can then be defined as:
where and are the observed time instances of the feature trajectories , and respectively, , and is a parameter characterizing the relative motion of the two trajectories. For a pair of 3-D key-point features , and , we estimate the mean relative displacement between a pair of points moving rigidly together as:
where . For 3-D key-points, we use in Eqn. 1. Figure 5 illustrates an example of rigid and non-rigid motions of feature trajectory pairs, and their corresponding distribution of relative displacements.
For a pair of surface normals and , we define the mean distance as
where . In this case, we use in Eqn. 1.
Since the bandwidth parameter for a pair of feature trajectories can be intuitively predicted from the expected variance in relative motions of trajectories, we employ DBSCAN , a density-based clustering algorithm, to find rigidly associated feature trajectories. The resulting cluster assignments are denoted as , where cluster consists of a set of rigidly-moving feature trajectories.
Iii-C Multi-Rigid-Body Pose Optimization
Given the cluster label assignment for each feature trajectory, we subsequently determine the 6-DOF motion of each cluster. We define as the set of features belonging to cluster at time . Additionally, we define as the set of poses estimated for each of clusters considered, and as the pose estimated for the cluster at time .
For each cluster , we consider the synchronized sensor observations of position and surface normals for each of its trajectories, and use the arbitrary pose as the reference frame for the remaining pose estimates of the cluster. Subsequently, we compute the relative transformation between successive time steps and for the cluster using the known correspondences between and . Since this step can lead to drift, we add an additional sparse set of relative pose constraints every 10 frames, denoted as . Our implementation employs a correspondence rejection step that eliminates outliers falling outside the inlier distance threshold of cm, as in RANSAC , making the pose estimation routine more robust to sensor noise.
We augment the estimation step with an optimization phase to provide smooth and continuous pose estimates for each cluster by incorporating a motion model. We use the 3-D pose optimizer iSAM  to incorporate the relative pose constraints within a factor graph, with node factors derived directly from the pose estimates. A constant-velocity edge factor term is also added to provide continuity in the articulated motion.
Iii-D Articulation Learning
Once the 6-DOF pose estimates of the individual object parts are computed, the kinematic model of the full articulated object is determined using tools developed in Sturm et al. . Given multiple 6-DOF pose observations of object parts, the problem is to estimate the most likely kinematic configuration for the articulated object. Formally, given the observed poses , we estimate the kinematic graph configuration that maximizes the posterior probability
We employ notation similar to that of Sturm et al.  to denote the relative transformation between two object parts and as , using standard motion composition operator notation . The kinematic model between part and is then defined as , with its associated parameter vector , where are the number of parameters associated with the description of the link. We construct a graph consisting of a set of vertices that denote the object parts involved in the articulated object, and a set of undirected edges describing the kinematic linkage between two object parts.
where is the sequence of observed relative transformations between parts and .
Since we are particularly interested in household objects, we focus on kinematic models involving rigid, prismatic, and revolute linkages. We then estimate the parameters that maximize the data likelihood of the object pose observations given the kinematic model:
Once we fit each candidate kinematic model to the given observation sequence, we select the kinematic model that best explains the data. Specifically, we compute the posterior probability of each kinematic model, given the data, as:
Due to the evaluation complexity of this posterior term, the BIC score is computed instead as the approximation:
where is the number of parameters involved in the kinematic model, is the number of observations in the data set, and is the maximum likelihood parameter vector. This implies that the model that best explains the observations would correspond to that with the least BIC score.
The kinematic structure selection problem is subsequently reduced to computing the minimum spanning tree of the graph with edges defined by . The resulting minimum spanning kinematic tree weighted by BIC scores is the most likely kinematic model for the articulated object given the pose observations. For a more detailed description, we refer the reader to Sturm et al. . Figure 6 shows a few examples of kinematic structures extracted given pose estimates as described in the previous section. Our limitation of linkage types to rigid, prismatic, and rotational does exclude various household objects such as lamps, garage doors, toys etc. with more complex kinematics.
Iii-E Learning to Predict Articulated Motion
Our daily environment is filled with articulated objects with which we repeatedly interact. A robot in our environment can identify instances of articulated objects that it has observed in the past, then use a learned model to predict the motion of an object when it is used.
Once the kinematic model of an articulated object is learned, the kinematic structure and its model parameters are stored in a database, along with its appearance model. The feature descriptors extracted (described in Section III-A) for each cluster of the articulated object are also retained for object recognition in future encounters. Demonstrations involving the same instance of the articulated object are represented in a single arbitrarily selected reference frame, and kept consistent across encounters by registering newer demonstrations into the initial object frame. Each of these attributes is stored in the bag-of-words driven database  for convenient querying in the future. Thus, on encountering the same object instance in the future, the robot can match the descriptors extracted from the current scene with those extracted from object instances it learned in the past. It then recovers the original demonstration reference frame along with the relevant kinematic structure of the articulated object for prediction purposes. We identify the surface of the manipulated object by extracting Maximally Stable Extremal Regions (MSER)  (Figure 7) for each object part undergoing motion. We use this surface to visualize the motion manifold of the articulated object.
Iv Experiments and Analysis
Our experimental setup consists of a single sensor providing RGB-D depth imagery. Each visual demonstration involved a human manipulating an articulated object and its parts at a normal pace, while avoiding obscuration of the object from the robot’s perspective. Demonstrations were performed for multiple robot viewpoints, to capture variability in depth imagery. We performed 43 demonstration sessions by manipulating a variety of household objects: refrigerators, doors, drawers, laptops, chair etc. Each demonstration was recorded for about 30-60 seconds. April tags  were used to recover ground truth estimates of each articulated object’s motion, which we adopted as a baseline for evaluation. In order to avoid any influence on our method of observations arising from fiducial markers, the RGB-D input was pre-processed to mask out regions containing the tags.
We then compared the pose estimation, model selection and estimation performance of our method to that of an alternative state-of-the-art method (re-implemented by us based on ), and to traditional methods using fiducial markers. We incorporated several improvements ,  to Katz’s algorithm, as previously described in Section III-A, to enable fair comparison with our proposed method.
Iv-a Qualitative and Overall Performance
Figure 8 shows the method in operation for household objects including a laptop, a microwave, a refrigerator and a drawer. Tables I and II compare the performance of our method in estimating the kinematic model parameters for several articulated objects observed from a variety of viewpoints. Our method recovered a correct model for more objects, and for almost every object tested recovered model parameters more accurately, than Katz’s method.
Iv-B Pose Estimation Accuracy
For each visual demonstration, we compared the segmentation and pose of each object part estimated by our method with those produced by Katz. We also obtained pose estimates for each object part by tracking attached fiducial markers. Synchronization across pose observations was ensured by evaluating only poses in the set intersection of the timestamps of each pose sequence. For each overlapping time step, we compared the relative pose of the estimated object segment obtained from both algorithms with that obtained via fiducial markers (Figure 9). For consistency in evaluation, the poses of individual object parts were initialized identically for both algorithms.
Figure 10 compares the absolute poses estimated by the three methods described above, given observations of a chair being moved on the ground plane. Figure 10(a) illustrates a scenario in which both algorithms, ours and Katz’s, perform reliably. Katz’s method is within cm and , on average, of the ground truth pose produced with fiducial markers. Our method achieves comparable average accuracy of cm and . Using data from another demonstration, Katz’s method failed to track the object motion robustly, resulting in drift and incorrect motion estimates (Figure 10(b)). Such failures can be attributed to: (i) the KLT tracker that is known to cause drift during feature tracking; (ii) SVD least squares minimization in the relative pose estimation stage, without appropriate outlier rejection.
For a variety of articulated objects (Table I), our method achieves average accuracies of cm and with respect to ground truth estimated from noisy Kinect RGB-D data. In comparison, Katz’s method  achieved average accuracies of cm and for the same objects. Our method achieved an average error of less than cm and in 37 of 43 demonstrations, vs. 23 of 43 for Katz.
Iv-C Model Estimation Accuracy
Once the poses of the object parts are estimated, we compare the kinematic structure and model parameters of the articulated object estimated by our method with those produced by Katz. As in our other experiments, we use the kinematic structure and model parameters identified from fiducial marker-based solutions as a baseline. Table II summarizes the model estimation and parameter estimation performance achieved with our method and Katz’s. The model fit error is defined as the average spatial and orientation error between the observations and the estimated articulation manifold (i.e. prismatic or rotational manifold). For the dataset of articulated objects evaluated (Table II), our method achieved an average model fit error of cm spatially, and in orientation, an improvement over Katz’s method (average model fit errors of cm and respectively). Of 43 demonstrations evaluated, our method determined the correct kinematic structure and accurate parameters in 30 cases, whereas Katz did so in only 15 cases.
We also compared the model parameters estimated by our method and Katz’s method with ground truth from markers, by transforming poses estimated by both methods into the fiducial marker’s reference frame based on the initial configuration of the articulated object. This allows us to directly compare model parameters estimated through our proposed framework, the current state-of-the-art and marker-based solutions. For multi-DOF objects, the model parameter error averaged across each corresponding object part is reported. In each demonstration, the model parameters estimated via our method are closer to the marker-based solution than those obtained by Katz.
We introduced a framework that enables robots to learn kinematic models for everyday objects from RGB-D data acquired during user-provided demonstrations. We combined sparse feature tracking, motion segmentation, object pose estimation and articulation learning to learn the underlying kinematic structure of the observed object. We demonstrated the qualitative and quantitative performance of our method; it recovers the correct structure more often, and more accurately, than its predecessor in the literature, and achieves accuracy similar to that of a marker-based solution. Our framework also enables the robot to predict the motion of articulated objects it has previously learned. Even given our method’s limitation to recovering open kinematic chains involving only rigid, prismatic or revolute linkages, its prediction capability may be useful in future robotic encounters requiring manipulation.
- Baya et al.  H. Baya, A. Essa, T. Tuytelaarsb, and L. Van Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3):346–359, 2008.
- Bouguet  J.-Y. Bouguet. Pyramidal implementation of the affine Lucas-Kanade feature tracker description of the algorithm. Intel Corporation, 2001.
- Brox and Malik  T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In Proc. European Conf. on Computer Vision (ECCV), pages 282–295, 2010.
- Brox and Malik  T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 33(3):500–513, 2011.
- Elhamifar and Vidal  E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2790–2797, 2009.
- Ester et al.  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. ACM Int’l Conf. on Knowledge Discovery and Data Mining (KDD), pages 226–231, 1996.
- Farnebäck  G. Farnebäck. Two-frame motion estimation based on polynomial expansion. In Image Analysis, pages 363–370. 2003.
- Fischler and Bolles  M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM, 24(6):381–395, 1981.
- Galvez-Lopez and Tardos  D. Galvez-Lopez and J. D. Tardos. Bags of binary words for fast place recognition in image sequences. Trans. on Robotics, 28(5):1188–1197, 2012.
- Huang et al.  X. Huang, I. Walker, and S. Birchfield. Occlusion-aware reconstruction and manipulation of 3d articulated objects. In Proc. IEEE Int’l Conf. on Robotics and Automation (ICRA), pages 1365–1371, 2012.
- Kaess et al.  M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and F. Dellaert. iSAM2: Incremental smoothing and mapping using the bayes tree. Int’l J. of Robotics Research, 31(2):216–235, 2012.
- Kalal et al.  Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In Proc. Int’l Conf. on Pattern Recognition (ICPR), pages 2756–2759, 2010.
- Katz et al.  D. Katz, A. Orthey, and O. Brock. Interactive perception of articulated objects. In Proc. Int’l. Symp. on Experimental Robotics (ISER), 2010.
- Katz et al.  D. Katz, M. Kazemi, J. Andrew Bagnell, and A. Stentz. Interactive segmentation, tracking, and kinematic modeling of unknown 3d articulated objects. In Proc. IEEE Int’l Conf. on Robotics and Automation (ICRA), pages 5003–5010, 2013.
- Lowe  D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int’l J. of Computer Vision, 60(2):91–110, 2004.
- Matas et al.  J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10):761–767, 2004.
- Olson  E. Olson. AprilTag: A robust and flexible visual fiducial system. In Proc. IEEE Int’l Conf. on Robotics and Automation (ICRA), pages 3400–3407, 2011.
- Paris and Durand  S. Paris and F. Durand. A fast approximation of the bilateral filter using a signal processing approach. In Proc. European Conf. on Computer Vision (ECCV), pages 568–580, 2006.
- Sand and Teller  P. Sand and S. Teller. Particle video: Long-range motion estimation using point trajectories. Int’l J. of Computer Vision, 80(1):72–91, 2008.
- Shi and Tomasi  J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 593–600, 1994.
- Smith et al.  R. Smith, M. Self, and P. Cheeseman. Estimating uncertain spatial relationships in robotics. In Autonomous Robot Vehicles, pages 167–193. Springer-Verlag, 1990.
- Sturm et al.  J. Sturm, C. Stachniss, and W. Burgard. A probabilistic framework for learning kinematic models of articulated objects. J. of Artificial Intelligence Research, 41(2):477–526, 2011.
- Wang et al.  H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3169–3176, 2011.
- Yan and Pollefeys  J. Yan and M. Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In Proc. European Conf. on Computer Vision (ECCV), pages 94–106, 2006.