Classifying Object Manipulation Actions based on Grasp-types and Motion-Constraints
In this work, we address a challenging problem of fine-grained and coarse-grained recognition of object manipulation actions. Due to the variations in geometrical and motion constraints, there are different manipulations actions possible to perform different sets of actions with an object. Also, there are subtle movements involved to complete most of object manipulation actions. This makes the task of object manipulation action recognition difficult with only just the motion information. We propose to use /media/arxiv_projects/328697/grasp and motion-constraints information to recognise and understand action intention with different objects. We also provide an extensive experimental evaluation on the recent Yale Human Grasping /media/arxiv_projects/328697/dataset consisting of large set of 455 manipulation actions. The evaluation involves a) Different contemporary multi-class classifiers, and binary classifiers with one-vs-one multi-class voting scheme, b) Differential comparisons results based on subsets of attributes involving information of /media/arxiv_projects/328697/grasp and motion-constraints, c) Fine-grained and Coarse-grained object manipulation action recognition based on fine-grained as well as coarse-grained /media/arxiv_projects/328697/grasp type information, and d) Comparison between Instance level and Sequence level modeling of object manipulation actions. Our results justifies the efficacy of /media/arxiv_projects/328697/grasp attributes for the task of fine-grained and coarse-grained object manipulation action recognition.
Human action recognition for full body parts has been studied in ,  for various applications such as video surveillance, content-based video search, and human-robot interactions. Most studies consider the aspect of short lived actions, where the beginning and end of the actions is explicitly specified. Later efforts have been made to the recognition of movements along with the associated objects, both problems of great interest to the study of action analysis. Still, these methods are less reliable for the case of manipulation actions which are performed at finer level. In the household and work environment tasks, considering actions involving local body parts, is important. The reason for this is the slight movement of hands required to accomplish most of these tasks, and these hand movements are not clearly perceivable through motion information from sensors. This is crucially important towards modeling and monitoring the behavior of the individuals, and also in transferring the object manipulation capabilities to the robots for performing both the household and workforce tasks. Such action recognition based technologies can also benefit various domains such as entertainment, smart homes, elderly care, health rehabilitation, analyzing productivity of human work-tasks etc.
Human object interactions are largely co-related to the actions performed using a particular object. Although action recognition for the actions specific to the objects is a problem which has been studied in some works , , the recognition and understanding of the varied object manipulation actions is still largely unresolved. Everyday manipulation tasks include considerable amount of variations in a particular task being performed. Same action can be performed in varied ways according to the habit and styles of the subject. Most of the manipulation actions contain very subtle variations in the observed whole body motion trajectories for the action being performed. These factors make the problem of recognizing a large set of manipulation tasks, a challenging job.
Most of the action recognition frameworks are specifically designed for a smaller and specific set of actions as the motion dynamics based modeling are trained specifically to distinguish them from the actions falling in that specific action set. While motion dynamics is important, it may not uniquely represent manipulation actions. For example, brushing the teeth and drinking water both have similar sort of movements at a coarse level if take the whole human skeleton instead of hand pose specifically. Thus, it is clear that all actions consisting of picking an object and then interaction of object with mouth and finally releasing an object, will not be different from one another based on motion dynamics. So, there is any inherent requirement other aspects that augment the motion dynamics to have better inter-class differentiation for actions.
The manipulation actions performed by humans can also be co-related to the hand /media/arxiv_projects/328697/grasps used to perform the specific actions with an object. This hypothesis is based on the fact that the object manipulation actions are initiated from the point when the first object is /media/arxiv_projects/328697/grasped at first. Thus, hand /media/arxiv_projects/328697/grasps types also aids in the segmentation of video sequences temporally, for object manipulation sequences. Human actions can be described at different levels of abstraction and the actions at lower level consists of multiple sub-actions where an object is first /media/arxiv_projects/328697/grasped, then manipulated and finally released. In most of the actions, the point of initiation is same as the action manipulation and thus a single /media/arxiv_projects/328697/grasp type is uniquely co-related to the action being performed. However, in some actions there may be multiple /media/arxiv_projects/328697/grasps being involved for specific type of action, which requires sequential modeling of the /media/arxiv_projects/328697/grasp types. Another interesting point to note here is that these /media/arxiv_projects/328697/grasp types are much easier to capture through the normal RGB images in comparison to the motion dynamics.
We propose and evaluate different approaches to utilize the /media/arxiv_projects/328697/grasp and object motion-constraints based information for fine-grained and coarse-grained recognition of everyday manipulation actions. These actions are performed in workforce environment and household environment by workers trained over years of experience performing these tasks (which allows to better evaluate the generalisations). We believe that the ability of classification of object manipulation actions using local body configurations (aspects of /media/arxiv_projects/328697/grasps) and motion information can allow a good-quality automated recognition of larger set of everyday actions because in general it allows to define the properties specifically unique to these actions. This /media/arxiv_projects/328697/grasp based action recognition is essentially more appropriate for the objects which can be manipulated in different ways for different actions. For instance, a particular object (e.g. bottle) can be opened using a precision /media/arxiv_projects/328697/grasp and can also be used for drinking with power /media/arxiv_projects/328697/grasp as illustrated in Figure 1. This type of classification allows to identify very useful information about the task intended by the user based on the /media/arxiv_projects/328697/grasp information, thus also facilitates to the prediction of actions in the scenarios where interactions between the humans and robots is required. As we focus on the task of action recognition, we assume that the information about /media/arxiv_projects/328697/grasp attributes and motion-constraints is available to us, as in the case of Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset , which we have used in this paper.
Ii Related work
Most of the action recognition methodologies models the action using full-body motion based features, which only works well for the specific class of action recognition problem where action set is relatively small such as in , , . These approach do not look useful when it comes to their application on real everyday actions. Research in the area of human action recognition has been mainly focused on full-body motions that can be characterized by movement and change of posture like walking, waving, etc.
In many action recognition approaches , , , human motion information have been used. The problem of action recognition has been dealt using motion trajectories with the use of depth cameras like Kinect. These approaches (e.g. see ) are typically considered to be more robust to generate human pose information which can be used for the purpose of action recognition. However, Kinect body pose recognition is not accurate when there are human-object interactions due to occlusions. Motion dynamics based action recognition still cannot capture the representation for the subtle object manipulations. Another interesting aspect is the variations in goal of the task with similar motion dynamics.
Hand gesture recognition is more closer to the problem of object manipulation action recognition. Hand gesture recognition has also been addressed using depth data generated from Kinect in Kurakin et al.  and Wang et al. . But these techniques mainly target sign language gestures and not the human hand-object interactions. Wang et al.  treats an action sequence as a 4D shape and propose random occupancy pattern (ROP) features, extracted from random sampling of 4D subvolumes with different sizes and at different locations. In gesture depth sequences, the semantics of the gestures are mainly understood by the large movement of the hand. These approaches use cropped portion of hand using some hand detection approach, to determine these large hand movements to model different gestures. But, these clear motion information are not easily perceivable in the case object manipulation tasks.
At this point, we note here that the above mentioned works involve processing low-level information (e.g. feature extraction from videos/images), whereas our goal in this work is to convey the importance of /media/arxiv_projects/328697/grasp and motion-constraints information at the higher semantic level (e.g. types of /media/arxiv_projects/328697/grasps and motion-constraints). Such high level attributes for manipulating actions, are indeed available , ,  as a part of the Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset that we are considering in this work.
The problem of understanding manipulation actions is of great interest in robotics as well, where the focus is on simplifying methods to implement action execution on robots. There has been considerable amount of work in robot task planning based on imitation learning , which is essentially the problem of object manipulation through robots by imitating the real world trajectory observed on people performing the action. Understanding the specific types of /media/arxiv_projects/328697/grasp required in the action sequence aids to the purpose of imitation learning as well. The knowledge about how to /media/arxiv_projects/328697/grasp the object is significant, so the robot can accordingly position its effectors. For example, humanoid robot with one parallel gripper and one vacuum gripper, should select the vacuum gripper for power /media/arxiv_projects/328697/grasp, but when a precision /media/arxiv_projects/328697/grasp is needed, the parallel gripper is a better choice. Yang et al.  presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. It understands the objects and hand /media/arxiv_projects/328697/grasp types using CNNs (convolutional neural networks) and later finds the candidate actions that can be performed using the recognized objects from trained language model. Finally, they provide an action tree which can be reversely parsed for action execution by robot.
To the best of our knowledge, apart from ,  and , there has been no work using /media/arxiv_projects/328697/grasp information for action recognition. Yang et al.  semantically group action intentions using /media/arxiv_projects/328697/grasp based information into three coarse and somewhat abstract classes: Force-oriented, Skill-oriented, and Casual actions. They use hand /media/arxiv_projects/328697/grasps recognized through convolutional neural network to understand the class of action, each image belong to. Yang et al.  develop a grammatical formalism for parsing and interpreting action sequences. Their basic idea is to divide actions into sub-actions of when the object is /media/arxiv_projects/328697/grasped and released, or if there is change in the /media/arxiv_projects/328697/grasp type during the course of an action motion. This grammatical formalism provides a syntax and semantics of action, over which basic tools for understanding of actions can be developed. Feix et al.  considers the problem of /media/arxiv_projects/328697/grasp classification on Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset, again based on the coarsely defined task attributes such as force (interaction and weight), motion-constraints on objects and functional class (use and hold), whereas we propose a solution to task or manipulation action classification based on the /media/arxiv_projects/328697/grasp information, motion-constraints, and object class.
Unlike the works of  and , we consider fine-grained and physically interpretable action categories, also including object information. For instance, we consider the manipulation action of towel wiping and cloth wiping as two different tasks whereas Feix et al.  consider it as a single task. We believe that manipulation actions need to be classified at such a finer level to be able to serve the purpose of recognition of everyday manipulation actions and transferring complex task capabilities to the robotic manipulations. We differentiate between object manipulation actions, focusing on the functional property of an object. Thus, we demonstrate that information related to /media/arxiv_projects/328697/grasp, objects, and their motion-constraints are useful in achieving high recognition accuracy for a large set of action classes in an everyday manipulation action /media/arxiv_projects/328697/dataset. Our work  considers fine-grained recognition of object manipulation actions using coarse-grained /media/arxiv_projects/328697/grasp attributes at instance level modeling of manipulation actions. However, in this work we also perform sequence level modeling and coarse-level action recognition of object manipulation actions. Also, we perform fine-grained action recognition based on fine-grained /media/arxiv_projects/328697/grasp attributes.
The important aspects of our work include: a) A compact representation of the /media/arxiv_projects/328697/grasp and motion-constraints using some popular and some contemporary schemes. b) Demonstrating the usefulness of information from coarse-grained and fine-grained /media/arxiv_projects/328697/grasp attributes as well as motion-constraints for fine-grained and coarse-grained action recognition. c) A differential experimental analysis involving subsets of /media/arxiv_projects/328697/grasp and motion-constraints features, to provide more insights on the usefulness of /media/arxiv_projects/328697/grasp information alone, motion-constraints information alone, and /media/arxiv_projects/328697/grasp and motion-constraints based information together for intended classification problem. d) Comparisons between Instance and Sequence level modeling of object manipulation actions using fine-grained /media/arxiv_projects/328697/grasp information. e) An extensive experimental evaluation using different contemporary multi-class and binary classifiers (with a multi-class voting strategy), which also serves as a useful comparative study of popular classifiers for the manipulation action recognition problem. This analysis also helps to demonstrate that different classification frameworks, largely arrive at a consensus with respect to our hypothesis about using /media/arxiv_projects/328697/grasp and motion-constraints for fine-grained action classification. We demonstrate our results on a large Yale Human Grasping /media/arxiv_projects/328697/dataset  which involves various tasks on different objects.
In this work, the attributes which we consider for recognition of manipulation actions, include object information, /media/arxiv_projects/328697/grasps, and motion-constraints of objects.
The object name (or corresponding symbols) serves as a simple string data on the information on the name of the object. As we want to perform classification of actions based on the /media/arxiv_projects/328697/grasp and motion-constraints information of the known object, we use the object name in the feature representation of an instance such as , , , etc.
Iii-B Grasp attributes
We propose to use coarse and fine level categorization of /media/arxiv_projects/328697/grasp types. Rest of /media/arxiv_projects/328697/grasp attributes have been illustrated in terms of /media/arxiv_projects/328697/grasped dimension, opposition type.
Coarse grained /media/arxiv_projects/328697/grasp categorization
There are large number of /media/arxiv_projects/328697/grasp taxonomies available based on earlier research on /media/arxiv_projects/328697/grasp types. Grasp types have also been classified at coarser and finer level (e.g. ), with /media/arxiv_projects/328697/grasp type as Power, Precision and Intermediate /media/arxiv_projects/328697/grasps at coarser level, as also discussed in . Both fine level and coarse level /media/arxiv_projects/328697/grasp categorization are quite popular but the coarse level /media/arxiv_projects/328697/grasp categorization is relatively simple. We note that our assumption about the availability of /media/arxiv_projects/328697/grasp attributes, is considered to be more suited for the coarse level attributes than the finer level ones, as the latter are arguably, more difficult to estimate. Figure 2 illustrates coarse and fine level /media/arxiv_projects/328697/grasp categorization for 33 /media/arxiv_projects/328697/grasp types specified in Feix et al. .
At the coarse level, each /media/arxiv_projects/328697/grasp can be classified by its need for precision or power to be properly executed. The differentiation is very important, and the idea has influenced many previous studies. In the power /media/arxiv_projects/328697/grasp, there is a rigid contact between the object and the hand that infers all the motion for the object is based on the human arm. For the precision /media/arxiv_projects/328697/grasp, the hand is able to perform intrinsic movements on the object without having to move the arm. In the third category i.e. Intermediate /media/arxiv_projects/328697/grasp, characteristics of power and precision /media/arxiv_projects/328697/grasps are present in roughly the equal proportion. We demonstrate that such a coarse division among /media/arxiv_projects/328697/grasp attributes is also useful for the purpose of manipulation action recognition.
Fine grained /media/arxiv_projects/328697/grasp categorization
As emphasized previously, /media/arxiv_projects/328697/grasp can also be categorized at finer level with 33 /media/arxiv_projects/328697/grasp types  as illustrated in Fig. 2. We use this finer level of /media/arxiv_projects/328697/grasp classification to compare the action recognition rates with the coarser level of /media/arxiv_projects/328697/grasp categorization to understand more accurately how useful the finer represent ion is to get more detailed information of /media/arxiv_projects/328697/grasp type in the context of manipulation action recognition.
Apart from /media/arxiv_projects/328697/grasp type, we further use three basic directions relative to the hand coordinate frame, as illustrated in Figure 2, for 33 /media/arxiv_projects/328697/grasp types . These are the directions in which, the hand can apply forces on the object to hold it securely. Pad Opposition occurs between hand surfaces along a direction generally parallel to the palm. Palm Opposition occurs between hand surfaces along a direction generally perpendicular to the palm. Side Opposition occurs between hand surfaces along a direction generally transverse to the palm.
Opposition type mainly contains the information about the direction of /media/arxiv_projects/328697/grasp of the object whereas Grasp type contains the information about the force on the object. Both opposition type and /media/arxiv_projects/328697/grasp type consists of complementary information.
In addition, we also employ /media/arxiv_projects/328697/grasped dimension as another feature for representation, which signifies the specific dimensions (sides) of the object along which the object is /media/arxiv_projects/328697/grasped. For instance, a knife needs to be /media/arxiv_projects/328697/grasped along the blade to be able to be used for cutting purpose. We use the /media/arxiv_projects/328697/grasped dimension stated in  as the part of the object that lies between the fingers when /media/arxiv_projects/328697/grasped. The values are from the set to indicate which axes best determine the hand opening. Here is along the longest object dimension and is along the shortest dimension. An example is illustrated in Figure 3. The /media/arxiv_projects/328697/grasped dimension contains crucial information about handling of the object. It gives a spatial relationship between the human hand and object.
Iii-C Motion-Constraints on object being manipulated
Depending on the task (and also the object properties), an object is only allowed to translate and rotate in certain directions in order to successfully complete the task. In order to categorize motion-constraints for manipulation action, each of the three axes is assigned a symbol for the motion-constraints as abbreviated in Table I. Thus, the resultant attribute can be represented as a string with three characters (symbols). Moreover, not all the combinations for three axes (i.e. combinations) are practically valid, and only a set of 20 possible relative motions between two rigid bodies specified in , , are used. The nomenclature defines the relationship between the object and the environment (a fixed reference frame). Table I illustrates the symbols used for the motion-constraints along each axes of the object being manipulated by human hand to show whether motion for the object along an axis is unconstrained or allows translation/rotation or fixed.
Iv-a Instance level modeling of manipulation actions
As discussed above, we represent an instance of a manipulation action using /media/arxiv_projects/328697/grasp label (power, precision and intermediate), opposition type (palm, side and pad), /media/arxiv_projects/328697/grasped dimensions of the object, object name, and motion-constraints on the object.
To represent an instance , we concatenate these string data abbreviated in Table II to form a feature vector .
During our experimentation for differential analysis, i.e. to see the effect of individual attributes or their subsets, we define the instance by removing one or more attributes from the representation in equation 1.
Iv-B Sequence level modeling of manipulation actions
In addition to action instances, we also model each manipulation action sequence of instances where /media/arxiv_projects/328697/grasp types are changed within each sequence. We use only those sequences for evaluation where action sequence consists of atleast two instances. For each instance we have /media/arxiv_projects/328697/grasp type information and the object information. We then take fine level /media/arxiv_projects/328697/grasp information and accumulate the number of each /media/arxiv_projects/328697/grasp types that fall into each sequence to represent that sequence of action. The representation is similar to the histogram of visual words representation using bag-of-words (BOW) model, where different visual words are the clusters (estimated using clustering techniques such as K-Means clustering) for all the feature vectors from the data. In BOW, the histogram feature representation for a sample is estimated by accumulating the number of features falling in each cluster. In context of our approach, the visual words are the 33 fine-grained /media/arxiv_projects/328697/grasp types. Each sequence of action is finally represented by 34 dimensional feature vector as in total there are 33 fine level /media/arxiv_projects/328697/grasp types and one object label.
Iv-C Coarse level classification of manipulation actions
We also perform coarse level classification for each instance at the force level i.e., weight and interaction and motion-constraint level  instead of the manipulation action label (i.e. fine-grained manipulation action recognition) of the instance. The force property specifies what type of force is necessary to complete the task. Since, the forces required can be complex and difficult to discern visually, we use a simplified description that still provides useful information about the task. Specifically, we assign a value of either âweightâ or âinteractionâ. We assign âweightâ if the /media/arxiv_projects/328697/grasp force is closely related to lifting the object. This can be the case for tasks other than object transport, such as using a drill. In that case, the dominating force requirement is to lift the drill, squeezing the trigger usually needs less force. In the second category, âinteraction,â the /media/arxiv_projects/328697/grasp force is determined by factors other than object weight, usually through the interaction with the environment. There are two main mechanisms for this decoupling: the weight of the object is supported by the constraints, making the force needed to move the object less than would be required to lift the object (such as opening a drawer or door); or when the interaction force is primarily intended to apply a force through the object, such as is done when scrubbing with a sponge (where the force needed to lift the sponge is much less than the force needed to scrub effectively).
Iv-D Classification models
As to our knowledge, there is no other work related to /media/arxiv_projects/328697/grasp and motion-constraints attributes for fine-grained classification of manipulation actions. Hence, we take this opportunity to provide classification results using various contemporary classification frameworks. These include multi-class decision forests, multi-class neural networks and multi-class classifiers constructed from binary classifiers. Such methods include locally deep support vector machines, support vector machines, binary boosted decision tree, and binary neural networks. We briefly discuss these below.
Multi-class decision forests  and binary boosted decision trees , are extensions of decision tree based classifiers. A decision forest is an ensemble model that very rapidly builds a series of decision trees, learning from labeled data. Decision trees subdivide the feature space into regions with largely the same label. These can be regions of consistent category or of constant value, depending on whether we are doing classification or regression. Boosted decision trees avoid overfitting by limiting how many times they can subdivide and how few data points are allowed in each region.
In both multi-class and binary neural networks which we use, input features are passed forward (never backward) through a sequence of layers before being turned into outputs. In each layer, inputs are weighted in various combinations, summed, and passed on to the next layer. This combination of simple calculations results in the ability to learn non-linear class boundaries and data trends.
Support vector machines (SVMs)  find the boundary that separates classes by as wide a margin as possible. When the two classes cannot be linearly separated, one can use kernel transformation to project the data into higher dimension, wherein classes may be arguably more separable. Two-class locally deep SVM is a non-linear variant of SVM proposed in Jose et al. .
As indicated above, one can perform a multi-class classification using binary classifiers. Typically, such schemes use one-vs-one classification, and construct one classifier per pair of classes. This approach requires the modeling of classifiers, where denotes the number of classes. During the testing stage, the test sample receiving the most votes from any class label is assigned that label. In the event of a tie (among two classes with equal number of votes), the label selection is based on the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers.
V Experiments and results
As mentioned earlier, we evaluate our proposed hypothesis on the Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset  consisting of everyday manipulation actions. We also emphasize that, to our knowledge, this is the only publicly available /media/arxiv_projects/328697/dataset which considers such a large set of everyday manipulation action in an unstructured environment. We evaluate the classification using different multi-class classifiers and binary classifiers with one-vs-one multi-class voting scheme to model the /media/arxiv_projects/328697/grasp and motion-constraints information. We also provide some differential analysis over the attributes, to study their effect on classification.
V-a Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset
This /media/arxiv_projects/328697/dataset consists of large annotated videos of housekeeper and machinist /media/arxiv_projects/328697/grasping in unstructured environments. The full /media/arxiv_projects/328697/dataset contains 27.7 hours of tagged video and represents a wide range of manipulative behaviors spanning much of the typical human hand usage. It involves total of 455 distinct manipulation actions (excluding holding actions and the action without proper /media/arxiv_projects/328697/grasp information) performed by two machinists and two housekeepers with 6188 action instances. Some example images from this /media/arxiv_projects/328697/dataset are illustrated in Figure 4, involving different /media/arxiv_projects/328697/grasps on some of the common objects like screwdriver, hammer, and pen. The videos are acquired by a head mounted camera on each subject. All subjects have normal physical ability, are right handed, and have been able to generate at least 8 hours of data. The labels for each of the task attributes, /media/arxiv_projects/328697/grasp attributes and object attributes are available with the /media/arxiv_projects/328697/dataset itself. This /media/arxiv_projects/328697/dataset is annotated by the raters experienced in the domain. We use different attributes such as /media/arxiv_projects/328697/grasp, motion-constraints, object name as features, and task attributes, which are available in the /media/arxiv_projects/328697/dataset as label for each action instance.
V-B Experimental settings
We evaluate the proposed hypothesis on Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset using two fold cross validation scheme where 50% of instances of each action associated with an object are used for training purpose and rest are used for testing purpose. As it is not necessary that all the actions performed by one machinist/housekeeper are performed by other machinist/housekeeper, we do not use a cross subject evaluation here. We remove the instances of task for which raters are not able to annotate any /media/arxiv_projects/328697/grasp information. Also, the task is trivial as a manipulation action so we get rid of those instances too. We ultimately concatenate the object and task string data for each instance to get manipulation action labels. These labels serves as our manipulation actions as the goal for us is to classify which action is being done using a particular object. Finally, we have 455 different manipulation actions after the cleaning of /media/arxiv_projects/328697/dataset for our purpose with a total of 6188 manipulation action instances.
V-C Results and discussion
Fine-Grained action recognition based on coarse-level /media/arxiv_projects/328697/grasp, motion-constraints, and rest /media/arxiv_projects/328697/grasp attributes
We first provide recognition results (Table III) using only the object and /media/arxiv_projects/328697/grasp attributes (without motion-constraints). These results indicate that even partial /media/arxiv_projects/328697/grasp information is quite useful enough to classify a large set of 455 complex manipulation actions. This information is useful to understand that even with methods to recognize /media/arxiv_projects/328697/grasp types at much coarser level, one can distinguish between the complex manipulation actions to some extent. Table III also shows differential recognition rates based on the individual /media/arxiv_projects/328697/grasp attributes (/media/arxiv_projects/328697/grasp type, /media/arxiv_projects/328697/grasped dimension, and opposition type). From these results, we can infer that opposition type contributes relatively more to the recognition results. However, most of the classifiers agree that the combined attributes do perform better than individual ones (as expected). In general, this clearly highlights that /media/arxiv_projects/328697/grasp attributes indeed provide quite useful information for manipulation action recognition, and the fact that we are using a large /media/arxiv_projects/328697/dataset, support such a hypothesis. Even with a large set of action classes, we are able to differentiate tasks based on the object and /media/arxiv_projects/328697/grasp information at a rate of 0.7085. Also in above experiments one can observe that, all the classifiers perform similar, but neural networks perform somewhat better.
|Classifier||Grasp Type (PIP)||Opposition Type||Grasped Dimension||Grasp Information(All)|
|Multi-class decision forest||0.6460||0.6532||0.6508||0.6966|
|Multi-class neural network||0.6810||0.6820||0.6474||0.6929|
|Locally deep SVM (Binary)||0.6688||0.6908||0.6677||0.6943|
|Neural network (Binary)||0.6973||0.7041||0.6508||0.7085|
|Multi-class decision forest||0.8235|
|Multi-class neural network||0.8262|
|Locally deep SVM (Binary)||0.8445|
|Support vector machine (Binary)||0.8408|
|Neural network (Binary)||0.8327|
We next provide, in Table IV the recognition results with objects and motion-constraints alone (without /media/arxiv_projects/328697/grasp attributes), and in Table V, results with all attributes. These results indicate that motion-constraints appears to help the manipulation action classification, much more than /media/arxiv_projects/328697/grasp information. However, in Table V, one can notice that most classifiers agree that /media/arxiv_projects/328697/grasp attributes further improves the overall classification up to some extent. Below, we take a closer look at the difference between /media/arxiv_projects/328697/grasp and motion-constraints, considering certain specific classes.
|Multi-class decision forest||0.8310|
|Multi-class neural network||0.8388|
|Locally deep SVM (Binary)||0.8293|
|Support vector machine (Binary)||0.8150|
|Neural networks (Binary)||0.8446|
The failure cases to the action recognition based on /media/arxiv_projects/328697/grasp are mainly of the objects which do not have any rigid structure. Such objects do not have a particular way of handling to complete an action, for e.g. towel, paper etc. The reason for lower recognition rates for manipulation actions using these objects based on /media/arxiv_projects/328697/grasp information is the non-rigid structure of the objects. Out of 6188 total action instances 19% of total instances i.e. 1189 instances consists of manipulation actions using towel. These object manipulation actions still are able to achieve better recognition rates based on the motion-constraints attributes as most of the actions based on these objects allow limited degree of freedom for the motion of object, for e.g. on plane surface does not usually consists of rotation along two axes and translation along one axis.
We perform another experiment to support this hypothesis, by removing instances of object - towel, cloth, and paper (where constitute 19% of instances of whole /media/arxiv_projects/328697/dataset). In Table VI, we provide the results for this experiment. One can clearly notice in the earlier recognition results (across Tables III and IV), the difference between the results with /media/arxiv_projects/328697/grasp and motion-constraints is of the order of 10% to 15%. However, after removing the “non-informative” classes from the /media/arxiv_projects/328697/grasp perspective, one can observe that the classification using /media/arxiv_projects/328697/grasp attributes has also improved dramatically. While, the motion-constraints still contribute more for the recognition, the difference between recognition using /media/arxiv_projects/328697/grasp and motion-constraints is now reduced to 2% - 3%. Moreover, combining /media/arxiv_projects/328697/grasps and motion-constraints consistently improves performance over their individual ones. Such a differential analysis highlights that while motion-constraints are generally useful for recognition, /media/arxiv_projects/328697/grasp attributes are also important, except for a small fraction of classes.
|Multi-class decision forest||0.7840||0.8022||0.8088|
|Multi-class neural networks||0.7913||0.8166||0.8378|
|Locally Deep SVM (Binary)||0.7876||0.8236||0.8286|
|Neural Networks (Binary)||0.8045||0.8218||0.8318|
Such an inference is vital considering that the motion-constraints information i.e. degrees of freedom of object for the manipulation action, is relatively difficult to understand from the manipulation actions as compared to /media/arxiv_projects/328697/grasp information at a coarser level, using existing methods. Thus, one can appreciate that such coarse /media/arxiv_projects/328697/grasp information (which is easier to compute) can still prove useful to the manipulation action recognition.
The above analysis also serves to provide a comparison among different contemporary classifiers, for the current task involving categorical features provided in Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset. We note that in majority of the cases binary neural network yields high classification accuracies. In addition, SVMs and multi-class neural networks also perform well, and often provide close to highest accuracies. It is also observed that the decision forest classifiers yield relatively low classification rates.
Fine-Grained action recognition based on fine-level /media/arxiv_projects/328697/grasp, motion-constraints, and rest /media/arxiv_projects/328697/grasp attributes
Recently, there has been considerable amount of research for the fine-grained recognition of /media/arxiv_projects/328697/grasps such as , . Intuitively, this problem is more challenging than the coarse grained /media/arxiv_projects/328697/grasp recognition due to obvious reason of classifying at much finer level. As we focus on action recognition, we consider here the reverse problem which uses fine-grained /media/arxiv_projects/328697/grasp information (Table VII). For this, 33 fine-grained /media/arxiv_projects/328697/grasp types as illustrated in Fig. 2 are used. The other /media/arxiv_projects/328697/grasp attributes mentioned in Table VII constitute of opposition type and /media/arxiv_projects/328697/grasped dimension.
|Classifier||Grasp type(33)||Grasp type(33) and other /media/arxiv_projects/328697/grasp attributes||Grasp type(33) and motion-constraints||Grasp (fine), Grasp (coarse), other /media/arxiv_projects/328697/grasp attributes & motion-constraints|
|Multi-class decision forest||0.7102||0.7197||0.8378||0.8327|
|Multi-class neural network||0.7703||0.7740||0.8805||0.8809|
|Locally deep SVM (Binary)||0.7224||0.7288||0.8517||0.8548|
|Neural network (Binary)||0.7404||0.7397||0.8541||0.8538|
There are quite a few interesting observations to note from these experiments. One is with both fine /media/arxiv_projects/328697/grasp information and fine /media/arxiv_projects/328697/grasp with rest of /media/arxiv_projects/328697/grasp attributes information, the recognition accuracy is nearly equal. Columns 2 and 3 in Table VII indicates that the finer level of /media/arxiv_projects/328697/grasp classification substitutes for the information imbibed in the other /media/arxiv_projects/328697/grasp attributes such as /media/arxiv_projects/328697/grasp dimension, opposition type. This fact is further justified when recognition rates are nearly equal for cases with and without rest of /media/arxiv_projects/328697/grasp attributes and along with motion-constraints information (columns 4 and 5 in Table VII).
Apart from that, by comparing Table III and VII, we note that there is a clear increment in the recognition accuracy when we just use fine-grained /media/arxiv_projects/328697/grasp class labels and object class labels instead of coarse-grained /media/arxiv_projects/328697/grasp labels and object labels for the task of fine-grained object manipulation action recognition. This result is very much expected as we are adding a finer level of information to our /media/arxiv_projects/328697/grasp labels. Thus, we note that we should preferably use fine-grained /media/arxiv_projects/328697/grasp information, if available, rather than coarse level /media/arxiv_projects/328697/grasp information.
After adding motion-constraints data in our instance representation, the difference in recognition rates with coarse and fine-grained /media/arxiv_projects/328697/grasp information is somewhat less. This is due to the fact that motion-constraint shows a better ability to model the instances for the task of object manipulation action recognition.
Coarse level action recognition based on /media/arxiv_projects/328697/grasp and object information
We now perform coarse level action classification using all the /media/arxiv_projects/328697/grasp information such as fine and coarse level /media/arxiv_projects/328697/grasp types, object labels and other /media/arxiv_projects/328697/grasp attributes (Table VIII). Coarse level action recognition experiments are performed at force level (weight and interaction class), and motion-constraints level (20 classes).
|Classifiers||Fine level (Manipulation Actions)||Coarse level (Motion-Constraints)||Coarse level (Force)|
|Multi-class decision forest||0.7197||0.8382||0.8394|
|Multi-class neural networks||0.7740||0.8761||0.8953|
|Locally Deep SVM (Binary)||0.7288||0.8380||0.8542|
|Neural Networks (Binary)||0.7397||0.8340||0.8552|
As expected, using full /media/arxiv_projects/328697/grasp information, recognition is more accurate for coarse level classification than fine level classification. As, we achieve 88% recognition accuracy at motion-constraints level (20 classes), one of the interesting observation is the interdependency between the /media/arxiv_projects/328697/grasp and motion-constraints information. This observation is especially important to observe that /media/arxiv_projects/328697/grasp based action recognition can be a good substitute to the motion-constraints based action recognition, where understanding motion-constraints for each action instance is difficult.
A high recognition rate for the force level based on the /media/arxiv_projects/328697/grasp and object information again highlights the efficacy of the /media/arxiv_projects/328697/grasp information. It allows one to infer what level of force (weight or interaction) is applied with a specific /media/arxiv_projects/328697/grasp for an object. It indicates a high level understanding for actions eg. drilling requires lifting of the machine therefore requires weighted force, whereas writing with a pen requires interaction force. To transfer the manipulation capabilities to robots, such an observation is really important.
Sequence level action recognition based on fine level /media/arxiv_projects/328697/grasp information
|Classifiers||Sequence level||Instance level|
|Multi-class decision forest||0.7029||0.7503|
|Multi-class neural networks||0.7256||0.7799|
|Locally Deep SVM (Binary)||0.7664||0.7534|
|Neural Networks (Binary)||0.7551||0.7653|
Finally, we show the recognition results for sequence level modeling of action based on the fine level /media/arxiv_projects/328697/grasp information. Sequence level modeling is based on 34 dimensional feature vector where each feature dimension represents the count of the specific fine-grained /media/arxiv_projects/328697/grasp type involved in that action sequence. This type of action modeling is expected to be a better way to model complex actions where multiple types of /media/arxiv_projects/328697/grasp types are involved, whereas a instance level modeling would either get confused for such an action sequence as the same action sequence will be co-related to different fine grained /media/arxiv_projects/328697/grasp types. We only use actions having more than one sequence (thus have 105 actions out of total 455 manipulation actions) in Table IX, and more than five sequence (thus having 39 actions out of total 455 manipulation actions) in Table X.
In Table IX, we note that generally, instance level recognition results are better than the sequence level recognition results.This could be due to less training examples in sequence level case (as we just have minimum one example for training and one example for testing). To model the sequence level information, we need to have more training examples even with the approach similar to bag-of-words.
|Classifiers||Sequence level||Instance level|
|Multi-class decision forest||0.7694||0.7972|
|Multi-class neural networks||0.8055||0.8015|
|Locally Deep SVM (Binary)||0.8028||0.7982|
|Neural Networks (Binary)||0.8055||0.7977|
Finally, we consider those actions which have more than 5 sequences to address the issue of lesser training examples in Table X. Here, the sequence level modeling performs marginally better than instance level modeling.
In this paper, we present a novel approach for the recognition of everyday manipulation actions based on the /media/arxiv_projects/328697/grasp and motion-constraints information. We evaluate our hypothesis on large Yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset consisting of 455 action classes. Our results and a varied experimental analysis clearly shows that /media/arxiv_projects/328697/grasp information contains important clue to the everyday manipulation actions. We consider the differentiation between the functionality of the object and show that this approach for recognition has a clear advantage over the traditional methods of action recognition based on the human dynamics. Another overall advantage to this approach is that this type of action analysis is shown to work over a large set of action classes with very subtle variations in their motion dynamics. Our work indicates that considering /media/arxiv_projects/328697/grasp information, and object motion-constraints, one can transfer advance task capabilities to the robotics applications and modeling the human behavior in complex environment.
- F. Zhu, L. Shao, J. Xie, and Y. Fang, “From handcrafted to learned representations for human action recognition: A survey,” Image and Vision Computing, 2016.
- R. Poppe, “A survey on vision-based human action recognition,” Image and vision computing, vol. 28, no. 6, pp. 976–990, 2010.
- R. Filipovych and E. Ribeiro, “Recognizing primitive interactions by exploring actor-object states,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–7.
- H. Kjellström, J. Romero, and D. Kragić, “Visual object-action recognition: Inferring object affordances from human demonstration,” Computer Vision and Image Understanding, vol. 115, no. 1, pp. 81–90, 2011.
- I. M. Bullock, T. Feix, and A. M. Dollar, “The yale human /media/arxiv_projects/328697/grasping /media/arxiv_projects/328697/dataset: Grasp, object, and task data in household and machine shop environments,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 251–255, 2015.
- K. Gupta and A. Bhavsar, “Scale invariant human action detection from depth cameras using class templates,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 38–45.
- R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition by representing 3d skeletons as points in a lie group,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595.
- A. Sharaf, M. Torki, M. E. Hussein, and M. El-Saban, “Real-time multi-scale action detection from 3d skeleton data,” in Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on. IEEE, 2015, pp. 998–1005.
- A. F. Bobick and A. D. Wilson, “A state-based approach to the representation and recognition of gesture,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 19, no. 12, pp. 1325–1337, 1997.
- C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation and recognition of actions,” International Journal of Computer Vision, vol. 50, no. 2, pp. 203–226, 2002.
- J. Sullivan and S. Carlsson, “Recognizing and tracking human action,” in Computer Vision–ECCV 2002. Springer, 2002, pp. 629–644.
- L. Chen, H. Wei, and J. Ferryman, “A survey of human motion analysis using depth imagery,” Pattern Recognition Letters, vol. 34, no. 15, pp. 1995–2006, 2013.
- A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamic hand gesture recognition with a depth sensor,” in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European. IEEE, 2012, pp. 1975–1979.
- J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, “Robust 3d action recognition with random occupancy patterns,” in Computer vision–ECCV 2012. Springer, 2012, pp. 872–885.
- T. Feix, I. M. Bullock, and A. M. Dollar, “Analysis of human /media/arxiv_projects/328697/grasping behavior: Object characteristics and /media/arxiv_projects/328697/grasp type,” Haptics, IEEE Transactions on, vol. 7, no. 3, pp. 311–323, 2014.
- ——, “Analysis of human /media/arxiv_projects/328697/grasping behavior: Correlating tasks, objects and /media/arxiv_projects/328697/grasps,” IEEE transactions on haptics, vol. 7, no. 4, pp. 430–441, 2014.
- T. Feix, J. Romero, H.-B. Schmiedmayer, A. M. Dollar, and D. Kragic, “The /media/arxiv_projects/328697/grasp taxonomy of human /media/arxiv_projects/328697/grasp types.”
- B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
- Y. Yang, Y. Li, C. Fermüller, and Y. Aloimonos, “Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web.” in AAAI, 2015, pp. 3686–3693.
- Y. Yang, C. Fermuller, Y. Li, and Y. Aloimonos, “Grasp type revisited: A modern perspective on a classical feature for vision,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 400–408.
- Y. Yang, A. Guha, C. Fermuller, and Y. Aloimonos, “A cognitive system for understanding human manipulation actions,” Advances in Cognitive Sysytems, vol. 3, pp. 67–86, 2014.
- T. Feix, I. M. Bullock, and A. M. Dollar, “Analysis of human /media/arxiv_projects/328697/grasping behavior: Correlating tasks, objects and /media/arxiv_projects/328697/grasps,” Haptics, IEEE Transactions on, vol. 7, no. 4, pp. 430–441, 2014.
- K. Gupta, D. Burschka, and A. Bhavsar, “Effectiveness of /media/arxiv_projects/328697/grasp attributes and motion-constraints for fine-grained recognition of object manipulation actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 27–34.
- J. D. Morrow and P. K. Khosla, “Manipulation task primitives for composing robot skills,” in Robotics and Automation, 1997. Proceedings., 1997 IEEE International Conference on, vol. 4. IEEE, 1997, pp. 3354–3359.
- G. H. Morris and L. S. Haynes, “Robotic assembly by constraints,” in Robotics and Automation. Proceedings. 1987 IEEE International Conference on, vol. 4. IEEE, 1987, pp. 1507–1515.
- A. Criminisi, J. Shotton, and E. Konukoglu, “Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning,” Foundations and Trends® in Computer Graphics and Vision, vol. 7, no. 2–3, pp. 81–227, 2012.
- J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
- C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
- C. Jose, P. Goyal, P. Aggrwal, and M. Varma, “Local deep kernel learning for efficient non-linear svm prediction,” in Proceedings of the 30th international conference on machine learning (ICML-13), 2013, pp. 486–494.
- G. Rogez, J. S. Supancic, and D. Ramanan, “Understanding everyday hands in action from rgb-d images,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3889–3897.
- M. Cai, K. M. Kitani, and Y. Sato, “A scalable approach for understanding the visual structures of hand /media/arxiv_projects/328697/grasps,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 1360–1366.