A Comparison of Directional Distances for Hand Pose Estimation
Abstract
Benchmarking methods for 3d hand tracking is still an open problem due to the difficulty of acquiring ground truth data. We introduce a new dataset and benchmarking protocol that is insensitive to the accumulative error of other protocols. To this end, we create testing frame pairs of increasing difficulty and measure the pose estimation error separately for each of them. This approach gives new insights and allows to accurately study the performance of each feature or method without employing a full tracking pipeline. Following this protocol, we evaluate various directional distances in the context of silhouettebased 3d hand tracking, expressed as special cases of a generalized Chamfer distance form. An appropriate parameter setup is proposed for each of them, and a comparative study reveals the best performing method in this context.
1 Introduction
Benchmarking methods for 3d hand tracking has been identified in the review [7] as an open problem due to the difficulty of acquiring ground truth data. As in one of the earliest works on markerless 3d hand tracking [19], quantitative evaluations are still mostly performed on synthetic data, e.g., [26, 2, 34, 4, 21]. The vast majority of the related literature, however, is limited to visual, qualitative performance evaluation, where the estimated model is overlaid on the images.
While there are several datasets and evaluation protocols for benchmarking human pose estimation methods publicly available, where markers [28, 1], inertial sensors [3], or a semiautomatic annotation approach [32] have been used to acquire ground truth data, there are no datasets available for benchmarking articulated hand pose estimation. We propose thus a benchmark dataset consisting of 4 sequences of two interacting hands captured by 8 cameras, where the ground truth position of the 3d joints has been manually annotated.
Tracking approaches are usually evaluated by providing the pose for the first frame and measuring the accumulative pose estimation error for all consecutive frames of the sequence, e.g., [28]. While this protocol is optimal for comparing full tracking systems, it makes it difficult to analyze the impact of individual components of a system. For instance, a method that estimates the joint positions with a high accuracy, but fails in a few cases and is unable to recover from errors, will have a high tracking error if an error occurs very early in a test sequence. However, the tracking error will be very low if the error occurs at the end of the sequence. The accumulation of tracking errors makes it difficult to analyze indepth situations where an approach works or fails. We therefore propose a benchmark that analyzes the error not over a full sequence, but over a set of pairs consisting of a starting pose and a test frame. Based on the start pose and the test frame, the pairs have different grades of difficulty.
In this work, we use the proposed benchmark to analyze various silhouettebased distance measures for hand pose estimation. Distance measures that are based on a closest point distance, like the Chamfer distance, are commonly used due to its efficiency [19] and often extended by including directional information [9, 33]. Recently, a fast method that computes a directional Chamfer distance using a 3d distance tensor has been proposed [16] for shape matching. In this work, we introduce a general form of the Chamfer distance for hand pose estimation and quantitatively compare several special cases.
2 Related Work
Since the earliest days of visionbased hand pose estimation [24, 7], lowlevel features like silhouettes [19], edges [13], depth [6], optical flow [19], shading [14] or a combination of them [17] have been used for hand pose estimation. Although Chamfer distances combined with an edge orientation term have been used in [33, 2, 31, 29], the different distances have not been thoroughly evaluated for hand pose estimation. While a KDtree is used in [31] to compute a directional Chamfer distance, Liu et al. [16] recently proposed a distance transform approach to efficiently use a directional Chamfer distance for shape matching. Different methods of shape matching for pose estimation have been compared in the context of rigid objects [12] or articulated objects [22]. While previous work mainly considered to estimate the pose of a hand in isolation, recent works consider more complicated scenarios where two hands interact with each other [21, 4] or with objects [11, 25, 10, 20, 4].
3 Hand Pose Estimation
For evaluation, we use a publicly available hand model [4], consisting of a set of vertices, an underlying kinematic skeleton with 35 degrees of freedom (DOF) per hand, and skinning weights. The vertices and the joints of the skeleton are shown in Fig. 1. Each 3d vertex is associated to a bone by the skinning weights , where . The articulated deformations of a skeleton are encoded by the vector that represents the rigid bone transformations , i.e., rotation and translation, by twists [18, 5]. Each twistencoded rigid body transformation for a bone can be converted into a homogeneous transformation matrix by the exponential map operator, i.e., . The mesh deformations based on the pose parameters are obtained by the linear blend skinning operator [15] using homogeneous coordinates:
(1) 
In order to estimate the hand pose for a given frame, correspondences between the mesh and the image of each camera are established. Each correspondence associates a vertex to a 2d point in camera view . Assuming that the cameras are calibrated, the point can be converted into a projection ray that is represented by the direction and moment of the line [30, 27]. The hand pose can then be determined by the pose parameters that minimize the shortest distance between the 3d vertices and 3d projection rays :
(2) 
This nonlinear leastsquares problem can be iteratively solved [27]:

Extract correspondences for all cameras ,

Solve (2) using the linearization ,

Update vertex positions by (1).
In this work, we reformulate (2) as a Chamfer distance minimization problem.
4 Generalized Chamfer Distance
As discussed in Section 2, the Chamfer distance is commonly used for shape matching and has been also used for pose estimation by shape matching. In our context, the Chamfer distance between pixels of a contour for a given camera view and the set of projected rim vertices , which depend on the pose parameters and project onto the contour of the projected surface, is
(3) 
This expression can be efficiently computed using a 2d distance transform [8].
The Chamfer distance (3) can be generalized by
(4) 
where is a 2d distance function to compute the distance between two points, is a penalty function for two closest points, and is a normalization factor. If we use
(5) 
is the standard Chamfer distance (3). In order to increase the robustness to outliers, is used in [29], where is a threshold on the maximum squared distance.
Orientation can be integrated by penalizing correspondences with inconsistent orientations:
(6) 
or by computing the closest distance to points of similar orientation based on a circular distance threshold [9]:
(7) 
where is the circular distance between two angles, which can be signed, i.e., in the range of , or unsigned, i.e., in the range of .
The directional Chamfer distance [16] can be written as
(8) 
To compute with (8) efficiently, can be quantized to compute a 3d distance transform [16]. As in [16], we compute by converting into a line representation [23]. is obtained by projecting the normals of the corresponding vertices in .
In order to use the generalized Chamfer distance for pose estimation from multiple views (2), only and need to be adapted. Let denote the contour of camera view and the set of projected vertices for pose parameters and camera . (2) can be rewritten as
(9)  
with  (10) 
where is the 3d vertex corresponding to and is the 3d projection ray corresponding to . can be any of the functions (5)(8).
In case of (6), instead of adding a fixed penalty term , correspondences with inconsistent orientation can be simply removed and becomes the set of correspondences with .
5 Benchmark
We propose a benchmarking protocol that analyzes the error not over full sequences, but over a sampled set of testing pairs. Each pair consists of a starting pose and a test frame, ignoring the intermediate frames to simulate various difficulties. This approach gives new insights and provides means to analyze indepth the contributions of various features or methods to the overall tracking pipeline under varying difficulty and to thoroughly study failure cases.
In this respect, 4 publicly available sequences^{1}^{1}1Model, videos, and motion data are provided at http://cvg.ethz.ch/research/ihmocap. Sequences: Finger tips touching and praying, Fingers crossing and twisting, Fingers folding, Fingers walking. Video: px, 50 fps, 8 cameraviews. are used, containing realistic scenarios of two strongly interacting hands [4]. of the total frames are randomly selected, forming the set of test frames of the final pairs. This is the basis to create 4 different sets of image pairs, having 1,5,10,15 frames difference respectively between the starting pose and the test frame, presenting thus increasing difficulty for tracking systems. These 4 sets and the overall combination constitute a challenging dataset, representing realistic scenarios the occur due to low frame rates, fast motion or estimation errors in the previous frame.
The created testing sets are used in two experimental setups: a purely synthetic and a realistic. In both cases, the starting pose is given by the publicly available motion data outputed by the tracker of [4]. In the synthetic experimental setup the test frame is synthesized by the hand model and the aforementioned motion data, while the required ground truth exists inherently in them. In the realistic setup the test frame is given by the camera images, for which no groundtruth data are available, thus the frames have been manually annotated^{2}^{2}2The groundtruth annotated dataset, along with a viewerapplication, is available at http://files.is.tue.mpg.de/dtzionas/GCPR_2013.html.. As error measure, we use the average of the Euclidean distances between the estimated and the groundtruth 3d positions of the joints. For the realistic setup we use only the joints of the model that could be annotated, which are depicted with black color in Fig. 1. For the synthetic setup all joints of the model (black and red) are taken into account.
6 Experiments
6.1 Implementation Details
The aforementioned benchmark is used to evaluate four special cases of the generalized Chamfer distance (Section 4) for hand pose estimation.
CH denotes the Chamfer distance without any orientation information (5).
DCHThres rejects correspondences if the orientations are inconsistent, depending on the circular distance threshold (6).
DCHQuant computes a 2d distance field for all quantizations of and assigns a vertex to one bin based on the orientation of its normal (7). Instead of hard binning, soft binning can also be performed, denoted by DCHQuant2. In this case, the two closest bins are used, yielding two correspondences per vertex.
6.2 Results
We have evaluated all Chamfer distances both on the synthetic and the realistic dataset in order to compare the distances for 3d hand pose estimation, but also in order to investigate the performance predicting abilities of synthetic test data. As measure, we use the average joint error per test frame and compute the percentage of frames with an error below a given threshold. We first evaluated the differences between the signed and unsigned circular distance for DCHThres and varied the threshold parameter . The results are plotted in Fig. 2. The plot shows that the signed distance outperforms the unsigned distance. Since we observed the same result for DCHDT3, we only report results for the signed distance (360) in the remaining experiments.
For DCHDT3, we evaluated the impact of the two parameters and the number of quantization bins for . The results are plotted in Fig. 3. Figs. (a)a and (b)b show the importance of directional information for hand pose estimation, and reveal that there is a large range of that works well. With a finer quantization of , the original directional Chamfer distance (8) is better approximated. Figs. (c)c and (d)d show that 16 bins are sufficient for this task.
We finally evaluated the number of bins for DCHQuant and DCHQuant2. Fig. 4 shows that DCHQuant2 performs better than DCHQuant. In this case, a large number of bins results in a very orientation sensitive measure, and the performance decreases with a finer quantization, in contrast to DCHDT3.
Fig. 5 summarizes the results for each distance with the best parameter setting. As expected, the results show that directional information improves the estimation accuracy. However, it is not DCHDT3 that performs best for hand pose estimation, but DCHThres, which is also more efficient to compute. While for DCHDT3 the full hand model converges smoothly to the final pose, the thresholding yields a better fit to the silhouette after convergence (see supplementary video^{3}^{3}3http://youtu.be/Cbu3eEcl1qk). Comparing the performances between synthetic and real data, we conclude that synthetic data is a good performance indicator, but might be misleading sometimes. For instance, CH performs well on the synthetic data but worst on the real data. This is also reflected by the mean error for the various frame differences provided in Table 1, that introduce an increasing difficulty in the benchmark. Denoted with the term initial is the average 3d distance of the joints before running the pose estimation algorithm. The result of a full tracking system [4] is provided for comparison, which expectedly performs better due to the number of features combined. Finally, runtime is provided for the synthetic experiments, giving some intuition about the time efficiency of each method.
All  Time  
Synthetic  CH  
DCHDT3  
DCHQuant  
DCHThres  


Realistic 
Initial    
Ballan et al. [4]            
CH    
DCHDT3    
DCHQuant    
DCHThres   
7 Conclusion
In this work, we propose a new benchmark dataset for hand pose estimation that allows to evaluate single components of a hand tracker without running a full system. As an example, we discuss a generalized Chamfer distance and evaluate four special cases. The experiments reveal that directional information is important and a signed circular distance performs better than an unsigned distance in the case of silhouettes. Interestingly, a distance using a circular threshold outperforms a smooth directional Chamfer distance both in terms of accuracy and runtime. We finally conclude that synthetic data can be a good indicator for the performance, but might be misleading when comparing different methods. Future plans include adding frame pairs of other sequences with more background clutter and segmentation noise.
8 Acknowledgments
The authors acknowledge financial support from the DFG Emmy Noether program (GA 1927/11) and the Max Planck Society.
References
 [1] van der Aa, N.P., Luo, X., Giezeman, G.J., Tan, R.T., Veltkamp, R.C.: Umpm benchmark: A multiperson dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction. In: Workshop on Human Interaction in Computer Vision. pp. 1264–1269 (2011)
 [2] Athitsos, V., Sclaroff, S.: Estimating 3d hand pose from a cluttered image. In: CVPR. pp. 432–439 (2003)
 [3] Baak, A., Helten, T., Müller, M., PonsMoll, G., Rosenhahn, B., Seidel, H.P.: Analyzing and evaluating markerless motion tracking using inertial sensors. In: Workshop on Human Motion. pp. 137–150 (2010)
 [4] Ballan, L., Taneja, A., Gall, J., Van Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: ECCV. pp. 640–653 (2012)
 [5] Bregler, C., Malik, J., Pullen, K.: Twist based acquisition and tracking of animal and human kinematics. IJCV 56(3), 179–194 (2004)
 [6] Delamarre, Q., Faugeras, O.D.: 3d articulated models and multiview tracking with physical forces. CVIU 81(3), 328–357 (2001)
 [7] Erol, A., Bebis, G., Nicolescu, M., Boyle, R.D., Twombly, X.: Visionbased hand pose estimation: A review. CVIU 108(12), 52–73 (2007)
 [8] Felzenszwalb, P.F., Huttenlocher, D.P.: Distance transforms of sampled functions. Theory of Computing 8(19), 415–428 (2012)
 [9] Gavrila, D.: Multifeature hierarchical template matching using distance transforms. In: ICPR. pp. 439–444 (1998)
 [10] Hamer, H., Gall, J., Weise, T., Van Gool, L.: An objectdependent hand pose prior from sparse training data. In: CVPR. pp. 671–678 (2010)
 [11] Hamer, H., Schindler, K., KollerMeier, E., Van Gool, L.: Tracking a hand manipulating an object. In: ICCV. pp. 1475–1482 (2009)
 [12] Han, D., Rosenhahn, B., Weickert, J., Seidel, H.P.: Combined registration methods for pose estimation. In: ISVC. pp. 913–924 (2008)
 [13] Heap, T., Hogg, D.: Towards 3d hand tracking using a deformable model. In: FG. pp. 140–145 (1996)
 [14] de La Gorce, M., Fleet, D.J., Paragios, N.: Modelbased 3d hand pose estimation from monocular video. PAMI 33(9), 1793–1805 (2011)
 [15] Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeletondriven deformation. In: SIGGRAPH (2000)
 [16] Liu, M.Y., Tuzel, O., Veeraraghavan, A., Chellappa, R.: Fast directional chamfer matching. In: CVPR. pp. 1696–1703 (2010)
 [17] Lu, S., Metaxas, D., Samaras, D., Oliensis, J.: Using multiple cues for hand tracking and model refinement. In: CVPR. pp. 443–450 (2003)
 [18] Murray, R.M., Sastry, S.S., Zexiang, L.: A Mathematical Introduction to Robotic Manipulation. CRC Press, Inc., Boca Raton, FL, USA (1994)
 [19] Nirei, K., Saito, H., Mochimaru, M., Ozawa, S.: Human hand tracking from binocular image sequences. In: IECON. pp. 297–302 (1996)
 [20] Oikonomidis, I., Kyriazis, N., Argyros, A.: Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: ICCV (2011)
 [21] Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of two strongly interacting hands. In: CVPR. pp. 1862–1869 (2012)
 [22] PonsMoll, G., LealTaixé, L., Truong, T., Rosenhahn, B.: Efficient and robust shape matching for model based human motion capture. In: DAGM (2011)
 [23] Ramer, U.: An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing 1(3), 244 – 256 (1972)
 [24] Rehg, J.M., Kanade, T.: Visual tracking of high dof articulated structures: an application to human hand tracking. In: ECCV. pp. 35–46 (1994)
 [25] Romero, J., Kjellström, H., Kragic, D.: Hands in action: realtime 3d reconstruction of hands in interaction with objects. In: ICRA. pp. 458–463 (2010)
 [26] Rosales, R., Athitsos, V., Sigal, L., Sclaroff, S.: 3d hand pose reconstruction using specialized mappings. In: ICCV. pp. 378–387 (2001)
 [27] Rosenhahn, B., Brox, T., Weickert, J.: Threedimensional shape knowledge for joint image segmentation and pose tracking. IJCV 73, 243–262 (2007)
 [28] Sigal, L., Balan, A., Black, M.: Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated humanÂ motion. IJCV 87, 4–27 (2010)
 [29] Stenger, B., Thayananthan, A., Torr, P.: Modelbased hand tracking using a hierarchical bayesian filter. PAMI 28(9), 1372–1384 (2006)
 [30] Stolfi, J.: Oriented Proj. Geometry: A Framework for Geom. Computation. Academic Press, Boston (1991)
 [31] Sudderth, E., Mandel, M., Freeman, W., Willsky, A.: Visual Hand Tracking Using Nonparametric Belief Propagation. In: Workshop on Generative Model Based Vision. pp. 189–189 (2004)
 [32] Tenorth, M., Bandouch, J., Beetz, M.: The TUM Kitchen Data Set of Everyday Manipulation Activities for Motion Tracking and Action Recognition. In: Int.Work. on Tracking Humans for the Eval. of their Motion in Im.Seq. pp. 1089–1096 (2009)
 [33] Thayananthan, A., Stenger, B., Torr, P.H.S., Cipolla, R.: Shape context and chamfer matching in cluttered scenes. In: CVPR. pp. 127–133 (2003)
 [34] Zhou, H., Huang, T.: Okapichamfer matching for articulate object recognition. In: ICCV. pp. 1026–1033 (2005)