Simultaneous Hand Pose and Skeleton BoneLengths Estimation from a Single Depth Image
Abstract
Articulated hand pose estimation is a challenging task for humancomputer interaction. The stateoftheart hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this work, we introduce a novel hybrid algorithm for estimating the 3D hand pose as well as bonelengths of the hand skeleton at the same time, from a single depth image. The proposed CNN architecture learns hand pose parameters and scale parameters associated with the bonelengths simultaneously. Subsequently, a new hybrid forward kinematics layer employs both parameters to estimate 3D joint positions of the hand. For endtoend training, we combine three public datasets NYU, ICVL and MSRA2015 in one unified format to achieve large variation in hand shapes and sizes. Among hybrid methods, our method shows improved accuracy over the stateoftheart on the combined dataset and the ICVL dataset that contain multiple subjects. Also, our algorithm is demonstrated to work well with unseen images.
1 Introduction
Human hand is an example of complex articulable object that exhibit many degrees of freedom (DoFs), self similarities, self occlusions and constrained parameters. With the arrival of commodity depth cameras and notable progress in machine learning in the past few years, the research on human hand tracking and pose inference has gained more popularity and has become an active area of research.
Mainly, three approaches exist for hand pose estimation. First, generative (model based), second, discriminative (learning based) and third, hybrid approach. Generative method starts by defining a calibrated hand model geometry and optimize an energy function to obtain the hand pose parameters [14, 16, 24, 26]. These methods achieve higher accuracy at the cost of complex energy functions optimizations [17]. On the other hand, discriminative approach tries to infer a coarse hand pose based on already learned information from single depth, RGBD or RGB images during training [12, 18, 35, 7, 3, 33, 10, 27]. Recently published CNNbased methods such as hierarchical treelike structured CNN [11], multiviewCNN [5, 6] and region ensemble network [8] have shown significant improvement in accuracy over their counterpart, random forest based methods [22, 29, 30]. Despite of the fact that direct joints regression using CNN has achieved higher accuracy over other existing methods and our approach, the estimated pose is coarse and do not exploit hand geometry i.e. kinematics and physical constraints. Hence, independent learning of hand joints is most likely to produce invalid hand poses especially during tracking. In hybrid method, the pose inference obtained from discriminative method can be fed as coarse input to a generative process to get refined hand pose [28, 13, 21, 20, 32]. Particularly, Zhou et al. [34] propose an efficient model based deep learning approach as an alternative to generative postprocessing step in hybrid methods. However, a big limitation of this work is an assumption of a fixed bonelengths hand model geometry during endtoend training. Clearly, this limitation restricts the generalization of this approach over different hand shapes and sizes. Our idea is to estimate not only the 3D hand pose but also the bonelengths of hand skeleton at the same time. To the best of our knowledge, this problem has never been explicitly addressed before. So, we introduce a novel hybrid algorithm which simultaneously estimates the 3D hand pose and bonelengths of hand skeleton. To this end, hand scale parameters are learned to facilitate the endtoend training process of model based deep learning approach thereby, leading to promising results for 3D hand pose estimation.
In order to show the validity of our approach, a hand pose dataset with large variation in hand shapes and sizes is necessary. Several real hand pose datasets are publicly available, but individually, these datasets lack in varying hand shapes and sizes of subjects, number of original depth images and complexity of hand poses [1]. Therefore, we combine most commonly used real hand pose datasets and convert them into a single unified format, we call HandSet. We summarize our main contributions as follows:

A novel hybrid approach for simultaneous estimation of 3D hand pose and bonelengths of hand skeleton.

A combined real hand pose dataset that offers large variation in hand shapes and sizes, increased number of preprocessed depth frames from different depth cameras and complex hand poses. The dataset will be publicly available.
2 Related Work
Comprehensive reviews of hand pose estimation methods using depth sensors have been reported in [1, 23, 4]. Our work is related to the hybrid methods and the real hand pose datasets from frontal camera view. Hence, we focus on the most related works in the following subsections.
2.1 Hand Pose Datasets Based on Real Depth Data
In this subsection, we briefly introduce the most commonly used real hand pose datasets.
NYU hand pose dataset [28] provides RGBD frames acquired from Prime Sense Carmine1.09 depth camera. The test set contains images. The dataset covers a wide range of complex hand poses. To acquire the ground truth, direct search method proposed by [14] is adopted with modifications and is quite accurate. However, this dataset has no variation in hand shapes and sizes because it has only one subject in the training set and two subjects in the test set.
ICVL dataset [25] contains original depth frames including subjects and two test sets with frames each. However, by applying rotations, the total size of dataset exceeds images alongwith the ground truth. Intel creative gesture camera was used to acquire the depth images. The dataset has good number of complex hand poses but, not as complex as NYU dataset [1]. Ground truth is created using a search method, guided by a Binary Latent Tree Model (LTM) [2]. However, ground truth is not very accurate and the variation in hand shapes and sizes is less.
MSRA dataset [22] contains depth frames captured from Creative gesture camera. Images are captured from subjects, each performing hand gestures. Ground truth is annotated using a semiautomatic and iterative process followed by manual corrections [16]. However, annotations are less accurate.
In order to benefit from the individual pros of the above described datasets and to add the advantages of having a bigger dataset with more variations in hand shapes, sizes and type of depth cameras, we propose to combine them into one unified format described in Section 3.
There are some other existing real hand pose datasets from frontal camera view i.e. Dexter [21], SHREC2017^{1}^{1}1http://wwwrech.telecomlille.fr/shrec2017hand/, MSRA2014 [16], ASTAR [31]. However, these datasets either contain small number of original images, missing depth information, few ground truth joint positions or many outliers in the annotations. Therefore, they are not considered in this work.
2.2 Hybrid Methods for Hand Pose Estimation
The first CNNbased hand pose estimation method was introduced by [28]. Joint locations are predicted from CNN in the form of heatmaps. Thereafter, an inverse Kinematics (IK) is applied to estimate 3D hand pose based on predicted joints. Poier et al. [15] use a model based optimization step based on multiple 3D joint hypothesis (proposal distributions) received from a random regressor. In [19], coarse joints are predicted using pixel classification random forest algorithm. In the generative model fitting step, a similarity function is optimized between the predicted joints and generated joints. In the methods mentioned above, model fitting (generative) is separated from the joints estimation part. In [13], Obreweger et al. perform a complex training of a feedback loop to infer the correct hand pose. It uses three neural networks. First, to estimate coarse hand pose. Second, is used to synthesize the input image. Third, comprises of pose update network. Ye et al. [32] introduce a hierarchical hybrid method with a spatial attention mechanism and hierarchical Particle Swarm Optimization (PSO). Zhou et al. [34] propose a low latency framework that seamlessly integrates a generative hand model layer with a neural network. A generative hand model layer is introduced to map the received joint angles to 3D positions. However, the hand model requires to be calibrated for a specific user. Inspired by this work, we propose a new low latency hybrid algorithm for estimating hand skeleton bonelengths and pose simultaneously. The endtoend training of our pipeline is simple and highly efficient. The forward kinematic function in the generative layer is differentiable with respect to joint angles and hand scale parameters.
3 Combined Dataset and PreProcessing
First step to merge different datasets is to select the number of common joint positions present in all datasets. ICVL dataset has least number of joints. We consider corresponding 16 joints in the NYU and MSRA datasets and remove additional joints for consistency. Since, each dataset uses different depth camera to acquire images, we need to preprocess the depth frames according to their respective camera intrinsics, frame resolutions and depth range. Inspired by the method in [34], for depth invariance, the images are cropped around palm center in all three dimensions (u, v and depth) using a fixed size bounding box. Then, depth values are normalized to , . The 3D joint locations are also normalized in range , using the bounding box. The final preprocessed image is of 128 x 128 dimension and has ground truth annotations which include internal joints as shown in Figure 1 and four fingertips. The HandSet contains preprocessed training depth images, test images and different subjects.
4 Hand Pose and BoneLengths Estimation
In this section, we explain our approach for simultaneous estimation of hand pose and bonelengths of the hand skeleton using a hybrid forward kinematics layer and deep architectures.
4.1 Hybrid Forward Kinematics Layer
Figure 1 shows our hand skeleton. We assume a zero pose vector (i.e. pose with all parameters set to zero) as the reference hand pose. All other poses are defined relative to this reference pose. We initialize the hand skeleton by the averages of individual bonelengths from ground truth annotations of each dataset. Given the hand pose and scale parameters, the hybrid forward kinematic layer (see Figure 4) implements a forward kinematic function defined as:
(1) 
Where , is a vector of pose parameters, , defines the hand scale factors associated with bonelengths and , is a vector of the predicted joint positions.
The 3D transformation of each of the joints in J is derived from its joint angles for rotation and scaled bonelengths for translation. The global 3D position of a joint is obtained by applying series of transformations (rotational and translational) along the path starting from hand root joint to this joint as shown in Figure 2.
Cost function is obtained by using Euclidean 3D joint location loss given as:
(2) 
Where is a vector of 3D ground truth joint positions.
Since, Equation 1 is differentiable with respect to both pose parameters and hand scales , hence, it can be used in deep network to compute gradients for backpropagation. The Jacobian of with respect to is defined as:
(3) 
The Jacobian of with respect to can be defined in a similar way. Partial derivative of a joint in with respect to a pose parameter can be calculated as:
(4) 
where,
is the set of joints along kinematic chain from to the root joint and is the rotation axis.
Similarly, we compute partial derivative of a joint in with respect to a scale parameter as:
(5) 
where,
and, is the set of parent joints of that share the same scale parameter .


















4.2 Deep Architectures with Hand Scales
Human hands differ in individual fingers and palm sizes. There is a need to explicitly consider such differences during training. Therefore, we introduce various scales of hand as additional learning parameters to facilitate CNN training on HandSet as shown in Figure 4. These scales factors are learned by the CNN alongwith the pose parameters.
We propose three implementations of our method explained in the following subsections and compare their performances in Section 6. We build our CNN architecture based on the baseline architecture proposed in [12], mainly for the sake of fair comparison. The pipeline of our algorithm is shown in Figure 4. The architecture of CNN comprises of 3 convolutional layers using 5, 5, 3 kernel sizes respectively. Max pooling layers are then connected using strides 4,2,1 with zero padding. The feature maps from convolutional layers are of size 12 x 12 x 8. Two fully connected layers consist of 1024 neurons each. Dropout layers are added with dropout ratio of 0.3. All convolutional layers use ReLu as activation.
4.2.1 GlobalScale
In this architecture, we define a global scale for the hand skeleton such that it can symmetrically vary its size. In Figure 4, the last fully connected layer outputs pose parameters and additional global hand scale parameter , shared by all bones of the hand skeleton. Larger scale value results in bigger hand skeleton and vice versa. The hybrid forward kinematic layer takes this scale parameter as input alongwith pose parameters and computes 3D joint positions according to Equation 1. The partial derivative of a joint with respect to the global scale parameter can be computed using Equation 5.
4.2.2 5Scales
This architecture associates five separate hand scale parameters from tips of the five fingers to the palm center (root joint). These parameters allow the individual fingers to vary their lengths according to their respective scale values, thereby adding a flexibility to both shape and size of the hand skeleton. These parameters are defined by as:
(6) 
Given the pose parameters and , forward kinematic function defined by Equation 1 is applied to estimate more accurate 3D joint locations. Using Equation 5, the partial derivative of a joint with respect to its associated finger scale parameter is calculated.
4.2.3 MultiScale
In this architecture, we assign a separate scale to each bone of our hand skeleton. Each bonelength can be estimated independently of other bones (see Section 4.1). Hence, this architecture provides the maximum flexibility to adapt shape and size of the hand skeleton.
5 Implementation Details
For endtoend training of our model, we use Caffe open source framework for deep networks [9]. The network is trained until convergence with a fixed learning rate of using as SGD momentum. We perform data augmentations i.e. rotations and scalings during training phase. The complete framework runs on a PC with Nvidia GeForce 1070 GPU. One forward pass takes .
USER 1 


USER 2 


USER 3 






6 Results
In this section, we illustrate the accuracy of our model through both qualitative and quantitative results and comparisons with the stateoftheart hybrid methods. We do not claim to exceed the accuracy of recently published discriminative methods [6, 8] which neglect hand model geometry i.e. kinematics and physical constraints. Instead, we provide a performance comparison with the existing hybrid methods to validate our algorithm that fully exploits a flexible hand model geometry and estimates the 3D hand pose and bonelengths of the hand skeleton simultaneously. Notably, famous public datasets such as NYU and ICVL contain low variation in hand shapes and sizes (see Section 2.1). However, we demonstrate our results on these datasets for completeness. We use two common evaluation metrics. First is the average 3D joint location error on test dataset. Second, fraction of test frames for which maximum predicted 3D joint error is below a certain threshold in millimeter.
6.1 Qualitative Evaluation
Some challenging hand pose images from three datasets alongwith predicted joint positions from our model are shown in Figure 23. We show some sample images with overlaid hand skeleton from our 5Scales model and deep model [34] in Figure 30. Our model shows very good results whereas, the compared model is unable to converge successfully leading to inaccurate 3D hand joint positions and bonelengths. We tested the 5Scales model with Zhou et al. [34] on unseen images acquired from three different users. Our model is able to infer hand pose quite accurately whereas, the other model fails to converge (see Figure 40). Some failure cases from our two other architectures (GlobalScale and MultiScale) are shown in Figure 45. Incorrect bonelengths estimation from GlobalScale architecture can happen due to a single scale parameter associated with all bones of the hand skeleton. On the other hand, in MultiScale architecture, independent learning of each bonelength of the hand skeleton may result in incorrect bonelengths estimation.
Methods  3D Joint Location Error 

Zhou et al. [34]  18.7mm 
MultiScale [Ours]  15.1mm 
GlobalScale [Ours]  15.3mm 
5Scales [Ours]  12.7mm 
6.2 Quantitative Evaluation
We trained our three architectures (GlobalScale, MultiScale and 5Scales) as well as publicly available model based deep architecture [34] on HandSet. Notably, [34] fails when trained on HandSet. This is mainly due to the fact that they assume a fixed hand model geometry during endtoend training. We summarize the comparison of accuracies in Figure 48 and Table 1. Our 5scales architecture shows the best accuracy and proves that our approach works well with large variation in hand shapes and sizes. On NYU dataset, our accuracy is comparable to Oberweger et al. [13] on common joints (see Table 2). On ICVL dataset, our method shows improved performance in comparison to other stateoftheart hybrid methods (see Table 3). Since, NYU dataset has no variation (one subject) and ICVL has low variation in hand shapes and sizes, therefore one can see a clear advantage of our method on ICVL dataset while a comparable performance on NYU dataset. Figure 50 shows a more detailed comparison on individual joints in ICVL dataset.


7 Conclusion and Future Work
In this work, we present a novel hybrid method that outputs 3D hand pose as well as bonelengths of the hand skeleton simultaneously. We demonstrate the effectiveness of our approach on depth images captured from unseen subjects. Our method uses one CNN and a hybrid forward kinematics layer to predict 3D joint positions of the hand from a single depth image. The CNN estimates hand scale parameters (associated to bones of hand) and pose parameters. In the hybrid forward kinematics layer, the initial hand skeleton is reshaped according to estimated hand scale parameters and a differentiable forward kinematic function is applied. Three different implementations of our method are introduced that describe the hand scale parameters in distinct ways. In addition, we present a unified preprocessing method to combine famous real hand pose datasets for better training of the CNN thereby, gaining an advantage of bigger dataset with large variation in hand shapes and sizes in particular. The training process is simple and efficient and proposed algorithm is well suited for realtime applications. Qualitative and quantitative results verify that our method achieves improved performance over the stateoftheart hybrid methods.
This work can be further extended in some interesting dimensions. It can be combined with other outperforming discriminative methods to achieve higher accuracy. We plan to extend this work for stable realtime hand tracking. In this case, small variations in hand scale may occur for the same person. This can be addressed by automatically fixing the estimated bonelengths after a few frames. The hand skeleton can be upgraded to skinned hand model for fine representation of hand shape and size, thereby learning more complex hand shape parameters using CNN. We further plan to enlarge the combined dataset by including more variety of hand shapes and sizes using both real and synthetic images. The same can be extended to simultaneous human body pose and shape estimation.
Acknowledgements
This work was partially funded by the European project Eyes of Things (EoT) under contract number GA643924.
References
 [1] E. Barsoum. Articulated hand pose estimation review. arXiv preprint arXiv:1604.06195, 2016.
 [2] M. J. Choi, V. Y. Tan, A. Anandkumar, and A. S. Willsky. Learning latent tree graphical models. Journal of Machine Learning Research, 12(May):17711812, 2011.
 [3] X. Deng, S. Yang, Y. Zhang, P. Tan, L. Chang, and H. Wang. Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224, 2017.
 [4] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly. Visionbased hand pose estimation: A review. Computer Vision and Image Understanding, 108(1):5273, 2007.
 [5] L. Ge, H. Liang, J. Yuan, and D. Thalmann. Robust 3d hand pose estimation in single depth images: from singleview cnn to multiview cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 35933601, 2016.
 [6] L. Ge, H. Liang, J. Yuan, and D. Thalmann. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [7] H. Guo, G. Wang, and X. Chen. Twostream convolutional neural network for accurate rgbd fingertip detection using depth and edge information. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 26082612. IEEE, 2016.
 [8] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang. Region ensemble network: Improving convolutional network for hand pose estimation. arXiv preprint arXiv:1702.02447, 2017.
 [9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675678. ACM, 2014.
 [10] P. Li, H. Ling, X. Li, and C. Liao. 3d hand pose estimation using randomized decision forest with segmentation index points. In Proceedings of the IEEE International Conference on Computer Vision, pages 819827, 2015.
 [11] M. Madadi, S. Escalera, X. Baro, and J. Gonzalez. Endtoend global to local cnn learning for hand pose recovery in depth data. arXiv preprint arXiv:1705.09606, 2017.
 [12] M. Oberweger, P. Wohlhart, and V. Lepetit. Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807, 2015.
 [13] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 33163324, 2015.
 [14] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient modelbased 3d tracking of hand articulations using kinect. In BmVC, volume 1, page 3, 2011.
 [15] G. Poier, K. Roditakis, S. Schulter, D. Michel, H. Bischof, and A. A. Argyros. Hybrid oneshot 3d hand pose estimation by exploiting uncertainties. arXiv preprint arXiv:1510.08039, 2015.
 [16] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and robust hand tracking from depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11061113, 2014.
 [17] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, et al. Accurate, robust, and flexible realtime hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 36333642. ACM, 2015.
 [18] A. Sinha, C. Choi, and K. Ramani. Deephand: Robust hand pose estimation by completing a matrix imputed with deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 41504158, 2016.
 [19] S. Sridhar, F. Mueller, A. Oulasvirta, and C. Theobalt. Fast and robust hand tracking using detectionguided optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 32133221, 2015.
 [20] S. Sridhar, F. Mueller, M. Zollhöfer, D. Casas, A. Oulasvirta, and C. Theobalt. Realtime joint tracking of a hand manipulating an object from rgbd input. In European Conference on Computer Vision, pages 294310. Springer, 2016.
 [21] S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive markerless articulated hand motion tracking using rgb and depth data. In Proceedings of the IEEE International Conference on Computer Vision, pages 24562463, 2013.
 [22] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded hand pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 824832, 2015.
 [23] J. S. Supancic, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan. Depthbased hand pose estimation: data, methods, and challenges. In IEEE international conference on computer vision, pages 18681876, 2015.
 [24] A. Tagliasacchi, M. Schröder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly. Robust articulatedicp for realtime hand tracking. In Computer Graphics Forum, volume 34, pages 101114. Wiley Online Library, 2015.
 [25] D. Tang, H. Jin Chang, A. Tejani, and T.K. Kim. Latent regression forest: Structured estimation of 3d articulated hand posture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 37863793, 2014.
 [26] D. Tang, J. Taylor, P. Kohli, C. Keskin, T.K. Kim, and J. Shotton. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE International Conference on Computer Vision, pages 33253333, 2015.
 [27] J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp, E. Soto, D. Sweeney, J. Valentin, B. Luff, et al. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG), 35(4):143, 2016.
 [28] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Realtime continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG), 33(5):169, 2014.
 [29] C. Wan, A. Yao, and L. Van Gool. Hand pose estimation from local surface normals. In European Conference on Computer Vision, pages 554569. Springer, 2016.
 [30] C. Xu, L. N. Govindarajan, Y. Zhang, and L. Cheng. Liex: Depth image based articulated object pose estimation, tracking, and action recognition on lie groups. International Journal of Computer Vision, pages 125, 2017.
 [31] C. Xu, A. Nanjappa, X. Zhang, and L. Cheng. Estimate hand poses efficiently from single depth images. International Journal of Computer Vision, 116(1):2145, 2016.
 [32] Q. Ye, S. Yuan, and T.K. Kim. Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In European Conference on Computer Vision, pages 346361. Springer, 2016.
 [33] Y. Zhang, C. Xu, and L. Cheng. Learning to search on manifolds for 3d pose estimation of articulated objects. arXiv preprint arXiv:1612.00596, 2016.
 [34] X. Zhou, Q. Wan, W. Zhang, X. Xue, and Y. Wei. Modelbased deep hand pose estimation. arXiv preprint arXiv:1606.06854, 2016.
 [35] C. Zimmermann and T. Brox. Learning to estimate 3d hand pose from single rgb images. arXiv preprint arXiv:1705.01389, 2017.