A Survey of Efficient Regression of General-Activity Human Poses from Depth Images

A Survey of Efficient Regression of General-Activity Human Poses from Depth Images

Wenye He

This paper presents a comprehensive review on regression-based method for human pose estimation. The problem of human pose estimation has been intensively studied and enabled many application from entertainment to training. Traditional methods often rely on color image only which cannot completely ambiguity of joint’s 3D position, especially in the complex context. With the popularity of depth sensors, the precision of 3D estimation has significant improvement. In this paper, we give a detailed analysis of state-of-the-art on human pose estimation, including depth image based and RGB-D based approaches. The experimental results demonstrate their advantages and limitation for different scenarios.


Human pose estimation from images has been studied for decades in computer vision. As recent development in cameras and sensors, depth images receive a wide spread of notice from researchers from body pose estimation [1] to 3D reconstruction [2]. Girshick et al.[1] present an approach to find the joints position in human body from depth images. They address the problem of general-activity pose estimation. Their regression-based approach sucessfully computes the joint positions even with occlusion. Their method can be view as a new combination of two existing works, implicit shape models[3] and Hough forest[4]. The following sections cover related works, explanation on the method from testing to training, and result and comparison.

Related Works

In previous works, one common idea in human pose estimation is to focus on finding different body parts. Bourdev et al.[5] put foward a two-layer regression model that trains segments classifers to local patterns detection and combines the output of the classifiers. Plagemann et al.[6] create a novel interest point detector for catching body components from depth images. Shotton et al.[7] design a system that change an single input depth image to an inferred per-pixel body part distribution and local the 3D joint position. Pictorial structures based methods have also been used for human pose estimation to enhance the body shape [8]. This method can optimally remove the ambiguity of 3D inference from a single view point. Pictorial structures can also be used together with segmentation to efficiently localize body part and predict the joint position [9]. However, this idea has some problems with the consideration of the required definition of body alignment, joints inside the body, and body occlusion. The implicit shape model[10] can solve these problems. Random forest[11] based method also have advantages over the body-part-based methods.


The researchers call their approach Joint Position Regression. Their algorithm find joint points of 3D human body from aggregated votes from a regression forest. The testing and training procedures are introduced as follow.
A regression forest is made up from a group of decision trees which give the predicted outputs. Every tree is binary and contains split nodes and leaf nodes. The split nodes have tests. In the test, the researchers compute the feature value by comparing the depth at nearby pixels to a threshold. The result of each test determines whether to branch to the left or right child. The leaf nodes are the end of a tree. Given different inputs into the root, the inputs go through a sequence of corresponding test in split nodes of each depth and finally return the corresponding output in the leaf node. The researchers store a few relative votes at each leaf node. They define the set of relative votes for joint at node as , where is a 3D relative vote vectors, is a confidence weight to each vote, and is the number of votes in each leaf node. The vectors are obtained by taking the centers of the largest modes, the most frequently appeared value, found by mean shift. The weights are given by the sizes of their cluster. is kept to be small for efficiency, e.g. or . The testing algorithm is showed in Algorithm 1 and Aggregation of pixel votes at test time is showed in Figure 1.

  // Collect absolute votes
  initialize for all joints
  for all pixels in the test image do
     lookup 3D pixel position
     for all trees in forest do
        descend tree to reach leaf node
        for all joints  do
           lookup weighted relative vote set
           for all  do
              if  then
                 compute absolute vote
                 adapt confidence weight
  //Aggregate weighted votes
  sub-sample to contain votes
  aggregate using mean shift on Eq.1
  return  weighted nodes as final hypotheses
Algorithm 1 Inferring joint position hypotheses

Figure 1: Each pixel (black square) casts a 3 D vote (orange line) for each joint. Mean shift is used to aggregate these votes and produce a final set of hypotheses for each joint. Note accurate predictions of internal body joints even when occluded. The highest confidence hypothesis for each joint is shown.

The set of abosolute votes cast by all pixels for each body joint .
The algorithm include 3 steps, collecting absolute votes, aggregate weighted votes and computing final hypotheses. The set of absolute votes is updated by adding the 3D pixel position into the reliable weighted relative vote with the adapt confidence weight. The threshold in the algorithm are used for elimation of unreliable predictions. Finally, the algorithm returns aggregated using mean shift using Eq. 1.


where is a learned per-joint bandwidth and is world space. Figure show aggregation of pixel votes in testing.
After the explanation of the testing procedure, the training is demonstrated here. Training is composed of three learning, the leaf node regression models, the hyper-parameters and the tree structure.
In the first learning, the major objective is to learn the set of relative votes. Algorithm 2 below show how to achieve the goal.

  // Collect relative offsets
  initialize for all leaf nodes and joints
  for all pixels in all training images  do
     lookup ground truth joint positions
     lookup 3D pixel position
     compute relative offset
     descend tree to reach leaf node
     store in with reservoir sampling
  // Cluster
  for all leaf nodes and joints  do
     cluster offsets using mean shift
     take top weighted modes as
  return  relative votes for all nodes and joints
Algorithm 2 Learning relative votes

The simple process of computing relative votes is that find the differences between the ground truth joint position and the 3D pixel position, throw the differences into different group using mean shift and pick the best relative votes. The goal of the second learning is to find the optimal bandwidth and thresholds for this method. The researchers find the bandwidth m and the threshold fall between m and m. In the third learning, the researchers use the standard greedy decision tree. They use Eq.2 to repeat splitting the set of all training pixels into left and right subsets.


is an error function to reducing the error in the partitions. The researchers use both regression error function, , and classification error one, , and observe the different result. For regression, they apply the method purposed by, while, for classification, they employ the method presented by.

Experiments and results

In this section, the researchers evaluated their tree structures and compare their work with existing work. They evaluate their method on the MSRC[7] dataset. To measure accuracy, they compare average precision and mean across joints (mAP) in experiments. They use forests of 3 trees each of which are trained to depth 20 with 5000 images. Figure 2 show some example of joints inferences in the researchers’ method.

Figure 2: In the left group, each example shows an input depth image with colored ground truth joint positions, and inferred joint positions from front, right, and top views. The size of the boxes implies the inferred confidence. In the right group, example inference results on flattened 2D silhouettes. The crosses are the ground truth joint positions and the circles with size indicating confidence are the highest scoring hypothesis.

In the end of the previous section, regression and classification objective functions are mentioned. Figure 3 show average precision of them on all joints. Classification objective function gives the highest accuracy, so it is used in later experiments for comparisons.

Figure 3: Comparison of tree structure training objectives. is the thresholds used in regression objective functions.

0.1 Hough forests

Figure 4 show the result of comparisons. The researchers use Hough forests on MSRC-5000 test data with different votes and tree structures.

Figure 4: Comparison with Hough forest. is the thresholds used in regression objective functions. Different votes and objective functions are used in Hough forest.

0.2 Shotton et al.[7]

Figure 5 show the result of comparisons. The researchers’algorithm get higher mAP than Shotton et al.’s one with different sizes of training set.

Figure 5: Comparison with Shotton et al. (a) Mean average precision versus total number of training images. (b) Average precision on each of the 16 test body joints.


  • 1. Girshick, Ross, et al. ”Efficient regression of general-activity human poses from depth images.” Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
  • 2. J. Shen and S. C. S. Cheung, “Layer Depth Denoising and Completion for Structured-Light RGB-D Cameras,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 1187-1194, 2013.
  • 3. Müller, Jürgen, and Michael Arens. ”Human pose estimation with implicit shape models.” Proceedings of the first ACM international workshop on Analysis and retrieval of tracked events and motion in imagery streams. ACM, 2010.
  • 4. Gall, Juergen, and Victor Lempitsky. ”Class-specific hough forests for object detection.” Decision forests for computer vision and medical image analysis. Springer London, 2013. 143-157.
  • 5. Bourdev, Lubomir, and Jitendra Malik. ”Poselets: Body part detectors trained using 3d human pose annotations.” Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.
  • 6. Plagemann, Christian, et al. ”Real-time identification and localization of body parts from depth images.” Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010.
  • 7. Shotton, Jamie, et al. ”Real-time human pose recognition in parts from single depth images.” Communications of the ACM 56.1 (2013): 116-124.
  • 8. J. Shen and J. Yang, “Automatic human animation for non-humanoid 3d characters,” International Conference on Computer-Aided Design and Computer Graphics (CAD/Graphics), pp. 220-221, 2015.
  • 9. J. Shen and Y. Yang “Automatic pose tracking and motion transfer to arbitrary 3d characters,” International Conference on Image and Graphics, pp. 640-653, 2015.
  • 10. Leibe, Bastian, AleÅ¡ Leonardis, and Bernt Schiele. ”Robust object detection with interleaved categorization and segmentation.” International journal of computer vision 77.1-3 (2008): 259-289.
  • 11. Rogez, Grégory, et al. ”Randomized trees for human pose detection.” Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description