A Bayesian Framework for Sparse Representation-Based 3D Human Pose Estimation

A Bayesian Framework for Sparse Representation-Based 3D Human Pose Estimation

Behnam Babagholami-Mohamadabadi, Amin Jourabloo, Ali Zarghami, and Shohreh Kasaei  The authors are with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran, e-mail: skasaei@sharif.edu
Abstract

A Bayesian framework for 3D human pose estimation from monocular images based on sparse representation (SR) is introduced. Our probabilistic approach aims at simultaneously learning two overcomplete dictionaries (one for the visual input space and the other for the pose space) with a shared sparse representation. Existing SR-based pose estimation approaches only offer a point estimation of the dictionary and the sparse codes. Therefore, they might be unreliable when the number of training examples is small. Our Bayesian framework estimates a posterior distribution for the sparse codes and the dictionaries from labeled training data. Hence, it is robust to overfitting on small-size training data. Experimental results on various human activities show that the proposed method is superior to the state-of-the-art pose estimation algorithms.

Bayesian learning, dictionary learning, Gibbs sampling, Metropolis-Hastings algorithm.

I Introduction

Recently, 3D human pose estimation from monocular images has attracted much attention in computer vision community due to its significant role in various applications; such as visual surveillance, activity recognition, motion capturing, etc. Although many algorithms have been proposed for estimating the 3D human poses from single images, it has been remained as a challenging task due to the lack of depth information and significant variations in visual appearances, hunam shapes, lightning conditions, clutters, and the forth.
Existing methods for monocular 3D pose estimation can be divided into three main categories. The model-based approaches which assume a known parametric body model and estimate the human pose by inverting the kinematics or by optimizing an objective function of pose variables [Agarwal, Andriluka]. These computationally expensive approaches need an accurate body model and a good initialization stage. Furthermore, due to non-convexity of their objective functions, their solution might be sub-optimal. On the other hand, the learning-based approaches employ a direct mapping between the visual input space and the human pose space [Elgammal, Lee]. Despite the superiority of these approaches, one of their major drawbacks is that their pose estimation accuracy depends on the amount of training data. Finally, the examplar-based approaches estimate the pose of an unknown visual input (image) by searching the training data (a set of visual inputs whose corresponding 3D pose descriptors are known) and interpolating from the poses of similar training visual input(s) to the unknown visual input [Jiang, Odobez]. The problem with these methods is that their computational complexity is high (because of searching the whole visual input database to find the similar sample(s) to an unknown input).
Some researchers have recently utilized the sparce representation and dictionary learning (SRDL) framework to estimate the human pose [Huang, Huang2, Su]. Huang et al. [Huang] proposed a SR-based method in which each test data point is expressed as a compact linear combination of training visual inputs. It is capable of dealing with occlusion. The pose of the test sample can be recovered by the same linear combination of the training poses. Ji et al. [Su] introduced a robust dual dictionaries learning (DDL) approach which can handle corrupted input images. An efficient algorithm is also provided to solve the DDL optimization model.
Although the results of SRDL approaches are comparable with the state-of-the-art methods, they suffer from two shortcomings. Firstly, since these algorithms only provide a point estimation of the dictionary and the sparse codes (which might be sensitive to the choice of training examples), they tend to overfit the training data (especially when the number of training examples is small). Secondly, none of these methods can use the information of the pose training data. Precisely speaking, all of the SRDL-based methods learn the dictionary and sparse codes without considering the fact that the dictionary should be learned such that the samples of similar poses have similar sparse codes. In order to overcome these shortcomings, this paper presents a Bayesian framework for SRDL-based pose estimation that targets the popular cases for which the number of training examples is limited. Moreover, by employing appropriate prior distributions on the latent variables of the proposed model, the dictionary is learnt with the constraint that the samples with similar poses must have similar sparse codes.
The remainder of this letter is organized as follows: The proposed method is introduced in Section II. Experimental results are presented in Section III. Finally, the conclusion and future work are given in Section IV.

Ii Proposed 3D Human Pose Estimation Method

Following [Su], we aim at learning two dictionaries (visual input dictionary and pose dictionary) with a shared sparse representation based on a Bayesian learning framework that utilizes the information of the pose training data.
Let , and denote the training set of visual input features and their corresponding pose features, respectively. We model each input feature and pose feature as a sparse combination of the atoms of dictionaries and with an additive noise and , respectively.The matrix form of the model is given as

(1)

where is the set of -dimensional sparse codes, , and are the zero-mean Gaussian noise with precision value ( and are and identity matrices, respectively). We model each sparse code as an element-wise multiplication of a binary vector and a weight vector , as

(2)

where denotes the -th coefficient of the -th sparse code, is a binary random variable, and is a zero-mean Gaussian random variable with precision value . The intuition for the above model is that if , then and the -th atom of the dictionaries and are inactive. If , then the -th atom of the dictionaries are active, and the value of cofficient is drawn from . We also put a prior distribution on each binary random variable by using the logistic sigmoid function, as

(3)

where are the hyper-parameters of the model. In order to exploit the information of the training pose data, a prior Gaussian distribution is considered for as

(4)

where

(5)

and is a valid kernel (a kernel which satisfies the Mercer’s condition) that diminishes by increasing the distance between and . By using above distributions, the process of generating the sparse codes is as follows:

  • Draw the parameters by using (4).

  • Draw the binary random vectors by using (3).

  • If , draw , else .

As it can be seen from this process, if the two input features have similar pose features, they tend to use the same dictionary atoms (imposed by kernel ) to get similar sparse codes.
In our method, we also impose a prior zero-mean Gaussian distribution on the dictionary atoms of and , as

(6)

To be Bayesian, we typically place non-informative Gamma hyper-priors on parameters , and . Given the training data , the proposed hierarchical probabilistic model can be expressed as

Fig. 1: Proposed model (blue shadings indicate observations).
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)

where is the element-wise multiplication operator, , , are the hidden variables, are the parameters (the precision values, inverse variances, of the Guassian noise distributions), and are the hyper-parameters of the proposed model. The graphical representation of the proposed probabilistic model is shown in Fig 1.

Ii-a Posterior Inference

Due to intractability of computing the exact posterior distribution of the hidden variables, the inference is performed by using the Gibbs sampling to approximate the posterior with samples. In the proposed model, all of the distributions are in the conjugate exponential form except for the logistic function. Due to the non-conjugacy between the logistic function and the Gaussian distribution, deriving the Gibbs update equation for in closed-form is intractable. To overcome this problem, one can put an exponential upper bound on the logistic functions of (8) based on the convex duality theorem [Jordan]. By using this theorem and utilizing the fact that the log of a logistic function is concave, an upper bound on the logistic functions is obtained in the form of

(18)

where

(19)

and are the variational parameters which should be optimized to get the tightest bound. By using the upper bound of (18), we propose a Gaussian distribution as the distribution in a Metropolis-Hastings (MH) independence chain algorithm [Hastings] and derive the posterior samples for by using this algorithm (the details of generating samples for and other hidden variables are available in Appendix A).

Ii-B Pose Prediction

After computing the posterior distribution of hidden variables, in order to determine the target pose of a given test instance , given the test instance, the predictive distribution of the target pose is first computed by integrating out the hidden variables as

(20)

where

(21)

and . The mean of this distribution is the target pose for . Since the expression of (II-B) cannot be computed in a closed-form fashion, one can resort to the Monte Carlo sampling to approximate that expression. As such, the distribution with samples can be approximated as

(22)
(23)

where is the -th sample of the hidden variable . By using the fact that the sum of Gaussian distributions is still a Gaussian distribution, is computed analytically by

(24)

Sampling from is straightforward (we use the posterior samples, see Section II-A). However, due to the fact that the true pose of the unknown visual input in unknown, cannot be obtained, and hence the posterior samples of cannot be generated. To overcome this problem, the samples of are derived based on the posterior samples as

(25)

where if belongs to the nearest neighbors of , and otherwise . Sample derivation of is based on the fact that neighboring visual inputs are more likely to have similar poses. By using the above samples for , one can generate the samples from by using (3).

Activity Tr. RVM TGP SR DDL PM
Acrobatics 30 15.963 2.73 15.411 2.97 14.005 2.32 16.731 3.82 12.595 1.24
Acrobatics 60 13.294 2.53 13.353 2.49 12.805 2.12 14.734 3.41 9.328 0.99
Acrobatics 100 10.651 1.52 9.882 1.76 8.104 1.44 10.323 1.97 6.443 0.96
Acrobatics 200 7.247 1.14 6.896 1.05 5.506 0.93 6.989 1.31 4.862 0.79
Navigate 30 10.821 1.19 10.917 1.31 10.455 0.99 11.421 1.53 7.623 0.61
Navigate 60 6.674 0.96 6.819 0.89 6.662 0.71 7.872 1.24 5.782 0.34
Navigate 100 4.434 0.22 5.029 0.36 5.550 0.49 5.753 0.57 3.229 0.20
Navigate 200 3.567 0.17 4.194 0.16 3.866 0.27 4.331 0.41 3.075 0.12
Golf 30 14.514 2.88 14.728 2.02 13.909 1.82 15.688 2.93 8.241 1.13
Golf 60 9.337 2.51 8.964 1.79 9.745 1.11 10.949 2.21 5.752 0.88
Golf 100 7.652 1.32 7.515 0.69 5.467 0.42 7.442 0.61 3.931 0.52
Golf 200 5.220 1.48 5.333 0.74 4.535 0.50 5.273 0.57 3.034 0.37
TABLE I: Average error (in degrees) with standard deviation of different methods.

Iii Experimental Results

In order to evaluate the performance of the proposed method, the activities in the CMU Mocap dataset111http://mocap.cs.cmu.edu/ are used in the bvh format to generate the silhouettes of real sequences. The method is tested on various activities (”Acrobatic”, ”Navigate”, ”Golf”, etc). We have used the histograms of shape contexts [Agarwal] which encodes the visual input (silhouette) into a 100-dimensional descriptor as the input feature. The human body pose is also encoded by 57 joint angles (three angles for each joint). The error is the average (over all angles) of the root mean square error (RMS). We captured 600 frames from each sequence and used 30, 60,100, and 200 of them as the training data, and the rest as the test data. In all experiments, all hyper-parameters are set to to make the prior Gamma distributions uninformative. We also used the exponentional kernel , for which the kernel parameter is set to

(26)

In order to determine an appropriate number of dictionary atoms. , and nearest neighbors of unknown data samples, , the five-fold cross validation approach is performed to find the best pair (, ). The tested values for are and for are . In the analysis that follows, 1200 MCMC iterations are used (700 burn-in and 500 collection, from a random start). For the proposal distribution in MH algorithm, the acceptance rates were greater than . We compared the performance of the proposed method with that of the relevance vector machine (RVM) as a well-known supervised regression method, the twin Gaussian process (TGP) [Bo] as a state-of-the-art method, and DDL [Su] and SR [Huang2] as two state-of-the-art SRDL-based 3D human pose estimation methods. The average estimation accuracies (over 10 runs) together with the standard deviation for three activities are shown in Table I (the results for other activities are available in Appendix B), from which we can see that the proposed method significantly outperforms the other methods. The improvement in performance is because of two reasons. Firstly, the number of labeled data is small; hence these methods may overfit to the labeled data. Secondly, these methods cannot utilize the information of the pose data. Figure 2 shows the subjective result of the proposed method, SR, and DDL for 4 sequences in the database, respectively. These outputs are obtained using 200 training data sampled from 400 test data. As can be seen, the proposed method has a better reconstruction rate than the other methods.

Fig. 2: Subjective comparison. Columns show input silhouettes, real labels, and outputs of PM, SR, and DDL, respectively.

Iv Conclusion

In this letter, a fully probabilistic framework for SR-based 3D human pose estimation was proposed that utilized the information of the pose space. Experimental results proved the high performance of the proposed method especially in cases for which only a few number of training data is available.

Appendix A: MCMC Inference

In the following equations, denotes the conditional probability of parameter , given current value of all other parameters. The sampling equations are as follows:

Sample :

(27)

Since is a Bernoulli random variable (), we have

(28)

where

(29)
(30)

Hence, it is obvious that is drawn from a Bernoulli distribution

(31)

where

(32)

Sample :

(33)

It is easy to show that if , then is drawn from

(34)

and if , is drawn from

(35)

where

(36)
(37)

Sample :

(38)

From the above equation, we can see that cannot directly sampled from. However, we can put an exponential upper bound on the logistic functions of the above equation based on the convex duality theorem [Jordan]. Using this theorem and utilizing the fact that the log of a logistic function is concave, we obtain an upper bound on the logistic functions of the form

(39)

where

(40)

are the variational parameters which should be optimized to get the tightest bound.
By substituting the above upper bound back into Eq. Sample :, we obtain

(41)

where

(42)
(43)

We use this normal distribution (right-hand side of Eq. 41) as the proposal distribution in a Metropolis-Hastings (M-H) independence chain [Hastings], and accept with probability , where

(44)

Since the proposal distribution should be accurate around the current sample (), we can optimize the variational parameters to make the upper bound (right-hand side of Eq. 39) as tight as possible around the current sample. Hence, by replacing with in the right-hand side of Eq. 39, and by setting the derivative of the right-hand side of Eq. 39 respect to equal to zero, we can optimize as

(45)

Sample :

(46)

It can be demonstrated that is drawn from a Normal distribution

(47)

where

(48)
(49)

Sample :

(50)

It can be demonstrated that is drawn from a Normal distribution

(51)

where

(52)
(53)

Sample :

(54)

It can be shown that can be drawn from a Gamma distribution

(55)

where,

(56)

Sample :

(57)

It is easy to show that can be drawn from a Gamma distribution

(58)

where,

(59)
(60)
Activity Tr. RVM TGP SR DDL PM
Throw and catch football 30 26.68 3.94 19.55 2.61 15.68 2.03 18.62 2.84 13.51 1.17
Throw and catch football 60 25.43 3.86 16.14 2.34 13.91 1.85 15.80 2.53 10.47 1.05
Throw and catch football 100 23.64 3.11 10.49 1.47 9.09 1.08 11.13 1.38 9.54 0.91
Throw and catch football 200 8.27 1.73 8.68 0.71 7.43 0.56 8.59 1.02 7.76 0.42
Michael jackson styled motions 30 19.71 2.17 18.92 2.33 17.38 1.97 19.92 2.87 14.89 1.11
Michael jackson styled motions 60 16.48 1.88 16.33 1.76 15.72 1.69 17.62 2.44 13.69 0.98
Michael jackson styled motions 100 13.23 0.99 13.42 0.84 12.34 1.02 12.95 1.39 11.32 0.85
Michael jackson styled motions 200 8.86 0.73 10.49 0.33 8.69 0.41 8.83 1.04 7.57 0.32
kick soccer ball 30 13.69 2.24 15.53 2.43 13.05 2.41 14.97 2.74 10.21 1.00
kick soccer ball 60 11.44 1.68 12.36 1.93 10.94 1.95 12.90 2.03 8.96 0.88
kick soccer ball 100 8.65 0.87 10.41 1.09 8.26 1.26 9.24 1.29 7.63 0.72
kick soccer ball 200 6.12 0.43 7.63 0.64 6.23 0.77 6.97 0.91 5.83 0.38
Run 30 15.71 2.89 15.92 2.16 14.17 2.00 15.75 2.51 10.00 0.97
Run 60 13.49 2.71 13.89 1.82 10.63 1.42 12.70 1.66 8.95 0.79
Run 100 9.85 1.53 10.36 0.69 8.96 0.74 9.33 0.88 7.34 0.48
Run 200 6.44 0.78 7.24 0.33 6.42 0.25 6.75 0.54 5.89 0.34
Walk 30 16.81 2.63 15.92 2.74 15.34 2.80 17.00 2.63 11.81 1.02
Walk 60 14.47 2.02 14.31 2.10 13.75 1.92 15.41 2.16 10.32 0.88
Walk 100 10.44