An Improved Tobit Kalman Filter with Adaptive Censoring Limits

An Improved Tobit Kalman Filter with Adaptive Censoring Limits

Kostas Loumponias &Nicholas Vretos &George Tsaklidis 1 &Petros Daras 2 K. Loumponias and G. Tsaklidis are with the Department of Mathematics, Aristotle University of Thessaloniki, GR-54124, Thessaloniki, Greece (e-mail:; N. Vretos and P. Daras are with the Information Technologies Institute, Centre for Research and Technology - Hellas, 6th km Charilaou - Thermi, GR-57001, Thessaloniki, Greece, (;
1footnotemark: 1
2footnotemark: 2

This paper deals with the Tobit Kalman filtering (TKF) process when the measurements are correlated and censored. The case of interval censoring, i.e., the case of measurements which belong to some interval with given censoring limits, is considered. Two improvements of the standard TKF process are proposed, in order to estimate the hidden state vectors. Firstly, the exact covariance matrix of the censored measurements is calculated by taking into account the censoring limits. Secondly, the probability of a latent (normally distributed) measurement to belong in or out of the uncensored region is calculated by taking into account the Kalman residual. The designed algorithm is tested using both synthetic and real data sets. The real data set includes human skeleton joints’ coordinates captured by the Microsoft Kinect II sensor. In order to cope with certain real-life situations that cause problems in human skeleton tracking, such as (self)-occlusions, closely interacting persons etc., adaptive censoring limits are used in the proposed TKF process. Experiments show that the proposed method outperforms other filtering processes in minimizing the overall Root Mean Square Error (RMSE) for synthetic and real data sets.


Censored data Adaptive Tobit Kalman filter Human skeleton tracking.

1 Introduction

Human skeleton motion tracking has been studied for several decades and remains a highly active research field due to its importance in several diverse domains like surveillance applications, medical applications, serious games, educational applications, high performance sports monitoring and others [1]-[4]. With the advent of commercial RGB-D sensors [5], [6], human skeleton motion tracking has attracted a lot of attention due to the capacity of the sensors to reliably track skeletal joints. However, regardless of the significant progress that has been achieved in both sensors’ developement and human skeleton motion tracking research, many applications require more accurate tracking of the human skeleton position and motion. On the sensors’ side, high performing sensors (such as the Vicon System), which are able to accurately track at high rates, are very expensive and cumbersome to deploy. On the other hand, affordable, commercial RGB-D solutions (i.e., the Microsoft Kinect, the Xtion Pro and others) often produce low quality human skeleton motion tracking due to their inherent problems (low sampling frequency, moderate depth resolution, UV light interferences, etc.) and also due to their simplistic setup (usually only one such sensor is deployed, resulting in occluding areas and human self-occlusion).

To overcome these issues and provide an affordable and, at the same time, reliable solution to the human skeleton motion tracking task, research has been steered towards two general categories of methods: methods that exploit multiple RGB-D sensors [6], [7] and methods that use various filters able to improve and smooth the sensors’ measurements [8]-[10]. For the first category, we are confronted with two major flaws: 1) the increase of the cost for monitoring, capturing and processing, and 2) the interferences between devices, which add more noise and restrictions to the problem at hand, thus, making it harder to solve. For the latter, the main drawback is the lack of a framework able to provide reliable estimations of the human skeleton joints.

In this paper, we introduce a new method, which belongs to the second category of methods. We improve the human skeleton motion tracking by smoothing the Kinect skeleton joints’ measurements through a novel Kalman type filtering method adapted to restrictive conditions concerning human skeleton movements. The measurements that we correct and smooth are the 25 Kinect’s V2 skeletal joints, which are time series of 3D spatial coordinates in a 3D space centred in the physical centre of the Kinect’s infrared sensor.

In the literature, in order to smooth spatial coordinates (or a signal), various filters, e.g. Kalman Filter (KF) [11], [12], Savitzky-Golay filter (SGF) [13], Particle Filtering [14] and others have been proposed. One of the most common filters for signal smoothing is KF under the assumption that the singal’s measurements are normally distributed. However, KF performs a poor smoothing when the noisy signal contains some extreme measurements (outliers). Then, the hypothesis of normally distributed measurements turns out to be inappropriate. In the case where certain bounds of the denoised signal’s values are considered, we can deal with the extreme measurements by providing this information in the KF process. In order to deal with that, we introduce the censored normal distribution in the KF estimation procedure [15], [16]. The use of censored probabilities theory in data filtering was firstly introduced in [17], where the Tobit Kalman Filter (TKF) was proposed aiming to estimate an unknown state vector, x, when censored measurements, y, are present. In our previous works [18],[19], TKF was utilized in order to filter spatial coordinates of human skeleton, however, no proofs for the TKF process were provided.

In this paper, we propose a new filter, the so-called Adaptive Tobit Kalman Filter (ATKF), which considers an occluded or self-occluded Kinect’s skeletal joint as a censored measurement. Our work takes advantage of the approaches presented in [17]-[19] and proposes a new proof. The proposed approach results in a more accurate estimation of the probability of a measurement to fall into the censoring region and as a consequence, it leads to a more accurate estimation of the state. The proposed ATKF also adapts its censoring region at each time step by considering previous states. The main contributions of this paper are:

  1. A proof for accurately calculating the covariance matrix of the censored measurements in Tobit Kalman filtering, by incorporating the censoring limits into the equation of censored covariance.

  2. A proof for accurately calculating the probabilities of a latent measurement, , to belong in or out of the uncensored region, by taking into consideration the Kalman residual.

  3. A new Adaptive Tobit Kalman Filter able to adapt the censoring limits at each time step.

  4. As an application of contributions 1,2 and 3, a new method, which improves the human skeleton tracking in real-time applications is provided.

  5. A new evaluation metric for human skeleton motion filtering to measure the performance of a filtering technique, when no ground truth data are available.

The rest of the paper is organised as follows. In Section 2, related works are described, while in Section 3, the proposed Adaptive Tobit Kalman Filter is presented in detail. In Section 4, experimental results are drawn, using artificial data as well as real human skeleton motion tracking data. Finally, Section 5, concludes the paper.

2 Related Work

Many approaches exist for filtering and motion tracking of the human skeleton either from images, videos or depth information. We mention only methods that are most relevant to our work (based on data filtering). For a more detailed discussion we refer to the books [20] and [21] for data filtering and human skeleton motion, respectively.

Similar to our work, Microsoft [22] proposed various filters for smoothing human skeleton motion data from Kinect devices. Two of them are the simple and the exponential moving average [23], [13], but there is not any reference on how the time windows and the weights should be chosen, since these are application dependent. Edwards et al. [10] smoothed human skeleton motion data (obtained by a Kinect V2 sensor) using four different filters: 1) the moving average, 2) KF, 3) the Holt double exponential filter [24] and 4) their proposed filter, consisting of a Kalman filter with a Wiener Process Acceleration (WPA) [25]. Both the averaging filter and KF had a good smoothing performance but they introduced relatively large amounts of latency, while the other two had good performance and low latency. Finally, the WPA Kalman filter exhibited the best overall performance.

Regarding the filtering process per se the most known and well established filtering method is the Kalman Filter (KF). In order to overcome several drawbacks of KF (mainly due to its linear nature), the Extended Kalman Filter (EKF) was proposed in [26]. Although EKF is not an optimal estimator as its linear counterpart, it has been proved that it performs better than KF in terms of smoothing and correcting signals in problems that are non-linear, as is the case in most of the real-life problems. However, EKF tends to be unstable in many applications due to its local nature, leading to incorrect smoothing of a signal that exhibits a high degree of non-linearities. To overcome these problems, the Unscented Kalman Filter (UKF) was proposed in [27],[28]. UKF uses a deterministic sampling technique known as unscented transform [29] to gather a minimal set of points around a local mean. By doing so, it provides better results than EKF when the predict and the update functions are highly non-linear and EKF has typically poor performance. Finally, a very successful method is the particle filtering [30], which is a Monte Carlo based filtering method. Though particle filtering is generally very adaptable, it requires a high computational burden, making it practically unsuitable for many real-time applications.

In the area of censored statistics, all the above mentioned methods have their drawbacks. In Allik [17], it is stated that the formulation of a standard KF, as an estimator for censored data, results in a biased estimation of the unknown state. EKF suffers from an undefined Jacobian at the censored region, resulting in an ill-posed Jacobian and thus exhibiting poor performance. On the other hand, UKF is a less computationally expensive approach, however it is proven to be non-robust when the measurements are close to the censored region [17]. Furthermore, while particle filtering is suitable for estimating the state values when the measurements are censored in certain cases, it has a substantial computational cost. Finally, TKF provides unbiased, recursive estimates of the latent state variables in/near the uncensored regions. TKF is completely recursive and computationally inexpensive, making it a perfect candidate for real-time applications such as the human skeleton motion tracking. Nevertheless, TKF neither takes into account the censored area in calculating the censored measurements variance nor it adapts the limits of the censored area [31].

Fei Han et al. [32] concerned TKF for a class of linear discrete-time system with random parameters. The elements of the state space matrices are allowed to be random variables in order to reflect the reality. Furthermore, they established a novel weighting covariance formula to address the quadratic terms associated with the random matrices. Although, their proposed method with only one censoring limit is coped.

In the area of human skeleton motion tracking, several methods have been proposed involving multiple RGB-D sensors, increasing the complexity and the cost of the solution as mentioned before. In [6], Sungphil et al., proposed a new method for human skeleton motion tracking using multiple Kinect V1 sensors. They determined the reliability of each 3D joint’s position, by combining multiple observations based on Kinect measurements confidence (a value gathered from the sensor). They used the variances of measurements noise in order to identify the contribution of an observation (i.e., a weight) to create a series of fused measurements. Furthermore, they explained how to estimate this variance for each joint through KF. Finally, they presented the average 3D position error of ten activities produced by: 1) their method, 2) a single Kinect and 3) a simple-average. In all activities but one (running), their method appeared to give better results than other methods compared with other methods provided in the paper.

Finally, it is worth to mention works on activity recognition that use human skeleton motion filtering as an initial step. In [9], [33], a simple SGF is used in order to correct the data. This is achieved through a convolution process by fitting successive subsets of adjacent observations with a low-degree polynomial in a least squares sense [34]. Amor et al. [35] dealt with human activity recognition as well, achieving state-of-the-art classification results by using RGB-D sensors. They represented human body as dynamical skeletons and they studied the evolution of the skeletons’ shapes as trajectories on manifolds. They performed a median filtering in the temporal dimension in order to de-noise the skeletons’ trajectories before using their proposed method.

3 Proposed Method

In this section, we briefly describe the censoring data theory and the well-known TKF [17] in order to better highlight the proposed contributions. Then, we demonstrate an alternative approach to the classical TKF, where the update function is generated by taking into account the censoring limits in the measurements covariance matrix calculation, and thus, resulting in a more accurate evaluation of the censored measurements. Finally, we introduce ATKF for human skeleton motion tracking, where the censored region limits (boundaries) are not constant, as is the case in the standard TKF.

3.1 Censored and Truncated Data

Censoring is a condition in which the value of a measurement or observation is only partially known [36]. Censoring occurs when a value falls outside the range of a measuring instrument. For example, a bathroom scale might only measure up to 140 kg. If an 150 kg individual is weighed using that scale, the observer would only know that the individual’s weight is at least 140 kg (partially known). Censoring should not be confused with the related idea of truncation; while by censoring, observations result either in knowing the exact value that applies or in knowing that the value lies into an interval, in the truncation case, only observations in a given range are considered by ignoring all the others. Different types of censoring exist [37]:

  • Left censoring: a data point is below a certain value but it is unknown by how much.

  • Interval censoring: a data point is somewhere on an interval between two values.

  • Right censoring: a data point is above a certain value but it is unknown by how much.

  • Type I censoring: occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored.

  • Type II censoring: occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored.

  • Random (or non-informative) censoring: each subject has a censoring time that is statistically independent of its failure time.

In real-life problems, censored data are very frequent and to the best of our knowledge the concept of censoring in human skeleton motion tracking has never been used before.

3.2 Tobit Kalman Filters

Tobit Kalman filters [17], [31], [38], provide a classification scheme for all aforementioned types of censoring. In the case of scalar measurements, the Tobit model is called censored regression model and is characterised by the stochastic difference non-linear equation


where , stand for the censored measurement and the latent variable, respectively, is a multiplicative scalar and , are the lower and the upper limits of the uncensored region, respectively. The random variable is drawn from a Gaussian distribution with mean and variance . From (1), it is obvious that the TKF process is a non-linear one, since when the latent measurement falls outside the uncensored region, the censored measurement does not depend on the variable .

As has been already stated, KF does not provide optimal or unbiased estimates for the states when the measurements are in the censored region. This is due to the fact that the assumptions of KF [39] are not met when the noise measurements are censored.

The scalar case can be easily extended to the general case TKF, which is defined as,


where stands for the discrete time step and and are random vector variables following and , respectively, where denotes the normal distribution with mean and covariance matrix . A and H are the transition and the observation matrices, respectively, while are the saturated observations (that are Left and Right censoring at the same time), and the latent observations, respectively. Finally, designates the dimensionality of the process (which is three in the case of 3D human skeleton motion data). The predict and the update functions of TKF for saturated measurements are described in detail in [17].

3.3 Censored Moments

In this section, we calculate the first, the second moment of a censored measurement (no truncated) and the covariance. For that purpose the following Proposition is needed [40]:

Proposition 1.

If the random variable follows a normal distribution with density function , mean value and non-singular covariance matrix then, the expected values of and given that , , are:


The functions and are given by:


where and
Next, the following Lemma is provided in order to calculate the censored moments.

Lemma 1.

Let be a continuous random variable on a probability space and a discrete random variable with outcomes . Then, the expected value of the joint probability function can be given by


We have that

Now the following Proposition can be proved (see Appendix A) using Lemma 1 and Proposition 1:

Proposition 2.

The mean value of the censored variable with censoring limits and (1), depends only on the censoring limits and can be written as:


Furthermore, it can be proved (see Appendix B) that:

Proposition 3.

The variance and the joint mean value of the censored variable (1) and , respectively, depend only on the censoring limits and , respectively, and are given by:




The probabilities and are defined as follows:

The truncated expected values in (9) are calculated in Appendix B. Hence, the covariance matrix of the censored variable y can be calculated by (7)-(9).

3.4 Corrected Tobit Kalman Filter

In this paper as in [32],[41], we calculate the a posteriori estimation, , as a linear combination of the a priori estimation, , and the censored measurement . Although these estimations are not optimal, it is proved that they minimize the trace of state error covariance [42]. Next, we provide the predict and update function of the proposed .
The Predict function:


The Update function:


The predict function is the same as in case of standard KF [11], since, the censored measurements are not used in this stage. Matrix has been calculated in [41] and takes the form


where is a diagonal matrix, and its entries are the probabilities of a measurement to be uncensored, at time step . More specifically, the diagonal element of , is the probability that a latent measurement belongs to the uncensored region. Furthermore, we denote by , the diagonal matrices, where its entries are the probabilities of a measurement to be censored from below or above, respectively, at time step . It is proved (see Appendix C) that:


where stands for the cumulative function of . In [17], and are calculated as (we denoted them with to not confuse them with the proposed)


where , and . We notice that the information from the Kalman residual, , is omitted. In our case (see Appendix C) these amounts are as follows:


In (23) and (24), as opposed to [17], we have incorporated in the denominator the term , which consequently, adds information into (23) and (24), concerning the Kalman residual. By doing so, the probability of a measurement to belong to the uncensored region is estimated more accurately.

The mean vector of the censored measurement given the previous censored measurement , can be written (in matrix notation) using (7) as:


The covariance matrix, , of the censored measurement, , given the last censored measurement, , is calculated via Proposition 3. In particular, the diagonal elements, , of are calculated as (8), where the mean vector, , and covariance matrix, , in our proposed model are equal with and , respectively, and the probabilities for are given in . In the same way, the off-diagonal elements, , of are calculated as (9).

In what follows we denote by TKF the filter described through (10)-(16) and by TKF the filter described in [17], [41]. In [41], the covariance matrix, , of the censored measurement , is given by


where is a diagonal matrix, where its entries are the truncated variances of for (2).

The main difference between (13) and (26) is that in (26) the limits and appear only in the matrices and . We notice that if and is big enough (that is, only non-negative measurements are considered), then (26) provides a satisfactory approximation of the covariance matrix of the censored measurements. In order to clarify the notation and illustrate the difference between and , we provide an illustrative example as follows: we examine the censored covariance matrix for the random multidimensional with censoring limits and . We define the mean vector, , and the covariance matrix to be equal with,


while, without loss of generality, we define H and to be equal with the identity matrix. Then, we proceed as follows: 1) we produce random measurements from 100 times. 2) Each time, we calculate the sampling covariance matrix derived from the censored measurements. 3) We calculate the arithmetic mean, , of the 100 sampling covariance matrices. 4) The covariance matrices and are calculated by (13) and (26), respectively. As it can been by (27)-(29), the proposed covariance matrix, , is almost identical with the sampling covariance matrix, .


The marginal probability function, , of the component of the censored measurement given the last measurement, is,


where and are the probability and the cumulative distribution functions of the standard normal distribution, respectively, is the Kronecker delta function and stands for the Heavyside function, where , when and , otherwise.

The next step in our procedure is to calculate the likelihood function by taking into consideration the censored data distribution. The likelihood function for the censored measurements by (30), (19) and (20) can be calculated as:


In the case that the components of are mutually independent, the likelihood function of the censored measurements takes the form:


In the case of [17], the likelihood function becomes


Note that the denominator does not take into account the specific distribution of the measurements.

3.5 Adaptive Tobit Kalman Filter used to Human Skeleton Tracking

In what follows, we use the Microsoft Kinect V2 sensor to record 3D point sequences (human skeletons) of a human in motion [43]. In human skeleton tracking, the body is represented by a number of joints (25 in total), corresponding to different body parts such as head, neck, shoulders, etc (see Fig. 1 [44]). Each joint is represented by the vector of its Euclidean 3D space coordinates and our aim is to denoise the measurements for every joint in order to improve the representation of human movements. Thus, we denoise each one of the joints’ coordinates separately; the input is the vector of the joints’ coordinates, (latent measurement), and the output is the vector of the denoised states coordinates, .

Figure 1: Human skeleton’s joints map of the Kinect V2 sensor.

To start tracking, we define the initial observation and the transition matrices to be equal to the identity matrix. Therefore, we define the covariance matrix of the noise measurement, to be


We chose to initialize R in that way, under the assumption that Kinect exhibits significant errors in human skeleton tracking. To support our claim, we conduct small scale experiments proving that even if a person is at rest and in front of the Kinect, the error in the displacement estimation between measurement and ground truth data is almost 0.02 meters [45], [46], thus a variance of 0.01 m seems to be a valid choice.

In KF [8], [10], no restrictions in joints’ movements have been taken into account, as opposed to the proposed method. To that end, in our experiments we have used, beyond the Kinect V2 sensor, the state-of-the-art Vicon tracking system as a ground truth reference. In Vicon data, for various recordings, we observe that the velocity of the spatial coordinates and did not exceed 34 cm per frame, for every joint, and the coordinate did not exceed 18 cm per frame. In what follows we will use these restrictions in order to correct the data produced by the Kinect sensor. By applying these restrictions we constructed ATKF with limits and for the vector of the spatial coordinates, , as follows:




where the observation matrix, H, is the identity matrix for smoothing approaches, and are the limits of ATKF at time , which depend on the previous estimation of spatial coordinates, , and the vector c, which for human skeleton tracking is experimentally found to be


Thus, for the latent measurement at time we get


This model corrects Kinect measurements, when they have high abnormal velocity. It should be noted that, if and (i.e, the range of ATKF limits becomes very large), ATKF tends to the standard KF, because in this case the Kinect measurements belong to the uncensored region and consequently they are known. Due to this fact, we expect in some recordings, which do not include big or fast joints’ movements (thus, the Kinect measurements always belong to the uncensored region), to get almost the same results concerning RMSE for KF as well as for ATKF.

In order to create a general model for smoothing Kinect V2 measurements without having to estimate the matrix for every time-window (because this is time consuming), we assume that this matrix is constant. Substituting for R in the likelihood function (31), the covariance matrix of the noise process, Q, can be estimated. By experimenting on various joints’ movements, it is derived that the values of Q are smaller than those of matrix R and generally they depend on the speed of the human skeleton’s joints. Regarding slow joints’ movements, the entries of Q are smaller than and for faster joints’ movements they lie between and . We notice that in some cases, where the entries of Q appeared to be quite large (in the order of ), the human skeleton moved too quickly in an abnormal manner due to occlusions and/or self-occlusions. Thereafter, we assume that the covariance matrix of the noise process is


otherwise, if we define smaller or larger values, ATKF will be either over-smoothed or will not denoise the Kinect measurements. Therefore, the matrix Q given in (39), seems to be an appropriate choice for smoothing the Kinect V2 sensor measurements of human skeleton tracking.

4 Experiments

In this section, we conduct three sets of experiments to evaluate TKF and ATKF compared to other methods. We use 1) TKF and 2) TKF in the first experimental set (oscillator), which is employed in [41]. Next, we use 1) SGF, 2) KF, 3) TKF, 4) TKF and 5) ATKF in order to smooth data for two different experimental sets: a) Real-life data captured by a Kinect sensor, b) Real-life data captured by both a Kinect sensor and a Vicon system.

4.1 Oscillator

In the first experimental set, we present a motivating example of tracking a sinusoidal model by a TKF and TKF, when the measurements are saturated. Let the state space equations have the form of (2), with state space matrices




where and . The disturbance is assumed to be normally distributed, i.e. , where


while, the measurement noise, , is normally distributed, . The initial state vector is equal to with covariance matrix , the censored limits are and . Therefore, by the above example we produce censored (saturated) measurements, , where .

Next, we repeat the above process 100 times and we calculate the filters’ RMSEs for each iteration. The means of filters’s RMSEs for 100 iterations are presented in Table 1, where we provide separate RMSEs for the two estimated coordinates of the state vector, . It can be observed that the corrected TKF outperforms TKF in state estimation (Fig. 2). This is due to the fact that in TKF some important terms are ignored when calculating (26), while these terms are included in TKF process (13).

Filter Mean RMSE of Mean RMSE of
TKF 0.4434 0.5464
TKF 0.4066 0.5192
Table 1: The mean of RMSEs for the filters TKF and TKF, respectively.
Figure 2: The difference between TKF’s and TKF’s RMSE for each iteration.

4.2 Recordings by the Kinect Sensor

In the second experiment set, we record various human movements by a single Kinect V2 sensor. In some of the recordings, the human skeleton motion exhibits an important error on the axis (practically, the human skeleton seems to "fall down") for one or two frames. We apply the above mentioned filters to correct this specific error.

In order to evaluate the performance of the different filters, we propose a novel metric , to better examine the result of smoothing the joints’ movements. Let us denote by the filtering of the component of the measurement at time . Then,


where and is the number of measurements.

In the case of TKF and TKF we use the device limits. For instance, the ranges of Kinect spatial coordinates and (depth) are approximately , (if the Kinect V2 sensor is located over the ground) and , respectively. Thus, we can use these limits for the Kinect measurements in order to test TKF and TKF. The covariance matrices of TKF and TKF for the noise measurement, R, are defined as in ATKF (34), while the covariance matrices for the noise process, Q, can be estimated using the likelihood functions (33) and (32), respectively. By experimenting on various joints’ movements, we get that the entries of Q are the same as in the case of ATKF, therefore, we can use the same matrix Q given by (39). In the case of KF, the covariance matrix, R, is defined as in ATKF (34) and the covariance matrix for the noise process, Q, is estimated by the log-likelihood function given in [47]. The results showed (in the same experiments as we mentioned before), that the entries of Q are almost the same as in the case of ATKF, thus, the matrix Q is defined as in (39).

In our experiments we take the overall average of the metrics for various recordings. The results showed that ATKF achieves better performance in noise reduction than the other filters (see Table 2 ), especially in the cases where the skeleton seems to collapse, while KF, TKF and TKF have almost the same overall average and SGF has a poor performance. As can be seen in Fig. 3 for two different experiments, the head’s spatial coordinates of the human skeleton resulted from ATKF, do not (correctly) follow the error produced by the Kinect sensor. It can be seeing (Fig. 3) that although KF, TKF and TKF improve the human skeleton motion, they provide inferior results than the ones produced by ATKF, while SGF has the worst performance among all. In the first experiment illustrated in Fig. 3a, the ATKF skeleton followed the sharp "fall" for almost 5 cm, while KF, TKF and TKF skeletons for 12 cm, and the SGF skeleton for 20 cm. The joint based average as opposed to the overall experiments average of ATKF in this experiment is , while in KF, TKF and TKF is and in SGF is .

Filter Overall Average
Kinect V2
Table 2: The overall average of the recordings for the Kinect V2 sensor and the filters.
Figure 3: The head’s spatial coordinates of Kinect V2 sensor, Saviztky-Golay, KF, TKF, TKF and ATKF.
Figure 4: The right hand’s coordinates by Kinect V2 sensor, SGF, KF, TKF, TKF, ATKF and Ground truth.

To better illustrate the superiority of ATFK we illustrate the motion of the human skeleton (obtained by Kinect) under heavy occlusion in the first row of subfigures in Fig. 5 for four consecutive frames. The first subfigure shows the human skeleton one frame before "collapsing", the next two show the human skeleton under heavy occlusion and the last one shows a better performance of human skeleton. In the next five rows of Fig. 5, the motion of human skeleton is illustrated as it is resulted by SGF, KF, TKF, TKF and ATKF, respectively. All filters had a delay of 1-2 frames due to the occluded area but ATKF clearly outperforms all other methods (see the last row in Fig. 5)

Angles Kin. v2 SGF KF TKF TKF ATKF
Right Elbow 39.31 37.44 36.60 36.60 36.60 36.32
Left Elbow 31.58 30.65 27.98 27.98 27.98 26.50
Right Knee 16.70 16.79 15.79 15.79 15.79 14.90
Left Knee 26.25 25.81 25.14 25.14 25.14 25.11
Angles Kin. v2 SGF KF TKF TKF ATKF
Right Elbow 38.76 36.86 35.90 35.90 35.90 35.57
Left Elbow 32.18 31.27 28.43 28.43 28.43 27.02
Right Knee 17.03 17.12 15.75 15.75 15.75 14.93
Left Knee 26.38 26.01 24.85 24.85 24.85 24.82
Angles Kin. v2 SGF KF TKF TKF ATKF
Right Elbow 38.43 36.63 35.40 35.40 35.40 35.06
Left Elbow 32.99 32.09 29.08 29.08 29.08 27.75
Right Knee 17.77 17.79 16.04 16.04 16.04 15.26
Left Knee 26.67 26.46 24.90 24.90 24.90 24.89
Angles Kin. v2 SGF KF TKF TKF ATKF
Right Elbow 38.39 36.64 35.25 35.25 35.25 34.93
Left Elbow 33.96 33.06 29.92 29.92 29.92 28.70
Right Knee 18.78 18.78 16.58 16.58 16.58 15.77
Left Knee 27.14 27.02 25.24 25.24 25.24 25.23
Table 3: RMSEs for the angles by Kinect V2, SGF, KF, TKF, TKF and ATKF for time delay 92, 93, 94 and 95.

4.3 Recording by Kinect Sensor and Vicon System

In this subsection, we evaluate the proposed method with respect to ground truth data. To that end, we monitor an athlete throwing a ball with his right hand, and we record this motion by a Kinect V2 sensor and the Vicon system at the same time. We use Vicon as the ground truth in order to compare results using the proposed method on Kinect measurements. The number of Kinect’s and Vicon’s frames are 266 (almost 8.8667 sec.) and 139 (4.4480 sec.), respectively. We note that Kinect time-stamp is almost 0.033 sec per frame while Vicon time-stamp is constantly 0.032sec. We interpolate Vicon data in order to deal with the time-stamp problem; after interpolation, the new Vicon data include 133 frames. Therefore, we temporally synchronize the two sensors to start together. To do so, we initially calculate the angles of knees and elbows obtained by Kinect and Vicon data and then, we calculate the RMSE between these angles for different delays. The results show that the minimum values of RMSE for every angle appeared for delays of 92-95 frames. The different delays between the angles in some cases are somewhat expected because Kinect records fast movements with delay (i.e., after some frames).

We notice that KF smooths the spatial coordinates without affecting the movement (see Fig. 4). TKF and TKF perform exactly the same smoothing in all joints as KF, while SGF does not perform a satisfactory smoothing in some points where the measurements have a significant error. In Table 3 we observe the RMSEs for the angles as they arise for delays frames, respectively. In all cases, the RMSEs are big enough because of the occlusion of some joints during the recording.

In Fig.4 the right hand’s coordinates resulted by KF, TKF, TKF and ATKF are almost the same, because all measurements belong to the uncensored region, while SGF coordinates are almost the same with Kinect’s coordinates. However, as can be seen in Table 3 , in all cases concerning RMSEs, we get better results via ATKF compared to those of standard KF, TKF and TKF. The RMSEs of SGF are almost the same as the Kinect RMSEs.

5 Conclusion and Discussion

The aim of this paper was to improve 1) the well-known TKF process [17] and 2) the human skeleton motion tracking using a single Kinect V2 sensor, which often generates noisy measurements due to occlusion, lighting conditions, etc. To that end, we proposed a novel filtering method, called ATKF, which relies on the censored data statistics theory for human skeleton motion tracking in real-time. In order to estimate the hidden state vector by the censored measurement, firstly, we evaluated the probabilities of a latent measurement to belong in or out of the uncensored region (Appendix C) and secondly, we evaluated the accurate covariance matrix of the censored normal distribution (Appendix B). In this approach, we had to define the limits of the uncensored region for the Kinect’s measurements, in a reasonable manner for every time step . To do so, we tested many data with various joints movements, which were obtained by ground truth sensor, such as the Vicon tracking system.

We evaluated the proposed method against 1) standard KF, 2) TKF, 3) TKF with constant limits and 4) SGF in three different setups: 1) Artificial data 2) Kinect and 3) Kinect plus Vicon human skeleton motion data. We also introduced a new metric in order to evaluate results when no ground truth is available. Finally, we calculated the covariance matrix, Q, of the noise process under a specific experimental methodology as opposed to previous methods where random or simple experimental covariance matrices were used. Among the five approaches, ATKF gave better results in all the different setups for human skeleton tracking.

In a future work it would be interesting to use the proposed filtering method for action recognition tasks in the wild, where uncontrolled environments and situations where RGB-D sensors may have poor performance often occur. Moreover, as a step beyond, it would be interesting to consider the state vector as a censored state, aiming at achieving a more accurate filtering of the human skeleton motion data.

1. Kinect
2. SGF
3. KF
4. TKF
5. TKF
Figure 5: Each row represents the human skeleton motion for four consecutive frames as it is obtained by 1) Kinect V2 sensor, 2) SGF, 3) KF, 4) TKF, 5) TKF and 6) ATKF, respectively.

Appendix A: The censored mean value

For a discrete random variable (Bernoulli distribution) in Lemma 1, it is derived that