A Neural Network Approach to Missing Marker Reconstruction

A Neural Network Approach to Missing Marker Reconstruction

Taras Kucherenko                Hedvig Kjellström Department of Robotics, Perception, and Learning
KTH Royal Institute of Technology, Stockholm, Sweden
Email: {tarask,hedvig}@kth.se
Abstract

Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers. These are then used to reconstruct the motion of rigid objects or human articulated bodies, to which the markers are attached.

The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep are missing from the reconstruction.

In this paper, we propose to use a neural network approach to learn how human motion is temporally and spatially correlated, and reconstruct missing markers positions through this model. We experiment with two different models, one LSTM-based and one window-based. Experiments on the CMU Mocap dataset show that we outperform the state of the art by 20% - 400%.

I Introduction

Often a digital representation of human motion is needed. This representation is useful in a wide range of scenarios: mapping an actor performance to a virtual avatar (in movie productions or in the game industry) [1]; predicting or classifying a motion (in robotics or human-robot interaction) [2]; trying clothes in a digital mirror [3]; etc.

A common way to obtain this digital representation is marker-based optical motion capture (mocap) systems. Such systems use a large number of cameras to triangulate the position of optical markers. These are then used to reconstruct the motion of rigid objects or human articulated bodies, to which the markers are attached.

All motion capture systems suffer to a higher or lower degree from missing marker detections, due to occlusion problems – that less than two of the cameras see the marker – or marker detection failures – that the marker is not correctly detected in less than two of the cameras.

In this paper, we propose a method for reconstruction of missing markers to create a more complete pose estimate (see Fig. 1). The method exploits knowledge about spatial and temporal correlation in human motion, learned from data examples to fill in missing parts of the pose estimate (Sec. IV).

Fig. 1: Illustration of our method for missing marker reconstruction. Due to errors in the capturing process, some markers are not captured. The proposed method exploits spatial and temporal correlations to reconstruct the pose of the missing markers.

A number of methods have been proposed within the Graphics community to address the problem of missing marker reconstruction. The traditional approach [4, 5] is interpolation within the current sequence. However, recently a learning approach has been proposed, which exploits a body of motion examples to learn typical correlations [6]. The novelty with respect to this method is that while they learn linear dependencies, we employ a Neural Network (NN) methodology which enables modeling of more complicated spatial and temporal correlations in sequences of a human pose.

In Sec. V we demonstrate the effectiveness of our network, showing that our method significantly outperforms the state of the art in missing marker reconstruction.

Finally, we discuss our results and give suggestions for future work (Sec. VI).

Ii Related work

The task of modeling human motion from mocap data has been studied quite extensively in the past. We here give a short review of the works most related to ours.

Ii-a Missing Marker Reconstruction

It is possible to do 3D pose estimation even with affordable sensors such as Kinect. However, all motion capture systems suffer to some degree from missing data. This has created a need for methods for missing marker reconstruction.

The missing marker problem has been traditionally formulated as a matrix completion task. Peng et al. [5] solve it by non-negative matrix factorization, using the hierarchy of the body to break the motion into blocks. Wang et al. [6] follow the idea of decomposing the motion and do dictionary learning for each body part. They train their system separately for each type of motion. Burke and Lasenby [4] apply PCA first and do Kalman smoothing afterwards, in the lower dimensional space. All those methods are based on linear algebra. They make strong assumptions about the data: each marker is often assumed to be present at least at one time-step in the sequence. Moreover, due to the linear models, they often struggle to reconstruct complex types of motion.

To the best of our knowledge, this paper is the first attempt to apply Neural Networks to missing marker reconstruction. The limitations discussed above motivate the presented novel approach to the missing marker problem.

Ii-B Prediction

A highly related problem is to predict human motion some time into the future.

State-of-the-art methods try to ensure continuity either by using Recurrent Neural Networks (RNNs) [7, 8] or by feeding many time-frames at the same time [9]. While our focus is not on prediction, our networks architectures is inspired by those methods.

The most related to the present work is Martinez et al. [10], who evaluate existing RNN-based methods for motion prediction [7, 8] and show that a primitive baseline (just taking a previous time-frame) beats many advanced methods. They also propose a simple network architecture with just one layer and analyze it. Since our application is not prediction, our architecture is slightly different: we do not have a residual connection (from the output to input). Instead, we use LSTM [11] at the output layer, hence having a connection from the previous to the current output.

Another related paper is the work of Bütepage et al. [9], who use a sliding window and a Fully Connected Neural Network (FCNN) to do motion prediction and classification. Again, since our problem is different, we modify their network, using a much shorter window length, fewer layers and no bottleneck.

Iii Dataset

We evaluate our method on the popular benchmark CMU Mocap dataset [12]. This database contains 2235 mocap sequences of 144 different subjects. We use the recordings of 25 subjects, sampled at the rate of 120 Hz, covering a wide range of activities, such as boxing, dancing, acrobatics and running.

Iii-a Preprocessing

We start preprocessing from transforming every mocap sequence into the hips-center coordinate system (translating it to the center of the hips). Then we normalize the data into the range [-1,1] by subtracting mean pose over the whole dataset and dividing into the absolute maximal value in the dataset.

Fig. 2: Illustration of a typical marker placement.

Iii-B Data Explanation

This data contains 3D positions of a set of markers, which were recorded by the mocap system at CMU. Example of a marker placement during the capture can be seen in Fig. 2. All details can be found in the dataset description [12].

The human pose at each time-frame is represented as a vector of the marker 3D coordinates: , where denotes the number of markers used during the mocap collection. In the CMU data, , and the dimensionality of a pose is .

A sequence of poses is denoted .

Iii-C Missing Markers

Missing markers in real life correspond to the failure of a sensor in the motion capture system.

In our experiments we use mocap data without missing markers. Missing markers are emulated by nullifying some marker positions in each timestep. This process can be mathematically formulated as a multiplication of the mocap frame by a binary diagonal matrix :

(1)

where , such that that all 3 coordinates of any marker are either missing or present at the same time. The percentage of missing values is usually referred to as the missing rate.

Iii-D Training, Validation, and Test Data Configurations

Validation dataset contains 2 sequences from each of the following motions: pantomime (subjects 32 and 54), sports (subject 86 and 127), jumping (subject 118) and general motions (subject 143).

Test dataset contains the following sequences: 102_03 (basketball), 14_01 (boxing) 85_02 (jump-turn).

Training dataset contains all the sequences not used for validation or testing, from 25 different folders in the CMU Mocap Dataset, such as 6, 14, 32, 40, 141, 143.

Iv Method

In this section the main idea of the proposed method is presented. Thereafter, the two versions of the method are explained. Finally, the training of the model is described.

We assume that the data has missing markers, as explained in Sec. III. Missing marker reconstruction is defined in the following way: Given a human motion sequence corrupted by missing markers, the goal is to reconstruct the true pose for every frame .

Iv-a Missing Marker Reconstruction as Function Approximation

We approach missing markers reconstruction as a function approximation problem: The goal is to learn a reconstruction function that approximates the inverse of the corruption function in Eq. (1). This function would map the sequence of corrupted poses to an approximation of the true poses:

(2)

The mapping is under-determined, so it is not invertible. However, it can be approximated by learning spatial and temporal correlations in human motion in general, from a set of other pose sequences.

We propose to use a Neural Network (NN) approach to learn , well known for being a powerful tool for function approximation [13]. We employ two different types of neural network models, which are described in the following subsections. They are compared to each other and to the state of the art in missing marker reconstruction in Sec. V.

(a) LSTM-based
(b) Window-based
Fig. 3: Illustration of the two architecture types. (a) LSTM-based architecture (Sec. IV-B). (b) Window-based architecture (Sec. IV-C).

Iv-B LSTM-Based Neural Network Architecture

Long-Short Term Memory (LSTM) [11] is a special type of Recurrent Neural Network (RNN). It was designed as a solution to the vanishing gradient problem [14] and has become a default choice for many problems that involve sequence-to-sequence mapping [15, 16, 17].

Our network is based on LSTM and illustrated in Fig. 3a. The input layer is a corrupted pose , and the output LSTM layer is the corresponding true pose

Iv-C Window-based Neural Network architecture

An alternative approach is to use a range of previous time-steps explicitly, and to train a regular Fully Connected Neural Network (FCNN) with the current pose along with a short history, i.e., a window of poses over time .

This network is illustrated in Fig. 3b. The input layer is a window of concatenated corrupted poses . The output layer is the corresponding window of true poses (Sec. III). In between, there are zero or more hidden fully connected layers.

This structure is inspired by the sliding time window-based method of Bütepage et al. [9], but is adapted to pose reconstruction. For example, there is no bottleneck middle layer and fewer layers in general, to create a tighter coupling between the corrupted and real pose, rather than learning a high-level or holistic mapping of a pose. We also use window length T=10, instead of 100, based on the performance on the validation dataset.

Iv-D Training

For training purposes, we extract short sequences from the dataset by sliding window, then shuffle them and feed to the network.

The training was done using Adam optimizer [18]. All hyper-parameters were optimized on the validation dataset.

Note that for the non-missing markers the existing input values were used instead of the network output, so that only missing values were reconstructed.

V Experiments

The models presented above are evaluated in the same setting as any other random missing marker reconstruction system. A specific amount of markers (10%, 20%, or 30%) are randomly removed at each time-frame and each method is applied to recover them. The reconstruction error is measured.

V-a Error Measure

We use the commonly used [5, 6, 4] Root Mean Squared Error (RMSE) to measure pose reconstruction error. This measure calculates average distance between a true and a reconstructed motion according to:

(3)

where is the original motion, is the recontructed one and is the amount of missing markers.

There is randomness in the system (in the initialization of the network weights), so every experiment is repeated 3 times and error mean and standard deviation are measured.

V-B Implementation Details

All methods were implemented using Tensorflow[19]. The code is publicly available 111https://github.com/Svito-zar/NN-for-Missing-Marker-Reconstruction.

V-C Hyperparameters

The hyperparameters for both architectures were optimized w.r.t. the validation dataset (Sec. III-D) using grid search.

Hyperparameters for the LSTM-based network (Sec. IV-B) were: hidden layer size = 2048, number of hidden layers = 1, dropout = 0.9, batch size = 64, initial learning rate = 0.0005, sequence length = 32 (number of frames unrolled). These hyperparameters correspond approximately to those found in related work [8, 10, 7].

Hyperparameters for the window-based network (Sec. IV-C) were: hidden layer size = 1024, number of hidden layers = 1, dropout = 0.9, batch size = 32, initial learning rate = 0.0005, time window size = 10.

Architecture Basketball Boxing Jump turn
LSTM-ba. 1.57 0.91 1.44
Window-based 2.32 1.63 2.45
TABLE I: Comparison of architectures for missing marker reconstruction. RMSE in marker position (cm) for LSTM-based (Sec. IV-B) and window-based (Sec. IV-C) architectures. A training set comprises all activities, 20% of the markers (randomly selected) in each indata frame are missing.

V-D Architecture Comparison

Table I shows a comparison of different architectures on the missing marker problem. The LSTM-based network performed better than the window-based one. A probable reason for that is that LSTM is better than window-based FCNN at modeling temporal correlations. Based on this, the LSTM-based architecture was chosen for the following state of the art comparison.

V-E Comparison to the State of the Art

Tables II provide the comparison of the performance of our system with 3 state-of-the-art papers and with a trivial solution (substitute mean values) as a baseline, on 3 action classes from the CMU Mocap dataset. The experiments from [4] were repeated by us while using the same hyperparameters as in their original paper. The results of the Wang method [6] were taken from the diagram in their paper. Last, the error measures of the Peng method [5] were recomputed from their original measure with averaging the error over all markers, to our measure where the error is computed only over the missing markers.

Each sub-table has a different missing rate (percentage of markers missing at each time-frame). Our system substantially outperforms the state of the art.

Method Basketball Boxing Jump turn
Mean val. 13.2 12.62 19.2
Burke [4] 10.1 1.1 9.6
Wang [6] 3.5 2.5 n.a.
Peng [5] n.a. n.a. n.a.
LSTM-ba. 1.57 0.91 1.44
(a) 10% of the markers (randomly selected) in each indata frame are missing.
Method Basketball Boxing Jump turn
Mean val. 12.4 12.92 19.9
Burke [4] 12.57 12.78 19.43
Wang [6] 8 4 n.a.
Peng [5] n.a. 4.94 5.12
LSTM-ba. 1.62 0.94 1.36
(b) 20% of the markers (randomly selected) in each indata frame are missing.
Method Basketball Boxing Jump turn
Mean val. 12.56 12.77 19.98
Burke [4] 12.6 12.76 19.85
Wang [6] 30 35 n.a.
Peng [5] n.a. 4.36 4.9
LSTM-ba. 1.85 1.01 1.42
(c) 30% of the markers (randomly selected) in each indata frame are missing.
TABLE II: Comparison to the state of the art in missing marker reconstruction. RMSE in marker position (cm) for LSTM-based (Sec. IV-B) architecture, compared to three related methods. The errors using the mean pose of the missing markers are given as baseline. A training set comprises all activities. The numbers from [6] were extracted from a diagram.

The method of Burke and Lasenby [4] performs well for repetitive motion if the noise level is low (see Table IIa). However, it requires all the markers to be present at least at one time-step in the sequence and is unable to reconstruct markers that failed early on during the mocap, as noted by the authors.

The method of Wang et al. [6] performs well for low missing marker rate (see Table IIa), but degrades steeply with an increase of missing rate (see Table IIc). In contrast, our method is robust to an increasing missing rate.

The method of Peng et al. [5] produces very promising results, even if a big fraction of data is missing. Still, it is considerably less accurate than our method; probably because this method relies on linear models, while the dependencies between different markers are non-linear in reality.

V-F Visualization of the Results

(a) Ground truth markers
(b) 20% of markers are missing
(c) Reconstruction result
Fig. 4: Four keyframes from the boxing test sequence, illustration of the reconstruction using the LSTM-based method.

Fig. 4 illustrates the marker reconstruction in one of the test sequences, jump-turn. The markers have been colored for visibility as: head-red, torso-yellow, arms-green, legs-blue. The subject is lifting their arms and jumping. The observed marker cloud (Fig. 4b) misses 20% of the markers, in this case, many of the markers on the lower torso, which makes the poses difficult to see. Our reconstruction result (Fig. 4c) is visually close to the ground truth (Fig. 4a), which is also supported by the results in Tables I and II.

V-G Development of Error over Time

(a) Reconstructing basketball: 1 marker of a hand is missing for all time frames except the first.
(b) Reconstructing basketball: 3 markers of a hand are missing for all time frames except the first.
(c) Reconstructing boxing: 1 marker of a hand is missing for all time frames except the first.
(d) Reconstructing boxing: 3 marker of a hand is missing for all time frames except the first.
Fig. 5: Long-term missing marker experiment. The plots show the development of error over time when specific markers were missing for an extended period of time.

So far we have been considering a scenario where some percentage of the markers are missing randomly at each time-step. This is a common setting in which to evaluate missing marker reconstruction algorithms. However, specific markers are often missing over extended periods of time. It is therefore interesting to simulate this type of missing data as well.

The following experiments compare our system with the method of Burke and Lasenby [4] (for which we have access to the implementation) in this setup. We focus on the markers of the hand, because this part of the body is the most articulated and the hardest to reconstruct.

It should be noted that while the method of Burke and Lasenby [4] requires a complete sequence, our method only uses the past frames, making it suitable for online estimation.

Results for the basketball test sequence, frames 20-110, are illustrated in Fig. 5a. Results for the other motions, frames 480-570, are illustrated in Fig. 5b,c,d. We can observe that the Burke and Lasenby method performs much better for the ”boxing” sequence in comparison to ”basketball”. The probable reason for that is that the ”basketball” sequence is much longer and hence provides more information for interpolation. Our system performs better than the one of Burke and Lasenby when just one marker is missing, but degrades steeply when the number of missing markers increase. During the training the data had random missing markers, so either previous time-steps or the neighboring markers were present for each marker, but in the test condition those markers has neither information about previous time-step, nor about their neighbors, so our system could not recover them.

Vi Conclusions

We propose a data-driven, Neural Network-based approach to missing marker reconstruction. We suggest two methods to model spatial and temporal correlations. The CMU Mocap dataset was used to evaluate and benchmark the method.

Our experiments show that both suggested methods outperform the state of the art in missing marker reconstruction; the LSTM-based (Sec. IV-B) method can compensate for missing markers a factor 4 better than the state of the art when the rate of missing markers is high and a factor 2 when the missing marker rate is low.

Vi-a Future Work

We envision two future directions for developing the method further. Firstly, to reconstruct the aspects of the human motion that are difficult to capture in a full-body scenario, such as fingers, gaze, and facial features. Secondly, to use the method for human-to-robot mapping of communicative motion behavior.

Acknowledgments

The authors would like to thank Rika Antonova, Simon Alexanderson, and Hossein Azizpour for useful discussions and comments. This work is funded by Stiftelsen för Strategisk Forskning.

References

  • [1] M. J. Black and P. Guan, “System and method for simulating realistic clothing,” Jun. 13 2017, uS Patent 9,679,409.
  • [2] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in ICCV, 2017.
  • [3] P. X. Nathan, C. J. Cox, F. T. Leibert, M. S. Meadows, and J. S. Mallis, “Systems and methods for managing a persistent virtual avatar with migrational ability,” Nov. 16 2006, uS Patent App. 11/560,743.
  • [4] M. Burke and J. Lasenby, “Estimating missing marker positions using low dimensional kalman smoothing,” Journal of Biomechanics, vol. 49, no. 9, pp. 1854–1858, 2016.
  • [5] S.-J. Peng, G.-F. He, X. Liu, and H.-Z. Wang, “Hierarchical block-based incomplete human mocap data recovery using adaptive nonnegative matrix factorization,” Computers & Graphics, vol. 49, pp. 10–23, 2015.
  • [6] Z. Wang, S. Liu, R. Qian, T. Jiang, X. Yang, and J. J. Zhang, “Human motion data refinement unitizing structural sparsity and spatial-temporal information,” in IEEE International Conference on Signal Processing, 2016.
  • [7] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in ICCV, 2015.
  • [8] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in CVPR, 2016.
  • [9] J. Bütepage, M. Black, D. Kragic, and H. Kjellström, “Deep representation learning for human motion prediction and classification,” in CVPR, 2017.
  • [10] J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in CVPR, 2017.
  • [11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
  • [12] CMU, “Carnegie-mellon mocap database.”
  • [13] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
  • [14] S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 2, pp. 107–116, 1998.
  • [15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014.
  • [16] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
  • [17] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” PAMI, vol. 39, no. 11, pp. 2298–2304, 2017.
  • [18] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2014.
  • [19] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
116999
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description