Automatic Error Analysis of Human Motor Performance for Interactive Coaching in Virtual Reality
In the context of fitness coaching or for rehabilitation purposes, the motor actions of a human participant must be observed and analyzed for errors in order to provide effective feedback. This task is normally carried out by human coaches, and it needs to be solved automatically in technical applications that are to provide automatic coaching (e.g. training environments in VR). However, most coaching systems only provide coarse information on movement quality, such as a scalar value per body part that describes the overall deviation from the correct movement. Further, they are often limited to static body postures or rather simple movements of single body parts. While there are many approaches to distinguish between different types of movements (e.g., between walking and jumping), the detection of more subtle errors in a motor performance is less investigated. We propose a novel approach to classify errors in sports or rehabilitation exercises such that feedback can be delivered in a rapid and detailed manner: Homogeneous sub-sequences of exercises are first temporally aligned via Dynamic Time Warping. Next, we extract a feature vector from the aligned sequences, which serves as a basis for feature selection using Random Forests. The selected features are used as input for Support Vector Machines, which finally classify the movement errors. We compare our algorithm to a well established state-of-the-art approach in time series classification, 1-Nearest Neighbor combined with Dynamic Time Warping, and show our algorithm’s superiority regarding classification quality as well as computational cost.
Coaching environments for motor learning have become a more and more popular research topic in the field of Virtual Reality (VR) . They are promising in areas such as rehabilitation or fitness training. Obviously, high-quality feedback on the coachee’s performance is crucial for the success of such systems. Therefore, an intelligent coaching system does not only have to detect which task — in the following called motor action — is executed. It also has to detect the specific errors the coachee performs during an exercise and has to address them using appropriate feedback. While lots of approaches exist for the classification of motor actions , fewer consider the analysis of the performance quality. If they do, authors often focus on reporting simple scores, which summarize the performance quality in terms of a deviation from a desired performance . Others provide scoring functions which describe overall improvement or decline in quality for a specific exercise . However, many types of complex sports movements can be executed correctly yet with different individual styles . Moreover, some parts of the body are often completely irrelevant for the successful execution of the movement. For instance, the orientation of the hands is negligible when analyzing the quality of a body weight squat. Consequently, feedback that only relies on an overall deviation from a prerecorded desired performance, including task-irrelevant deviations, is non-optimal when aiming at improving the coachee’s performance .
For many types of motor actions, a set of typical errors can be found . Often, there is only a very subtle distinction between a correct movement and the occurrence of a certain error. For many known errors, coaches have established feedback strategies to support a coachee in improving her performance. This could be, for instance, verbal descriptions of the error together with best practices on how to eliminate it. Intelligent coaching environments in VR need to be able to detect such error patterns automatically and to provide elaborate feedback, e.g. taken from real-world coaching experience. Such feedback must be provided online or rapidly, i.e., either directly after a coachee has finished the movement or — even better — already during the motor action being performed. Some approaches try to achieve this using manually designed rules that can be evaluated online . However, this requires enormous manual effort and bears the risk of gaps or under-fitting of the designed rules.
In this paper, we present an approach to automatic error analysis of human motor performance in an immersive VR coaching environment for sports and rehabilitation exercises (see Figure 1). We focus on the squat movement as a test case for our approach. The squat is a full-body motor action that is frequently used in the context of rehabilitation  as well as for sports training . When executed by novice coachees, various different error patterns can be observed in a squat. We consider the detection of such error patterns as a time series classification problem. In the field of time series classification, 1-Nearest-Neighbor combined with Dynamic Time Warping (1NN-DTW) proved to be state of the art and difficult to beat by other classifiers . We aim to extend the current state of the art in the classification of typical error patterns in motor performance. Our contribution is as follows:
We propose a novel approach towards the classification of error patterns in motor performances which uses a reference-based Dynamic Time Warping of movement segments as a basis for a feature selection using Random Forest. The selected features are in a final step classified by a Support Vector Machine (SVM).
We show that this classifier outperforms the 1NN-DTW approach, in both classification performance as well as time needed for classification.
We show the effectiveness of the approach on an exemplary data set and demonstrate the impact of all components on classification performance as well as on time needed for classification.
In the next section, we discuss related work towards motor performance analysis and time series classification. Then, we describe how we obtain our data set, which consists of a list of typical error patterns, together with annotated movement data. In Section 4, we first evaluate the performance of 1NN-DTW on our data set. Next, we provide a step-by-step evaluation of the components of our approach. In Section 5, we discuss the results and conclude the paper. The video in the online material demonstrates how we use the proposed analysis to generate verbal feedback inside our “Intelligent Coaching Space”
Two main approaches have been applied to assess the quality of human motor performances. The first approach (Section Section 2.1) is to engineer a highly specialized method, e.g., for the evaluation of feedback strategies for a very specific type of motor action. In this approach, a common choice is to assess quality by determining the overall distance of the performed motion to the desired motion. Often, a model for these specific performance patterns is manually designed drawing from expert knowledge. The second direction (Section Section 2.2) consists in using more general, data-based approaches, such as well established techniques from time series classification. In the following, we will present and discuss work stemming from both directions.
2.1Specific, Manually Designed Approaches
 use a manually designed scoring function to represent patients’ performance changes in a rehabilitation setting . Even though this approach provides compelling results in the field of application, no detailed information on occurred error patterns is gained, which would be necessary for the application of complex coaching strategies.
Other approaches make use of rule-based systems to detect the occurrence of certain error patterns. In the context of yoga training,  define optimal yoga poses . De Kok et al. went one step further by manually defining error patterns  focussing on the whole trajectory. Rules are implemented, first to split the motion into sequential movement segments, and then to describe the error patterns. A state machine performs the classification.
One major advantage of the approaches by Rector et al. or de Kok et al. is their real-time capability: Specific feedback strategies linked to typical error patterns can be applied immediately. Further, the results are deterministic: If the rules are correct and exhaustive and the motion capture system works properly, an incorrect classification is unlikely to occur. This directly leads to the major disadvantage: As the rules have to be designed manually, they are prone to errors during the design phase, which might be difficult to be tracked down later on. A single error during the design of only one pattern might have a devastating effect on the resulting system in terms of effectiveness and even safety of the training. Moreover, it is mostly not trivial — even when interviewing sports coaches — to obtain exact information about which features are significant or where to draw the border between a correct or an incorrect movement. Finally, the design of rules requires enormous manual effort: For each motor action and for each type of error, a detailed investigation on how to describe the motor action and the error has to be performed. For complex error patterns, this quickly becomes infeasible. Thus, it is desirable to focus on approaches that automatically learn most of their information from data.
 focus on classifying error patterns in rehabilitation exercises using a combination of rule-based segmentation and AdaBoost on a set of manually defined features . In a within-subject cross validation, the authors obtain highly convincing results. However, classification performance decreases significantly when generalizing to new subjects. Furthermore, the design of feature sets requires additional manual work.
 present an approach towards distinguishing between good, moderate, and bad performances of squat movements . They use a feature vector based on manually designed features, such as skewness and range, whose dimensionality is reduced using Sparse Principal Component Analysis (SPCA). Finally, Decision Trees are used for classification. The classification accuracy to distinguish between good, moderate, and bad squats in a leave-one-subject-out cross validation is 73 %. For the distinction between only two classes (good and bad), a higher accuracy of 98.6 % was achieved. The presented approach is only able to distinguish between three coarse classes of quality and cannot spot single error patterns. In addition, manual effort is needed for feature preparation. Furthermore, SPCA is an unsupervised algorithm, which searches for a set of sparse principal components which cover as much as possible of the variance inside the data . This is problematic when most of the variance is due to individual differences rather than performance errors, which holds for sports movements that can differ considerably between subjects.
 use a neural network classifier to differentiate between correct and incorrect performances of squats and to classify error patterns. A leave-one-out cross validation resulted in an accuracy of 80 % to distinguish between correct and incorrect, but only in an accuracy of 57 % for the classification of error patterns. Similar experiments were conducted by  .
 proposed an extension of Dynamic Time Warping (DTW) that is able to detect multiple occurrences of multiple exercise types in trajectories as well as to classify error patterns . Classification is performed by comparing the just performed motion to pre-recorded templates and then selecting the best matching one. This leads to a very high accuracy of 93 % for exercise classification and 89 % for the classification of errors in motor performances (inter-subject performance was not tested). However, combinations of multiple error patterns cannot be considered as long as they are not included as individually pre-recorded templates.
Overall, the data-based approaches employed in the context of sports and rehabilitation applications have three weaknesses: First, it is often not analyzed how well the trained classifiers generalize to new subjects. Many approaches require the system be re-trained for each user. This leads to problems as subjects are often physically not able to provide all the required training data. For instance, in the context of sports performances, some users are not able to perform the desired motor action correctly or, on purpose, with a certain type of error. Second, the motor actions and error patterns are often rather simple. Some of the presented systems only distinguish between, e.g., “good” or “bad” for a motor action that only involves a very small number of joints. Especially algorithms using variance-based dimensionality reduction or pure comparisons with prototypes will perform worse on more subtle errors or more complex movements: Most of the variance and also the similarity to prototypes would be covered by inter-subject variations instead of the movement patterns underlying the errors. Finally, for most algorithms, no information on the applicability in interactive or real-time systems is given. Especially algorithms which require expensive calculations for each classification do not meet the requirements of VR coaching systems as, e.g., described in .
Another group of data-based approaches has been developed in the field of Computer Graphics to capture and synthesize human motion with particular styles. Analysis of observed movements is then often possible through “analysis by synthesis”. Giese et al. introduced Spatio-Temporal Morphable Models for analysis and synthesis of morphs between gait styles . First, recordings of prototypical performances are brought into spatio-temporal correspondence. Then, new trajectories can be described as spatio-temporal blends between prototypes. The underlying assumption is that a clearly defined prototype can be obtained for each desired style. In our case, these styles would be the possible error patterns in a motor performance. However, in the context of motor learning, movements often contain a combination of different error patterns and prerecorded single prototypical errors do not work equally well for different subjects.
A related approach has been proposed by  : Their model, called Motion Graphs++, describes human movements by (a) discrete structural variations that define the motor action together with (b) continuous variations that capture the movement style. Style variations are represented using Principal Component Analysis (PCA) together with a Mixture of Gaussians. MotionGraphs++ are powerful as they do not need an isolated demonstration of each prototype. However, if a targeted variation in style is not covered by the PC dimensions, the model cannot detect this style pattern. In the case of typical error patterns in motor performances, the differences between users who perform the same error may be relatively big, whereas the difference between error patterns within a user can be very subtle. Thus, MotionGraphs++ would rather encode the inter-individual differences than the characteristics of the error patterns.
Finally, the classification of errors in motor performances is a special case of time series classification, for which several machine learning algorithms have been proposed. Ground-breaking work was performed by , who used hidden Markov models (HMM) for the recognition of gestures . Other methods are based on decision trees , SVMs , or Multi-Layer Perceptrons (MLP) . Dynamic Time Warping (DTW) is usually used to temporally align two recorded trajectories. As a pseudo-metric combined with a subsequent classification, DTW has a highly positive impact on motion classification . Xi et al. provide an extensive review comparing a large set of available classification methods, such as HMMs, MLPs, and decision trees on time series data . They show that no tested classifier is able to beat a combination of DTW and 1-Nearest-Neighbor (1NN-DTW), which basically compares the query trajectory to each available training trajectory using DTW as distance measure. Then the most similar training trajectory is used to predict the label of the query trajectory. The superiority of this approach in comparison with nine classifiers, including Random Forests, SVM, Bayes Networks, et cetera, is supported by work from  . Likewise,  achieved good classification results using a method similar to 1NN-DTW, which, however, was limited to simple movement patterns and was not evaluated with respect to generalization to movements of other persons .
To sum up, the approaches discussed in this section suffer from a number of limitations that prevent their use for real-time coaching of human motor performances. We aim to go beyond this by developing a classification approach that can classify subtle errors in a complex motor action with high accuracy, works on a small or unbalanced dataset, achieves good generalization over different users, and provides its results very quickly and already after relevant parts of the performance have been observed. We will base our approach on knowledge from Sports Science about which errors are particularly relevant, and we present an approach that determines discriminatory features of these errors and then realizes classifiers with the desired properties. We will take 1NN-DTW as a baseline in evaluating them.
3Domain and Dataset
Sports coaches and sports scientists have developed coaching strategies to address specific error patterns during a coaching session. Before developing a VR coaching system, and to enable it to detect those errors automatically, it is important to identify relevant error patterns along with corresponding feedback strategies for each motor action of interest. To this end, we analyzed 21 video recordings of real-world squat coaching sessions. A part of these data comes from the corpus described in ; additional other videos were recorded in our lab. We used the videos together with information from Sports Scientists as well as literature (e.g. ) to compile a list of 21 relevant error patterns. For instance, one error pattern is an incorrect weight distribution (depicted in Figure 2), which happens if the coachee shifts major parts of the body weight too much to the front.
Motion data was recorded using an OptiTrack motion capture system, which consists of ten Prime 13W cameras. Passive markers were mostly attached to a customized motion capture suit; markers at the arms and the hands were directly attached to the subjects’ skin (see Figure ?). The motion capture system outputs kinematic features for 19 joints (see Figure ?) per frame at 120 Hz. In our representation, each frame consists of joint rotations as well as joint positions (with ). Joint rotations are represented as quaternions . Each quaternion denotes the rotation of a joint with respect to its parent. The root rotation describes to rotation of the root with respect to its rotation at the beginning of the movement. As root joint, we use the hips. The joint positions are represented by vectors . Each denotes the y- component of the translation (height) of the joint as well as the translation relative to the x- and z- position of the root joint at the beginning of the movement, after removing the subjects orientation at the beginning of the movement. Further we additionally use joint angles as Euler angles, calculated from the quaternion representation, which correspond to flection/extension, abduction/adduction and twist of the corresponding joint.
We asked 49 subjects to perform squats inside the capture volume. Up to two squats per participant were annotated by an expert for the presence of any of the error patterns. The expert had to add confidence and intensity ratings for each decision. These ratings were combined into a score in the interval by averaging. Only ratings with a score above were used for the experiment. Trajectories which contained severe errors caused by the motion capture system (e.g. due to missing markers), were excluded. The final training data set consisted of squat movements coming from 49 subjects. We selected the error patterns that appeared with a sufficient frequency (at least 15 positive and negative examples) for training. The ten resulting patterns and their frequency in the training data are listed in Table 1.
|Performance Error Pattern||#Erroneous Executions||#Correct Executions|
|feet distance not sufficient||45||33|
|hips do not initiate movement||23||51|
|incorrect weight distribution||51||16|
|knees tremble sideways||23||33|
|legs extended at end||42||38|
The combination of Dynamic Time Warping and 1-Nearest-Neighbor (1NN-DTW) is one of the most successful classifiers for time series classification . Thus it will serve as our baseline. In the following, we first report how we evaluate classifier performance. Then we describe the 1NN-DTW baseline approach and carve out its drawbacks for motor performance analysis in interactive coaching sessions. Then, we develop classifiers to eliminate or mitigate its weaknesses step by step. Finally, we verify that our approach is suitable for error analysis of human motor performances in the context of interactive VR coaching sessions.
Motor actions in sports or rehabilitation training often exhibit large inter-subject variation . Consequently, it is important to ensure that classifiers are tested on data from persons whose performances are not included in the training data. This hypothesis is experimentally supported by , who measure a huge difference in classifier scores when testing on samples from participants included in the training set, as compared to samples from participants who were not included in the training data . We made sure that for the results described in the following, no data from subjects who provided a recording to the training set is contained in the test set. We applied 5-fold cross validation under this constraint for each error pattern. In each fold, we aimed at achieving a similar proportion of positive and negative labels as in the overall data set. For our experiments, the variables of interest are the quality of the classification and the time needed for the classification of a single query trajectory.
To investigate the quality of a classification, different types of scores can be used. We report the accuracy of the described classifier, defined as the number of correctly classified samples weighted by the overall number of samples:
is the number of true positives and the number of true negatives. is the overall number of positive examples and the overall number of negative examples in the training data. Additionally, at the end of Section 4, we provide plots for F1 scores, which is the harmonic mean of precision and recall of the classifier:
Here, is the number of false positives, and the number of false negatives. All measured scores and standard deviations for the cross validation folds can be found in the supplementary online material.
In addition to the quality of classification, we report information on the time each algorithm needs to classify a new query trajectory. As DTW is an essential part for each of the proposed algorithms, we report the time that is approximately needed for a DTW without any parallelization. Furthermore, to be able to compare the algorithms that only have to perform one DTW per query, we report the average time per query needed for the classification of a single error pattern. All experiments were conducted on a machine with Intel Xeon CPU E5-1620 3.6 Ghz.
As described above, we take as baseline one of the most successful classification algorithms for time series: 1-Nearest-Neighbor as classification algorithm together with Dynamic Time Warping as distance measure (1NN-DTW). For an input query, 1NN searches for the data point that is most similar to the input. Then it returns the classification label of this nearest neighbor in the training set. The underlying assumption is that data points that lie nearby belong to the same class. To determine which points lie nearby, a frame-wise comparison is problematic in time series such as motion trajectories. If the trajectories would be compared simply frame-to-frame, results would be highly distorted: Even if the movement is performed completely in the same way in space, but with a slight temporal offset, this measure would report a very high distance, whereas if a movement is performed with similar timing but different postures (e.g. a slightly weaker movement of some joints), the distance would be very low. Dynamic Time Warping (DTW) is typically used to solve this problem as it establishes a frame-to-frame correspondence between two trajectories by warping in time and then allows to determine the distance between them.
We implemented 1NN-DTW as follows. Given two trajectories and , consisting of and frames, respectively, we use DTW to calculate the optimal match between them . First, a local cost matrix is constructed. Each element of this matrix corresponds to the distance between the postures and . This distance is defined as the sum of the quaternion distances of the corresponding joints. As quaternion distance, we use the inner product as evaluated by  . Thus, each element in the matrix is calculated as follows:
To establish a frame-to-frame correspondence, an optimal path through from to is determined based on dynamic programming. The distance between the two trajectories , can now be defined as mean value of the on the warping path. Comparison of classification results using different features, such as joint angles or joint positions, yielded no significant improvements in the 1NN step. Results of these comparisons can be found in the supplementary online material.
We applied the above procedure to the relevant error patterns: For each query trajectory we compute DTW to each training trajectory . Next, the trajectory with the smallest DTW distance to which is annotated with respect to the error pattern, is selected. The label of this trajectory is then returned for . As shown in Figure 3, 1NN-DTW is able to detect some of the error patterns with accuracies of more than 60 percent. This is comparable to the results by   and   for simple rehabilitation exercises. The computational cost of DTW are quadratic with respect to the lengths of the trajectories. In our setting, a single DTW takes about 55 ms on average per trajectory. On average, the trajectories used for this experiment consist of 500 frames. For each trajectory to be classified, DTW has to be calculated with each of our training trajectories (). This leads to an average time of over 5 seconds to calculate the DTWs necessary for one single query trajectory. Thus, even if the classification led to optimal results, it would not be applicable in an interactive setting.
4.3Reducing Alignment Cost: 1NN-RefDTW
To reduce computational cost, we can exploit the general similarity between the trajectories that all represent the same motor action (squat). We can thus warp all training trajectories to a normalized timing in an offline preprocessing step. This is done by selecting one reference trajectory and warping all trajectories to its timing. If it is a very short trajectory (i.e. a fast movement), information from the original trajectories gets lost due to the warping. Thus, as reference trajectory, we select the longest trajectory that contains all available movement segments. The warping exploits the correspondences found by DTW. For each frame of , the corresponding frame in the to-be-warped trajectory is selected according to the correspondence path from DTW.
For classification, we perform 1NN using the mean of the frame-by-frame distance between the warped query trajectory and the warped training trajectories as distance measure:
is the quaternion describing the -th joint in the -th frame of the warped query trajectory, whereas refers to the corresponding joint of the training trajectory . is the length of the reference trajectory. In our case, we have . For each classification, the calculation of one DTW for is sufficient: All comparisons between warped query and training trajectories can now be done frame-by-frame with computational cost linear in . In our setting, this process needs on average 25 ms per trajectory. We call the resulting algorithm 1NN-RefDTW and expect it to have similar classification performance as 1NN-DTW while incurring reduced computational cost.
Figure 4 summarizes the classification results of 1NN-RefDTW. The classification accuracy is comparable to 1NN-DTW, with some error patterns detected slightly better. Still, the classification accuracy is insufficient for being applied in a coaching scenario. Concerning the computational costs, the new classifier only needs one DTW per query trajectory. Warping a training trajectory into the timing of the reference trajectory needs on average 90 ms. Additionally, the frame-to-frame distance between the warped query and the training trajectories has to be calculated. The computational effort for classification is thus instead of if all trajectories are of size . In our setting, the classification process for needs approximately 2.5 s. However, the time needed for classification still depends on the number of trajectories in the data set, which is problematic for large training sets.
4.4Separate Classification of Error Patterns: RefDTW-SVM
Errors during the performance of motor actions can occur in many different combinations. 1NN-DTW and its extension 1NN-RefDTW only return the whole set of labels of the nearest neighbor as classification for each query. Combinations of error patterns that do not exist in the training data cannot be detected by the algorithm, unless the training data contains all possible combinations of error patterns. As this is typically not the case, it is desirable to learn a separate classifier for each pattern. Furthermore, we would like to provide a classifier with even more reduced computational cost, ideally independent of the size of the training set. Both goals can be achieved using Support Vector Machines (SVM), one of the most successful machine learning algorithms in general . An SVM learns a decision hyperplane which maximizes the margin between two classes . For classification, the SVM only has to determine on which side of a hyperplane an input query lies. In our case, we can learn a classifier for each error pattern, considering each training trajectory as one data point with the label pattern occurs or pattern does not occur. To use the SVM for training, we first warp all training trajectories to the timing of the reference trajectory. Then, for each warped training trajectory, a feature vector is constructed and standardized via scaling to unit variance and removing the mean. This vector consists of all joint angles in Euler angle representation as well as the joint positions for each frame in the warped trajectory. The feature vector thus has size , where is the number of frames of the reference trajectory and the number of joints. In our case, we have and . Again, we tested different features and found that using joint angles in Euler angle representation together with joint positions leads to good classification results (cf. supplementary online material).
We trained one two-class SVM for each error pattern on the feature vectors obtained from the warped trajectories. In our experiments, a non-linear RBF kernel was unable to beat the linear kernel, thus we decided to use SVMs with linear kernel (cf. supplementary online material). We use the standard SVM implementation from scikit-learn  in version 0.17.1. For classification, a query trajectory is first warped to the timing of the reference trajectory. Then the feature vector is constructed and classified by the trained SVMs. The resulting algorithm is called RefDTW-SVM.
Results can be seen in Figure 5: Now, three of the error patterns are classified with an accuracy greater than 80 %. Also, most of the other patterns reach higher results than with the previous 1NN approaches. However, the overall classification performance is still not sufficient. One explanation is the immense number of features per trajectory. We will approach this problem in the next section. Concerning the time needed for classification, for each error pattern, the classifier now only needs a mean of 9.7 ms. Before starting the classification of error patterns, one DTW has to be calculated, which takes about 90 ms as described in Section 4.3.
4.5Reducing Features: RefDTW-RF-SVM
Our feature vector of size comprises many irrelevant features: For instance, we intuitively do not consider the rotation of the wrist to be related to having a straight back. The SVM classifier might suffer from this high number of irrelevant features as shown by   and  . According to their results, we assume a robust feature selection method to be able to help improving classifier performance. To this end, we use Random Forests (RF) for feature selection . Random Forests perform feature selection as well as classification. They are based on Decision Trees, which learn a hierarchical set of rules to distinguish between classes. Thereby, they implicitly weight the importance of each feature. Random Forests extend Decision Trees and reduce their susceptibility to overfitting via training multiple randomized Decision Trees and averaging them. This leads to an improved accuracy of the estimator as well as a reduced overfitting . See  for an in-depth analysis of the statistical properties and the mathematical background of Random Forests.
Direct classification using Random Forests leads to high computational cost, as all trees in the forest must be considered. We are interested in a model that provides good classification performance with minimal time for classification. As the SVM-based classification presented in Section 4.4 provides almost acceptable results in real-time, we boosted it with a Random-Forest-based feature selection: We trained one Random Forest for each error pattern. The Random Forests are trained on the same feature vectors extracted from the warped trajectories as described for RefDTW-SVM. To train the trees, we used the Gini impurity as criterion to optimize the decision rules. As break condition for growing, we require all leaves to contain only a single class or less than two samples. We observed a number of 200 trees to lead to good results.
The idea of our new algorithm RefDTW-RF-SVM is to use the Random Forests only for feature selection during training: For each error pattern, the Random Forest assigns an importance value to each feature via averaging the relative importance of the feature in each decision tree. Following an idea of  , we add 20 random features to each frame before performing the feature weighting by Random Forests. The average of their importance values is used as threshold to discard irrelevant features. This leads to 580 features on average per error pattern (from originally around 100,000 features) which we use as input for the SVMs. We trained the SVMs with the same parameters as for RefDTW-SVM. For the Decision Trees as well as the Random Forests, we use the the scikit-learn implementation .
Figure 6 shows the resulting classification accuracy, which outperforms RefDTW-SVM for nearly all patterns. Five patterns reach accuracies higher than 80 percent. Concerning the classification time, only 0.1 ms is needed in addition to the DTW step. This leads to a total time to classify all patterns after DTW of around 1 ms.
4.6Getting Classification Results Earlier: Segment-based RefDTW-RF-SVM
RefDTW-RF-SVM and all other approaches presented before only allow classification after the whole motor action is completed, as the full query trajectory needs to be warped by DTW. However, some error patterns are limited to parts of the motor action. For instance, the desired depth of the squat is relevant only at the deepest point of the motion. This information can be exploited by using the concept of movement segments: Each performance of a motor action can be considered a combination of simpler sequential sub-actions. These movement segments are homogeneous and functionally meaningful parts of a more complex movement. For the squat, we define the movement segments preparation, going down, is down, going up, and wrap up.
The underlying idea of a segment-based RefDTW-RF-SVM is to simply apply RefDTW-RF-SVM to a single movement segment once it has been completed. The segmentation is done based on a state machine which splits the trajectory at boundary points (state changes) where important joints like the knees start or stop moving. This is similar to the approach proposed in . The segmentation takes less than 1 ms per frame.
As shown in Figure 7, the classification results are comparable to the results obtained with RefDTW-RF-SVM, which however works on the complete trajectories. For each pattern, the maximum accuracy per movement segment is reported. Seven error patterns are classified with an accuracy of above 80 percent. We performed the classification with the automatically segmented trajectories as well as with manually segmented trajectories. Both led to similar results. Concerning the time needed for classification, as the trajectories for the movement segments are shorter than for the whole motor actions, DTW only needs about 10 ms per movement segment instead of about 90 ms for a whole trajectory. The classification step itself using Segment-based RefDTW-RF-SVM needs around 0.1 ms. Overall, an error pattern is classified on average around 10.1 ms after the movement segment of interest has been performed. As the DTW, which is responsible for around 10 ms of this time, has to be performed only once, we now need approximately 11.0 ms to classify each of our ten error patterns.
4.7Summary of the Results
All algorithms except from 1NN-RefDTW were able to beat the classification performance of our baseline 1NN-DTW. The best classification quality is achieved by RefDTW-RF-SVM and Segment-based RefDTW-RF-SVM. Most error patterns, including the most frequent ones “wrong dynamics”, “incorrect weight distribution”, and “too deep”, can all be detected with an accuracy above 80 %. The patterns “incorrect weight distribution” and “feet distance not sufficient” are even nearly perfectly classified. Only the error patterns with the fewest occurances in our training data, namely “not symmetric” (17 occurances) and “knees tremble sideways” (23 occurances) are classified with an accuracy below 70 %. Additionally, Figure 8 reports the F1 score of all presented approaches. Concerning the F1 score, the data looks similar: Only four patterns are classified with a score below 0.8. This enables our system to make use of various feedback strategies (cf. video in the supplementary online material). The exact scores and their standard deviation in the 5-fold cross validation can be found in the supplementary online material. All algorithms, except from Segment-based RefDTW-RF-SVM, require the calculation of DTW on the whole trajectory, which takes on average about 90 ms. Segment-based RefDTW-RF-SVM only needs single movement segments to be warped, which can be performed in around 10 ms. Table 2 summarizes the time needed to classify a query trajectory with respect to the ten error patterns. Segment-based RefDTW-RF-SVM is clearly the fastest classifier as the classification step itself only needs 1 ms and the result is potentially available already during the execution of the movement, directly after a single movement segment has been completed.
|5000 ms||approx. 2500 ms||187 ms||90 ms||11 ms|
5Discussion and Conclusion
We have presented steps to yield a novel classifier for a fast detection of a variety of error patterns in movement trajectories, as required for interactive coaching applications, e.g., in virtual reality environments. We evaluated all algorithms on a complex motor task involving a high number of relevant error patterns. All scores were measured using cross validation, in a setup where data from one single subject is not allowed to be distributed over multiple folds. Thus our results capture the algorithms’ abilities to generalize across subjects. The resulting algorithm, Segment-based RefDTW-RF-SVM, provides the best balance between quality of classification and computation time: Besides being the fastest classifier in our set, it is among the two classifiers with the highest accuracy scores. Nearly all error patterns, especially the most frequent ones, are classified with accuracies above 80 %. In contrast to many related approaches, this classifier is able to work in interactive setups as shown in our demonstration of how online verbal feedback can be triggered through our automatic error analysis (see the video in the supplementary material).
Overall, from the evaluation of each of the different steps taken in the previous section, we can derive the following conclusions about automatic error analysis of human motor performances:
If the data consists of structurally similar movements such as the same type of motor actions, it is sufficient to temporally align all trajectories via performing DTW with one reference trajectory. Thereby we were able to reduce the computational effort while keeping the quality of the classification in a similar range for nearly all error patterns.
For the classification of multiple error patterns, independent classifiers should be trained. A nearest neighbor-based classification, which only copies all labels from the nearest neighbor of a query, is insufficient especially for small training data sets. Learning independent classifiers for all error patterns increased the classifier performance for nearly all examined error patterns.
Random Forests help to select relevant features from high-dimensional input trajectories, even if the number of training examples is small. Such a preprocessing step significantly improves the performance of SVM-based classification. This holds especially for error patterns which are characterized only by very few features such as the “hollow back”.
By classifying data from appropriate movement segments, instead of whole trajectories, the time needed for classification can be drastically minimized while keeping the classification performance high.
Note that even though general classification performance of our algorithm is high, the performance is not convincing specifically for two error patterns: The pattern “not symmetric” is detected only with F1 scores around 0.43. This error pattern is annotated in trajectories where some joints are not symmetric between the left and the right side of the body. As this can occur in almost all joints and all phases of the movement, the feature selection cannot easily spot those features of interest that are relevant. Further, the classifier has no possibility to infer information on the relationship between multiple joints with respect to symmetry. For the other problematic pattern, “knees tremble sideways”, our best classifier only achieves an F1 score of 0.51. This pattern describes a very subtle movement. Also, it can spread temporarily: Exactly the frames that are problematic for subject A can be correct for subject B and vice versa. Finally, the number of trembles can be different for different subjects which also makes classification harder. One way to deal with these two problematic patterns is the construction of more complex higher-level features. A higher-level feature could, for instance, describe the relationship between certain parts of the body or the movement of the athlete’s center of mass. The automatic generation and inclusion of such higher-level features is a promising field of future work. Another limitation is that temporal properties of the movements are not covered directly by our algorithm. For motor actions where the user’s timing has an influence on whether certain errors occur, temporal information could be included via adding velocity as well as information on the warping function extracted from DTW.
This research was supported by the Cluster of Excellence Cognitive Interaction Technology CITEC (EXC 277) at Bielefeld University, which is funded by the German Research Foundation (DFG).
ASupplementary Online Materials
Here, we report the measured classification performance of all tested classifiers with respect to accuracy (see Figure 10 and Table 3), F1 score (see Table 4) and Receiver Operating Characteristics Area Under the Curve (ROC AUC) (see Figure 9 and Table 5). ROC curves provide a plot which describes the relationship between recall and fall-out. The true positive rate is plotted on the y axis, the false positive rate on the x axis. The higher the curve, the better the classification. The area under the curve (AUC) is thus often used as score for classifier performance as it provides the probability to rank a randomly chosen positive instance higher than a randomly chosen negative one. Thus the higher the result, the better the classifier performs. This section also contains results for the pure Random-Forest-based classification (RefDTW-RF). This, leads to a classification performance in a similar range to RefDTW-RF-SVM, but also to more computational effort: We need around 160 ms additional to the time needed for the DTW step to classify all of our 10 error patterns even if the trees inside the Random Forests are evaluated in parallel. All further components of the system, such as dialogue planning, Text-to-Speech, coaching animation, et cetera have to wait this period of time until they can start planning the feedback corresponding to the motion the trainee just performed in the virtual environment. Thus, for RefDTW-RF-SVM, we only use the Random Forests for feature selection during training to significantly speed up the classification time.
a.2Comparison of Different Feature Types
First, we compare the classification results of the baseline 1NN-DTW when using rotations as quaternions, as Euler angles or using joint positions. In the nearest neighbor step, the euclidean distance between the warped frames is used. All approaches lead to results in a similar range on average over all error patterns (see Figure 11 and Figure 12).
Second, we compare the classification results of our own final classifier Segment-based RefDTW-RF-SVM with respect to different feature sets. The feature weighting using Random Forests on quaternions is implemented component-wise. All quaternions with at least one feature weight above the threshold are completely used for SVM classification. For some error patterns, we observe that the quality of the classification complements each other for joint angles and joint translations: Some patterns (such as “feet distance not sufficient”) can be classified best based on the translations, others (such as “hollow back”) are classified much better based on the angles. We thus combine joint angles and joint translations which leads to a slight enhancement of the overall performance. Here, we finally decide to use Euler angles instead of quaternions for the sake of better interpretability of the selected features and a slightly shorter feature vector. In general, all classifiers behave similarly (see Figure 13 for the accuracies and Figure 14 for the F1 scores).
a.3RFB Kernel vs. Linear Kernel in SVM
In this part, we compare the classification performance of Segment-based RefDTW-RF-SVM using a linear kernel compared to using a radial basis function kernel. Results are in a similar range (see Figure 15 for the accuracies and Figure 16 for the F1 scores). We finally decide to use the linear kernel for the sake of simplicity.
a.41NN-DTW Based on Movement Segments
Finally, we evaluated the performance of 1NN-DTW on movement segments which leads to Segment-based 1NN-DTW. Here, the results are again worse than for our own classifier Segment-based RefDTW-RF-SVM (see Figure 17 for the accuracies and Figure 18 for the F1 scores).
- An approach to ballet dance training through ms kinect and visualization in a cave virtual reality environment.
Matthew Kyan, Guoyu Sun, Haiyan Li, Ling Zhong, Paisarn Muneesawang, Nan Dong, Bruce Elder, and Ling Guan. ACM Transactions on Intelligent Systems and Technology (TIST)
- A multimodal system for real-time action instruction in motor skill learning.
Iwan de Kok, Julian Hough, Felix Hülsmann, Mario Botsch, David Schlangen, and Stefan Kopp. In Proceedings of the International Conference on Multimodal Interaction, pages 355–362. ACM, 2015.
- Sonification and haptic feedback in addition to visual feedback enhances complex motor task learning.
Roland Sigrist, Georg Rauter, Laura Marchal-Crespo, Robert Riener, and Peter Wolf. Experimental brain research
- A virtual reality dance training system using motion capture technology.
Jacky CP Chan, Howard Leung, Jeff KT Tang, and Taku Komura. IEEE Transactions on Learning Technologies
- Bayesian approaches for learning of primitive-based compact representations of complex human activities.
Dominik Endres, Enrico Chiovetto, and Martin A Giese. In Dance Notations and Robot Motion, pages 117–137. Springer, 2016.
- The use of inertial sensors for the classification of rehabilitation exercises.
Oonagh Giggins, Kevin T Sweeney, and Brian Caulfield. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 2965–2968. IEEE, 2014a.
- Online segmentation with multi-layer svm for knee osteoarthritis rehabilitation monitoring.
Hsieh-Ping Chen, Hsieh-Chung Chen, Kai-Chun Liu, and Chia-Tai Chan. In Wearable and Implantable Body Sensor Networks (BSN), International Conference on, pages 55–60. IEEE, 2016.
- Toward automatic activity classification and movement assessment during a sports training session.
Amin Ahmadi, Edmond Mitchell, Chris Richter, Francois Destelle, Marc Gowing, Noel E O’Connor, and Kieran Moran. IEEE Internet of Things Journal
- Exercise motion classification from large-scale wearable sensor data using convolutional neural networks.
Terry Taewoong Um, Vahid Babakeshizadeh, and Dana Kulic. arXiv preprint arXiv:1610.07031
- Efficient metric learning for the analysis of motion data.
Babak Hosseini and Barbara Hammer. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pages 1–10. IEEE, 2015.
- The assessment of learning performance using dynamic time warping algorithm for the virtual reality of full-body motion sensing control.
Huey-Min Sun. In Human-Computer Interaction (SIGHCI), 2016.
- Movement analysis of rehabilitation exercises: Distance metrics for measuring patient progress.
R. Houmanfar, M. Karg, and D. Kulić. IEEE Systems Journal
- A functional approach to movement analysis and error identification in sports and physical education.
Ernst-Joachim Hossner, Frank Schiebl, and Ulrich Göhner. Frontiers in Psychology
- Augmented visual, auditory, haptic, and multimodal feedback in motor learning: a review.
Roland Sigrist, Georg Rauter, Robert Riener, and Peter Wolf. Psychonomic bulletin & review
- Evidence for the flexible sensorimotor strategies predicted by optimal feedback control.
Dan Liu and Emanuel Todorov. The Journal of Neuroscience
Introduction to sports biomechanics: Analysing human movement patterns
Roger Bartlett. .
- Eyes-free yoga: an exergame using depth cameras for blind & low vision exercise.
Kyle Rector, Cynthia L Bennett, and Julie A Kientz. In Proceedings of the International ACM SIGACCESS Conference on Computers and Accessibility, pages 12–19, 2013.
- Multi-level analysis of motor actions as a basis for effective coaching in virtual reality.
Felix Hülsmann, Cornelia Frank, Thomas Schack, Stefan Kopp, and Mario Botsch. In Proceedings of the International Symposium on Computer Science in Sports (ISCSS), pages 211–214. Springer, 2016.
- The single leg squat test in the assessment of musculoskeletal function: a review.
Robert Bailey, James Selfe, and Jim Richards. Physiotherapy Practice and Research
- Knee biomechanics of the dynamic squat exercise.
Rafael F Escamilla. Medicine and science in sports and exercise
- Fast time series classification using numerosity reduction.
Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. In Proceedings of the 23rd international conference on Machine learning, pages 1033–1040. ACM, 2006.
- An experimental evaluation of nearest neighbour time series classification.
Anthony Bagnall and Jason Lines. arXiv preprint arXiv:1406.4757
- Realizing a low-latency virtual reality environment for motor learning.
Thomas Waltemate, Felix Hülsmann, Thies Pfeiffer, Stefan Kopp, and Mario Botsch. In Proceedings of the 21st ACM Symposium on Virtual Reality Software and Technology, pages 139–147, 2015.
- Classifying human motion quality for knee osteoarthritis using accelerometers.
Portia E Taylor, Gustavo JM Almeida, Takeo Kanade, and Jessica K Hodgins. In Annual International Conference of the IEEE Engineering in Medicine and Biology, pages 339–343, 2010.
- Classification of squat quality with inertial measurement units in the single leg squat mobility test.
Rezvan Kianifar, Alex Lee, Sachin Raina, and Dana Kulić. In Engineering in Medicine and Biology Society (EMBC), Annual International Conference of the, pages 6273–6276. IEEE, 2016.
- Sparse principal component analysis.
Hui Zou, Trevor Hastie, and Robert Tibshirani. Journal of computational and graphical statistics
- Evaluating squat performance with a single inertial measurement unit.
Martin O’Reilly, Darragh Whelan, Charalampos Chanialidis, Nial Friel, Eamonn Delahunt, Tomás Ward, and Brian Caulfield. In International Conference on Wearable and Implantable Body Sensor Networks (BSN), pages 1–6. IEEE, 2015.
- Evaluating rehabilitation exercise performance using a single inertial measurement unit.
Oonagh Giggins, Daniel Kelly, and Brian Caulfield. In Proceedings of the International Conference on Pervasive Computing Technologies for Healthcare, pages 49–56, 2013.
- Rehabilitation exercise assessment using inertial sensors: a cross-sectional analytical study.
Oonagh M Giggins, Kevin T Sweeney, and Brian Caulfield. Journal of Neuroengineering and Rehabilitation
- Automated evaluation of physical therapy exercises using multi-template dynamic time warping on wearable sensor signals.
Aras Yurtman and Billur Barshan. Computer methods and programs in biomedicine
- Morphable models for the analysis and synthesis of complex motion patterns.
Martin A Giese and Tomaso Poggio. International Journal of Computer Vision
- Motion graphs++: a compact generative model for semantic motion analysis and synthesis.
Jianyuan Min and Jinxiang Chai. ACM Transactions on Graphics (TOG)
- Parametric hidden markov models for gesture recognition.
Andrew D Wilson and Aaron F Bobick. IEEE transactions on pattern analysis and machine intelligence
- Interval and dynamic time warping-based decision trees.
Juan J Rodríguez and Carlos J Alonso. In Proceedings of the ACM symposium on Applied computing, pages 548–552, 2004.
- Distance-function design and fusion for sequence data.
Yi Wu and Edward Y Chang. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 324–333, 2004.
- Feature-based classification of time-series data.
Alex Nanopoulos, Rob Alcock, and Yannis Manolopoulos. International Journal of Computer Research
- Motion classification using dynamic time warping.
Kevin Adistambha, Christian H Ritz, and Ian S Burnett. In Multimedia Signal Processing, IEEE Workshop on, pages 622–627, 2008.
- Dynamic time warping averaging of time series allows faster and more accurate classification.
François Petitjean, Germain Forestier, Geoffrey I Webb, Ann E Nicholson, Yanping Chen, and Eamonn Keogh. In International Conference on Data Mining, pages 470–479. IEEE, 2014.
- Dialogue structure of coaching sessions.
Iwan de Kok, Julian Hough, Cornelia Frank, David Schlangen, and Stefan Kopp. In Proceedings of the SemDial Workshop on the Semantics and Pragmatics of Dialogue (DialWatt), pages 167–169, 2014.
NASM essentials of personal fitness training
Micheal A Clark, Scott Lucett, and Brian G Sutton. .
- Dynamic programming algorithm optimization for spoken word recognition.
Hiroaki Sakoe and Seibi Chiba. IEEE transactions on acoustics, speech, and signal processing
- Metrics for 3d rotations: Comparison and analysis.
Du Q Huynh. Journal of Mathematical Imaging and Vision
- Do we need hundreds of classifiers to solve real world classification problems.
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. J. Mach. Learn. Res
Pattern Recognition and Machine Learning (Information Science and Statistics)
Christopher M. Bishop. .
- Scikit-learn: Machine learning in Python.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Journal of Machine Learning Research
- Feature selection for svms.
Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and Vladimir Vapnik. Advances in Neural Information Processing Systems (NIPS)
- Combining svms with various feature selection strategies.
Yi-Wei Chen and Chih-Jen Lin. In Isabelle Guyon, Masoud Nikravesh, Steve Gunn, and Lotfi A. Zadeh, editors, Feature Extraction: Foundations and Applications, volume 207, chapter 12, pages 315–324. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.
- Variable selection using random forests.
Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-Malot. Pattern Recognition Letters
- Random forests.
Leo Breiman. Machine learning
- Analysis of a random forests model.
Gérard Biau. Journal of Machine Learning Research
- Dimensionality reduction via sparse support vector machines.
Jinbo Bi, Kristin Bennett, Mark Embrechts, Curt Breneman, and Minghu Song. Journal of Machine Learning Research