Classify, predict, detect, anticipate and synthesize: Hierarchical recurrent latent variable models for human activity modeling
Human activity modeling operates on two levels: high-level action modeling, such as classification, prediction, detection and anticipation, and low-level motion trajectory prediction and synthesis. In this work, we propose a semi-supervised generative latent variable model that addresses both of these levels by modeling continuous observations as well as semantic labels. We extend the model to capture the dependencies between different entities, such as a human and objects, and to represent hierarchical label structure, such as high-level actions and sub-activities. In the experiments we investigate our model’s capability to classify, predict, detect and anticipate semantic action and affordance labels and to predict and generate human motion. We train our models on data extracted from depth image streams from the Cornell Activity 120, the UTKinect-Action3D and the Stony Brook University Kinect Interaction Dataset. We observe that our model performs well in all of the tasks and often outperforms task-specific models.
Human behavior is determined by many factors such as intention, need, belief and environmental aspects. For example, when standing at a red traffic light, a person might wait or walk depending on whether there is a car approaching, a police car is parked next to the light or they are looking on their mobile phone. This poses a problem for computer vision systems as they often only have access to a visual signal such as single view image sequences or 3D skeletal recordings. Based on this signal, the current class label or subsequent labels and trajectories needs to be determined. In the following discussion we focus on action labels, but the class labels can also describe other factors, e.g. environmental aspects such as affordances or object identity.
The field of human activity modeling can be divided into several tasks, some of which are listed in Table 1 and visualized in Figure 1. We identify classification, prediction, detection and anticipation of semantic labels as well as prediction and synthesis of continuous human motion. The methods concerned with discrete labels usually use segmented feature sequences and their respective labels to train a model. They can be distinguished by the nature of the testing data and the specific task or cost function. While classification (Figure 1a)) requires an inferred label at the end of the observed sequence, prediction, or early action recognition, (Figure 1b)) should predict a label as early as possible. Detection and anticipation (Figure 1c) and 1d)) are not provided with segmented sequences during test time. They are required to detect the action onset and to determine current or anticipate future action labels after time respectively. Finally, human motion prediction and synthesis (Figure 1e) and 1f)) need to operate on continuous sequences which are often noisy and stochastic. While motion prediction is required to produce a single best guess of how the human will move after time , motion synthesis aims at generating different, probable future motion trajectories. Both of these methods can also be provided with label information.
|Method||Training data||Testing input data||Task|
|Action classification||segmented features, labels||segmented features||classify at the end of sequence|
|Action prediction||segmented features, labels||segmented features||classify as early as possible|
|Action detection||segmented features, labels||features||detect action onset and classify|
|Action anticipation||segmented features, labels||features up to time||predict actions after|
|Motion prediction||features, (labels)||features, (labels) up to||predict sequence after|
|Motion synthesis||features, (labels)||features, (labels) up to||generate different sequences after|
The features used by these methods are generally extracted from color and depth images and contain 3D skeletal coordinates of the human and/or meta-data such as the location of objects in the environment. Most often, the labels comprise a single action class for an observed sequence. However, they can be hierarchical in nature, e.g. being the parent of , and/or cover meta-data such as affordance labels.
Each of the methods in Table 1 in itself addresses an important aspect of human activity understanding and most developed models fall into one or several of these categories as discussed in the related work Section 4. In order to solve these problems with a single model, we require a generative model that can represent complex spatio-temporal patterns. It should be able to model the label space and the feature space over time, even if meta-data and meta-labels or hierarchical label structures are present. If this is given, we can make inferences over current and future labels as well as future feature sequences.
The key contribution of this paper is to address the problems in Table 1 with a generative, temporal latent variable model that can capture the complex dependencies of continuous features as well as discrete labels over time. Extensions of this model allow to integrate hierarchical label structure as well as multiple entities. In detail, we propose a semi-supervised variational recurrent neural network (SVRNN), as described in Section 3.1, which inherits the generative capacities of a variational autoencoder (VAE) [17, 28], extends these to temporal data  and combines them with a discriminative model in a semi-supervised fashion. The semi-supervised VAE  can handle labeled and unlabeled data. This property allows us to propagate label information over time even during testing and therefore to generate possible future action and motion sequences. We propose two extensions.
Firstly, we make use of the hierarchical label structure in human activities in form of a hierarchical SVRNN (HSVRNN), as described in Section 3.2. The hierarchical structure is incorporated by conditioning the discrete distributions in the lower level of the hierarchy on their parent distributions, e.g. by modeling .
Secondly, we model the dependencies between multiple entities, such as a human and objects or two interacting humans, by extending the model to a multi-entity SVRNN (ME-SVRNN), as introduced in Section 3.3. The ME-SVRNN propagates information about the current state of an entity to other entities which increases the predictive power of the model.
We test our model on the Cornell Activity Dataset 120 (CAD -120) , to test action detection and anticipation, the UTKinect-Action3D Dataset , to test action detection and prediction, and the Stony Brook University Kinect Interaction Dataset (SBU) , to test action classification and motion prediction and synthesis. We find that our model outperforms state-of-the-art methods in both action and affordance detection and anticipation (Section 5.2) and action detection and prediction (Section 5.3) while performing comparable in action classification (Section 5.4). Further, our generative approach outperforms a state-of-the-art non-probabilistic model  in human motion prediction and is able to generate possible long-term motion sequences (Section 5.5). We conclude this paper with a final discussion of these findings in Section 6.
Our approach builds on three basic ingredients, namely Variational Autoencoders, or VAEs, (Section 2.1), their semi-supervised equivalent (Section 2.2) and a recurrent version (Section 2.3). To ease understanding of later sections, we will here introduce each of these concepts and reference to relevant literature for further details. First, we will introduce the notation used in this paper.
Notation We denote random variables by bold characters . We represent continuous data points by , discrete labels by and and latent variables by . The hidden state of a recurrent neural network (RNN) unit at time is denoted by . Similarly, time-dependent random variables are indexed by , e.g. . Distributions commonly depend on parameters . For the sake of brevity, we will neglect this dependency in the following discussion. is a set of data points in the interval 1 to . denotes a pair, denotes the intersection and the concatenation of variables and .
2.1 Variational autoencoders
Our model builds on VAEs, latent variable models that are combined with an amortized version of variational inference (VI). Amortized VI employs neural networks to learn a function from the data to a distribution over the latent variables that approximates the posterior . Likewise, they learn the likelihood distribution as a function of the latent variables . This mapping is depicted in Figure 2a). Instead of having to infer local latent variables for observed data points, as common in VI, amortized VI requires only the learning of neural network parameters of the functions and . We call the recognition network and the generative network. To sample from a VAE, we first draw a sample from the prior which is then fed to the generative network to yield . We refer to  for more details.
2.2 Semi-supervised variational autoencoders
To incorporate label information when available, semi-supervised VAEs (SVAE)  include a label into the generative process and the recognition network , as shown in Figure 2b). To handle unobserved labels, an additional approximate distribution over labels is learned which can be interpreted as a classifier. When no label is available, the discrete label distribution can be marginalize out, e.g. .
2.3 Recurrent variational autoencoders
VAEs can also be extended to temporal data, so called variational recurrent neural networks (VRNN) . Instead of being stationary as in vanilla VAEs, the prior over the latent variables depends in this case on past observations , which are encoded in the hidden state of a RNN . Similarly, the approximate distribution depends on the history as can be seen in Figure 2c). The advantage of this structure is that data sequences can be generated by sampling from the temporal prior instead of an uninformed prior, i.e. .
Equipped with the background knowledge introduced in the previous section, we will now describe the structure of our proposed model, semi-supervised variational RNNs, and the inference procedure applied to train them (Section 3.1). We will further detail how to extend the model to capture hierarchical label structure (Section 3.2) and to jointly model multiple entities (Section 3.3).
3.1 Semi-supervised variational RNNs
In the SVRNN, we assume that we are given a dataset with temporal structure consisting of labeled time steps and unlabeled observations . denotes the empirical distribution. Further we assume that the temporal process is governed by latent variables , whose distribution depends on a deterministic function of the history up to time : . The generative process is as follows
where and are time-dependent priors, as shown in Figure 3a). To fit this model to the dataset at hand, we need to estimate the posterior over the unobserved variables and which is intractable. Therefore we resign to amortized VI and approximate the posterior with a simpler distribution , as shown in Figure 3b). To minimize the distance between the approximate and posterior distributions, we optimize the variational lower bound of the marginal likelihood . As the distribution over is only required when it is unobserved, the bound decomposes as follows
and are the lower bounds for labeled and unlabeled data points respectively, while is an additional term that encourages and to follow the data distribution over . This lower bound is optimized jointly. We assume the latent variables to be iid Gaussian distributed. The categorical distribution over is determined by parameters . To model such discrete distributions, we apply the Gumbel trick [13, 25]. The history is modeled with a Long short-term memory (LSTM) unit . For more details, we refer the reader to the related work discussed in Section 2 and the Supplementary material.
3.2 Hierarchical SVRNN
Human activity can often be described by hierarchical semantic labels. For example, the label cleaning might be parent to the labels vacuuming and scrubbing. While we here describe how to model a hierarchy consisting of two label layers, the number of layers is not constrained. Let the parent random variable of be represented by . To incorporate we extend the model by additional prior and approximate distributions, and . The latent state at time depends on both and . Thus, the dependency of and on is modeled by conditioning as follows and .
Instead of partitioning the dataset into two parts, , the additional variable requires us to divide it into four parts, , where , , and . This means that the lower bound in Equation 2 is extended to
3.3 Multi-entity SVRNN
To model different entities, we allow these to share information between each other over time. The structure and information flow of this model is a design choice. In our case, these entities consist of the human and additional entities, such as objects or other humans. We denote the dependency of variables on their source by and . Further, we summarize the history and current observation of all additional entities by and respectively. Instead of only conditioning on its own history and observation, as described in Section 3.1, we let the entities share information by conditioning on others’ history and observations. Specifically, the model of the human receives information from all additional entities, while these receive information from the human model. Let and for . The structure of the prior and approximate distribution then become , , and for the human, and , , and for each additional entity , We assume that the labels for all entities are observed and unobserved at the same points in time. Therefore, the lower bound in Equation 2 is only extended by summing over all entities:
where and depend on the probability distributions associated with entity and take the same form as in Equation 2. This model can be extended to a hierarchical version ME-HSVRNN.
3.4 Classify, predict, detect, anticipate and generate
To solve the problems listed in Table 1, we make use of the different components of our model in the following way. We describe only the procedures for the SVRNN, the other models follow the same ideas.
Classify, predict and detect actions: To classify or detect at time , we choose the largest of the weights of the categorical distribution . Classification is performed at the end of the sequence, while prediction and detection are performed at all time steps.
Anticipate actions: To anticipate a label after time , we make use of the prior, which is does not depend on the current observation . Thus, for time , we choose the largest of the weights of the categorical distributions . To anticipate several steps into the future, we need to generate both future observations and future labels with help of the priors and as described below.
Predict and generate motion: To sample an observation sequence after time , we follow the generative process in Equation 1 for each . Alternatively, we can propagate the sampled observations and generate with help of the approximate distribution for each . This method is used to predict a sequence, by averaging over several samples of the distributions.
4 Related work
Before presenting the experimental results, we will point to relevant prior work both when it comes to methodology (Section 4.1) and to human action classification (Section 4.2), action detection and prediction (Section 4.3), action anticipation (Section 4.4) and human motion prediction and synthesis (Section 4.5). As each of these fields is rich in literature, we will concentrate on a few, highly related works that consider 3D skeletal recordings.
4.1 Recurrent latent variable models with class information
Recurrent latent variable models that encode explicit semantic information have mostly been developed in the natural language processing community. The aim of  is sequence classification. They encode a whole sequence into a single latent variable , while static class information , such as sentiment, is modeled in a semi-supervised fashion. A similar model is suggested in  for sequence transduction. Multiple semantic labels, such as part of speech or tense, are encoded into the control signal . Sequence transduction is also the topic of . In contrast to , the latent space is assumed to resemble a morphological structure, i.e. that at every word in a sentence is assigned latent lemmata and morphological tags. While this discrete structure is optimal for language, continuous variables, such as trajectories, require continuous latent dynamics. These are modeled by , who divide the latent space into static (e.g. appearance) and dynamic (e.g. trajectory) variables which are approximated in an unsupervised fashion. While this model lends itself to sequence generation, it is not able to incorporate explicit semantic information. In contrast to [34, 36] and , our model incorporates semantic information that changes over the cause of the sequence, such as composable action sequences, and does simultaneously model continuous dynamics.
4.2 Human activity classification
3D human action classification is a broad field which has been covered by several surveys, e.g  and . Traditionally, the problem of classifying a motion sequence has been a two-stage process of feature extraction followed by time series modeling, e.g. with Hidden Markov Models . Developments in deep learning have led to fusing these two steps. Both convolutional neural networks, e.g. [6, 15], and recurrent neural network architectures, e.g. [7, 23], have been adapted to this task. Recent developments include the explicit modeling of co-occurrences between joints  and the introduction of attention mechanisms that focus on action-relevant joints, the so called Global Context-Aware Attention LSTM (GCA-LSTM) . A different approach are view Adaptive LSTMs (VA-LSTM) which learn to transform the skeleton spatially to facilitate classification . Compared to these approaches, we adopt a semi-supervised, probabilistic latent variable model which learns to propagate action hypotheses over time and is robust to view and skeletal differences.
4.3 Human activity prediction and detection
Activity prediction and detection are related in the sense that both methods require classification before the whole sequence has been observed. Detection however aims also at determining the onset of an action within a data stream.
To encourage early recognition,  define a loss that penalizes immediate mistakes more than distant false classifications. A more adaptive approach is proposed by , namely a convolutional neural network with different scales which are automatically selected such that actions can be predicted early on.
In order to detect action onsets,  combine class-specific pose templates with dynamic time warping. Similarly, such pose templates are used by  who couple these with variables describing actions at different levels in a hierarchical model. Instead of templates,  introduce a LSTM that is trained to both classify and predict the onset and offset of an action. In contrast to these approaches, we propose a generative, semi-supervised model, which proposes action hypotheses from the first frame and onward. As we do not constrain the temporal dynamics of the distribution over labels, the model learns to detect action changes online.
4.4 Human activity anticipation
Activity anticipation aims at predicting semantic labels of actions that have not yet been initiated. This spatio-temporal problem has been addressed with anticipatory temporal conditional random fields (ATCRF) , which augment conditional random fields (CRFs) with predictive, temporal components. In a more recent work, structural RNNs (S-RNNs) have been used to classify and predict activity and affordance labels by modeling the edges and nodes of CRFs with RNNs . Instead of a supervised approach, our semi-supervised generative model does naturally propagate label information over time and anticipate the label of the next action.
4.5 Human motion prediction and synthesis
Recent advances in human motion prediction are based on deep neural networks. As an example, S-RNNs have been adapted to model the dependencies between limbs as nodes of a graphical model . However, RNN based models have been outperformed by a window-based representation learning method  and suffer among others from an initial prediction error and propagated errors . When the network has to predict residuals, or velocity, in an unsupervised fashion (residual unsupervised, RU) these problems can be overcome .
Human motion modeling with generative models has previously been approached with Restricted Bolzmann Machines , Gaussian Processes  and Variational Auroencoders . In , a recurrent variational autoencoder is developed to synthesize human motion with a control signal. Our model differs in several aspects from this approach as we explicitly learn a generative model over both observations and labels and make use of time-dependent priors.
In this section, we describe both experimental design and results. First, we detail the datasets (Section 5.1). In the following, we investigate the ability of our model to detect and anticipate human activity (Section 5.2), to detect and predict actions (Section 5.3) and to classify actions (5.4). The final experiments center around the prediction and synthesis of continuous human motion (Section 5.5).
|Joint Feat. ||80.3|
|Joint Feat. ||86.9|
|Co-occ. RNN ||90.4|
CAD-120 The CAD-120 dataset  consists of 4 subjects performing 10 high-level tasks, such as cleaning a microwave or having a meal, in 3 trials each. These activities are further annotated with 10 sub-activities, such as moving and eating and 12 object affordances, such as movable and openable. In this work we focus on detecting and anticipating the activities, sub-activities and affordances. Our results rely on four-fold cross-validation with the same folds as used in . For comparison, we train S-RNN models, for which code is provided online, on these four folds and under the same conditions as described in . We use the features extracted in  and pre-process these as in . The object models share all parameters, i.e. we effectively learn one human model and one object model both in the single- and multi-entity case.
UTKinect-Action3D Dataset The UTKinect-Action3D Dataset (UTK)  consists of 10 subjects each recorded twice performing 10 actions in a row. The sequences are recorded with a kinect device (30 fps) and the extracted skeletons consist of 20 joints. Due to high inter-subject, intra-class and view-point variations, this dataset is challenging. While most previous work has used the segmented action sequences for action classification, we are aiming at action detection and prediction, i.e. the model has to detect action onset and classify the actions correctly. This is demanding as the longest recording contains 1388 frames that need to be categorized. The actions in each recording do not immediately follow each other but are disrupted by long periods of unlabeled frames. As our model is semi-supervised, these unobserved data labels can be incorporated naturally and do not require the introduction of e.g. an additional label class. We train our model on five subjects and test on the remaining 5 subjects.
|Observed||25 %||50 %||75 %||100 %|
|SVRNN with H||61.0||78.0||84.0||89.0|
|SVRNN without H||29.0||48.0||67.0||74.0|
SBU Kinect Interaction Dataset The SBU dataset  contains around 300 recordings of seven actors (21 pairs of two actors) performing eight different interactive activities such as hugging, pushing and shaking hands. The data was collected with a kinect device at 15 fps. While the dataset contains color and depth image, we make use of the 3D coordinates of 15 joints of each subject. As these measurements are very noisy, we smooth the joint locations over time . We follow the five-fold cross-validation suggested by , which splits the dataset into five folds of four to five actors. On the basis of the SBU dataset, we investigate sequence classification as well as prediction and generation of interactive human motion over the range of around 660 ms (10 frames). In order to model two distinct entities, we assign the two actors in each recording the label active or passive. For example, during the action kicking the active subject kicks while the passive subject avoids the kick. In a more equal interaction such as shaking hands, the active actor is the one who initiates the action. We list these labels for all recorded sequences in the Supplementary material.
5.2 Action detection and anticipation
In this section, we focus on the capabilities of our models to detect and anticipate semantic labels. We present experimental results on the inference of activity as well as sub-activity and affordance labels based on the CAD-120 dataset.
CAD-120 Following related work [12, 18], we investigate the detection and anticipation performance of our model for sub-activities (SAct), object affordances (Aff) and high-level activities (Act). Detection entails classification of the current observation at time and anticipation measures the predictive performance for the next observation at time .
In Table 2 we present the results for the baseline models ATCRF  and S-RNN (as reported in  and reproduced on the same folds (SF) as we use here). We compare these two the performance of the vanilla SVRNN, the multi-entity ME-SVRNN and a multi-entity hierarchical ME-HSVRNN. We see that especially the anticipation of sub-activities gains in performance when incorporating information from the object entities (ME-SVRNN). Further improvements are achieved when the hierarchical label structure is included (ME-HSVRNN).
5.3 Action detection and prediction
In this section, we focus on the capabilities of our models to detect and predict semantic labels. We test the performance of our model on the UTK dataset.
UTKinect-Action3D Dataset - detection As far as we are aware, only one comparable work, based on class templates , has attempted to detect actions on the UTK.  only reports results on jointly detecting which actions are performed and which body parts are used (F1 score=69.0). We assume action to be detected if the majority of observations within the ground truth time interval are inferred to belong to action . We compare the F1 score averaged over all classes after having observed 100 % of each action to  in Table 5. Further, we visualize the detected and ground truth action sequence of one test sample in Figure 4. In this test sequence, the action is partially confused with which might be caused by the lack of meta-data such as that the subject is holding an object.
UTKinect-Action3D Dataset - prediction More generally, we can see that the model is able to detect actions with only a short or no delay. This is also apparent when we measure the F1 score for partially observed sequences, namely when the model has observed 25 %, 50 %, 75 % or 100 % of the current action with history of the previous actions (with H) or without history (without H). On average, this corresponds to having observed 8,16,25 or 33 frames of the ongoing action. As listed in Table 5 (with H), the F1 score increases continuously the more of the action has been observed. At 75 % the SVRNN outperforms the results reported in  which are based on 100 % of the action interval. Without history, the performance is lower as our model has not been trained to predict actions without history.
5.4 Action classification
Action classification aims determining the action class at the end of an observed motion sequence. We apply this method to classify interactive actions of two actors (SBU).
SBU Kinect Interaction Dataset To classify a sequence, we average over the last 3 time steps of the sequence. The classification accuracy of our model is compared to state-of-the-art models in Table 4. Our model achieves comparable performance to most of the related work. It needs to be kept in mind that the other models are task-specific and are not able to e.g. predict labels or human motion. Thus, the computational resources of the other models are directed towards classification.
5.5 Motion prediction and synthesis
Finally, we present results on feature prediction and synthesis. We present results on the SBU dataset, which means that we predict and generate two interacting subjects. We present additional results on the Human3.6M dataset  (H36M) in the Supplementary material. The H36M dataset consists of motion capture data and is often used for human motion prediction experiments.
SBU Kinect Interaction Dataset We compare the predictive performance to a state-of-the-art human motion prediction model (RU) . This model learns the residuals (velocity) in an unsupervised fashion and is provided with a one-hot vector indicating the class. To be comparable, we also model the residuals. Thus, the main differences between the RU and the ME-SVRNN are that we a) formulate a probabilistic, latent variable model, b) combine information of both subjects and c) model an explicit belief over the class distribution. To compare, we let both models predict ten frames given the first six frames of the actions , , , and . The error is computed as the accumulated squared distance between the ground truth and the prediction of both subjects up to frame . We present the results for 260, 400, 530 and 660 ms in Table 4. While the RU outperforms our model for and some measurements at +260 ms, the ME-SVRNN does perform better during long-term predictions.
In addition to prediction, our generative model allows us to sample possible future trajectories. In the case of multiple entities, we can either generate all entity sequences or provide the observation sequence of one entity while generating the other. In Figure 5 we present samples of the action . The upper row shows the ground truth. The middle row was produced by providing the model with the sequence of the subject while generating the sequence of the subject. In the lower row, the sequences of both subjects are generated.
Human activity modeling poses a number of challenging spatio-temporal problems. In this work we proposed a semi-supervised generative model which learns to represent semantic label and continuous feature observations over time. By this, the model is able to classify, predict, detect and anticipate discrete labels and to predict and generate feature sequences. When extended to model multiple entities and hierarchical label structures, our approach is able to tackle complex human activity sequences. While most previous work has been centered around task-specific modeling, we suggest that joint modeling of continuous observations and semantic information, whenever available, forces the model to learn a more holistic representation which can be used to solve many different tasks. In future work, we plan to extend our model to more challenging semantic information such as raw text and to incorporate multiple modalities.
-  M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early. In IEEE International Conference on Computer Vision (ICCV), volume 1, 2017.
-  S. Berretti, M. Daoudi, P. Turaga, and A. Basu. Representation, analysis, and recognition of 3d humans: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(1s):16, 2018.
-  J. Bütepage, M. Black, D. Kragic, and H. Kjellström. Deep representation learning for human motion prediction and classification. 2017.
-  J. Bütepage, H. Kjellström, and D. Kragic. Anticipating many futures: Online human motion prediction and synthesis for human-robot collaboration. In ICRA, 2018.
-  J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In NIPS, pages 2980–2988, 2015.
-  Y. Du, Y. Fu, and L. Wang. Skeleton based action recognition with convolutional neural network. In Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on, pages 579–583. IEEE, 2015.
-  Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1110–1118, 2015.
-  K. Gupta and A. Bhavsar. Scale invariant human action detection from depth cameras using class templates. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 38–45, 2016.
-  I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura. A recurrent variational autoencoder for human motion synthesis. In British Machine Vision Conference (BMVC), 2017.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.
-  A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. Structural-RNN: Deep learning on spatio-temporal graphs. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
-  Y. Ji, G. Ye, and H. Cheng. Interactive body part contrast mining for human interaction recognition. In Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on, pages 1–6. IEEE, 2014.
-  T. S. Kim and A. Reiter. Interpretable 3d human action analysis with temporal convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1623–1631. IEEE, 2017.
-  D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In NIPS, pages 3581–3589, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:, 2013.
-  H. S. Koppula, R. Gupta, and A. Saxena. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8):951–970, 2013.
-  H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence, 38(1):14–29, 2016.
-  Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu. Online human action detection using joint classification-regression recurrent neural networks. In European Conference on Computer Vision, pages 203–220. Springer, 2016.
-  I. Lillo, J. Carlos Niebles, and A. Soto. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1981–1990, 2016.
-  J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. C. Kot. Ssnet: Scale selection network for online 3d action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8349–8358, 2018.
-  J. Liu, A. Shahroudy, D. Xu, A. K. Chichung, and G. Wang. Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot. Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Transactions on Image Processing, 27(4):1586–1599, 2018.
-  C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017.
-  J. Martinez, M. J. Black, and J. Romero. On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4674–4683. IEEE, 2017.
-  J. Naradowsky, R. Cotterell, S. J. Mielke, and L. Wolf-Sonkin. A structured variational autoencoder for contextual morphological inflection. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pages 2631–2640, 2018.
-  D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
-  S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI, volume 1, pages 4263–4270, 2017.
-  G. W. Taylor, G. E. Hinton, and S. T. Roweis. Modeling human motion using binary latent variables. In NIPS, 2006.
-  J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–298, 2008.
-  P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera. Rgb-d-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding, 2018.
-  L. Xia, C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 20–27. IEEE, 2012.
-  W. Xu, H. Sun, C. Deng, and Y. Tan. Variational autoencoder for semi-supervised text classification. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 3358–3364, 2017.
-  J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden markov model. In Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pages 379–385. IEEE, 1992.
-  L. Yingzhen and S. Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pages 5656–5665, 2018.
-  K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012.
-  K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 28–35. IEEE, 2012.
-  C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt. Advances in variational inference. arXiv preprint arXiv:1711.05597, 2017.
-  P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2136–2145, Oct 2017.
-  C. Zhou and G. Neubig. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 310–320. Association for Computational Linguistics, 2017.
-  W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, et al. Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In AAAI, volume 2, page 6, 2016.
7 Supplementary material
This is the supplementary material of the paper Classify, predict, detect, anticipate and synthesize: Hierarchical recurrent latent variable models for human activity modeling. Here we describe details of the derivation of the semi-supervised variational recurrent neural network (SVRNN) in Section 7.1 and its hierarchical version in Section 7.2. Furthermore, we describe the network architectures and data processing steps for all experiments in Section 7.3 and 7.4 respectively. We list the labels of active and passive subjects for the Stony Brook University Kinect Interaction Dataset (SBU)  in Section 7.6. Finally, we present additional results on human motion prediction in Section 7.5.
The derivation of the lower bound follows the description in  with a number of exceptions. First of all, we assume that the approximate distribution factorizes and the prior over the latent variable does depend on the label and the history .
Secondly, we use two different priors on the discrete random variable depending on whether the data point has been observed or is unobserved . We apply a uniform prior for and a history-dependent prior for , which follows the Gumbel-Softmax distribution.
Finally, as the prior on the discrete variables is history-dependent, we want to encourage it to encode the information provided in the labeled data points. Therefore, we add not only an additional term for the approximate distribution but also for the prior distribution .
7.2 Hierarchical SVRNN
where denotes the conditional (background) variables of variable . When both labels are present the lower bound and additional term take the following form
When only the label has been observed , the lower bound and additional term take the following form
When only the label has been observed , the lower bound and additional term take the following form
When only no label has been observed , the lower bound takes the following form
7.3 Network architecture
In this section, we begin by describing the overall structure and follow up with details on the specific number of units for each experiment.
We represent the unobserved labels as a stochastic vector and the observed labels as a one-hot vector. The distributions over labels are given by fully connected neural networks with a Gumbel-Softmax output layer. The input is given by a concatenation for the approximate label distribution and by for the prior label distribution. In case of a hierarchical structure, we concatenate even the parent label, e.g. for .
The distributions over latent variables are given by fully connected neural networks that output the parameters of a Gaussian . The input is given by for the approximate distribution in the case of SVRNN and in the case of HSVRNN and by for the prior distribution. When a label has not been observed, we propagate a sample from the respective Gumbel-Softmax distribution.
The recurrent unit receives the as input in the case of SVRNN and in the case of HSVRNN.
Fully connected neural networks are used to reconstruct the next observation based on the input in the case of SVRNN and in the case of HSVRNN.
When multiple entities are combined, the same structure as discussed above is used. However, in this case the observations and history features are concatenated and for for the respective entities.
We use the non-linearity for all layers except for the output and latent variables layers. The recurrent layers consist of LSTM units.
CAD-120 - action detection and anticipation
We always map the input to 256 dimensions for each entity with a fully connected layer. As each entity follows the same pattern, the details below do not distinguish between them. The approximate distribution and prior over of dimension is given by . The approximate distribution and prior over of dimension is given by . The approximate distribution and prior over of dimension are given by . The size of the hidden state of the recurrent layer is . The reconstruction of the observation of dimension is given by .
UTKinect-Action3D - action detection and prediction
We always map the input to 516 dimensions for each entity with a fully connected layer. The approximate distribution and prior over of dimension is given by . The approximate distribution and prior over of dimension is given by . The approximate distribution and prior over of dimension are given by . The size of the hidden state of the recurrent layers is . We have three layers. The reconstruction of the observation of dimension is given by .
SBU - action classification and motion generation and prediction
We always map the input to 516 dimensions for each entity with a fully connected layer. As each entity follows the same pattern, the details below do not distinguish between them. The approximate distribution and prior over of dimension is given by . The approximate distribution and prior over of dimension is given by . The approximate distribution and prior over of dimension are given by . The size of the hidden state of the recurrent layers is . We have three layers. The reconstruction of the observation of dimension is given by . The same model is used for the additional experiments presented in Section 7.5.
7.4 Data preprocessing and training
We always set equal to the number of all features and the temperature parameter of the Gumbel-Softmax distribution to . For all fully connected layers except the output layers and parameters of the latent variables, we apply a dropout rate of 0.1.
CAD-120 - action detection and anticipation
We use the features extracted in  and preprocess these as in . The features contain information about the human subject and the objects, which are modeled jointly in the multi-entity condition. The hierarchical structure is given by the high-level task governing the sub-activities. The features extracted by  assign a sub-activity label and affordance labels to the human subject and the objects respectively at each time step. In each batch, we mark ca. 25 % of the labels as unobserved. We apply a learning rate of and cut the gradients at an absolute value of . The results are averaged over 20 evaluations of our probabilistic models.
UTKinect-Action3D - action detection and prediction
The skeletons are centered around the root joint to reduce variability between recordings. All unlabeled observations between two labeled action intervals are set to be unobserved. In each training epoch, 10 % of the remaining action labels are randomly assigned to be unobserved. We apply a learning rate of and cut the gradients at an absolute value of . The results are averaged over 10 evaluations of our probabilistic model.
SBU - action classification
For classification, the model is provided with the whole sequence as evidence and is trained to predict a single frame at each time step. To force the network to encode the sequence label over a long time period, we label only the last seven frames of each recording with the respective interaction label and assume the remaining labels to be unobserved. We apply a learning rate of and cut the gradients at an absolute value of .
SBU - motion generation and synthesis
For sequence prediction, we provide the model with six frames of observations from which it needs to predict the ten following frames. In this case, we label only the last three frames of each data point in the mini-batch. The remaining labels are assumed to be unobserved. The results are averaged over 20 evaluations of our probabilistic model. We apply a learning rate of and cut the gradients at an absolute value of .
7.5 Human motion prediction - H36M
In this section we present additional results for human motion prediction on the Human3.6M dataset  (H36M). This motion capture dataset consists of seven actors performing 15 actions in two trials each. We follow the data processing and error computation as described in . Both the residual unsupervised model (RU)  and our SVRNN model are trained to predict 10 frames given the last 6 frames. The network architecture and modeling approach of the SVRNN follow the model used in the SBU - motion generation and synthesis experiments. The resulting errors for different actions in the range of 80,160,320 and 400 ms are listed in Table 6. The SVRNN outperforms the RU model especially in long-term (¿80 ms) predictions.
7.6 SBU active and passive labeling
Below we list the labeling of active and passive subjects in the SBU dataset . Each recording sequence is labeled by action class, subject id and activity level (active or passive) as follows , e.g. . We use for active and for passive. During actions with distinct roles, such as , the assignment of and is straight forward. Actions such as offer a less clear labeling. We aimed at labeling the action-initiating actor as in these cases.
s01s02;01;001;1 s01s02;01;002;0 s01s02;02;001;1 s01s02;02;002;0 s01s02;03;001;1 s01s02;03;002;0 s01s02;04;001;1 s01s02;04;002;0 s01s02;05;001;1 s01s02;06;001;1 s01s02;07;001;1 s01s02;07;002;0 s01s02;07;003;0 s01s02;08;001;1 s01s02;08;002;0 s03s04;01;001;0 s03s04;01;002;1 s03s04;02;001;0 s03s04;02;002;1 s03s04;03;001;0 s03s04;03;002;1 s03s04;04;001;0 s03s04;04;002;1 s03s04;05;001;0 s03s04;06;001;0 s03s04;07;001;0 s03s04;07;002;1 s03s04;08;001;0 s03s04;08;002;1 s05s02;01;001;1 s05s02;01;002;0 s05s02;02;001;1 s05s02;02;002;0 s05s02;03;001;1 s05s02;03;002;0 s05s02;04;001;1 s05s02;04;002;0 s05s02;04;003;1 s05s02;06;001;1 s05s02;08;001;1 s05s02;08;002;0 s06s04;01;001;0 s06s04;01;002;1 s06s04;02;001;0 s06s04;02;002;1 s06s04;03;001;0 s06s04;03;002;1 s06s04;04;001;0 s06s04;04;002;1 s06s04;05;001;0 s06s04;06;001;0 s06s04;07;001;0 s06s04;07;002;1 s06s04;08;001;0 s06s04;08;002;1
s02s03;01;001;0 s02s03;01;002;1 s02s03;02;001;0 s02s03;02;002;1 s02s03;03;001;0 s02s03;03;002;1 s02s03;04;001;0 s02s03;04;002;1 s02s03;06;001;0 s02s03;07;001;0 s02s03;07;002;1 s02s03;08;001;1 s02s07;01;001;0 s02s07;01;002;1 s02s07;02;001;0 s02s07;02;002;1 s02s07;03;001;0 s02s07;03;002;0 s02s07;03;003;1 s02s07;04;001;0 s02s07;04;002;1 s02s07;05;001;0 s02s07;06;001;0 s02s07;07;001;0 s02s07;07;002;1 s02s07;08;001;0 s02s07;08;002;1 s03s05;01;001;1 s03s05;02;001;1 s03s05;02;002;0 s03s05;03;001;1 s03s05;03;002;0 s03s05;04;001;1 s03s05;04;002;0 s03s05;05;001;0 s03s05;06;001;1 s03s05;07;001;0 s03s05;07;002;1 s03s05;08;001;0 s03s05;08;002;1 s05s03;01;001;0 s05s03;01;002;1 s05s03;02;001;0 s05s03;02;002;1 s05s03;03;001;0 s05s03;04;001;0 s05s03;04;002;1 s05s03;05;001;0 s05s03;06;001;0 s05s03;07;001;0 s05s03;08;001;0 s05s03;08;002;1
s01s03;01;001;0 s01s03;01;002;1 s01s03;02;001;0 s01s03;02;002;1 s01s03;03;001;0 s01s03;03;002;1 s01s03;04;001;0 s01s03;04;002;1 s01s03;05;001;0 s01s03;06;001;0 s01s03;07;001;0 s01s03;08;001;0 s01s03;08;002;1 s01s03;08;003;1 s01s07;01;001;1 s01s07;01;002;0 s01s07;02;001;1 s01s07;02;002;0 s01s07;03;001;1 s01s07;03;002;0 s01s07;04;001;1 s01s07;04;002;0 s01s07;05;001;1 s01s07;06;001;1 s01s07;07;001;1 s01s07;07;002;0 s01s07;08;001;1 s01s07;08;002;0 s07s01;01;001;0 s07s01;01;002;1 s07s01;02;001;0 s07s01;02;002;1 s07s01;03;001;0 s07s01;03;002;1 s07s01;04;001;0 s07s01;04;002;1 s07s01;05;001;0 s07s01;06;001;1 s07s01;07;001;0 s07s01;07;002;1 s07s01;08;001;0 s07s01;08;002;1 s07s03;01;001;1 s07s03;01;002;0 s07s03;02;001;1 s07s03;02;002;0 s07s03;03;001;1 s07s03;03;002;0 s07s03;04;001;1 s07s03;04;002;0 s07s03;05;001;1 s07s03;06;001;1 s07s03;07;001;1 s07s03;07;002;0 s07s03;08;001;1 s07s03;08;002;0
s02s01;01;001;0 s02s01;01;002;1 s02s01;01;003;1 s02s01;02;001;0 s02s01;02;002;1 s02s01;02;003;1 s02s01;03;001;0 s02s01;03;002;1 s02s01;03;003;0 s02s01;04;001;0 s02s01;05;001;0 s02s01;06;001;0 s02s01;07;001;0 s02s01;07;002;1 s02s01;08;001;0 s02s06;01;001;0 s02s06;01;002;1 s02s06;02;001;0 s02s06;02;002;1 s02s06;03;001;0 s02s06;03;002;1 s02s06;04;001;0 s02s06;04;002;1 s02s06;05;001;0 s02s06;06;001;0 s02s06;07;001;0 s02s06;07;002;1 s02s06;08;001;0 s02s06;08;002;1 s03s02;01;001;0 s03s02;01;002;1 s03s02;02;001;0 s03s02;02;002;1 s03s02;03;001;0 s03s02;03;002;1 s03s02;04;001;1 s03s02;05;001;0 s03s02;06;001;1 s03s02;07;001;0 s03s02;07;002;1 s03s02;08;001;0 s03s02;08;002;1 s03s06;01;001;1 s03s06;01;002;0 s03s06;02;001;1 s03s06;02;002;0 s03s06;03;001;0 s03s06;04;001;1 s03s06;04;002;0 s03s06;06;001;0 s03s06;07;001;1 s03s06;07;002;0 s03s06;08;001;1 s03s06;08;002;0
s04s03;01;001;0 s04s03;01;002;1 s04s03;02;001;0 s04s03;02;002;1 s04s03;03;001;0 s04s03;03;002;1 s04s03;04;001;0 s04s03;04;002;1 s04s03;05;001;1 s04s03;06;001;0 s04s03;07;001;0 s04s03;07;002;1 s04s03;08;001;0 s04s03;08;002;1 s04s06;01;001;0 s04s06;01;002;1 s04s06;02;001;0 s04s06;02;002;1 s04s06;03;001;0 s04s06;03;002;1 s04s06;04;001;0 s04s06;04;002;1 s04s06;05;001;0 s04s06;06;001;1 s04s06;07;001;0 s04s06;07;002;1 s04s06;08;001;0 s04s06;08;002;1 s06s02;01;001;1 s06s02;01;002;0 s06s02;02;001;1 s06s02;02;002;0 s06s02;03;001;0 s06s02;04;001;1 s06s02;04;002;0 s06s02;05;001;1 s06s02;06;001;1 s06s02;07;001;0 s06s02;07;002;1 s06s02;08;001;0 s06s03;01;001;1 s06s03;01;002;0 s06s03;02;001;1 s06s03;02;002;0 s06s03;03;001;1 s06s03;03;002;0 s06s03;04;001;1 s06s03;04;002;0 s06s03;05;001;1 s06s03;06;001;0 s06s03;07;001;1 s06s03;07;002;1