Human Action Generation with Generative Adversarial Networks

Human Action Generation with Generative Adversarial Networks

Mohammad Ahangar Kiasari, Dennis Singh Moirangthem, Minho Lee
School of Electronics Engineering
Kyungpook National University
80 Daehakro, Bukgu, Daegu - 41566, South Korea
{ahangar100, mdennissingh, mholee}

Inspired by the recent advances in generative models, we introduce a human action generation model in order to generate a consecutive sequence of human motions to formulate novel actions. We propose a framework of an autoencoder and a generative adversarial network (GAN) to produce multiple and consecutive human actions conditioned on the initial state and the given class label. The proposed model is trained in an end-to-end fashion, where the autoencoder is jointly trained with the GAN. The model is trained on the NTU RGB+D dataset and we show that the proposed model can generate different styles of actions. Moreover, the model can successfully generate a sequence of novel actions given different action labels as conditions. The conventional human action prediction and generation models lack those features, which are essential for practical applications.


Human Action Generation with Generative Adversarial Networks

  Mohammad Ahangar Kiasari, Dennis Singh Moirangthem, Minho Lee School of Electronics Engineering Kyungpook National University 80 Daehakro, Bukgu, Daegu - 41566, South Korea {ahangar100, mdennissingh, mholee}


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Generating novel sequences of human motion to form an action has been a challenging problem. Human action classification and prediction have been studied in detail but there have been few works exploring generation of novel actions. With the advent of powerful generative models such as generative adversarial networks (GAN) (Goodfellow et al., 2014), novel data generation has become a possibility. However, in human action generation, simple generation of a random sequence of motions may not suffice the requirements and usability. We need a proper control of the action labels for the generated actions and also the style of the motion sequences in a given action class.

Recent studies have explored the possibility of human action prediction given a number of prior motion sequences  (Bütepage et al., 2017; Martinez et al., 2017; Barsoum et al., 2017)Bütepage et al. (2017) proposed a sparse autoencoder model to predict human motions in an unsupervised manner. Martinez et al. (2017) proposed a prediction model based on sequence to sequence(Seq2seq) model to predict short-term human actions. On the other hand, Barsoum et al. (2017) introduced a Seq2seq model trained with an adversarial cost to predict the future skeleton frames given a sequence of prior frames.

The conventional human motion prediction models described above have several limitations as they are proposed to just predict the future frames given a series of inputs. These models have limited ability to generate novel actions. Moreover, such models have no control, such as the class of action or initial position, over the predicted motions as the generation is conditioned over the input frames. The recently proposed prediction model (Barsoum et al., 2017) that utilizes GAN in the model to enhance the generation capability does not have the capability to generate a desired class of action. Furthermore, the generated future motion sequences are highly dependent on the given prior poses and the effect of random on the generated action style is minimal. In order to address these issues, we need a generative model that can handle generation of desired action classes with different styles.

The aim of this paper is to explore a new algorithm to generate multiple classes of human actions as well as consecutive sequences of human actions. Indeed, most of the movement information is stored in the human skeleton sequences. Therefore, generating corresponding human skeleton plays an important role in the human motion generation task. This fact inspired us to introduce a model that is able to generate a novel sequence of human skeleton poses with the help of a generative model.

In this work, we introduce a new framework to generate novel human actions with the help of GAN. We propose a model, which combines an autoencoder and a conditional GAN (Mirza and Osindero, 2014), in order to generate a sequence of human actions. The autoencoder and the conditional GAN are trained simultaneously in an end-to-end manner. The model generates human actions based on a given action class and an initial start position. We can change the random vector to generate different styles of actions. The proposed model can also generate a sequence of human actions without any continuity issues. We train our model using the NTU RGB-D dataset (Shahroudy et al., 2016), which is collected using Microsoft Kinect, and we show that the model can successfully generate a variety and combinations of multiple actions.

The major contributions of this paper are as follows:

  • We introduce a semi-supervised model based on GAN to generate novel human actions.

  • The proposed model takes into account the class of the action to be generated as well as the initial state.

  • Our model has the capability to generate multiple styles of a single action class by changing the random vector .

  • It can also generate a consecutive sequence of different actions with the help of the initial position and different class labels during generation. This allows the model to generate a series of action sequences with seamless transition.

2 Related works

Recurrent neural networks(RNNs) such as long short-term memory(LSTM) (Hochreiter and Schmidhuber, 1997) or gated recurrent units(GRU) (Bahdanau et al., 2014) have been at the forefront of human motion prediction. However, such deep neural networks based model are primarily deterministic. There have been attempts to modify RNN encoder-decoder frameworks to work as a combination of deterministic and probabilistic human motion prediction models (Fragkiadaki et al., 2015). In recent advances, Jain et al. (2016) and Martinez et al. (2017) have introduced several structures and frameworks with deep RNNs to produce state-of-the-art results in human motion prediction. On the other hand Bütepage et al. (2017) proposed a sparse autoencoder model to predict human motions in an unsupervised manner without using RNNs. Even though these models are good in human motion prediction, generation of novel human actions have been a challenge with such models.

Generative Models have received a significant boost in performance and applicability with the advent of Generative Adversarial Networks(GAN) (Goodfellow et al., 2014; Mirza and Osindero, 2014; Arjovsky et al., 2017; Li et al., 2017; Liu et al., 2017). Recent studies have incorporated GAN in RNN sequence-to-sequence (Seq2seq) (Sutskever et al., 2014) architectures to improve motion prediction. Researchers in (Barsoum et al., 2017) developed a model called HP-GAN to achieve the probability of the future sequences conditioned on the given incomplete input sequence. In order to predict, the model basically employed a Seq2seq framework with mapping random vector on the encoder output part of the Seq2seq model. However, the model is not able to generate new actions based on different class labels. Moreover, generating multiple sequences without the prior human poses is not considered in this study.

One of the recent works (Cai et al., 2017) introduced human motion prediction using GAN trying to close the gap between prediction and generation. The model contains a two-step generation pipeline. The human pose generation part consists of two generative models, which are trained separately to predict a sequence of human skeleton. The generative models have the random input along with two conditions called and . The and present initial pose of the generated sequence and the class label, respectively. However, this part of the model is not an end-to-end algorithm. moreover, the paper lacks sufficient exploration regarding the effect of and in the generation process.

Despite the use of RNN and GAN, the current models lack the complete features of a true generative model for human actions. We develop an end-to-end model that can generate different human actions with different styles. Our generative model is based only on GAN and we can generate particular classes of human motions conditioned on the given class label and initial pose, and generate different styles in each class with a variable random . The amount of control and flexibility introduced in our model make it more usable in real applications.

3 Methodology

Figure 1: The proposed model consists of an autoencoder and a conditional GAN that can take multiple conditions to generate multiple classes of human actions with different styles.

In this section, we describe the proposed model, which consists of an autoencoder and a GAN as shown in Fig. 1. All the components of the model are trained simultaneously end-to-end. The trained model is then configured to the generation phase in order to produce the human motions.

3.1 Generative Adversarial Networks (GAN)

Generative adversarial networks(GANs) (Goodfellow et al., 2014) present a push-pull game between a generator and a discriminator. The training objective of the discriminative model is to determine whether the data are from fake data generated by the generative model or real training data. For the generative model, its objective is to generate realistic data, which is similar to the true training data and the discriminative model can’t distinguish. For the standard generative adversarial networks, we train the discriminative model to maximize the probability of giving the correct labels to both the samples from the generative model and training examples. We simultaneously train the generative model to minimize the estimated probability of being true by the discriminator. The objective function is:


where denotes the random input data with a uniform or normal distribution , and denotes the generator. indicates the discriminator and is the training data with empirical distribution .

3.1.1 Conditional Generative Adversarial Networks

In the conventional GAN, there is no control on the mode of the generated data. Therefore, (Mirza and Osindero, 2014) proposed the conditional generative adversarial networks to address this problem. In the conditional GAN, both the input random to the generator and the input of the discriminative model are concatenated with a label vector. The label vector contains the mode of the data such as class labels. Recently, several papers successfully applied conditional GAN on different purposes (Chen et al., 2016; Odena et al., 2016; Makhzani et al., 2015). In this paper, we also utilize the conditional GAN to introduce control on the class label and initial pose of the generated action sequences.

3.2 Proposed model

We employ an adversarial generative model to address human action generation problem. The goal of the model is to generate sequences of human poses with random input conditioned by an initial pose. The initial pose condition is introduced in the model in order to enable the model to generate consecutive multiple human actions without continuity problems. To this purpose, we develop a framework of a conditional generative adversarial networks and an autoencoder.

The input and output of the autoencoder are vector representation of human-pose frames. The reconstruction loss function of the autoencoder is an norm. The latent variables in the bottleneck of the autoencoder, , represents the low-dimensional space of the original vector of each frame of human pose. Therefore, with a sequence of frames as the input of the autoencoder, we can get a sequence of low dimensional vector, , at the latent space. The dimension of the is , where is the dimension of and is the length of each sequence. We utilize this low-dimensional representation of the original skeleton as the real data in the discriminator. Fig. 1 demonstrates the architecture of the proposed model.

The generative adversarial model includes a generator and a discriminator. The input of the generator is a concatenation of three vectors. The first vector is the vector drawn from a random normal distribution. The second vector , represents the initial state of the human pose. The third vector is the action class label. The class-label is a one-hot vector corresponding to the number of actions defined in the dataset. We concatenate the generator output, , with the class label and the initial pose and feed it into the discriminator as fake data. We optimize the parameters of the discriminator and generator using min-max optimization given in Eq. (2).


where and are the Generative and Discriminative models, and represents the distribution of real sequences and the random data distribution, respectively. As a result, the generator learns to imitate human action sequences with respect to the given initial posture state and action label . The aim of employing input is to generate a variety of action styles with a fixed and . In addition to the generator and discriminator losses, we add another loss to keep the consistency between adjacent frames in each generated sequence as shown in Eq. (3(Barsoum et al., 2017).


where, shows the frames in generated sequence. is the batch size and is the length of . Since, can be a large value, it can dramatically increase the computation cost. Hence, in order to compute the in each iteration, instead of the full sequence, we randomly select a part of each sequence with starting point and length . Eq. (4) shows the new consistency loss.


Eq. (5) shows the total objective function in the generative and discriminative model.


where, the is a value between 0 to 1. We found that a very small results in losing consistency between adjacent frames in a sequence. Reversely, by choosing a very large value, the generator cannot learn different patterns of sequences. As a result, the generated sequences are almost constant even with different random . We set 0.01 to in all our experiments.

Figure 2: The model in generation phase utilizes the encoder and the decoder to produce the generated actions using the generator .

In the generation phase of our model shown in Fig. 2, we use the trained encoder to generate the initial condition from a single given frame. The initial condition , random and the label of the desired class are fed into the trained generator to generate the output . Then this output is passed on to the decoder to construct the sequence of generated human poses.

4 Experiments and results

We evaluate our model based on the benchmark NTU RGB+D dataset (Shahroudy et al., 2016). The NTU RGB+D action recognition dataset consists of action samples containing RGB videos, depth map sequences, 3D skeletal data, and infrared videos for each sample. This dataset was captured by 3 Microsoft Kinect V2 cameras concurrently. We utilize the 3D skeletal data, which contains the three dimensional locations of 25 major body joints, at each frame. The dataset contains different action classes including daily, mutual, and health-related actions collected using different actors.

All parameters of the model are trained using the Adam optimizer(ADAM) (Kingma and Ba, 2014). We set the learning rate as . The ReLU activation (Nair and Hinton, 2010) is used in the network with the exception of the output layers of autoencoder, generator and discriminator, where we applied , and activation functions, respectfully. We applied a normal distribution with zero mean and unit variance to generate random inputs . The model is trained with a mini-batch size of .

Figure 3: Generated sequences mapped into 2-dimensional space with three different random and a fixed initial pose . The star marker shows the given initial pose. The generated sequences are specified by different colors and markers.

The entire model is trained end-to-end where the autoencoder and the GAN are jointly optimized on their corresponding losses. Once the model is trained, generation of new actions are performed using the generate phase of the model shown in Fig. 2. In a generative model like the one proposed in this work, the effect of the random should be clearly demonstrated in the generated results. By using effectively, we will be able to generate different action styles that may not be included in the training data. Therefore, in order to evaluate the impact of the space in the sequence generation task, we generate three sequences with three different random vector , drawn from Gaussian distribution, and a fixed initial condition . Fig. 3 illustrates the effect of applying different to generate various motion sequences in 2-dimensional space. The star marker in the figure indicates the given initial position . According to these results, all of the three generated sequences started relatively close to the given initial pose, but at the later point of time during generation, the sequence changes. Hence, these results indicate that the proposed model is able to generate sequences with different styles using multiple .

Figure 4: Generated actions of Sitting-down and throwing with identical initial pose . The results show the impact of the changing class label in the generative model with the same initial pose condition.

Further experiments are conducted to show the effect of the class label on the generator output. Fig. 4 demonstrates the generated sequences with the same initial pose but with different class labels. The first and the second rows in Fig. 4 show the sitting down and throwing actions, respectively. The generated sequences not only follow the different corresponding actions but also start relatively near to the given initial pose.

In order to generate two consecutive actions, after generating the first action sequence, we have to reinitialize both and with the last frame of the first generated sequence and the new class label, respectively. Fig. 5 shows the standing-up action followed by sitting-down action with identical initial poses and two different random . Similarly, Fig. 6 illustrates the results of sitting-down and standing-up with a random .

Figure 5: Generated sitting-down followed by standing-up sequences with two different random . The initial pose conditions are identical in both (a) and (b) sequences. Even though (a) and (b) show the same sequence of actions, the generated results demonstrate the different styles using different .
Figure 6: Generated standing-up followed by sitting-down sequences with two different random . The initial pose conditions are identical in both (a) and (b) sequences.

5 Discussion and future work

The results of the proposed model illustrates several improvements over the existing human action prediction models (Barsoum et al., 2017; Cai et al., 2017) that uses generative models like GAN. The proposed model is able to generate a novel sequence of human motions without any prior sequence of input frames. The model also utilizes only a GAN for the generative part whereas other models utilizes multiple GANs (Cai et al., 2017) as well as RNNs (Barsoum et al., 2017; Cai et al., 2017). Another important feature of the proposed framework is the ability to generate multiple actions in a seamless consecutive manner. This major feature is lacking in the existing models.

The proposed model’s ability to generate different styles of actions using different random has been well illustrated in the results. Moreover, the result shown in Fig. 3 clearly demonstrate the generative capability where the generated actions differ considerably from each other with different random given the same initial conditions and class. Additionally, we demonstrated the generated sequences in 2-dimensional space with different initial poses and random in Fig. 7. The results of both Fig. 7(a) and Fig. 7(b) indicate that the generated sequences started from different locations corresponding to the given initial poses. This result is also important to show the effect of the initial condition in the generation process where different results in different start positions of the generated sequence. As far as we know, these kind of analysis and results have not been demonstrated in prior works.

Figure 7: Generated sequences mapped into 2-dimensional space with three different random and two different initial poses. The star markers show the given initial poses. Each generated sequences are specified by different colors and markers.

In the future, we will try to generate a blend of multiple actions, which are a combination of two or more action classes. We also plan to generate videos with the help of skeleton to image transformation models (Yan et al., 2017).

6 Conclusion

We introduced a new human action generation model that can generate a sequence of novel human actions. Unlike existing human action prediction models that use RNNs, we proposed a framework of an autoencoder and a conditional GAN trained end-to-end to produce multiple and consecutive human actions. The model trained on the NTU RGB-D dataset showed that it can generate different styles of actions. Furthermore, the proposed model was able to generate a set of consecutive actions with seamless transition given different action labels as conditions. We also showed the effect of random in the generated results where different produced different styles of actions.


  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875.
  • Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Barsoum et al. (2017) Barsoum, E., Kender, J., and Liu, Z. (2017). Hp-gan: Probabilistic 3d human motion prediction via gan. arXiv preprint arXiv:1711.09561.
  • Bütepage et al. (2017) Bütepage, J., Black, M. J., Kragic, D., and Kjellström, H. (2017). Deep representation learning for human motion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 2017.
  • Cai et al. (2017) Cai, H., Bai, C., Tai, Y.-W., and Tang, C.-K. (2017). Deep video generation, prediction and completion of human action sequences. arXiv preprint arXiv:1711.08682.
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180.
  • Fragkiadaki et al. (2015) Fragkiadaki, K., Levine, S., Felsen, P., and Malik, J. (2015). Recurrent network models for human dynamics. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4346–4354. IEEE.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680.
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.
  • Jain et al. (2016) Jain, A., Zamir, A. R., Savarese, S., and Saxena, A. (2016). Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5308–5317.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Li et al. (2017) Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B. (2017). Mmd gan: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2200–2210.
  • Liu et al. (2017) Liu, M.-Y., Breuel, T., and Kautz, J. (2017). Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708.
  • Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
  • Martinez et al. (2017) Martinez, J., Black, M. J., and Romero, J. (2017). On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4674–4683. IEEE.
  • Mirza and Osindero (2014) Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  • Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814.
  • Odena et al. (2016) Odena, A., Olah, C., and Shlens, J. (2016). Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585.
  • Shahroudy et al. (2016) Shahroudy, A., Liu, J., Ng, T.-T., and Wang, G. (2016). Ntu rgb+d: A large scale dataset for 3d human activity analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Yan et al. (2017) Yan, Y., Xu, J., Ni, B., Zhang, W., and Yang, X. (2017). Skeleton-aided articulated motion generation. In Proceedings of the 2017 ACM on Multimedia Conference, pages 199–207. ACM.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description