DIY Human Action Data Set Generation

DIY Human Action Data Set Generation

Mehran Khodabandeh
Simon Fraser University
The work was performed during an internship at Microsoft.
   Hamid Reza Vaezi Joze
   Illya Zharkov
   Vivek Pradeep

The recent successes in applying deep learning techniques to solve standard computer vision problems has aspired researchers to propose new computer vision problems in different domains. As previously established in the field, training data itself plays a significant role in the machine learning process, especially deep learning approaches which are data hungry. In order to solve each new problem and get a decent performance, a large amount of data needs to be captured which may in many cases pose logistical difficulties. Therefore, the ability to generate de novo data or expand an existing data set, however small, in order to satisfy data requirement of current networks may be invaluable. Herein, we introduce a novel way to partition an action video clip into action, subject and context. Each part is manipulated separately and reassembled with our proposed video generation technique. Furthermore, our novel human skeleton trajectory generation along with our proposed video generation technique, enables us to generate unlimited action recognition training data. These techniques enables us to generate video action clips from an small set without costly and time-consuming data acquisition. Lastly, we prove through extensive set of experiments on two small human action recognition data sets, that this new data generation technique can improve the performance of current action recognition neural nets.

Figure 1: Our algorithm takes as input an action label, a set of reference images and an arbitrary background. The output is a generated video of the person in the reference image performing a given action. We approached this problem in two stages. Firstly (left side) a generative model trained on a small labeled dataset of skeleton trajectories of human actions, generates a sequence of human skeletons conditioned on the action label. Secondly (right side), another generative mode trained on an unlabeled set of human action videos, generates a sequence of photo-realistic frames conditioned on the given background, generated skeletons, and the person’s appearance given in the reference frames. This produces an arbitrary number of human action videos.

1 Introduction

After significant successes in face detection, face recognition and object detection commonly used in our daily life, computer vision researchers are now aiming at understanding video which is one dimension more difficult. These successes rely on advanced machine learning techniques and training data which require computational power, mainly deep networks. Hence, the process of data acquisition may be as vital as the technique used. Large data sets, such as a million object and animal photos [23], hundreds of thousands of faces [21] or millions of scenes [26], enables complex neural networks to train successfully. However, similar results can never be achieved through small data sets manually captured by researchers themselves. Video data sets or specifically human action data sets are more difficult to compile. There are two common scenarios to generate a human action data set: (1) asking subjects to do a series of actions in front of a camera (2) labeling an existing video from the internet. The first scenario is not scaleable considering the number of subjects and the limitations imposed by the capturing environment. These types of data sets are not common anymore due to their small size. Some examples of the second scenario are UCF 101 [44] containing 101 actions of thousands of online clips, Hollywood2 [29] containing 12 actions in around 3 thousands clip extracted from movies and the kinetics [20] including 400 actions from hundreds of thousands of YouTube videos. Although these data sets are very useful to benchmark the accuracy of different algorithms, the clips or actions are not necessarily useful for real world action recognition tasks such as security surveillance cameras, sport analysis, smart home devices, health monitoring etc, as each scenario has different settings and sets of actions. A solution would be for researchers to collect their own data sets which may prove to be costly and time consuming.

In this paper, we’ve introduced a novel way to partition an action video clip into action, subject and context. We showed that we can manipulate each part separately and assemble them with our proposed video generation model into new clips. The actions are represented by a series of skeletons, the context is an still image or a video clip, and the subject is represented by random images of the same person. We can change an action by extracting it from an arbitrary video clip, generate it through our proposed skeleton trajectory model, or by applying perspective transform on existing skeleton. Additionally, we can change the subject and the context using arbitrary video clips, enabling us to arbitrarily generate action clips. This is particularly useful for action recognition models which require large data sets to increase their accuracy. With the use of a large unlabeled data and a small set of labeled data, we can synthesize a realistic set of training data for training a deep model.

We called it DIY (do it yourself) because we can eventually build our own data set from a small one. Similar to actual data collection, not only we can add a new person or action to the data set, but also internally expand the data set or capture the same data from different angles with very little time and effort.

Lastly, to quantitatively evaluate our data generation technique, we applied it to UT Kinects [57] a human action data set comprised of 10 actions in 200 video clips. We generated new video clip types by adding new subjects or actions or by expanding current action and subjects. It is shown that generated data along with the existing data, can improve the performance of well-performed video representation networks: I3D [4] and C3D [47] on action recognition task. For further investigation, we applied our method and action recognition task to actions with two persons in SUB interact [61] data sets. The outline of this paper is as follows. In §2 we’ve described related works in action recognition, data augmentation and video generative model. Section 3 introduces our video generation methods as well as skeleton trajectory generation methods with samples and use cases. In §4, we’ve discussed the data sets and action recognition methods used to evaluate our work. In §5 we’ve presented the extensive experimental data backing our claims. Our paper is concluded in §6.

Figure 2: Structure of the network. On the left side ”generator network” takes as input background, target skeleton, and the transformed reference images to the target skeleton along with their masks. On the right side ”discriminator” takes as input generated image or ground truth and outputs ”fake” or ”real”.

2 Related Works

2.1 Action Recognition

Human action recognition has drawn attention for some time. Before deep learning era of computer vision, many researchers tried to inflate successful 2D features or descriptors in order to solve this problem such as 3d SIFT [41], 3d bag of features [24] or dense trajectories [54]. Please refer to [33] for a comprehensive survey of these types of algorithms.

Deep learning networks significantly outperformed transitional approaches and are therefore the focus of this paper. Unlike image representation network architecture, the video representation networks haven’t had satisfactory advances. There have been different approaches to this problem. Some used the convolution and layers in 2D (image-based) [7, 60] while some used 3D (video-based) kernels [15, 47, 4]. Input to the networks could be just RGB video  [47] while optical flow could be used as an additional input [9, 4]. Information could propagate across frames either through LSTMs [7, 60] or feature aggregation [18].

Data Augmentation Using synthetic data or data warping for training classifiers has been proven effective [23, 63, 43]. Sato et al[39] proposes a method for training a neural network classifier using augmented data. Wong et al[56] thoroughly investigated the benefits of data augmentation for classification tasks. In action recognition tasks, data is usually very limited, since collecting and annotating videos is difficult. Although one can use our algorithm for data augmentation by generating videos varying in background, human appearance, and type of actions, this is not the purpose of our work. Unlike data augmentation that is limited to manipulating data, our method is capable of generating new data with new content and visual features.

2.2 Video Generative Models

Video generation has posed as a challenge for a number of years. The early work in the field focused on generating texture  [8, 46, 55]. In recent years with the success of generative models in image generation such as GANs [11], VAEs [22, 35], Plug&Play Generative Networks [31], Moment Matching Networks [25], and. PixelCNNs [50], a new window of opportunity has opened towards generating videos using generative models. In this paper, we use GANs to generate human skeleton trajectories and realistic video sequences. GAN consists of a discriminator and a generator, trained in a 2-player zero-sum game. Although GANs have shown promising results on image generation [6, 34, 62, 28, 27], they have proven to be difficult to train. To address this issue, Arjovsky et al[1] proposed Wassertein GAN to combat mode collapse with more stability. Salimans et al[38] introduced several tricks for training GANs. Karras et al[19] proposed a novel method for training GANs through progressively adding new layers. Ronneberger et al[36] proposed U-Net, a convolutional network for segmentation.

GANs have previously been used for video generation. There are two lines of work in video generation. First is video prediction where given the first few frames of a video, the goal is to predict the future frames. Several papers focus on producing pixel values conditioned on the past observed frames [59, 45, 32, 30, 17, 58, 51]. Another group of papers aimed at reordering the pixels from the previous frames to generate the new ones [49, 10].

In the second line of work, the goal is to generate a sequence of video frames conditioned on label, single frame, etc. Early attempts assumed video clips to be fixed length and embedded in a latent space [52, 37]. Tulyakov et al[48] proposed to decompose motion from content and generate videos using a recurrent neural net. Our work is different from [48] where their model learns motion and content in the same network whereas we separated them completely.Furthermore, [48] is not capable of generating complex human motions. Also filling gaps in the background initially blocked by the person in the input video is a difficult task for this method. On the other hand, our method handles these challenges by completely separating appearance, background, and motion. Our work is somewhat similar to [53], which does video forecasting using pose estimation, by modeling the movement of human using a VAE and then using a GAN to predict the pixel value of the future frames.

Figure 3: Architecture of the discriminator, .

Our work lies in the ”video generation” category where we focus on employing video generation techniques to generate human action videos. In our proposed method we completely separate background, skeleton motion, and appearance, allowing us to model frame generation and skeleton trajectory independently. So, one would require labeled data and the other can benefit from unlimited unlabeled human action videos available on internet, respectively.

3 Method

We define problem as follows; given an action label a small set of reference images each containing a human subject from which a sequence of video frames is generated featuring a human with the same appearance as the human in the reference image set performing an action . Modeling the (human/camera) motion and generating photo-realistic video frames may be challenging but knowing the location/motion of human skeletons in each frame would simplify it. Hence, we subdivided the problem into two simpler tasks (inspired by [48, 51]).

  • The first task comprised of the reference images , background image , and a sequence of target skeletons employed to render photo-realistic video frames of the person in moving according to on background.

  • The second task produced the target skeleton sequences for the first part. In another words, given action label , a sequence of skeletons of a random person performing action was generated.

By combining the two tasks, we created a novel algorithm that can generate arbitrary number of human action videos with varying backgrounds, human appearances, actions, and ways each action is performed.

3.1 Video Generation from Skeleton and Reference Appearance

In this section, we explain our algorithm used to generate a video sequence of a person based on given appearance () and a series of target skeletons () in an arbitrary background(). In our proposed model, we use GAN conditioned on the appearance, the target skeleton, and the background. Our proposed generator network works in a frame-by-frame fashion, where each frame is generated independently from others. We have tried using LSTMs and RNNs to take into account smoothness of the videos. However, our experiments show frames that are generated separately are sharper as RNNs/LSTMS may introduce blurriness to the generated frames.

Generator Input. Our generator network needs a reference image of the person in order to generate images of the same person with arbitrary poses/backgrounds. However, one reference image may not have all the appearance information due to occlusions in some poses (e.g. face is not visible when the person is not facing the camera). To overcome this issue to some extent, we provided multiple reference images of the person to the network. In both training and testing, these images were selected completely at random, so that network would be responsible for choosing the right pieces of appearance features from the set of input images. These images could be selected with a better heuristic to produce better results though this is not in the scope of this work.

(a) UT dataset. Subjects from the same dataset.
(b) SBU dataset. None of the subjects exist in this dataset.
Figure 4: Generated images on two different datasets.

The reference images were pre-processed before incorporation into the network. First we extracted the human skeleton from each reference image (using [3]), then used an offline transform to map the RGB pixel values of each skeleton part from the image to the target skeleton. Also, a binary mask of where the transformed skeleton is located was created. All these images, , along with the background, , and the target skeleton, were stacked.

Conditional GAN. Inspired by pix2pix [14], we used a U-net style conditional GAN. The generator , is conditioned on the set of transformed images and corresponding masks, along with the background and target skeleton. The generator, , maps to the target frame , such that it fools the discriminator, . The discriminator, , on the other hand is trained to discriminate between real images and the fake images generated by . The architecture of the discriminator is illustrated in Fig. 3. The pipeline and architecture of the generator is illustrated in Fig. 2. Fig. 3(a) illustrates some of the results.

The objective function of GAN is expressed as:

Following [14] we added an loss to the objective function, which resulted in sharper generated frames.

In initial experiments, we noticed that using only loss and GAN loss is not enough as the output background would be sharp but the region that the target person is supposed to be was blurry. Subsequently, we introduced a ”Regional L1 loss” with a larger weight as following,

where ”masked” masks out the region where the person was located. This mask was generated based on the target skeleton, , using morphological functions (erode, etc.).

Our final objective is as follows:

where and are weights of and regional losses (in our experiments ). and the goal is to solve the following optimization problem.


Multi-person Video Generation In a nutshell, our algorithm merges transformed images of a person on an arbitrary pose with an arbitrary background in a natural photo-realistic way. We managed to go beyond simple one person human action videos and extended our method to multi-person interaction videos as well. For this purpose, we trained our model on a two person interaction data set [61]. The only difference with single frame generation process is that in the pre-processing phase, for each person in the input reference image, we needed to know the corresponding skeleton in the target frame, we then transformed each person’s body parts to his/her own body parts in the target skeleton. There are some challenges in this task such as occlusions in certain interactions (e.g. passing by, hugging, etc.). The data set that we used contains these occlusions to some extent. Our method is able to handle relatively well some simple occlusions that occur in such interactions. We acknowledge that there is room for improvement in this area, but that would not fit in the scope of this work. Fig. 3(b) illustrates some of the generated videos.

3.2 Skeleton Trajectory Generation

In the previous section, we explained how we designed a method that enables us to generate videos of an arbitrary person in any background based on any given sequence of skeletons. Although number of backgrounds and persons are unlimited, the number of labeled skeleton sequences are limited to the ones in the existing data sets. We propose a novel solution to this problem; using a generative model to learn the distribution of skeleton sequences conditioned on the action labels. This allows us to generate as many skeleton sequences as needed for the actions in the data set. Fig. 6 shows a few sample generated skeleton sequences.

We used small data sets for training our model. However, due to the nature of the problem and the limited amount of data, generating long sequences of natural looking skeletons proved challenging. Thus we aimed at generating relatively short fixed-length sequences. Having said that, training GAN in such way is still prone to problems such as mode collapse, divergence, etc. In designing the generator and discriminator networks, we have taken into account these problems (e.g. introduced batch diversity in the discriminator, created multiple discriminators, etc.).

Skeleton Trajectory Representation. Each skeleton consists of 18 joints. We represented each skeleton with a vector (a flattened version of matrix of joints coordinates). We normalized the coordinates by dividing them by ”height” and ”width” of the original image.

Generator Network. We used a conditional GAN model to generate sequences of skeletal positions corresponding to different actions. Our generator has a ”U” shape architecture where input consists of action label and noise, and output is a tensor representing a human skeleton trajectory with time-steps.

Based on our results, providing a vector of random noise for each time step helps the generator to learn and generalize better. So the input noise, , is a tensor with size ; drawn from a uniform distribution. The one-hot encoding of action label, , is replicated and concatenated to the 3rd dimension of the . The rest is a ”U” shaped network with skip connections that maps the input () to a skeleton sequence . Fig. 4(a) illustrates the network architecture. We also used Dense-net [12] blocks in our network.

Discriminator Network. Architecture of discriminator is three-fold. The base for discriminator is 1D convolutional neural net along the time dimension. In order to allow discriminator to distinguish ”human”-looking skeletons, we used sigmoid layer on top of fully-convolutional net. To discriminate ”trajectory”, we used set of convolutions along the time with stride 2, shrinking output to one containing features of the whole sequence. To prevent mode collapse, first we grouped fully convolutional net outputs across batch dimension.We then used min, max and mean operations across batch, and provided these statistical information to the discriminator. This method seems to provide enough information about distribution of values across batch and allows to change batch size during training. For detailed discriminator architecture see Fig. 4(b).

(a) Generator Network.
(b) Trajectory Discriminator Network. The discriminator is the sum of three discriminators illustrated in this figure: .
Figure 5: Trajectory GAN network architecture.

Our objective function is:

Figure 6: Samples of generated skeleton sequences, conditioned on action label (e.g. throwing, hand waving, sitting).

where and are action label and skeleton trajectories, respectively. We aim to solve the following:

In this work, we have shown that generative models can be adopted to learn human skeleton trajectories. We trained a Conditional GAN on a very small data set (200 sequences) and managed to generate natural looking skeleton trajectories conditioned on action labels. This can be used to generate a variety of human action sequences that don’t exist in the data set. However, our work is limited to a fixed number of frames. Thus for future work, we’ll work to improve our method so that it’ll accommodate longer sequences varying in length. We also explained that in addition to the generated skeletons, we can also use real skeleton sequences from other sources (other data sets, current data set but different subjects) to largly expand existing data sets.

4 Datasets and Action Recognition Methods

4.1 Data Sets

In this paper, we’ve claimed to expand small amount of action videos by addition of new generated videos. We targeted smaller action recognition data sets and expanded them to meet the large data load requirements of recent action recognition algorithms such as UCF 101 [44], the kinetics [20] or NTU RGB+D [42]. This eliminates the need for time and cost inefficient data acquisition processes.

UT Kinects [57]: One of the data sets wildly used in our experiments is UT Kinects which includes 10 action labels: Walk, Sit-down, Stand-up, Trow, Push, Pull, Wave-hand, Carry and Clap-hand. There are 10 subjects that perform each of these action twice in front of a rig of RGB camera and Kinect. Therefore in total they are 200 action clips of RGB and depth though depth is ignored. All videos are taken in office environment with similar lighting condition and the position of the camera is fixed.

For the training setup, 2 random subjects were left out (20%, used for testing) and the experiments were carried out using 80% of the subjects. The reported results are the average of six individual runs. The 6 train/test runs are constant throughout our experiment.

SUB Interact [61]: Since our methods work with multiple human subjects in a scene, we picked SUB Interact. It is a kinect captured human activity recognition data set depicting two person interaction. It contains 294 sequences of 8 classes (Kicking, Punching, Hugging, Shaking-hand, Approaching, departing and Exchanging objects) with subject independent 5-fold cross validation. The original data includes RGB, depth and skeleton but we only use RGB for our purpose. We used a 5-fold cross validation throughout our experiments and reported the average accuracy.

KTH [40]: KTH action recognition data set was commonly used at the early stage of action recognition. It includes 600 low resolution clips of 6 actions: Walk, Wave-hand, Clap-hand, Jogging, running and boxing which are divided in train, test and validation. The first three action labels are shared with UT data set while the last three are new. We used this data set to add new action to UT data set and for cross data set evaluation.

4.2 Action Recognition Methods

We used the following deep learning networks which have previously shown decent performance on recent action recognition data sets.

Convolutional 3D (C3D) [47]: is a simple and efficient 3-dimensional ConvNet for spatiotemporal feature which shows decent performance on video processing benchmarks such as action recognition in conjunction with large amount of training data. We used their proposed network with 8 convolutional layers, 5 pooling layers and 2 fully connected layers with 16-frames of RGB input. They released a network pre-trained on UCF Sport [44] which we used for our experiments aimed at training from scratch, denoted as C3D(p) vs. C3D(s). Unfortunately we can not couldn’t converge the C3D when we trained from scratch on UT data set but it converged successfully on SUB.

Inflated 3D ConvNets (I3D) [4] : is a more complex model which has recently been proposed as the state-of-the-art for action recognition task. It builds upon Inception-v1 [13], but inflates their filters and pooling kernels into 3D. It is a two-steam network which uses both RGB and optical flow input with inputs. We only used RGB for simplicity. They released a network pre-trained on ImgeNet [5] followed by the Kinetics [20]. We used this for our experiments aimed at training from scratch, denoted as I3D(p) vs. I3D(s).

We use data augmentation by translation and clipping as mentioned in [4] for all experiments. For training, we only used the original clips as test, making sure there was no generated clips with skeletons or subjects (subject pair) from test data in each run.

5 Experiments

So far, we have introduced our video generation method which enable us to generate new action clips for the action recognition training process. In this section, we show different scenarios for generating new data and running experiments for each to see if adding the generated data to a training process can improve the accuracy of the action recognizer. We applied our proposed video generation models to all the experiments using skeletons. The skeletons were trained using data from UT and SBU data sets as well as 41 un-annotated clips (between 10 to 30 seconds) that we captured from our colleagues. For future works, we will train our model again using a large amount of data from web. But the time being, we are satisfied with the current model as higher resolution for action recognition is currently unnecessary. Our technique for generating new action video clips has the capacity of running experiments with numerous varying settings. Here, we show five experiments which may be quantitatively evaluated.

5.1 Generated Trajectory

The first experiments is a combination of our proposed video generation technique and skeleton trajectory generation. We generated around 200 random skeleton trajectories from action labels in UT data set using the method mentioned in §3.2. Each of these skeleton trajectories generated a video by proposed video generation applied to a person in UT data set, meaning our new data set is doubled with half of it being the generated data. We then trained our model by I3D and C3D using training setting mentioned in §4.1. Table 1 shows about 3% improvement for I3D with and without training data as well as significant improvement (by 15%) for C3D network which is less complex.

Method Org. Org. + Gen.
I3D(s) 64.58% 67.50%
I3D(p) 86.25% 89.17%
C3D(p) 55.83% 70.83%
Table 1: Action recognition on UT data set using original data compared to generated from scratch data with proposed method in §3.1 and §3.2

5.2 New Subjects

One common way to extend a video data set is to invite new people to do a series of actions in front of a camera. Diversity [2] in body shape, cloths and behaviour will clearly help with the generalization of the ML methods. In this experiment, we aimed to virtually add new subject to the data set. Thus, we collected a small unannotated clips from 10 distinct persons and fed them as new subjects into our proposed video generation method. For UT, each subject was replaced by a new one for all of his/her action which is similar to adding 10 new subjects to UT. The same was done with SUB to double the data set, the only difference being the replacement each pair with a new subject pair. Figure 3(b) shows a few new subjects with their generated action videos from SBU data set. The results have been presented in Table 2.

Org. Exp. Org Exp.
I3D(s) 64.58% 67.08% 86.48% 91.23%
I3D(p) 86.25% 89.17% 97.30% 98.65%
C3D(s) - - 83.52% 87.00%
C3D(p) 55.83% 70.43% 92.02% 96.25%

Table 2: Performance comparison of multiple algorithms, trained on original data and additional subjects.
Figure 7: The screen shot of a video generated by UTK expansion. The first row shows skeleton clips extracted from an arbitrary action. Second to fourth rows show the generated video for subjects from different clip carrying out that specific action.

5.3 New Actions

In real computer vision problems, one might decide to add a new label class after the data collection process has been done. Adding a new label action to a valid data set could cost the same as gathering a data set from scratch as all the subjects are needed for re-acting that single action. In this experiment, we tried to introduce a new action labeled to UT data set. As mentioned in §4.1 , UT consists 10 action labels. We used training data from a third data set called KTH [40] in order to generate 3 new actions, running, jogging and boxing, in addition to that of the UT. For each subject in UT data set and each of these 3 new action, we randomly picked 5 action clips from KTH training data clips and extracted the skeleton by OpenPose [3] where in addition to input background image, we generated 150 new action clips from our data set. We then trained a new model using I3D by pre-trained network where in each run we used training data from original set and all the data generated for the new set of actions. Since the KTH data is grey scaled images, we randomly grey scaled both the original and the generated training clips in the training phase. For each run, we found per class accuracy for UT test set (refer to §4.1 for explaining UT train/test) as well as KTH test sets. Table 3 shows average of the per class accuracy for both test sets. We may consider KTH test results as a measure of cross data set accuracy for walk, wave-hand and clap-hand. Our trained network on new action labels boxing, running and jogging achieved 72.14%, 44.44% and 63.20%, respectively. This indicates that the new actions in the data set performed as good as the data captured by camera.

Action UTK Test Label KTH Test
Walk 91.67% Walk 67.18%
Wave-hand 100.0% Wave-hand 58.59%
Clap-hand 91.67% Clap-hand 28.90%
Push 33.33% Boxing 72.14%
Pull 58.33% Running 44.44%
Pick-up 100.0% Jogging 63.20%
Sit-down 87.50%
Stand-up 95.83%
Threw 54.17%
Carry 79.17%

Table 3: Per class average accuracy for model trained by i3d using original training data from UT plus new action clip generated by our method using skeleton extracted from KTH training set.

5.4 Data set Expansion

So far, we’ve shown that using our proposed method we can generate video clips with any number of arbitrary action videos and subjects. In an action data set with subjects carrying out distinct actions, there will be video actions. when applied to our proposed method of action video generation, the subjects and the video actions will result in generation of video actions comprising of original videos while the rest is generated videos. This approach enabled us to expand UT Kinect data set from 200 clips to 4000 clips and SUB Interact from 283 clips to 5943 using only the original data set. We trained I3D and C3D using our expanded data set as described in §4.1. Table 4 shows the result of this experiment.

Org. Exp. Org Exp.
i3d(s) 64.58% 69.58% 86.48% 93.54%
i3d(p) 86.25% 90.42% 97.30% 99.13%
c3d(s) - - 83.52% 86.03%
c3d(p) 55.83% 71.25% 92.02% 97.41%

Table 4: The comparison of data set expansion by original data for UTK and SUB data set.

Figures 7 shows an screen shot of the clips from UTK and SUB data sets. The first row shows skeleton clips extracted from an arbitrary action while rows 2-4 show the generated video for subjects from different clip performing that specific action.

5.5 Real World

In this section, we carried out 4 different experiments on 2 data sets for bench-marking. Although in all experiments, the generated data improved the network performance, we believe none of the experiments show the actual strength and convenience of our proposed methods in real world scenarios. In both data sets, as well as other commonly used small data sets, the environmental setup for data acquisition such as distance from camera view [16] and light condition were kept as uniformly as possible for both test and train video clips. This would be unattainable in real life data acquisitions. A way of overcoming this obstacle would be to collect diverse sets of data for strong neural network models. We’ve previously shown that by partitioning the video to action, subject and context allows us to easily manipulate the background or change the camera view. In this experiment, We applied perspective transform on skeleton while using diverse backgrounds. Although the model trained with these data did not outperform our previous experiments, a live demo showed it to be better for unseen cases, qualitatively. Figure 8 illustrates an input skeleton and its perspective transform as well as the generated clip.

Figure 8: Perspective transform example.

6 Conclusion and Future Works

In this paper, we’ve introduced a novel way to partition an action video clip into action, subject and context. We showed that we can manipulate each part separately, reassemble them with our proposed video generation model into new clips and use as an input for action recognition models which require large data. We can change an action by extracting it from an arbitrary video clip, generate it through our proposed skeleton trajectory model or by applying perspective transform on existing skeleton. Additionally, we can change the subject and the context using arbitrary video clips.

For the future work, we will replace our 2d skeleton with 3d skeleton to achieve a 3d transformation and handle occlusions. Additionally, while our video generation technique demonstrated acceptable results for images, we believe it can be extended even further to achieve higher resolution by feeding more unannotated data.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [2] M. Bagheri, Q. Gao, S. Escalera, A. Clapes, K. Nasrollahi, M. B. Holte, and T. B. Moeslund. Keep it accurate and diverse: Enhancing action recognition performance by ensemble learning. In CVPRW, pages 22–29, 2015.
  • [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
  • [4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. arXiv preprint arXiv:1705.07750, 2017.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
  • [6] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
  • [7] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
  • [8] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. International Journal of Computer Vision, 51(2):91–109, 2003.
  • [9] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1933–1941, 2016.
  • [10] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, pages 64–72, 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [12] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
  • [14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
  • [15] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.
  • [16] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez. View-independent action recognition from temporal self-similarities. PAMI, 33(1):172–185, 2011.
  • [17] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
  • [18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  • [19] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [20] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [21] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In CVPR, pages 4873–4882, 2016.
  • [22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
  • [24] W. Li, Z. Zhang, and Z. Liu. Action recognition based on a bag of 3d points. In CVPRW, pages 9–14. IEEE, 2010.
  • [25] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1718–1727, 2015.
  • [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  • [27] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017.
  • [28] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in neural information processing systems, pages 469–477, 2016.
  • [29] M. Marszałek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.
  • [30] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  • [31] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016.
  • [32] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015.
  • [33] R. Poppe. A survey on vision-based human action recognition. Image and vision computing, 28(6):976–990, 2010.
  • [34] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [35] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent gaussian models. In International Conference on Machine Learning, 2014.
  • [36] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  • [37] M. Saito and E. Matsumoto. Temporal generative adversarial nets. arXiv preprint arXiv:1611.06624, 2016.
  • [38] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
  • [39] I. Sato, H. Nishimura, and K. Yokoi. Apac: Augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229, 2015.
  • [40] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In ICPR, volume 3, pages 32–36. IEEE, 2004.
  • [41] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM international conference on Multimedia, pages 357–360. ACM, 2007.
  • [42] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, pages 1010–1019, 2016.
  • [43] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 3, pages 958–962, 2003.
  • [44] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [45] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, pages 843–852, 2015.
  • [46] M. Szummer and R. W. Picard. Temporal texture modeling. In Image Processing, 1996. Proceedings., International Conference on, volume 3, pages 823–826. IEEE, 1996.
  • [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
  • [48] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
  • [49] J. van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran, and S. Chintala. Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435, 2017.
  • [50] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
  • [51] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. ICLR, 1(2):7, 2017.
  • [52] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
  • [53] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose knows: Video forecasting by generating pose futures. arXiv preprint arXiv:1705.00053, 2017.
  • [54] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, pages 3169–3176. IEEE, 2011.
  • [55] L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 479–488. ACM Press/Addison-Wesley Publishing Co., 2000.
  • [56] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell. Understanding data augmentation for classification: when to warp? In Digital Image Computing: Techniques and Applications (DICTA), 2016 International Conference on, pages 1–6. IEEE, 2016.
  • [57] L. Xia, C. Chen, and J. Aggarwal. View invariant human action recognition using histograms of 3d joints. In CVPRW, pages 20–27. IEEE, 2012.
  • [58] T. Xue, J. Wu, K. Bouman, and B. Freeman. Probabilistic modeling of future frames from a single image. In NIPS, 2016.
  • [59] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, pages 91–99, 2016.
  • [60] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
  • [61] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In CVPRW, pages 28–35. IEEE, 2012.
  • [62] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.
  • [63] X. Zhang, Y. Fu, A. Zang, L. Sigal, and G. Agam. Learning classifiers from synthetic data using a multichannel autoencoder. arXiv preprint arXiv:1503.03163, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description