Learn to Dance with AIST++: Music Conditioned 3D Dance Generation

Learn to Dance with AIST++: Music Conditioned 3D Dance Generation


In this paper, we present a transformer-based learning framework for 3D dance generation conditioned on music. We carefully design our network architecture and empirically study the keys for obtaining qualitatively pleasing results. The critical components include a deep cross-modal transformer, which well learns the correlation between the music and dance motion; and the full-attention with future-N supervision mechanism which is essential in producing long-range non-freezing motion. In addition, we propose a new dataset of paired 3D motion and music called AIST++, which we reconstruct from the AIST multi-view dance videos. This dataset contains 1.1M frames of 3D dance motion in 1408 sequences, covering 10 genres of dance choreographies and accompanied with multi-view camera parameters. To our knowledge it is the largest dataset of this kind. Rich experiments on AIST++ demonstrate our method produces much better results than the state-of-the-art methods both qualitatively and quantitatively. Please watch the video project page and dataset at https://google.github.io/aichoreographer. 1


= 0mm \belowrulesep= 0mm \aboverulesep= 0.605mm \belowrulesep= 0.984mm \cvprfinalcopy

1 Introduction

Figure 1: Cross-Modal 3D Motion Generation Overview. Our proposed cross-modal learning framework takes in a music piece and a -second sequence of seed motion, then generates long-range future motions that correlates with the input music.

The ability to dance by composing movement patterns that align to musical beats is a fundamental aspect of human behavior. Dancing is an universal language found in all cultures [45], and today, many people express themselves through dance on contemporary online media platforms. The most watched videos on YouTube are dance-centric music videos such as “Baby Shark Dance”, and “Gangnam Style” [70], making dance a more and more powerful tool to spread messages across the internet. However, dancing is a form of art that requires practice—even for humans, professional training is required to equip a dancer with a rich repertoire of dance motions to create an expressive choreography. Computationally, this is even more challenging as the task requires the ability to generate a continuous motion with high kinematic complexity that captures the the non-linear relationship with the accompanying music.

In this work, we address these challenges by presenting a novel cross-modal transformer-based learning framework and a new 3D dance motion dataset called AIST++ that can be used to train a model that generates 3D dance motion conditioned on music. Specifically, given a piece of music and a short (2 seconds) seed motion, our model is able to generate a long sequence of realistic 3D dance motions. Our model effectively learns the music-motion correlation and can generate dance sequences that varies for different input music. We represent dance as a 3D motion sequence that consists of joint rotation and global translation, which enables easy transfer of our output for applications such as motion retargeting as shown in Figure LABEL:fig:teaser.

For the learning framework, we propose a novel transformer based cross-modal architecture for generating 3D motion conditioned on music. We build on the recent attention based networks [14, 62, 2, 71], which have shown to be effective for long sequence generation especially, and take inspiration from the cross-modal literature in vision and language [71] to design a framework that uses three transformers: one for audio sequence representation, one for motion representation and one for cross-modal audio-and-motion correspondence. The motion and audio transformer encode the input sequences, while the cross-modal transformer learns the correlation between these two modalities and generates future motion sequences.

We also carefully design our novel cross-modal transformer to be auto-regressive but with full-attention and future-N supervision, which are shown to be key for preventing 3D motion from freezing or drifting after several iteration as reported in prior works on 3D motion generation [3, 2]. The resulting model generates different dance sequences for different music, while generating long-term realistic motion that does not suffer from freezing of drifting at inference time.

In order to train the proposed model, we also address the problem of data. While there are a few motion capture datasets of dancers dancing to music, collecting mocap data requires heavily instrumented environments making these datasets severely limited in the number of available dance sequences, dancer and music diversity. As such, we propose a new dataset called AIST++, which we build on from the existing multi-view dance video database called AIST [78]. We use the multi-view information to recover reliable 3D motion from this data. Note that while this database has multi-view shots, the cameras are not calibrated, making 3D reconstruction a non-trivial challenge. The resulting AIST++ dataset contains up to M frames of 3D dance motions accompanied with music, which to our knowledge is the largest dataset of such kind. AIST++ also spans 10 music genres, 30 subjects, and 9 video sequence per dance with recovered camera intrinsics, which has ample potential to be useful for other human body and motion research. This dataset is available at https://google.github.io/aistplusplus_dataset/.

We conduct a thorough set of experiments comparing our approach to two prior works quantitatively, qualitatively, and with user studies. We propose novel metrics to better evaluate 3D motion conditioned on music, and ablate our models to identify key aspects of our architecture. The resulting learning framework has a wide variety of applications, such as the development of exercise tools where users may learn to dance by observing the dance sequence from multiple angles with their own choice of music. Our computational model is also useful for content creation and animation where the generated 3D motion can be directly transferred to a novel 3D character as illustrated in Figure LABEL:fig:teaser.

2 Related Work

3D Human Motion Synthesis Realistic and controllable 3D human motion synthesis from past motion has long been studied. Earlier works employ statistical models such as kernel-based probability distribution  [60, 6, 21, 7] to synthesize motion, but abstract away motion details. Motion graphs [47, 5, 43] address this problem by generating motions in a non-parametric manner. Motion graph is a directed graph constructed on a corpus of motion capture data, where each node is a pose and the edges represent the transition between poses. Motion is generated by a random walk on this graph. A challenge in motion graph is in generating plausible transition that some approaches address via parameterizing the transition [26]. With the development in deep learning, many approaches explore the applicability of neural networks to generate 3D motion by training on a large-scale motion capture dataset, where network architectures such as CNNs [30, 31], GANs [27], RBMs [75] and RNNs [20, 3, 36, 23, 11, 15, 84, 8, 51, 83] have been explored. Auto-regressive models like RNNs are capable of generating unbounded motion in theory, but in practice suffer from regression to the mean where motion “freezes” after several iterations, or drift to unnatural motions [3, 2]. Phase-functioned neural networks and it’s variations [90, 29, 68, 69] address this issue via conditioning the network weights on phase, however, they do not scale well to represent a wide variety of motion.

In this work, we present a transformer based approach for generating 3D motion conditioned on music. The use of transformers is similar to a recently proposed approach by Aksan \etal [2], however we employ a full-attention transformer with future-N supervision which are shown to be key for long-range 3D motion generation in our experiments.

Cross-Modal Sequence-to-Sequence Generation Beyond of the scope of human motion generation, our work is closely related to the research of using neural network on cross-modal sequence to sequence generation task. In natural language processing and computer vision, tasks like text to speech (TTS) [64, 37, 39, 79] and speech to gesture [18, 24, 19], image/video captioning (pixels to text) [9, 40, 54, 44] involve solving the cross-modal sequence to sequence generation problem. Initially, combination of CNNs and RNNs [82, 81, 87, 89] were prominent in approaching this problem. More recently, with the development of attention mechanism [80], transformer based networks achieve top performance for visual-text [91, 72, 16, 49, 34, 71, 71],visual-audio [22, 85] cross-modal sequence to sequence generation task. Our work explores audio to 3D motion in a transformer based architecture.

While all cross-modal problems induce its own challenges, the problem of music to 3D dance is uniquely challenging in that there are many ways to dance to the same music and that the same dance choreography may be used for multiple music. We hope the proposed AIST++ dataset advances research in this relatively under-explored problem.

Audio To Human Motion Generation Dance to motion generation has been studied in 2D pose context either in optimization based approach [76], or learning based approaches [48, 67, 46, 63] where 2D pose skeletons are generated from a music conditioning. Training data for 2D pose and music is abundant thanks to the high reliability of 2D pose detectors [10]. However, predicting motion in 2D is limited in its expressiveness and potential for downstream applications. For 3D dance generation, earlier approaches explore matching existing 3D motion to music [66] and motion graph based approach [17]. More recent approach employ LSTMs [4, 74, 86, 93, 38, 88] or convolutional [1] sequence-to-sequence models. Closest to our work is that of Li \etal [50], which also employ transformer based architecture but only on audio and motion. Furthermore, their approach discretize the output joint space in order to account for multi-modality, which generates unrealistic motion. In this work we introduce a novel full-attention based cross-modal transformer for audio and motion, which can not only preserve the correlation between music and 3D motion better, but also generate more realistic long 3D human motion with global translation. One of the biggest bottleneck in 3D dance generation approaches is that of data. Recent work of Li \etal [50] reconstruct 3D motion from dance videos on the Internet, however the data is not public. Further, using 3D motion reconstructed from monocular videos may not be reliable and lack accurate global 3D translation information.

In this work we also reconstruct the 3D motion from 2D dance video, but from multi-view video sequences, which addresses these issues. While there are many large scale 3D motion capture datasets [35, 55, 57, 33], mocap dataset of 3D dance is quite limited as it requires heavy instrumentation and expert dancers for capture. As such, many of these previous works operate on small-scale motion capture dataset such as Dance with Melody [74], which is 94 minutes long with 4 types of music, GrooveNet [4], which is 23 minutes long with one dancer and one genre of electronic dance music, and DanceNet [92], which consist of an hour long two sequences of dance. In this paper, we present the AIST++ dataset, a 5 hours long 3D dance dataset with 10 genres of music and 30 dancers.

3 AIST++ Dataset


= 0mm \belowrulesep= 0mm Dataset Music 3D 3D 2D Kpt Views Genres Subjects Sequences Seconds AMASS[55] - - - - 344 11265 145251 Human3.6M[35] - 4 - 11 210 71561 Dance with Melody[74] - - - 4 Unknown 61 5640 GrooveNet [4] - - - 1 1 2 1380 DanceNet [92] - - - 2 2 2 3472 AIST++ 9 10 30 1408 18694


= 0.605mm \belowrulesep= 0.984mm

Table 1: 3D Motion Datasets Comparisons. Here we present a detailed comparison between our AIST++ dataset against other published 3D motion datasets. Length-wise, our AIST++ dataset rank the third. Motion-wise, our AIST++ dataset has 10 types of different dance motions accompanied with music. Whereas, Human3.6M [35], the second largest, only has simple walking, sitting down etc. motions.

Here we first discuss the content of the proposed AIST++ dataset, and then the process by which we obtain the 3D motion from the original AIST Dance Database(AIST) [78], which is a non-calibrated multi-view collection of dance videos.

AIST++ is a large-scale 3D human dance motion dataset that contains a wide variety of 3D motion paired with music. It has the following extra annotations for each frame:

  • views of camera intrinsic and extrinsic parameters;

  • COCO-format[65] human joint locations in both 2D and 3D;

  • SMPL [52] pose parameters along with the global scaling and translation.

This dataset is designed to serve as a benchmark for both motion generation and prediction tasks. It can also potentially benefit other tasks such as the 2D/3D human pose estimation. To our knowledge, AIST++ is the largest 3D human dance dataset with sequences, subjects and dance genres with basic and advanced choreographies. See Table. 1 for comparison with other 3D motion and dance datasets. AIST++ is a complementary dataset to existing 3D motion dataset such as AMASS [55], which contains only minutes of dance motions with no accompanying music.

Thanks to AIST, AIST++ contains 10 dance genres: Old School (Break, Pop, Lock and Waack) and New School (Middle Hip-hop, LA-style Hip-hop, House, Krump, Street Jazz and Ballet Jazz) (shown in Figure 2). Please see Appendix 7.1 for more details and statistics. The motions are equally distributed among all dance genres, covering wide variety of music tempos denoted as beat per minute (BPM)[58]. Each genre of dance motions contains of basic choreographies and of advanced choreographies, in which the former ones are those basic short dancing movements while the latter ones are longer movements freely designed by the dancers. However, note that AIST is an instructional database and records multiple dancers dancing the same choreography for different music with varying BPM, a common practice in dance. This posits a unique challenge in cross-modal sequence-to-sequence generation.

3.1 3D Motion Reconstruction

Next we describe how we reconstruct 3D motion from the AIST dataset. Although the AIST dataset contains multi-view videos, they are not calibrated meaning their camera intrinsic and extrinsic parameters are not available. Without camera parameters, it is not trivial to automatically and accurately reconstruct the 3D human motion. We start with 2D human pose detection [59] and manually initialized the camera parameters. On this we apply bundle adjustment [77] to refine the camera parameters. With the improved camera parameters, the 3D joint locations () are then triangulated from the multi-view 2D human pose keypoints locations. During the triangulation phase, we introduce temporal smoothness and bone length constraints to improve the quality of the reconstructed 3D joint locations. We further fit SMPL human body model [53] in order to obtain 3D joint rotation information, which is the most common form of motion data used in animation and other graphics applications. SMPL can be thought of as a function that takes input a 3D joint rotation pose parameters , a low-dimensional shape parameters , a global scale coefficient and global translation and outputs a mesh and a set of joints . We fit this model to the triangulated joint locations by minimizing an objective with respect to , global scale parameter and global transformation for each frame:


We fix to the average shape as the problem is under-constrained from 3D joint locations alone. We verify the quality of the recovered 3D motion using multi-view re-projection in Section 5.1.

Figure 2: AIST++ Motion Diversity Visualization. Here we show the 10 types of 3D human dance motion in our dataset.

4 Music Conditioned 3D Dance Generation

With the 3D dataset in hand, next we describe our approach to music conditioned 3D dance generation.

Problem statement Given a -second seed sample of motion represented as and a music sequence represented as , the problem is to generate a sequence of future motion from time step to , where .

4.1 Cross-Modal Motion Generation Transformer

We propose a transformer-based network architecture that can learn the music-motion correlation and generate non-freezing realistic motion sequences. The overview of this architecture is shown in Figure 1. While previous works have leveraged transformers [50, 2], we introduce some critical design choices that assist in learning cross-modal correspondence and, more importantly, prevent generated motion from freezing. These choices include the cross-modal transformer architecture (the number of attention layers), attention mechanism—causal attention [62] vs. full attention [14] for each transformer—and the supervision scheme. Here, we explain our design choices in detail.

We introduce three transformers to our model: the motion transformer , which embeds the motion feature into a motion embedding ; the audio transformer , which similarly embeds audio feature into an audio embedding , and the cross-modal transformer , which learns the correspondence between both modalities and generates future motion . To better learn the correlation between the two modalities, we employ a deep layer cross-modal transformer. We find that increasing the depth of the cross-modal transformer can greatly help the model to pay attention to both modalities (as shown in Figure. 5).

Transformer network is all about attention [80], and there are two common types of attention: causal and full attention. They differ in the data computation dependencies. Figure 3 illustrates this relation between the inputs (the bottom row), context vectors (the middle row) and the outputs (the top row) as a simplified two-layer transformer. The connections (edges) in the Figure 3 represents the computation dependencies. For causal-attention, the context tensors are only computed from the current and past inputs. Similarly, the output tensors in the causal-attention only get to see the current and previous context tensors. But for full-attention, they are fully dependent on each other. Full-attention is more commonly used in a transformer encoder network [14, 12], while the causal-attention is usually used in a transformer decoder network [62, 13]. Specifically, the output of the attention layer, the context vector is computed using the query vector and the key value pair from input with or without a mask via .


where is the number of channels in the attention layer. In causal-attention, the mask is a triangular matrix, also referred to as the look-ahead-mask, with non-zero elements close to negative infinity, while in full-attention . For all three transformers: motion transformer , audio transformer and cross-modal transformer , we apply the full-attention.

Aside from the network architecture, the supervision scheme can critically affect the model’s performance as it is tightly related to how the gradients are computed. Many previous works [62, 50] on the sequence to sequence generation task apply the shift-by-1 (or auto-regressive) supervision scheme with the causal-attention transformer. When supervised on shift-by-1 output, the attention layer learns to predict the immediate next frame of the input vector. We find that motions generated using this shift-by-1 supervision are subject to freezing after a few steps. Instead, we combine our full-attention transformer with the proposed future-N supervision scheme. Specifically the output of the transformer is supervised on the future N time steps from the last observed timestamp in the input sequence. At test time, our approach can still be applied in an auto-regressive framework. Compared with the causal-attention (shown on the left of Figure 3) with shift-by-1 supervision, our full-attention transformer with future-N supervision results in a non-freezing, more realistic long motion generation. We show this comparison results in Sec. 5.2.4.

Figure 3: Attention Mechanism Comparison. Here we show the data tensor relation for a causal-attention transformer (used in models like GPT [62] and the motion generator of [50]) and our full-attention transformer, as a simplified two-layer transformer for illustration purposes. The dots on the bottom row are the input tensors, which are computed into context tensors through causal (left) and full (right) attention transformer layer. The output (predictions) are shown on the top. While causal models are often supervised to predict the immediate next future for each input tensor, in our work we employ full attention and predict the future time steps from the last timestamp (3 shown here, 20 used in practice). We empirically show that this results in a non-freezing, more realistic motion generation.

5 Experiments


= 0mm \belowrulesep= 0mm Motion Quality Motion Diversity Motion-Music Correlation User Study Pos. Frechet Dist Vel. Frechet Dist Pos. Var Vel. Var Beat Align. Score Beat DTW Cost Our Winning Rate AIST++ 0.295 10.51 AIST++ (unpaired) 0.212 13.54 25.4% Li \etal[50] 5595.91 3.40 0.019 121.36* 0.231 12.56 80.6% Dancenet[92] 2367.26 1.13 0.215 1.05 0.232 12.17 71.1% Ours 113.56 0.45 0.509 6.51 0.241 12.16


= 0.605mm \belowrulesep= 0.984mm

Table 2: Conditional Motion Generation Evaluation on AIST++ dataset. Comparing to the two baseline methods, our model generates motions that are more realistic, better correlated with input music and more diversified when conditioned on different music. *Note Li \etal [50]’s generated motions are highly jittery making its velocity variation extremely high.

5.1 AIST++ Motion Quality Validation

We first carefully validate the quality of our 3D motion reconstruction. Possible error sources that may affect the quality of our 3D reconstruction include inaccurate 2D keypoints detection and the estimated camera parameters. As there is no 3D ground-truth for AIST dataset, our validation here is based-on the observation that the re-projected 2D keypoints should be consistent with the predicted 2D keypoints which have high prediction confidence. We use the 2D mean per joint position error MPJPE-2D, commonly used for 3D reconstruction quality measurement [42, 35, 61]) to evaluate the consistency between the predicted 2D keypoints and the reconstructed 3D keypoints along with the estimated camera parameters. Note we only consider 2D keypoints with prediction confidence over 0.5 to avoid noise. The MPJPE-2D of our entire dataset is pixels on the image resolution, and over of those has less than pixels of error. Please refer to Appendix 7.1 for the distribution of MPJPE-2D on AIST++.

5.2 Music Conditioned 3D Motion Generation

Experimental Setup

Dataset Split All the experiments in this paper are conducted on our AIST++ dataset, which to our knowledge is the largest dataset of this kind. We split AIST++ into train and test set, and report the performance on the test set only. We carefully split the dataset to make sure that the music and dance motion in the test set does not overlap with that in the train set. To build the test set, we first select one music piece from each of the 10 genres. Then for each music piece, we randomly select two dancers, each with two different choreographies paired with that music, resulting in total unique choreographies in the test set. The train set is built by excluding all test musics and test choreographies from AIST++, resulting in total unique choreographies in the train set. Note that in the test set we intentionally pick music pieces with different BPMs so that it covers all kinds of BPMs ranging from to in AIST++.

Implementation Details In all our experiments, the input of the model contains a seed motion sequence with frames (2 seconds) and a music sequence with frames (4 seconds), where the two sequences are aligned on the first frame. The output of the model is the future motion sequence with frames (expect for the attention ablation study experiments in Sec. 5.2.4). During training we supervise the future motion sequence with loss. During inference we generate long-range future motions in a auto-regressive way on FPS. We use the public available audio processing toolbox Librosa [56] to extract the music features including: 1-dim envelope, 20-dim MFCC, 12-dim chroma, 1-dim one-hot peaks and 1-dim one-hot beats, resulting in a 35-dim music feature . This 35-dim music feature is the input to the audio transformer . For the motion feature, we use the 9-dim rotation matrix representation for all joints, along with a 3-dim global translation vector, resulting in a 219-dim motion feature as the input to the motion transformer . All the three (audio, motion, cross-modal) transformers have attention heads. All the layers in each transformer have hidden size. The number of attention layers in each transformer varies based on the experiments, as described in Sec. 5.2.4. Learn-able position encoding is used for motion and audio transformer, respectively. All our experiments are trained with batch size using Adam [41] optimizer. The learning rate starts from and drops to {, } after {, } steps. The training finishes after , which takes days on TPUs. For baselines, we compare with Dancenet[92] and Li \etal[50], which we train and test on the same dataset with ours using the official code provided by the authors.

Evaluation Metrics

Figure 4: Beats Alignment between Music and Generated Dance. Here we visualize the kinetic velocity (blue curve) and kinematic beats (green dotted line) of our generated dance motion, as well as the music beats (orange dotted line). The kinematic beats are extracted by finding local minima from the kinetic velocity curve.

Motion Quality Metric We evaluate the motion quality by calculating the distribution distance between the generated motion clips and the ground-truth motion clips inspired by FID [28], which is widely used in image generation evaluations. As there is no standard motion feature extractor like Inception network [73] for motions, we directly calculate the Frechet Distance (FD) in the joint position space and joint velocity space . All the following experiments using this metric are conducted on randomly sampled -second motion clips from -second generated sequences with all the joints.

Motion Diversity Metric The diversity of the long-range generated dance motions conditioned on various musics reflects how well the model learns the cross-modal relationship. We use motion variation in both joint position space and joint velocity space to measure to diversity. All the following experiments using this metric are conducted on -seconds sequences with all the joints.

Motion-Music Correlation Metric Furthermore, we propose two metrics: Beat Alignment Score and Beat Dynamic-Time-Warping (DTW) Cost to evaluate how well the generated motion is correlated with the input music. The correlation is defined through the similarity between kinematic beats and music beats. The kinematic beats are computed as the local minima of the kinetic velocity, as shown in Figure 4. The Beat Alignment Score is the average distance between every kinematic beat and its nearest music beat. Specifically, our Beat Alignment Score is defined as:


where is the kinematic beats, is the music beats and is a parameter to normalize sequences with different FPS. We set in all our experiments as the FPS of all our experiments sequences is 60. A similar metric Beat Hit Rate was introduced in [46, 32], but this metric requires a dataset dependent handcrafted threshold to decide the alignment (“hit”) while ours directly measure the distances. To measure the similarity between music beats and kinematic beats, we employ the Dynamic-Time-Warping (DTW) [25]. The Beat DTW Cost is thus computed as, where represents the paired frame indexes on the warping path.

Quantitative Evaluation

In this section, we report the quantitative evaluation results of our method compared with the two baselines: Li \etal [50] and Dancenet [92] on out AIST++ test set. The results are shown in Table 2.

Motion Quality In this experiment, we first generate -seconds motion sequences using the paired data from our AIST++ test set. Then we compute our motion quality metrics on these generated motion sequences. As shown in Table 2, our generated motion sequence joint and velocity distributions are much closer to ground-truth motions compared with the two baselines. We also visualize the generated sequences from the two baselines in our supplemental video2. The participants of our user study (Sec. 5.2.5) often commented “too jittery” on the results from Li \etal [50] and “very limited movements” on the results from Dancenet [92].

Figure 5: Attention Weights Visualization. We compare the attention weights from the last layer of the (a) 12-layer cross-modal transformer and (b) 1-layer cross-modal transformer. Deeper cross-modal transformer pays equal attention to motion and music, while a shallower one intends to pay more attention to motion.
Figure 6: Motion Diversity Results. Here we visualize 4 different dance motions generated using our proposed model when given different music but same seed motion. Each row is a sampled 2 second (from a 20 seconds sequences) generated 3D dance motion. Especially, on the second row, when given a more modern dance music but hip-hop seed motion, our model can generate ballet motions (refer to our supplementary video).

Motion Diversity We evaluate our model’s ability of generating diverse dance motions when given various input musics compared with the two baseline methods. In this experiment, we pair all the 10 music pieces with the unique seed motions from the AIST++ test set as the input to generate sequences of motion. Then, we compute the diversity metrics over the generated motions from the same seed motion but different music. Table 2 shows our method generates more diverse dance motions comparing to the baselines. Based on our model ablation study (Sec. 5.2.4), our careful network design, particularly the cross-modal transformer is the main reason for this difference. In addition, we also visualize our generated diverse motions in Figure 6.

Motion-Music Correlation Further, we evaluate how much the generated 3D motion correlates to the input music. To calibrate the results, we compute the correlation metrics on the entire AIST++ dataset (upper bound) and on the random-paired data (lower bound). As shown in Table 2, our generated motion is better correlated with the input music compared to the baselines. We also show one example in Figure 4 that the kinematic beats of our generated motion align well with the music beats. However, when comparing to the real data, all three methods including ours have a large space for improvement. This reflects that music-motion correlation is still a challenging problem.

Ablation Study

Cross-modal Transformer We study the functionality of our cross-modal transformer using three different settings: (1) 14-layer motion transformer only; (2) 13-layer motion/audio transformer with 1-layer cross-modal transformer; (3) 2-layer motion/audio transformer with 12-layer cross-modal transformer. For fair comparison, we change the number of attention layers in the motion/audio transformer and the cross-modal transformer simultaneously to keep the total number of the attention layers the same. Table 3 shows that the cross-modal transformer is critical to generate motions well correlated with input music. Also as shown in Figure 5, deeper cross-modal transformer pays more attention to the music, thus it learns better music-motion correlation.


= 0mm \belowrulesep= 0mm Cross-Modal Transformer Vel. Var Pos. Var Beat Align. Score Beat DTW Cost w/o Audio 0.227* 12.64* w/ Audio + CM-1 0.480 0.78 0.233 12.59 w/ Audio + CM-12 0.509 6.51 0.241 12.16


= 0.605mm \belowrulesep= 0.984mm

Table 3: Ablation Study on Cross-modal Transformer. The deeper cross-modal transformer the better it learns motion-music correlation so it more follows the beats and generate more diverse motion from different musics. *Note those numbers are calculated using the music paired with the input motion.

Causal-Attention or Full-Attention Transformer Here we study the effectiveness of our full-attention mechanism and future-N supervision scheme. We set up four experiments with different settings: causal-attention with shift-by-1 supervision, and full-attention with future-1/10/20 supervision. Qualitatively, we find the motion generated by the causal-attention mechanism with shift-by-1 supervision (similar to the GPT model [62]) starts to freeze after several seconds. Similar problem was reported using this type of setting in motion prediction works [3, 2]. Quantitatively, as shown in the Table 4, there is a huge distribution difference between the generated motion and ground-truth motion in -seconds long-range generation when using causal-attention mechanism. For the full-attention mechanism with future-1 supervision setting, the results rapidly drift during long-range generation. However, when the model is supervised with or future frames, it can generate good quality (non-freezing, non-drifting) long-range motion.


= 0mm \belowrulesep= 0mm Attn-Supervision Pos. Frechet Dist Vel. Frechet Dist Causal-Attn-Shift-by-1 206.7 1.60 Full-Attn-F1 188.3 2.34 Full-Attn-F10 142.4 0.50 Full-Attn-F20 113.6 0.45


= 0.605mm \belowrulesep= 0.984mm

Table 4: Ablation Study on Attention and Supervision Design Choices. Causal-attention transformer with shift-by-1 supervision intends to generate freezing motion in long-term. Full-attention transformer supervised with more future frames boost the ability of generating more realistic dance motion.

User Study

Finally, we perceptually evaluate the motion-music correlation with a user study to compare our method with the two baseline methods and the “random” baseline, which randomly combines AIST++ motion-music. (refer to Appendix 7.2 for user study details.) In this study, each user is asked to watch 10 random videos out of 120, and answer the question ”which person is dancing more to the music? LEFT or RIGHT” for each video. There are in total of 90 participants in this user study, ranging from professional dancers to people rarely dance. We analyze the feedback and the results are: (1) of our generated dance motion is better than Li \etal [50]; (2) of our generated dance motion is better than Dancenet [92] (3) 75% of the unpaired AIST++ dance motion is better than ours. Clearly we surpass the baselines in the user study. But because the “random” baseline consists of real advanced dance motions that are extremely expressive, participants are biased to prefer it over ours. However, quantitative metrics show that our generated dance is more aligned with music.

6 Conclusion and Discussion

In this paper, we present a cross-modal transformer-based neural network architecture that can not only learn the audio-motion correspondence but also can generate non-freezing high quality 3D motion sequences conditioned on music. As generating 3D movement from music is a nascent area of study, we hope our work will pave the way for future cross-modal audio to 3D motion generation. We also construct the largest 3D human dance dataset: AIST++. This large, multi-view, multi-genre, cross-modal 3D motion dataset can not only help research in the conditional 3D motion generation research but also human understanding research in general. While our results shows a promising direction in this problem of music conditioned 3D motion generation, there are more to be explored. While our approach can generate realistic motion, the model is currently deterministic. Exploring how to generate multiple realistic dance per music is an interesting direction.

7 Appendix

7.1 AIST++ Dataset Details

Statistics We show the detailed statistics of our AIST++ dataset in Table 5. Thanks to the AIST Dance Video Database [78], our dataset contains in total -hour (M frame, sequences) of 3D dance motion accompanied with music. The dataset covers 10 dance genre and pieces of music. For each genre, there are different pieces of music, ranging from seconds to seconds long, and from BPM to BPM (except for House genre which is BPM to BPM). Among those motion sequences, (%) of them are basic choreographies and (%) of them are advanced. Advanced choreographies are longer and more complicated dances improvised by the dancers. Note for the basic dance motion, dancers are asked to perform the same choreography on all the pieces of music with different speed to follow different music BPMs. So the total unique choreographies in for each genre is . In our experiments we split the AIST++ dataset such that there is no overlap between train and test for both music and choreographies (see Sec. 5.2.1 in the paper).

Validation As described in Sec. 5.1 in the paper, we validate the quality of our reconstructed 3D motion by calculating the overall MPJPE-2D (in pixel) between the re-projected 2D keypoints and the predicted 2D keypoints. We provide here the distribution of MPJPE-2D among all motion sequences and all images (Figure 7)).

Genres Musics Music Tempo Motions Choreographs Motion Duration (sec.) Total Seconds
ballet jazz 6 80 - 130 141 85% basic + 15% advanced 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1910.8
street jazz 6 80 - 130 141 7.4 - 12.0 basic / 14.9 - 48.0 adv. 1875.3
krump 6 80 - 130 141 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1904.3
house 6 110 - 135 141 7.1 - 8.7 basic / 28.4 - 34.9 adv. 1607.6
LA-style hip-hop 6 80 - 130 141 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1935.8
middle hip-hop 6 80 - 130 141 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1934.0
waack 6 80 - 130 140 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1897.1
lock 6 80 - 130 141 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1898.5
pop 6 80 - 130 140 7.4 - 12.0 basic / 29.5 - 48.0 adv. 1872.9
break 6 80 - 130 141 7.4 - 12.0 basic / 23.8 - 48.0 adv. 1858.3
total 60 1408 18694.6
Table 5: AIST++ Dataset. We reconstruct these subset of AIST Dataset that are single-person dance sequences

7.2 User Study Details

Figure 7: AIST++ 3D Keypoints Re-projection Error Distribution. We compare the average pixel distance of the re-projected 2D keypoints and the detected 2D keypoints [59] for each video on 1920x1080 resolution.

Comparison User Study

As mentioned in Sec. 5.2.5 in the main paper, we qualitatively compare our generated results with several baselines in a user study. Here we describe the details of this user study. Figure 8 shows the interface that we developed for this user study. We visualize the dance motion using stick-man and conduct side-by-side comparison between our generated results and the baseline methods. The left-right order is randomly shuffled for each video to make sure that the participants have absolutely no idea which is ours. Each video is -second long, accompanied with the music. The question we ask each participant is “which person is dancing more to the music? LEFT or RIGHT”, and the answers are collected through a Google Form. At the end of this user study, we also have an exit survey to ask for the dance experience of the participants. There are two questions: “How many years have you been dancing?”, and “How often do you watch dance videos?”. Figure 9 shows that our participants ranges from professional dancers to people rarely dance, with majority with at least 1 year of dance experience.

Realism User study

Here we further provide a second-round user study that focus on the realism of the generated dance motion. Our ultimate goal is to generate dance motions that from human’s perception are realistic. In this user study, we ask each participant to watch one -second dance video with music (also in stick-man) at a time, and answer the question “Is it a real dance or a machine synthesized one?”. The 3D dance motions are randomly selected from a pool with real dances from AIST++ and generated dances from our method. So of the time the participants would see the real dance and of the time they would see the generated dance. Each participant is asked to watch videos selected randomly out of the video pool. We use the similar interface as our first user study, as shown in Figure 8, except for a different Google Form and one video at a time. In this user study, there are in total participants with various years of dance experience and computer vision experience (see Figure 10). The feedback shows that on average of our generated dance motion have been selected as REAL. Interestingly, only of the real dance motion sequences have been selected as REAL. This shows participants have a quite high standard about identifying a video to be a real dance.

7.3 Motion Genre Analysis

Genre Similarity in Real Dance AIST++ contains ten different genres of dance motion. To study how different the dance motions are among different genres, we calculate the motion similarity matrix between different genres of dance motion in AIST++ (see Figure 11). The similarity is calculated based on the Frechet Distance in the joint space (see Sec. 5.2.3 in the main paper) between each two sets of motion sequences. Specifically, we define the similarity between two sets of motion and as


where is the Frechet Distance and is a dataset dependent scale factor which we set to for AIST++. As shown in Figure 11, motions in the same genre have high similarity and motions in different genres are less similar to each other. Specially the ballet jazz dance motion is quite different to any other kind of dance in AIST++. This poses a unique challenge for cross-modal learning between music and motion.

Figure 8: User study interface. The interface of our User study. We ask each participant to watch 10 videos and answer the question ”which person is dancing more to the music? LEFT or RIGHT”.

Genre Consistency of Generated Dance A good model should not only generated long-range, non-freezing dance motion that follows the beats of the music, but also follow the genre of the music. Thus we further analyze the genre consistency between the generated dance motion and the conditioned music. As motion and music are in two different domains, there is no direct way to calculate the genre similarity between them. Fortunately AIST++ has paired motion-music data, thus we can safely assume the dance motion in AIST++ can also well represent the genre of the paired music. Based on this assumption, the genre consistency is then defined as the similarity between the generated dance motion and the real dance motion in AIST++ with the same genre. Here we also use Equation 4 to calculate the similarity between each two sets of motion sequences. Figure 12 shows the genre consistency of the generated results from our method as well as two baselines Dancenet [92] and Li \etal [50]. Clearly our method can generate motions more correlated with the genre of the music, which also demonstrates that our model can better learn the audio-motion correspondence.

Figure 9: Participant Demography of the Comparison User Study.
Figure 10: Participant Demography of the Realism User Study.
Figure 11: Motion Similarity in AIST++. We analyze the similarity between different genres of dance motion in AIST++, which shows distribution difference among genres. For example, ballet jazz is quite different than any other kind of dance.
Figure 12: Genre Consistency between Generated Dance and Music. We analyze the genre consistency between the generated dance motion and the input music, of all three methods, on all ten genres. Higher the better.


  1. equal contribution. Work performed while Ruilong was an intern at Google.
  2. https://www.youtube.com/watch?v=VrVsAcgFK_4


  1. H. Ahn, J. Kim, K. Kim and S. Oh (2020) Generative autoregressive networks for 3d dancing move synthesis from music. IEEE Robotics and Automation Letters 5 (2), pp. 3500–3507. Cited by: §2.
  2. E. Aksan, P. Cao, M. Kaufmann and O. Hilliges (2020) Attention, please: a spatio-temporal transformer for 3d human motion prediction. arXiv preprint arXiv:2004.08692. Cited by: §1, §1, §2, §2, §4.1, §5.2.4.
  3. E. Aksan, M. Kaufmann and O. Hilliges (2019) Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7144–7153. Cited by: §1, §2, §5.2.4.
  4. O. Alemi, J. Françoise and P. Pasquier (2017) GrooveNet: real-time music-driven dance movement generation using artificial neural networks. networks 8 (17), pp. 26. Cited by: §2, §2, Table 1.
  5. O. Arikan and D. A. Forsyth (2002) Interactive motion generation from examples. ACM Transactions on Graphics (TOG) 21 (3), pp. 483–490. Cited by: §2.
  6. R. Bowden (2000) Learning statistical models of human motion. In IEEE Workshop on Human Modeling, Analysis and Synthesis, CVPR, Vol. 2000. Cited by: §2.
  7. M. Brand and A. Hertzmann (2000) Style machines. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 183–192. Cited by: §2.
  8. J. Bütepage, M. J. Black, D. Kragic and H. Kjellström (2017) Deep representation learning for human motion prediction and classification. In CVPR, pp. 2017. Cited by: §2.
  9. S. by Saheel Baby talk: understanding and generating image descriptions. Cited by: §2.
  10. Z. Cao, G. Hidalgo, T. Simon, S. Wei and Y. Sheikh (2018) OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, Cited by: §2.
  11. H. Chiu, E. Adeli, B. Wang, D. Huang and J. C. Niebles (2019) Action-agnostic human pose forecasting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1423–1432. Cited by: §2.
  12. A. Conneau and G. Lample (2019) Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, pp. 7059–7069. Cited by: §4.1.
  13. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §4.1.
  14. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4.1, §4.1.
  15. X. Du, R. Vasudevan and M. Johnson-Roberson (2019) Bio-lstm: a biomechanically inspired recurrent neural network for 3-d pedestrian pose and gait prediction. IEEE Robotics and Automation Letters 4 (2), pp. 1501–1508. Cited by: §2.
  16. X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu and J. Huang (2018) Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems, pp. 3059–3069. Cited by: §2.
  17. R. Fan, S. Xu and W. Geng (2011) Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics 18 (3), pp. 501–515. Cited by: §2.
  18. Y. Ferstl and R. McDonnell (2018) Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 93–98. Cited by: §2.
  19. Y. Ferstl, M. Neff and R. McDonnell (2020) Adversarial gesture generation with realistic gesture phasing. Computers & Graphics. Cited by: §2.
  20. K. Fragkiadaki, S. Levine, P. Felsen and J. Malik (2015) Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354. Cited by: §2.
  21. A. Galata, N. Johnson and D. Hogg (2001) Learning variable-length markov models of behavior. Computer Vision and Image Understanding 81 (3), pp. 398–413. Cited by: §2.
  22. C. Gan, D. Huang, P. Chen, J. B. Tenenbaum and A. Torralba (2020) Foley music: learning to generate music from videos. arXiv preprint arXiv:2007.10984. Cited by: §2.
  23. P. Ghosh, J. Song, E. Aksan and O. Hilliges (2017) Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV), pp. 458–466. Cited by: §2.
  24. S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens and J. Malik (2019) Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3497–3506. Cited by: §2.
  25. T. Giorgino (2009) Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software 31 (7), pp. 1–24. Cited by: §5.2.2.
  26. R. Heck and M. Gleicher (2007) Parametric motion graphs. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pp. 129–136. Cited by: §2.
  27. A. Hernandez, J. Gall and F. Moreno-Noguer (2019) Human motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7134–7143. Cited by: §2.
  28. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §5.2.2.
  29. D. Holden, T. Komura and J. Saito (2017) Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–13. Cited by: §2.
  30. D. Holden, J. Saito, T. Komura and T. Joyce (2015) Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs, pp. 1–4. Cited by: §2.
  31. D. Holden, J. Saito and T. Komura (2016) A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG) 35 (4), pp. 1–11. Cited by: §2.
  32. R. Huang, H. Hu, W. Wu, K. Sawada and M. Zhang (2020) Dance revolution: long sequence dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119. Cited by: §5.2.2.
  33. Y. Huang, M. Kaufmann, E. Aksan, M. J. Black, O. Hilliges and G. Pons-Moll (2018-11) Deep inertial poser learning to reconstruct human pose from sparseinertial measurements in real time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 37 (6), pp. 185:1–185:15. Cited by: §2.
  34. V. Iashin and E. Rahtu (2020) Multi-modal dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 958–959. Cited by: §2.
  35. C. Ionescu, D. Papava, V. Olaru and C. Sminchisescu (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §2, Table 1, §5.1.
  36. A. Jain, A. R. Zamir, S. Savarese and A. Saxena (2016) Structural-rnn: deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 5308–5317. Cited by: §2.
  37. Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C. Chiu, N. Ari, S. Laurenzo and Y. Wu (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7180–7184. Cited by: §2.
  38. H. Kao and L. Su (2020) Temporally guided music-to-body-movement generation. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 147–155. Cited by: §2.
  39. S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto and X. Wang (2019) A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. Cited by: §2.
  40. A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §2.
  41. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.1.
  42. M. Kocabas, N. Athanasiou and M. J. Black (2019) VIBE: video inference for human body pose and shape estimation. External Links: 1912.05656 Cited by: §5.1.
  43. L. Kovar, M. Gleicher and F. Pighin (2008) Motion graphs. In ACM SIGGRAPH 2008 classes, pp. 1–10. Cited by: §2.
  44. R. Krishna, K. Hata, F. Ren, L. Fei-Fei and J. Carlos Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp. 706–715. Cited by: §2.
  45. K. LaMothe (2019-06) The dancing species: how moving together in time helps make us human. Aeon. External Links: Link Cited by: §1.
  46. H. Lee, X. Yang, M. Liu, T. Wang, Y. Lu, M. Yang and J. Kautz (2019) Dancing to music. External Links: 1911.02001 Cited by: §2, §5.2.2.
  47. J. Lee and S. Y. Shin (1999) A hierarchical approach to interactive motion editing for human-like figures. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 39–48. Cited by: §2.
  48. J. Lee, S. Kim and K. Lee (2018) Listen to dance: music-driven choreography generation using autoregressive encoder-decoder network. arXiv preprint arXiv:1811.00818. Cited by: §2.
  49. G. Li, L. Zhu, P. Liu and Y. Yang (2019) Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8928–8937. Cited by: §2.
  50. J. Li, Y. Yin, H. Chu, Y. Zhou, T. Wang, S. Fidler and H. Li (2020) Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171. Cited by: §2, Figure 3, §4.1, §4.1, §5.2.1, §5.2.3, §5.2.3, §5.2.5, Table 2, §7.3.
  51. Z. Li, Y. Zhou, S. Xiao, C. He, Z. Huang and H. Li (2018) Auto-conditioned recurrent networks for extended complex human motion synthesis. ICLR. Cited by: §2.
  52. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll and M. J. Black (2015) SMPL: a skinned multi-person linear model. SIGGRAPH Asia. Cited by: 3rd item.
  53. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 248. Cited by: §3.1.
  54. J. Lu, J. Yang, D. Batra and D. Parikh (2018) Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7219–7228. Cited by: §2.
  55. N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll and M. J. Black (2019) AMASS: archive of motion capture as surface shapes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5442–5451. Cited by: §2, Table 1, §3.
  56. B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg and O. Nieto (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Cited by: §5.2.1.
  57. Mixamo. Note: \urlhttps://www.mixamo.com/ Cited by: §2.
  58. D. Moelants (2003) Dance music, movement and tempo preferences. In Proceedings of the 5th Triennial ESCOM Conference, pp. 649–652. Cited by: §3.
  59. G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler and K. Murphy (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: §3.1, Figure 7.
  60. K. Pullen and C. Bregler (2000) Animating by multi-level sampling. In Proceedings Computer Animation 2000, pp. 36–42. Cited by: §2.
  61. H. Qiu, C. Wang, J. Wang, N. Wang and W. Zeng (2019) Cross view fusion for 3d human pose estimation. In International Conference on Computer Vision (ICCV), Cited by: §5.1.
  62. A. Radford, K. Narasimhan, T. Salimans and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1, Figure 3, §4.1, §4.1, §4.1, §5.2.4.
  63. X. Ren, H. Li, Z. Huang and Q. Chen (2019) Music-oriented dance video synthesis with pose perceptual loss. arXiv preprint arXiv:1912.06606. Cited by: §2.
  64. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao and T. Liu (2019) Fastspeech: fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3171–3180. Cited by: §2.
  65. M. R. Ronchi and P. Perona (2017-10) Benchmarking and error diagnosis in multi-instance pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: 2nd item.
  66. T. Shiratori, A. Nakazawa and K. Ikeuchi (2006) Dancing-to-music character animation. In Computer Graphics Forum, Vol. 25, pp. 449–458. Cited by: §2.
  67. E. Shlizerman, L. Dery, H. Schoen and I. Kemelmacher-Shlizerman (2018) Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7574–7583. Cited by: §2.
  68. S. Starke, H. Zhang, T. Komura and J. Saito (2019) Neural state machine for character-scene interactions.. ACM Trans. Graph. 38 (6), pp. 209–1. Cited by: §2.
  69. S. Starke, Y. Zhao, T. Komura and K. Zaman (2020) Local motion phases for learning multi-contact character movements. ACM Transactions on Graphics (TOG) 39 (4), pp. 54–1. Cited by: §2.
  70. Statista (2020) Note: \urlhttps://www.statista.com/statistics/249396/top-youtube-videos-views/Accessed: 2020-11-09 Cited by: §1.
  71. C. Sun, F. Baradel, K. Murphy and C. Schmid (2019) Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743. Cited by: §1, §2.
  72. C. Sun, A. Myers, C. Vondrick, K. Murphy and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473. Cited by: §2.
  73. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §5.2.2.
  74. T. Tang, J. Jia and H. Mao (2018) Dance with melody: an lstm-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1598–1606. Cited by: §2, §2, Table 1.
  75. G. W. Taylor and G. E. Hinton (2009) Factored conditional restricted boltzmann machines for modeling motion style. In Proceedings of the 26th annual international conference on machine learning, pp. 1025–1032. Cited by: §2.
  76. P. Tendulkar, A. Das, A. Kembhavi and D. Parikh (2020) Feel the music: automatically generating a dance for an input song. arXiv preprint arXiv:2006.11905. Cited by: §2.
  77. B. Triggs, P. F. McLauchlan, R. I. Hartley and A. W. Fitzgibbon (1999) Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pp. 298–372. Cited by: §3.1.
  78. S. Tsuchida, S. Fukayama, M. Hamasaki and M. Goto (2019-11) AIST dance video database: multi-genre, multi-dancer, and multi-camera database for dance information processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, Netherlands, pp. 501–510. Cited by: §1, §3, §7.1.
  79. J. Valin and J. Skoglund (2019) LPCNet: improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5891–5895. Cited by: §2.
  80. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §4.1.
  81. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell and K. Saenko (2015) Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pp. 4534–4542. Cited by: §2.
  82. S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney and K. Saenko (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. Cited by: §2.
  83. R. Villegas, J. Yang, D. Ceylan and H. Lee (2018) Neural kinematic networks for unsupervised motion retargetting. In CVPR, Cited by: §2.
  84. B. Wang, E. Adeli, H. Chiu, D. Huang and J. C. Niebles (2019) Imitation learning for human pose prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7124–7133. Cited by: §2.
  85. H. Xu, R. Zeng, Q. Wu, M. Tan and C. Gan (2020) Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 3893–3901. Cited by: §2.
  86. N. Yalta, S. Watanabe, K. Nakadai and T. Ogata (2019) Weakly-supervised deep recurrent neural networks for basic dance step generation. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
  87. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle and A. Courville (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pp. 4507–4515. Cited by: §2.
  88. Z. Ye, H. Wu, J. Jia, Y. Bu, W. Chen, F. Meng and Y. Wang (2020) ChoreoNet: towards music to dance synthesis with choreographic action unit. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 744–752. Cited by: §2.
  89. H. Yu, J. Wang, Z. Huang, Y. Yang and W. Xu (2016) Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584–4593. Cited by: §2.
  90. H. Zhang, S. Starke, T. Komura and J. Saito (2018) Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–11. Cited by: §2.
  91. L. Zhou, Y. Zhou, J. J. Corso, R. Socher and C. Xiong (2018) End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748. Cited by: §2.
  92. W. Zhuang, C. Wang, S. Xia, J. Chai and Y. Wang (2020) Music2Dance: music-driven dance generation using wavenet. arXiv preprint arXiv:2002.03761. Cited by: §2, Table 1, §5.2.1, §5.2.3, §5.2.3, §5.2.5, Table 2, §7.3.
  93. W. Zhuang, Y. Wang, J. Robinson, C. Wang, M. Shao, Y. Fu and S. Xia (2020) Towards 3d dance motion synthesis and control. arXiv preprint arXiv:2006.05743. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description