To React or not to React

To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations

Chaitanya Ahuja cahuja@andrew.cmu.edu Carnegie Mellon University Shugao Ma shugao@fb.com Facebook Reality Labs, Pittsburgh Louis-Philippe Morency morency@cs.cmu.edu Carnegie Mellon University  and  Yaser Sheikh yasers@fb.com Facebook Reality Labs, Pittsburgh
Abstract.

Non verbal behaviours such as gestures, facial expressions, body posture, and para-linguistic cues have been shown to complement or clarify verbal messages. Hence to improve telepresence, in form of an avatar, it is important to model these behaviours, especially in dyadic interactions. Creating such personalized avatars not only requires to model intrapersonal dynamics between a avatar’s speech and their body pose, but it also needs to model interpersonal dynamics with the interlocutor present in the conversation. In this paper, we introduce a neural architecture named Dyadic Residual-Attention Model (DRAM), which integrates intrapersonal (monadic) and interpersonal (dyadic) dynamics using selective attention to generate sequences of body pose conditioned on audio and body pose of the interlocutor and audio of the human operating the avatar. We evaluate our proposed model on dyadic conversational data consisting of pose and audio of both participants, confirming the importance of adaptive attention between monadic and dyadic dynamics when predicting avatar pose. We also conduct a user study to analyze judgments of human observers. Our results confirm that the generated body pose is more natural, models intrapersonal dynamics and interpersonal dynamics better than non-adaptive monadic/dyadic models. Figure 1. Overview of the visual pose forecasting task, which takes avatar’s audio and predicted pose history along with human’s audio and pose to forecast the avatar’s future pose and generate a natural looking avatar animation. The model dynamically decides which of monadic or dyadic dynamics to focus to make the prediction.

dyadic interactions, pose forecasting, multimodal fusion
copyright: acmcopyrightjournalyear: 2019conference: 2019 International Conference on Multimodal Interaction; October 14–18, 2019; Suzhou, Chinabooktitle: 2019 International Conference on Multimodal Interaction (ICMI ’19), October 14–18, 2019, Suzhou, Chinaprice: 15.00doi: 10.1145/3340555.3353725isbn: 978-1-4503-6860-5/19/10ccs: Computing methodologies Procedural animationccs: Computing methodologies Motion processingccs: Computing methodologies Neural networksccs: Human-centered computing Virtual realityccs: Human-centered computing Auditory feedback

1. Introduction

Telepresence has the potential to evolve the way people communicate. With the application of immersion theory, stereoscopic vision and spatial audio, a 3D virtual space has characteristics inspired from the real world (Singh and Singh, 2017). Communicating in a virtual world poses some interesting challenges. A person sitting thousands of miles away, where only speech signals are available, an avatar will need to represent not only his/her facial expressions (Lombardi et al., 2018), but also produce realistic non verbal body cues.

Non verbal behaviours such as hand gestures, head nods, body posture and para-linguistic cues play a crucial role in human communication (Wagner et al., 2014). These can range from simple actions like pointing at objects, head nod agreement, to body pose mirroring. Consider a person giving a monologue. Barring the minimal reaction of the audience (e.g. laughing on a joke he/she made), the speaker relies on his/her audio and his/her hand gestures, head motions and body posture to convey a message to the audience. These behaviours can be combined under the umbrella term of intrapersonal behaviours. Realism in intrapersonal behaviours is crucial to communication in the virtual world (Bailenson et al., 2006). People can display different kind of gesture patterns, hence there is a need of driving the body pose of personalized avatars using the audio as input.

Figure 2. Overview of the proposed model Dyadic Residual-Attention Model (DRAM) designed to model the end-to-end visual pose forecasting task. Avatar’s monadic pose forecast along with human’s audio and pose history forecasts the next dyadic pose conditioned on dyadic (or interpersonal) dynamics. Avatar’s monadic and dyadic pose predictions are inputs to DRAM  which first calculates the dyadic residual attention vector () followed by an attention layer over monadic and dyadic pose predictions to make the final forecast .

During dyadic interaction, behaviours of a person will be influenced by the behaviour of the interlocutor (Steed and Schroeder, 2015). In other words, forecasting an avatar’s pose should take interpersonal dynamics into consideration. This brings an interesting challenge on how to integrate back channel feedback (Ward and Tsukahara, 2000) and other interpersonal dynamics while animating the avatar’s behaviour. Examples of such behaviour can be seen in situations where people mimic head nods in agreement (Cassell and Thorisson, 1999) or mirroring a posture shift at the end of conversation turn. Modeling such interpersonal behaviour can aid in building a more realistic avatar.

Speaker and listener roles, in a dyadic conversation, can change multiple times during the course of the conversation. A speaker’s behaviour is affected by a combination of their non verbal signatures and interpersonal feedback from the listener. Similarly, a listener’s behaviour is affected by a combination of some non verbal signatures and mostly providing feedback to the speaker in form of head nods, pose changes and short utterances (like ‘yes’, ‘ya’, ‘ah’ and so on). Hence, to produce avatars capable of dyadic interactions with a human interlocutor, pose forecasting models need to anthropomorphise the character based on two facets of a conversation: interpersonal and intrapersonal dynamics.

In this paper, we learn to predict non verbal behaviours (i.e body pose) of an avatar111Project webpage: http://chahuja.com/trontr/ conditioned on the para-linguistic cues extracted from input audio and behaviours of the interlocutor as described in Figure 1. Central to our approach is a dynamic attention module that can toggle between monadic-focused (e.g. speaking with limited input from the listener) and dyadic-focused (e.g. interacting with the interlocutor) where interpersonal dynamics are also integrated. Our model Dyadic Residual-Attention Model (or DRAM) allows us to dynamically integrate intrapersonal (a.k.a monadic) and interpersonal (a.k.a dyadic) dynamics by attending to the interlocutor as and when needed. We present two variants of our model based on recurrent neural networks and temporal convolutional networks. These models are trained on a dataset consisting of conversations between two people. We study the avatar pose forecasting of one participant generated by these models on three challenges {enumerate*}

Naturalness,

Intrapersonal Dynamics, and

Interpersonal Dynamics, by analyzing the effects of missing audio or pose information. Finally, we conduct a user study to get an overall human evaluation of the generated avatar pose sequences.

2. Related Work

Figure 3. An example demonstrating a hybrid of Intrapersonal and Intrapersonal dynamics in predictions made by DRAM . For the avatar, the black skeleton is the current pose and the red skeleton is the pose from one second in the past. Similarly for the interlocutor, the black skeleton is the current pose and the purple skeleton is the pose from one second in the past. For the first 3 seconds, the avatar focusses on words ’Next Month’ and DRAM  forecasts hand raises denoting emphasis.mean() is moslty less than . As soon as the interlocuter chimes in with an exclaimation ’Oh!’, mean() rises up implying more focus on the interlocutor. DRAM  forecasts head nods denoting agreement with the interlocuter. Beat motions are also predicted by the model which is probably due to emphasis on the words ’Um Hum’.

Pose forecasting has been previously studied with approaches ranging from goal conditioned forecasting (Peng et al., 2018; Agrawal and van de Panne, 2016), image (Chao et al., ) and video (Fragkiadaki et al., 2015) conditioned forecasting to pose synthesis using high-level control parameters (Pavllo et al., 2018; Habibie et al., 2017). These vision-only approaches do not make use of audio signals from the speech.

It has been shown that fusing audio and visual information can give more robust predictions hence leading to improved performance (Baltrušaitis et al., 2018; Jaimes and Sebe, 2007) especially for emotion modeling (Wöllmer et al., 2013; Zadeh et al., 2018). Emotions are correlated to body motions (Schindler et al., 2008) implying that audio is also correlated to body pose. Earlier work directly studied rhythm relationships of audio with body pose (Dittmann, 1972), correlation of head motion and speech disfluencies (Hadar et al., 1984) and influence of audio on gestures in a dyadic setting (Wagner et al., 2014).

In context of audio conditioned generation of facial expression and head pose, previous work includes creating voice driven puppets (Brand, 1999) and more recently deep learning approaches have improved the quality of lip-movement generation (Suwajanakorn et al., 2017), facial expression generation (Taylor et al., 2017; Karras et al., 2017; Ezzat et al., 2002) and facial expression generation in a conversation setting (Chu et al., 2018). A related topic is generating speech by measuring vibrations in a video (Davis et al., 2014). Follow up works include separating input audio signals into a set of components that corresponds to different objects in the given video (Gao et al., 2018), and separating audio corresponding to each pixel (Zhao et al., 2018).

Cassell et al. (2004) created the Behavior Expression Animation Toolkit (BEAT), which takes text as input to generate synthesized speech along with gestures and other nonverbal behaviors such as gaze and facial expression. The assignment is done on the linguistic and contextual analysis of the input text, relying on rules predefined based on evidence from previous research on human conventional behavior. Scherer et al. (2012) proposes a markup language for generalizing perceptual features and show its effectiveness by integrating it into an automated virtual agent. Non verbal behaviours generated in this approach constructs a fixed set of body gestures(Chiu et al., 2015; Chiu and Marsella, 2011; Lee and Marsella, 2006), hence posing it as a classification problem. Fixed set of gestures cannot generalize to new behaviours, which is a drawback to this approach.

Parameterizing avatars with joint angles instead can alleviate this shortcoming. Extending this idea to audio conditioned pose forecasting, Takeuchi et al. (2017) use linguistic features extracted from audio to predict future body poses using a bi-directional LSTM. As this method uses audio information from the future, it cannot be used for pose forecasting in real-time. In comparison, our models are auto-regressive in nature, using only information from the past. We note that our focus is on scenarios where manual text transcription may not be available, so our focus stays on the non-linguistic components of audio signals.

To our knowledge, our proposed work is the first to integrate both intrapersonal and interpersonal dynamics for body pose forecasting. Our aim is to generate natural looking sequence of body poses which correlate with audio signals driving the avatar as well as paralinguistic cues and behaviour of the interlocutor.

3. Problem Statement

Consider a conversation between two human participants, one of which is interacting remotely (henceforth referred as avatar) and only the audio is available. For the local participant (henceforth referred as human/interlocutor), we have pose and audio information. The goal of the forecasting task is to model future body pose of the avatar. Formally, given sequence of local human audio features , human pose , avatar’s audio features and history of avatar’s pose , we want to predict avatar’s next pose . Let be vectors of dimension , and be vectors of dimension for all . is the size of feature history used by the model. Hence, and are matrices.

This is equivalent to modeling the probability distribution
. Concatenating all input features
, we define a joint model which predicts the future pose

(1)

where are trainable parameters of function . As the history of avatar’s previously predicted pose sequence is used to predict the future pose, is an autoregressive model.

In this section we discuss challenges of jointly modeling interpersonal and intrapersonal dynamics. We propose our model Dyadic Residual-Attention Model (DRAM) to tackle these challenges in the next Section while maintaining the generalizability of the model.

3.1. Challenges of Jointly Modeling Interpersonal and Intrapersonal Dynamics

Interpersonal dynamics are important in providing a realistic social experience in a virtual environment. Ignoring verbal or non-verbal cues from the interlocutor may result in generating avatar body-poses which are not synchronous with the interlocutor(Jones and LeBaron, 2002). Given enough dyadic conversation data, the function in Equation 1 has the capacity to jointly model interpersonal and intrapersonal dynamics.

However, an imbalance between interpersonal and intrapersonal dynamics in dyadic conversations is common, with generally more instances of the intrapersonal dynamics (e.g. between speech and gestures of the same person). This naturally occurring imbalance ends up treating and from intrapersonal dynamics as the stronger prior than signals from the interlocutor ( and ) while solving the optimization problem,

(2)

where is L2 Norm.

Hence, interpersonal dynamics could end up getting ignored, leaving the pose generation largely monadic and intrapersonal.

4. Dyadic Residual-Attention Model

To combat the imbalance in dyadic conversations, our proposed approach shown in Figure 2 decomposes pose generation of the avatar to monadic function () and dyadic function (), which model Intrapersonal and Interpersonal dynamics respectively.

We propose a model Dyadic Residual-Attention Model (DRAM) with as a time-continuous vector. allows for smooth transitions between monadic and dyadic models and it is a vector with the same dimensions as the pose of the avatar/interlocutor. We use like an attention vector which attends to different joints at different points in time. Hence, our proposed model
can be written as

(3)

where and is the Dyadic Residual Vector. can be used as a trainable parameter which enables the model to implicitly learn attention weights for each joint at each time-step.

In this section, we describe the formulation of the Dyadic Residual Vector () which is further used as soft-attention weights on Dyadic () and Monadic () models resulting in the formation of DRAM . We end the section by explaining the loss function and the training curriculum for the model.

4.1. Dyadic Residual Vector

Monadic model learns the intrapersonal dynamics (i.e. demonstrating emphasis using head nods or hand gestures etc.) of the avatar. This is equivalent to the avatar giving a monologue without any ineterlocutor,

(4)

Dyadic model can be written as,

(5)

where .

Dyadic model depends on the Monadic dynamic and features of the interlocutor222For computational efficiency, we use as an input for the dyadic network. An alternative equivalent model is where raw avatar features () are used as one of the inputs.. Hence, it learns to model the interpersonal dynamics (i.e. head nods, pose switches, interruptions etc.) between the avatar and interlocutor.

The absolute difference between Monadic and Dyadic Models , or the residual, is representative of the joints that were affected by interpersonal dynamics333It may be possible to use a separate network to model the attention vector. Our proposed network Dyadic Residual-Attention Model , in a manner of speaking, shares weights with existing networks and to estimate the attention vector . This helps us limit the number of trainable parameters.. If, at a given time , the residual for some joints is high, the interlocutor is influencing those joints, while if the residual is low, the avatar’s audio and pose-history is dominating the avatar’s current pose behaviour.

The residual for all joints is always positive, so to compress all dimensions of the residual vector between 0 and 1, we use non-linearity to get the dyadic residual vector ,

(6)

The interpretability of the Dyadic Residual Vector is an added advantage. We know which joint has the maximum influence at every time , and hence can estimate whether the non-verbal behaviours of the avatar are due to interpersonal or intrapersonal dynamics.

Using Equations 3, 4, 5 and 6, the predicted pose of our proposed DRAM  can be re-written as:

(7)

4.2. Loss Function

Pose is a continuous variable, hence we use Mean Squared Error (or L2) loss. Based on predicted pose in Equation 7, the loss function is

(8)
(9)

4.3. Design Choices for and

\toprule Average Position Error (APE) in cms
\cmidrule(l)4-11 Dynamics Models Avg. Torso Head Neck RArm LArm RWrist LWrist
\midrule
Human Audio
Only ()
LSTM 4.9 0.2 1.1 0.4 0.2 0.2 14.4 21.3
\cmidrule(l)3-11 Interpersonal TCN 3.3 0.2 1.2 0.5 0.2 0.3 9.5 13.8
\cmidrule(l)2-11
Human Monadic
Only ()
LSTM 3.6 0.2 1.1 0.4 0.2 0.2 13.2 12.5
\cmidrule(l)3-11 TCN 3.3 0.2 1.1 0.4 0.2 0.2 9.3 13.8
\midrule
Avatar Audio
Only ()
LSTM 3.9 0.6 1.5 1.4 1.0 0.5 10.7 15.0
\cmidrule(l)3-11 Intrapersonal TCN 3.5 0.2 1.1 0.5 0.3 0.3 9.3 15.5
\cmidrule(l)2-11
Avatar Monadic
Only ()
LSTM 3.4 0.2 1.3 0.4 0.3 0.2 9.5 14.4
\cmidrule(l)3-11 TCN 3.0 0.2 1.4 0.5 0.2 0.3 8.8 12.1
\midrule
Early Fusion
()
LSTM 3.0 0.2 1.1 0.5 0.3 0.2 9.3 11.7
\cmidrule(l)3-11
Interpersonal
and Intrapersonal
TCN 3.2 0.2 1.1 0.4 0.2 0.3 10.5 12.2
\cmidrule(l)2-11
DRAM
w/o Attention ()
LSTM 3.1 0.2 1.8 0.4 0.2 0.2 9.7 12.1
\cmidrule(l)3-11 TCN 3.0 0.1 1.0 0.4 0.2 0.2 8.8 12.6
\midrule
Adaptive Interpersonal
and Intrapersonal
DRAM
()
LSTM 2.8 0.1 1.0 0.4 0.2 0.2 9.0 10.8
\cmidrule(l)3-11 TCN 2.8 0.2 1.1 0.4 0.2 0.2 8.8 11.1
\bottomrule
Table 1. Objective Metric Average Position Error (APE) for DRAM is compared with all baseline models. Lower values are better. The first row networks, Human Audio Only and Human Monadic Only, model Intrapersonal dynamics, while the second row networks, Avatar Audio Only and Avatar Monadic only, model Intrapersonal Dynamics. The third row networks, Early Fusion and DRAM w/o attention, non-adaptively model Interpersonal and Intrapersonal dynamics jointly. Fourth row networks, DRAM , adaptively choose from Interpersonal and Intrapersonal dynamics. Two-tailed pairwise t-test between all TCN models and DRAM-TCN where - p¡0.01, and - p¡0.05

The prediction functions (for monadic) and (for dyadic) of our proposed model can work with any autoregressive temporal network that depends only on features from the past. This gives our model the flexibilty of incorporating temporal models that may be domain dependent or pre-trained on some other dataset.

Recurrent neural architectures have been shown to perform very well in such autoregressive tasks (Graves, 2012; Hochreiter and Schmidhuber, 1997; Ahuja and Morency, 2018), especially for pose forecasting algorithms (Chao et al., ; Pavllo et al., 2018). Recent work demonstrates the utility of bi-directional LSTMs to model speech signals to forecast body pose (Takeuchi et al., 2017). One weakness of this approach is the dependency of the pose prediction on future speech input, hence making it unusable for real-time applications. A uni-directional LSTM model is used as a baseline model (Takeuchi et al., 2017).

Temporal convolutional networks (TCNs) work just as well in many practical applications (Bai et al., 2018). It was shown that adding residual connections and dilation layers (Van Den Oord et al., ) can boost the empirical performance of TCNs equal to, if not better than, LSTM and GRU based models.

In our experiments, both TCNs and LSTMs are used as temporal models for and , which demonstrates the versatility of our proposed approach.

5. Experiments

Figure 4. Two examples demonstrating Interpersonal Dynamics in predictions made by DRAM . For the first example, interlocutor performs a pose switch which is followed by predicted pose switch by the avatar. Mean of Dyadic Residual Attention vector (mean()) is moslty below till the interlocutor performs a pose switch. DRAM  estimates the need to focus on the human (i.e. interpersonal dynamics) with the increase in value of and predicts a pose-switch for the avatar. Similarly, the second example has the interlocutor performing a head nod, which is followed by a forecast of head nod by the avatar. mean() values rise to values above just after the interlocuter’s head nod implying the need for interpersonal dynamics.

Visual pose-forecasting of an avatar during dyadic conversations can be broken down into three core challenges,

  1. Naturalness: How natural is the flow of poses and how close are they to the ground truth?

  2. Intrapersonal Dynamics: How correlated is the generated pose sequence with avatar’s speech?

  3. Interpersonal Dynamics: Is the generated pose sequence reacting realistically to the interlocutor’s behaviour and speech?

Experiments, both subjective and objective are designed to evaluate there 3 aspects of pose forecasting.

In the following subsections, we describe the dataset and have a short discussion on pre-processing of audio and pose features. This is followed by constructing a set of competitive baseline models to compare our own proposed DRAM  model.

5.1. Dataset

Our models are trained and evaluated on a previously recorded dataset of dyadic face-to-face interactions. The dataset contains one person who is the same across all conversations. This person interacts with 11 different participants for around 1 hour each. The participants were given different topics (like sports, school, hobbies etc.) to choose from and the conversation was guided by these topics. No script were given to either of the participants. The capture system included 24 OptiTrack Prime 17W cameras surrounding a capture area of approximately 3 m × 3 m and two directional microphone headsets. Twelve of the 24 cameras were placed at 1.6 m height. Participants wear marker-attached suits and gloves. The standard marker arrangement provided by OptiTrack for body capture and glove marker arrangement suggested in Han et al. (Han et al., 2018) was followed.

For each conversation, there are separate channels of audio signals for each person with a sampling rate of 44.1 kHz. Body pose was collected at a frequency of 90 Hz using a Motion-Capture (MoCap) setup, and gives 12 joint positions of the upper body including which can be grouped444Our modeling is performed for all 12 joints, but we group them in our results to help with interpretability into Torso, Head, Neck, RArm (Right Arm), LArm (Left Arm), RWrist (Right Wrist) and LWrist (Left Wrist).

5.2. Feature Representation

Body pose is shown to have correlation with affect and emotion. GeMAPS is a minimalist set of low level descriptors for audio including prosodic, excitation, vocal tract, and spectral descriptors which increase the accuracy of affect recognition. OpenSmile (Eyben et al., 2013) is used to extract GeMAPS (Eyben et al., 2016) features sub-sampled rate of 90 Hz to match the body pose frequency.

In this work, translation of the body is not considered, as it is largely absent in the data. Instead rotation angles are modeled, which form the crux of dyadic interactions in a conversation setting. In our experiments we use pose features that are 3-dimensional joint coordinates are converted to local rotation vectors and parameterized as quaternions (Pavllo et al., 2018).

5.3. Baselines

There has been limited work in the domain of gesture generation from audio signals using neural architectures. The model only take into account monadic behaviours of a person using a bidirectional-LSTM (Takeuchi et al., 2017). Bidirectional-LSTMs depend on the future time-steps, hence they are unusable in real-time applications. An adapted version of this network (referred as Avatar Audio only- LSTM in Table 1) and TCNs are used as temporal models for Dyadic and Monadic models.

To gauge the naturalness of our proposed model DRAMs, they are compared with Early Fusion ( from Equation 1) and DRAM w/o Attention ( from Equation 5)

To demonstrate presence of Intrapersonal Dynamics in a dyadic conversation, a reasonable baseline is Monadic Models ( from Equation 4) with inputs as the avatar’s audio (refer as Avatar Audio Only) and avatar’s audiopose history (refer as Avatar Monadic Only). Both of these models forecast the pose of the avatar.

To demonstrate presence of Interpersonal Dynamics in a dyadic conversation, a reasonable baseline is Monadic Models ( from Equation 4) with inputs as the human’s audio (refer as Human Audio Only) and human’s audiopose history (refer as Human Monadic Only). Both of these models forecast the pose of the avatar.

5.4. Objective Evaluation Metrics

We evaluate all models on the held-out set with a metric Average Position Error (APE). Given a particular keypoint , it can be denoted as ,

(10)

where is the true location and is the predicted location of keypoint

Another metric, Probability of Correct Keypoints (PCK) (Andriluka et al., 2014; Simon et al., 2017), is also used to evaluate all models. If a predicted keypoint lies inside a sphere (of radius ) around the ground truth, the prediction is deemed correct. Given a particular keypoint p, is defined as follows,

(11)

where is an indicator function.

Figure 5. Histograms of Subjective Scores for Naturalness, Intrapersonal Dynamics, Interpersonal Dynamics and the mean across all criteria. Higher scores are better. Two-tailed pairwise t-test between Avatar Monadic only, Early Fusion TCN models and DRAM-TCN where - p¡0.05, - p¡0.01, and - p¡0.001

5.5. User Study: Subjective Evaluation Metrics

Pose generation during dyadic interactions can be a subjective task as reactions of the avatar depend on its own audio, and the human’s audio and pose. A human’s subjective judgement on the quality of prediction is an important metric for this task.

To achieve this, we design a user study of the generated videos from a held-out set. During the study, an acclimation phase is performed by showing reference clips (ground truth poses taken from the training set) to annotators to get them acquainted with the natural motion of the avatar. The main part of the study consists of showing annotators multiple one minute clips from the test set. Each video contains predicted avatar pose, avatar’s audio and the ground truth audio and poses for the human. Pose is represented in form of a stick figure (Refer to 7). The avatar predicted poses can come from one of these models in Figure 5 or the ground truth. Annotators do not know which model was used to animate the avatar. They are instructed to judge the animation based on the following statements:

  1. Naturalness:The motion of avatar looks natural and match his/her audio

  2. Intrapersonal Dynamics:Avatar behaves like themself (recall the reference video)

  3. Interpersonal Dynamics:Avatar reacts realistically to the interlocutor (in terms of interlocutor’s audio and motion)

At the end of each clip they give a score for all the statements following a 1 to 7 on the likert scale where,

\addlinespace\toprule1 2 3 4 5 6 7
\midruleDisagree
Somewhat
Disagree
Somewhat
Agree
Agree
\bottomrule\addlinespace

A fourth quesiton is asked to know how confident annotators were in scoring each video based on all input modalities. Each video is rated by a minimum of 2 human annotators where the final score is a weighted average with the weights as the confidence rating for each video.

Figure 6. Plots of average Probability of Correct Keypoint (PCK) values over multiple values of PCK threshold () for Early Fusion, DRAM w/o Attention, and DRAM models with TCNs. Higher values are better.

6. Results and Discussion

Figure 7. An example demonstrating Intrapersonal Dynamics in predictions made by DRAM . The black skeleton is the current pose, while the red skeleton is the pose from one second in the past. Paralinuguistic cues of emphasis are denoted by a larger font in the spoken sentence. It can be seen that the avatar performs a beat motion for the first five seconds to emphasize the word go. For the next two seconds, the avatar nod’s its head also to denote emphasis while speaking the words our bags. Mean of Dyadic Residual Attention vector (mean()) is moslty below which implies that is dominant for the final avatar body pose prediction.

6.1. Objective Evaluation

Average Position Error (APE): Models with only Interpersonal Dynamics achieve the best APE of around 3.3 which is slightly worse than the best APE of 3.0 on models with Intrapersonal dynamics (Table 1). Early Fusion and DRAM w/o Attention are models with both interpersonal and intrapersonal dynamics as input, but they are not able to surpass the Avatar Monadic Only model. This is not surprising as avatar’s speech is highly correlated with its body pose. Models with Adaptive Interpersonal and Intrapersonal Dynamics (e.g. DRAM), which achieved an APE of 2.8, are able to exploit changing dynamics in a conversation unlike non-adaptive methods such as Early Fusion and DRAM w/o Attention.

Each joint has different characteristics in a conversation setting. Some of them, like Torso, and Neck, do not move a lot during the course of the conversation. It is clear from the low APE values for these joints in Table 1 that modeling them is easier when compared to frequently moving joints like Wrists. It is also evident from the table that forecasting wrists had a much higher APE across all joints across all models. Dyadic Residual-Attention Model gives almost a 1.0 reduction in APE values when compared to the best non-adaptive model.

The joint, Head, shows some interesting characteristics. It can be fairly hard to predict head motion with just monadic data of the avatar as it sometimes mirrors head nods coming from the interlocutor (or Human). Dyadic information becomes crucial in this scenario. It is interesting to see that head predictions are around 1.0 value of APE worse for models with only intrapersonal dynamics. The monadic model conditioned on only Human features ends up performing reasonably well, probably because it can learn to map Avatar’s sporadic head nods to those of the Human.

Probability of Correct Key-points (PCK): The gap between PCK values of DRAM  and other baselines increase with the value of PCK threshold (Figure 6), which implies that variance of erroneous predictions by DRAM  is lesser than other baselines, making our proposed model more robust.

6.2. Subjective Evaluation

Based on the user study we conducted on the generated Avatar Pose (see Figure 5), humans find that DRAM generates more natural pose sequences which correlate better with the audio signals (i.e intrapersonal) and the Human’s body pose (i.e. interpersonal) than other models. DRAM gets an average score of 4.6 which implies that annotators ‘somewhat agree’ with all the statements in Section 5.5. On the other hand, annotators are neutral towards the pose generated by Avatar conditioned Monadic only model.

6.3. Qualitative Analysis

Conversations in a dyad contain non-verbal social cues, which might go unnoticed to us humans, but play an important role in maintaining the naturalness of the interaction. Head nod mirroring and Torso Pose switching are two of the most common cues. We pick out 2 cases in Figure 4 with such cues existing in the conversation. Our model detects the head nod and pose switch in the human’s pose and is successfully able to react to it.

Another aspect in social cues is hand gestures during conversation. Utterances that are emphasized usually lead to a switch in position of hands almost instantaneously. Our model is able to detect emphasis and moves their hand(s) up and down (Figure 7) to sync with the speech.

Real conversations are a mixture of changing interpersonal and intrapersonal dynamics. Our model is able to detect these changes and react appropriately. In Figure 3, the avatar conducts hand raises which are due to emphasis in the avatar’s speech, but when the interlocutor interrupts in with an exclamation, the avatar starts head nodding in agreement while still performing beat gestures to accompany emphasis in its audio signal.

The mean value of Dyadic Residual Vector is plotted along the animation to analyze its effects and correlations with changing dynamics in the conversation. First, ’s mean value seems to correlate with changing interpersonal and intrapersonal dynamics. In Figure 3, mean() rises as soon as the interlocuter interrupts the avatar. In Figure 7, mean() remains almost constant as the interlocutor does not seem to have a huge role in that part of the sequence. Second, even though the value of correlate with changing roles in a conversation, its absolute value is not extreme. (i.e. at an average it is closer to 0.5 than to 0 or 1). This is not surprising as the contribution of interpersonal and intrapersonal dynamics can often overlap hence requiring both monadic and dyadic models.

7. Conclusions

In this paper we introduce a new model for the task of generating body pose in a conversation setting conditioned on an audio signal, and interlocutor’s audio and body pose. This person specific model, Dyadic Residual-Attention Model , learns to selectively attend to interpersonal and intrapersonal dynamics. The attention mechanism is successfully able to capture social cues such as head nods and pose switches while generating a sequence of poses which appear natural to the human eye. It is a first step towards an avatar for remote communication which is anthropomorphised with non-verbal cues.

Acknowledgements.
This material is based upon work partially supported by the National Science Foundation (Award #1722822). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of National Science Foundation, and no official endorsement should be inferred.

References

  • S. Agrawal and M. van de Panne (2016) Task-based locomotion. ACM Transactions on Graphics (TOG) 35 (4), pp. 82. Cited by: §2.
  • C. Ahuja and L. Morency (2018) Lattice recurrent unit: improving convergence and statistical efficiency for sequence modeling. In AAAI-18, pp. 4996–5003. External Links: Link Cited by: §4.3.
  • M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693. Cited by: §5.4.
  • S. Bai, J. Z. Kolter and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §4.3.
  • J. N. Bailenson, N. Yee, D. Merget and R. Schroeder (2006) The effect of behavioral realism and form realism of real-time avatar faces on verbal disclosure, nonverbal disclosure, emotion recognition, and copresence in dyadic interaction. Presence: Teleoperators and Virtual Environments 15 (4), pp. 359–372. Cited by: §1.
  • T. Baltrušaitis, C. Ahuja and L. Morency (2018) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • M. Brand (1999) Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 21–28. Cited by: §2.
  • J. Cassell and K. R. Thorisson (1999) The power of a nod and a glance: envelope vs. emotional feedback in animated conversational agents. Applied Artificial Intelligence 13 (4-5), pp. 519–538. Cited by: §1.
  • J. Cassell, H. H. Vilhjálmsson and T. Bickmore (2004) Beat: the behavior expression animation toolkit. In Life-Like Characters, pp. 163–185. Cited by: §2.
  • [10] Y. Chao, J. Yang, B. L. Price, S. Cohen and J. Deng Forecasting human dynamics from static images.. Cited by: §2, §4.3.
  • C. Chiu and S. Marsella (2011) How to train your avatar: a data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents, pp. 127–140. Cited by: §2.
  • C. Chiu, L. Morency and S. Marsella (2015) Predicting co-verbal gestures: a deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents, pp. 152–166. Cited by: §2.
  • H. Chu, D. Li and S. Fidler (2018) A face-to-face neural conversation model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7113–7121. Cited by: §2.
  • A. Davis, M. Rubinstein, N. Wadhwa, G. Mysore, F. Durand and W. T. Freeman (2014) The visual microphone: passive recovery of sound from video. ACM Transactions on Graphics (Proc. SIGGRAPH) 33 (4), pp. 79:1–79:10. Cited by: §2.
  • A. T. Dittmann (1972) The body movement-speech rhythm relationship as a cue to speech encoding. Studies in dyadic communication, pp. 135–152. Cited by: §2.
  • F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka and S. S. Narayanan (2016) The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing 7 (2), pp. 190–202. Cited by: §5.2.
  • F. Eyben, F. Weninger, F. Gross and B. Schuller (2013) Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia, pp. 835–838. Cited by: §5.2.
  • T. Ezzat, G. Geiger and T. Poggio (2002) Trainable videorealistic speech animation. Vol. 21, ACM. Cited by: §2.
  • K. Fragkiadaki, S. Levine, P. Felsen and J. Malik (2015) Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354. Cited by: §2.
  • R. Gao, R. Feris and K. Grauman (2018) Learning to separate object sounds by watching unlabeled video. arXiv preprint arXiv:1804.01665. Cited by: §2.
  • A. Graves (2012) Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks, pp. 5–13. Cited by: §4.3.
  • I. Habibie, D. Holden, J. Schwarz, J. Yearsley and T. Komura (2017) A recurrent variational autoencoder for human motion synthesis. BMVC17. Cited by: §2.
  • U. Hadar, T. Steiner and F. C. Rose (1984) The relationship between head movements and speech dysfluencies. Language and Speech 27 (4), pp. 333–342. Cited by: §2.
  • S. Han, B. Liu, R. Wang, Y. Ye, C. D. Twigg and K. Kin (2018) Online optical marker-based hand tracking with deep labels. ACM Transactions on Graphics (TOG) 37 (4), pp. 166. Cited by: §5.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.3.
  • A. Jaimes and N. Sebe (2007) Multimodal human–computer interaction: a survey. Computer vision and image understanding 108 (1-2), pp. 116–134. Cited by: §2.
  • S. E. Jones and C. D. LeBaron (2002) Research on the relationship between verbal and nonverbal communication: emerging integrations. Journal of Communication 52 (3), pp. 499–521. Cited by: §3.1.
  • T. Karras, T. Aila, S. Laine, A. Herva and J. Lehtinen (2017) Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36 (4), pp. 94. Cited by: §2.
  • J. Lee and S. Marsella (2006) Nonverbal behavior generator for embodied conversational agents. In International Workshop on Intelligent Virtual Agents, pp. 243–255. Cited by: §2.
  • S. Lombardi, J. Saragih, T. Simon and Y. Sheikh (2018) Deep appearance models for face rendering. ACM Transactions on Graphics (TOG) 37 (4), pp. 68. Cited by: §1.
  • D. Pavllo, D. Grangier and M. Auli (2018) QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485. Cited by: §2, §4.3, §5.2.
  • X. B. Peng, P. Abbeel, S. Levine and M. van de Panne (2018) DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37 (4), pp. 143:1–143:14. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • S. Scherer, S. Marsella, G. Stratou, Y. Xu, F. Morbini, A. Egan and L. Morency (2012) Perception markup language: towards a standardized representation of perceived nonverbal behaviors. In International Conference on Intelligent Virtual Agents, pp. 455–463. Cited by: §2.
  • K. Schindler, L. Van Gool and B. de Gelder (2008) Recognizing emotions expressed by body pose: a biologically inspired neural model. Neural networks 21 (9), pp. 1238–1246. Cited by: §2.
  • T. Simon, H. Joo, I. A. Matthews and Y. Sheikh (2017) Hand keypoint detection in single images using multiview bootstrapping.. In CVPR, Vol. 1, pp. 2. Cited by: §5.4.
  • N. Singh and S. Singh (2017) Virtual reality: a brief survey. In Information Communication and Embedded Systems (ICICES), 2017 International Conference on, pp. 1–6. Cited by: §1.
  • A. Steed and R. Schroeder (2015) Collaboration in immersive and non-immersive virtual environments. In Immersed in Media, pp. 263–282. Cited by: §1.
  • S. Suwajanakorn, S. M. Seitz and I. Kemelmacher-Shlizerman (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36 (4), pp. 95. Cited by: §2.
  • K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, H. Sakuta and K. Sumi (2017) Speech-to-gesture generation: a challenge in deep learning approach with bi-directional lstm. In Proceedings of the 5th International Conference on Human Agent Interaction, pp. 365–369. Cited by: §2, §4.3, §5.3.
  • S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins and I. Matthews (2017) A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36 (4), pp. 93. Cited by: §2.
  • [41] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior and K. Kavukcuoglu WaveNet: a generative model for raw audio.. Cited by: §4.3.
  • P. Wagner, Z. Malisz and S. Kopp (2014) Gesture and speech in interaction: an overview. Elsevier. Cited by: §1, §2.
  • N. Ward and W. Tsukahara (2000) Prosodic features which cue back-channel responses in english and japanese. Journal of pragmatics 32 (8), pp. 1177–1207. Cited by: §1.
  • M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller and G. Rigoll (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing 31 (2), pp. 153–163. Cited by: §2.
  • A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria and L. Morency (2018) Multi-attention recurrent network for human communication comprehension. arXiv preprint arXiv:1802.00923. Cited by: §2.
  • H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott and A. Torralba (2018) The sound of pixels. arXiv preprint arXiv:1804.03160. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393307
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description