Contextualized Multimodal Representations
Multimodal Embeddings from Language Models
Word embeddings such as ELMo have recently been shown to model word semantics with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant improvement in state of the art across many natural language tasks. In this work we integrate acoustic information into contextualized lexical embeddings through the addition of multimodal inputs to a pretrained bidirectional language model. The language model is trained on spoken language that includes text and audio modalities. The resulting representations from this model are multimodal and contain paralinguistic information which can modify word meanings and provide affective information. We show that these multimodal embeddings can be used to improve over previous state of the art multimodal models in emotion recognition on the CMU-MOSEI dataset.
Department of Electrical and Computer Engineering
University of Southern California
Los Angeles, CA, USA
Acoustic and visual elements in human communication, such as intonation or facial expressions, infuses semantic content with additional paralinguistic cues which may modify intent and can convey affective meaning more clearly if not exclusively . For this reason many work have proposed multimodal systems which integrate information from multiple modalities to improve natural language understanding. This effort encompasses research in many applications such as human robot interfaces [2, 3], video summarization [4, 5, 6], dialogue systems [7, 8], and emotion and sentiment analysis [9, 10, 11].
The study of multimodal fusion in affective systems is a prevalent and important topic. This follows from the fact that human behavioral expression is fundamentally a multifaceted phenomena that spans over multiple modalities [12, 13] and can be more accurately identified through multimodal classifiers . Another factor is the importance of affective information as an intermediate step in a variety of downstream tasks. Examples of which include the use of these human behavioral states in language modeling , dialogue systems [16, 17], and video summarization .
Many multimodal systems for recognition of sentiment, emotion, and behaviors have been proposed in prior work. In feature-level fusion, Tzirakis et al.  combined auditory and visual modalities by extracting features from convolutional neural networks (CNN) on each modality which were then concatenated as input to an LSTM network. Hazarika et al.  proposed the use of a self-attention mechanism to assign scores for weighted combination of modalities. Other work has applied multimodal integration using late fusion methods [21, 22].
For deeper integration between modalities many work have proposed the use of multimodal neural architectures. Lee et al.  proposed the use of an attention matrix calculated from speech and text features to selectively focus on specific regions of the audio feature space. The memory fusion network was introduced by Zadeh et al.  which accounted for intra- and inter-modal dependencies across time. Akhtar et al.  proposed a contextual inter-modal attention network which leveraged sentiment and emotion labels in a multi-task learning framework.
The strength of deep models arises from the ability to internally learn meaningful representations of features from multiple modalities. This is learned implicitly by the model through the course of training on respective datasets . In this work we propose a model to explicitly learn informative joint representations of speech and text. This is achieved by modeling the dynamics between lexical content and paralinguistics from audio through a language modeling task on spoken language. We augment a bidirectional language model (biLM) with word-aligned acoustic features and optimize the model using large-scale text corpora followed by spoken articles. The internal states of this biLM aren’t representative of a specific task but rather models the intricacies of human communication through speech and language. We show the effectiveness of representations extracted from this model in capturing multimodal information by evaluating in the task of emotion recognition. Through the use of these representation we improve the state of the art in emotion recognition on the CMU-MOSEI dataset.
2 Related Work
Lexical representations such as ELMo  and BERT  have recently been shown to model word semantics and syntax with greater efficacy. This is achieved through contextualized learning on large-scale language corpora which allows internal states of the model to capture both the complex characteristics of word use as well as polysemy due to different contexts. The integration of these word embeddings into downstream models have improved the state of the art in many NLP tasks through their rich representation of language use.
To learn representations from multimodal data Hsu et al.  proposed the use of variational autoencoders to encode inter- and intra-modal factors into separate latent variables. Later, Tsai et al.  factorized representations into multimodal discriminative and modality-specific generative factors using inference and generative networks. During the course of writing this paper Rahman et al.  concurrently proposed the infusion of multimodal information into the BERT model. There the authors combined the generative capabilities of the BERT model with a sentiment prediction task to allow the model to implicitly learn rich multimodal representations through a joint generative-discriminative objective.
In this work we propose to explicitly learn multimodal representations of spoken words by augmenting the biLM model in ELMo with acoustic information. This is motivated from how humans integrate acoustic characteristics in speech to interpret the meaning and intent of lexical content from a speaker. Our work differs from prior work in that we do not include or target any discriminative objectives and instead rely on the generative task of language modeling to learn meaningful multimodal representations. We show how this model can be easily trained with large-scale unlabeled data and also demonstrate how potent multimodal embeddings from this model are in tasks such as emotion recognition.
3 Multimodal Embeddings
In this section we describe the network architecture of the bidirectional language model with acoustic information that generates the multimodal embeddings. The biLM comprises two layers of bidirectional LSTMs which operate over lexical and audio embeddings. The lexical and audio embeddings are calculated from respective convolutional layers and combined using a sigmoid-gating function. Multimodal embeddings are then computed using a linear function over the internal states of the recurrent layers. The overall architecture of the mutlimodal biLM is shown in Figure 1.
3.1 Bidirectional language model
A language model (LM) computes the probability distribution of a sequence of words by approximating it as the product of conditional probabilities of each word given previous words. This has been implemented using neural networks in many prior work yielding state of the art results [32, 33]. In this work we applied the biLM model used in ELMo, which is similar to the character-level RNN-LM described by Józefowicz et al.  and Kim et al. .
The biLM is composed of a forward and backward LM each implemented by a two-layer LSTM. The forward LM predicts the probability distribution of the next token given past context while the backward LM predicts the probability distribution of the previous token given future context. Each LM operates on the same input, which is a token embedding of the current token calculated through a character-level convolutional neural network (CharCNN). A softmax layer is used to estimate token probabilities from the output of the two-layer LSTM in the LMs. The parameters of the softmax layer are shared between the LMs in both directions.
Different from ELMo our input to the biLM includes acoustic features in additional to word tokens. Now the forward LM aims to model, at each timestep, the conditional probability of the next token given the current token , acoustic features , and previous internal states of the two-layer LSTM :
The backward LM operates similarly but predicts the previous token given the current token , acoustic features , and internal states resulting from future context .
3.2 Acoustic convolution layers
In our implementation time-aligned acoustic features of each word are provided in adjunct to word tokens. We build on ELMo and add additional convolutional layers to calculate acoustic embeddings from the acoustic features. The convolutional layers provide a feature transformation of the acoustic features which we combine with token embeddings using a gating function.
Due to the varying lengths of words in time, the acoustic features are first padded to a fixed size in the temporal dimension before being passed to the CNN. Each convolution layer in the CNN comprises a 1-D convolution layer followed by a max-pooling layer. Finally, the feature map is projected to the same dimension size as token embeddings to allow for element-wise combination.
3.3 Multimodal ELMo
We combine token and acoustic embeddings using a sigmoid gating function:
where and are the embeddings calculated from the token and corresponding acoustic features, respectively, is the sigmoid function, and represents element-wise multiplication. The resulting multimodal embeddings are used as input to the forward and backward LM.
The sigmoid gate scales the token embedding based on corresponding acoustic features of the word. This serves as a modifier of semantic meaning using paralinguistic information which we hypothesize will be useful in capturing affective expressions in downstream tasks.
Word embeddings are extracted for use in downstream models in a similar fashion to ELMo. That is, we compute a task-specific weighted sum of all LSTM outputs as each word embedding. We adopt the use of sentence embeddings in downstream models and additionally average all the word embeddings in a sentence. The final multimodal ELMo (M-ELMo) sentence embedding is given as
where are the concatenated outputs of LSTMs in both directions at the layer for the token. Values are softmax-normalized weights and is a scalar value, all of which are tunable parameters in the downstream model.
4 Experimental Setup
|(A + L + V)|
|(A + L)|
|M-ELMo + NN||65.8||74.7||74.2||81.7||63.2||85.1||67.0||65.2||63.1||72.0||63.8||83.3||66.2||77.0|
4.1 Pre-training the multimodal biLM
The multimodal biLM is pre-trained in two stages. In the first stage the lexical components of the biLM are optimized prior to the inclusion of acoustic features. This is achieved by training with a text corpus and fixing the acoustic input as zero. We use the 1 Billion Word Language Model Benchmark  for this purpose and train the biLM for 10 epochs. After training, the model achieves perplexities of around 35 which is similar to values reported in .
In the second stage of training we optimize the biLM using the multimodal dataset CMU-MOSEI  (described in Section 4.3). In our experiments we use only the text and audio of segments in the training split of the dataset to train the model. In terms of word count CMU-MOSEI is much smaller than the 1-billion word LM benchmark, therefore to prevent overfitting we reduce the learning rate used in the previous stage by a factor of 10 and train for an additional 5 epochs.
Since a CharCNN is used as the lexical embedder, input words to the biLM are first transformed into a character map and padded to a fixed length. The character-level representation of each word is then given as a matrix, where is the dimension size of the character embedding and is the maximum number of characters in a word.
Acoustic features were extracted from each recording at 10ms frame intervals using the COVAREP software version 1.4.2 . There are 74 features in total and include, among others, pitch, voiced/unvoiced segment features, mel-frequency cepstral coefficients, glottal flow parameters, peak slope parameters, and harmonic model parameters.
The acoustic features are aligned with word timings to provide acoustic information for each word. Since the time duration varies between words we pad the number of acoustic frames per token to a fixed length. Thus, word-aligned acoustic features are given as a matrix, where is the number of acoustic features and is the maximum number of acoustic frames in a word.
4.3 Emotion recognition as a downstream task
After pre-training, the multimodal biLM is used to extract multimodal sentence embeddings for use in downstream models. In our experiments we adopt emotion recognition as the downstream task and evaluate on the CMU-MOSEI dataset.
CMU-MOSEI contains 23,453 single-speaker video segments from YouTube which have been manually transcribed and annotated for sentiment and emotion. Emotions are annotated on a [0,3] Likert scale and include those such as happiness, sadness, anger, fear, disgust, and surprise. We binarize these annotations to arrive at class labels by predicting the presence of emotions, i.e. any emotion with a rating greater than one. Since video segments have ratings for all emotions this becomes a multi-label classification task.
As our goal is to evaluate the efficacy of the multimodal sentence embeddings we used a simple feedforward neural network for emotion recognition. The network inputs the multimodal sentence embeddings and predicts the presence of each emotion. The tunable parameters described in Section 3.3 are also included in this network. We trained the network using data from the training split provided in the dataset and validated using the validation split. We also used the validation split as a development set in choosing hyper-parameters of the network.
4.4 Evaluation methods
We evaluated the emotion recognition model using weighted accuracy (WA) and F1 score on each emotion. Weighted accuracy, as used in , is equivalent to the macro-average recall value. We also averaged the metrics across all emotions to obtain an average WA and F1 score.
The downstream model was trained for 30 epochs and separately optimized on WA and F1 score using the validation set. We randomly initialized each downstream model ten times and a best model was selected based on the average scores on validation over the ten runs. The final model was a neural network with two hidden layers using ReLU activation functions.
Due to the lack of work that only focuses on text and audio, we compared with models that also considers the visual modality. We compared our performance with two recent state of the art emotion recognition models on CMU-MOSEI. Specifically, these are the graph Memory Fusion Network (Graph-MFN)  and the contextual inter-modal attention framework (CIM-Att) . To match learning conditions, we compared with the single task learning (STL) model of  where only emotion labels are used in training.
The results of the final model averaged across ten runs are shown in Table 1. Our simple feedforward neural network using multimodal embeddings achieved state of the art results in terms of average WA and F1 over all emotions at 66.2% and 77.0%, respectively. On individual emotions our model yielded comparable to improved results over state of the art. Specifically, we observed improvements in the weighted accuracy of all emotions except sad as well improvements in F1 score of disgust and sad.
In this work we proposed a method for extending ELMo word embeddings to include acoustic information. We used convolutional layers over word-aligned acoustic features to calculate acoustic embeddings which we then combined with token embeddings in ELMo using a sigmoid gating function. The model was trained on a language modeling task, first with a text corpus followed by inclusion of audio from a multimodal dataset. We then showed the effectiveness of sentence embeddings extracted from this multimodal biLM in emotion recognition. The results are surprising given that our downstream model using a neural network with two hidden layers outperformed state of the art architectures. This demonstrates how well the multimodal embeddings have captured inter- and intra-modal dynamics in spoken language.
-  Emma Rodero, “Intonation and emotion: influence of pitch levels and contour type on creating emotions,” Journal of Voice, 2011.
-  Erica L Meszaros, Meghan Chandarana, Anna Trujillo, and B Danette Allen, “Compensating for limitations in speech-based natural language processing with multimodal interfaces in uav operation,” in International Conference on Applied Human Factors and Ergonomics. Springer, 2017, pp. 183–194.
-  Iñaki Maurtua, Izaskun Fernández, Alberto Tellaeche, Johan Kildal, Loreto Susperregi, Aitor Ibarguren, and Basilio Sierra, “Natural multimodal communication for human–robot collaboration,” International Journal of Advanced Robotic Systems, vol. 14, no. 4, pp. 1729881417716043, 2017.
-  Fumio Nihei, Yukiko I Nakano, and Yutaka Takase, “Predicting meeting extracts in group discussions using multimodal convolutional neural networks,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 2017, pp. 421–425.
-  Fumio Nihei, Yukiko I Nakano, and Yutaka Takase, “Fusing verbal and nonverbal information for extractive meeting summarization,” in Proceedings of the Group Interaction Frontiers in Technology. ACM, 2018, p. 9.
-  Shruti Palaskar, Jindřich Libovický, Spandana Gella, and Florian Metze, “Multimodal abstractive summarization for How2 videos,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 6587–6596, Association for Computational Linguistics.
-  Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua, “Knowledge-aware multimodal dialogue systems,” in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 801–809.
-  Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, et al., “End-to-end audio visual scene-aware dialog using multimodal attention-based video features,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2352–2356.
-  Anthony Hu and Seth Flaxman, “Multimodal sentiment analysis to explore the structure of emotions,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 350–358.
-  Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung, “Multimodal speech emotion recognition using audio and text,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 112–118.
-  Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency, “Multimodal language analysis with recurrent multistage fusion,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 150–161.
-  Shrikanth Narayanan and Panayiotis G Georgiou, “Behavioral signal processing: Deriving human behavioral informatics from speech and language,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1203–1233, 2013.
-  Sidney K D’mello and Jacqueline Kory, “A review and meta-analysis of multimodal affect detection systems,” ACM Computing Surveys (CSUR), vol. 47, no. 3, pp. 43, 2015.
-  Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017.
-  Prashanth Gurunath Shivakumar, Shao-Yen Tseng, Panayiotis Georgiou, and Shrikanth Narayanan, “Behavior gated language models,” arXiv preprint arXiv:1909.00107, 2019.
-  Johannes Pittermann, Angela Pittermann, and Wolfgang Minker, “Emotion recognition and adaptation in spoken dialogue systems,” International Journal of Speech Technology, vol. 13, no. 1, pp. 49–60, 2010.
-  Dario Bertero, Farhad Bin Siddique, Chien-Sheng Wu, Yan Wan, Ricky Ho Yin Chan, and Pascale Fung, “Real-time speech emotion and sentiment recognition for interactive dialogue systems,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1042–1047.
-  Ashish Singhal, Pradeep Kumar, Rajkumar Saini, Partha Pratim Roy, Debi Prosad Dogra, and Byung-Gyu Kim, “Summarization of videos by analyzing affective state of the user through crowdsource,” Cognitive Systems Research, vol. 52, pp. 917–930, 2018.
-  Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
-  Devamanyu Hazarika, Sruthi Gorantla, Soujanya Poria, and Roger Zimmermann, “Self-attentive feature-level fusion for multimodal emotion detection,” in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 2018, pp. 196–201.
-  Shao-Yen Tseng, Haoqi Li, Brian Baucom, and Panayiotis Georgiou, “”Honey, I learned to talk”: Multimodal fusion for behavior analysis,” in Proceedings of the 20th ACM International Conference on Multimodal Interaction, New York, NY, USA, 2018, ICMI ’18, pp. 239–243, ACM.
-  Nathaniel Blanchard, Daniel Moreira, Aparna Bharati, and Walter J Scheirer, “Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities,” arXiv preprint arXiv:1807.01122, 2018.
-  Chan Woo Lee, Kyu Ye Song, Jihoon Jeong, and Woo Yong Choi, “Convolutional attention networks for multimodal emotion recognition from speech and text data,” ACL 2018, p. 28, 2018.
-  Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency, “Memory fusion network for multi-view sequential learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Md Shad Akhtar, Dushyant Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya, “Multi-task learning for multi-modal emotion recognition and sentiment analysis,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
-  Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
-  Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, “Deep contextualized word representations,” in Proceedings of NAACL-HLT, 2018, pp. 2227–2237.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BEST: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
-  Wei-Ning Hsu and James Glass, “Disentangling by partitioning: A representation learning framework for multimodal sensory data,” arXiv preprint arXiv:1805.11264, 2018.
-  Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov, “Learning factorized multimodal representations,” in International Conference on Learning Representations, 2019.
-  Wasifur Rahman, Md Kamrul Hasan, Amir Zadeh, Louis-Philippe Morency, and Mohammed Ehsan Hoque, “M-BERT: Injecting multimodal information in the BERT structure,” arXiv preprint arXiv:1908.05787, 2019.
-  Stephen Merity, Nitish Shirish Keskar, and Richard Socher, “Regularizing and optimizing LSTM language models,” in International Conference on Learning Representations (ICLR), 2018.
-  Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu, “Frage: Frequency-agnostic word representation,” in Advances in neural information processing systems (NIPS), 2018.
-  Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016.
-  Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush, “Character-aware neural language models,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
-  Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson, “One billion word benchmark for measuring progress in statistical language modeling,” in Interspeech, 2014.
-  Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer, “COVAREP - A collaborative voice analysis repository for speech technologies,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.