Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning

Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning


Minghai Chen
Language Technologies Institute
Carnegie Mellon University \cityPittsburgh \statePA \postcode15213 \countryUSA
   Sen Wang
Language Technologies Institute
Carnegie Mellon University \cityPittsburgh \statePA \countryUSA \postcode15213
   Paul Pu Liang
Language Technologies Institute
Carnegie Mellon University \cityPittsburgh \statePA \postcode15213 \countryUSA
   Tadas Baltrušaitis
Language Technologies Institute
Carnegie Mellon University \cityPittsburgh \statePA \postcode15213 \countryUSA
   Amir Zadeh
Language Technologies Institute
Carnegie Mellon University \cityPittsburgh \statePA \postcode15213 \countryUSA
   Louis-Philippe Morency
Language Technologies Institute
Carnegie Mellon University \cityPittsburgh \statePA \postcode15213 \countryUSA


With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which focus on holistic information in speech segments such as bag of words representations and average facial expression intensity, we develop a novel deep architecture for multimodal sentiment analysis that performs modality fusion at the word level. In this paper, we propose the Gated Multimodal Embedding LSTM with Temporal Attention  (GME-LSTM(A)) model that is composed of 2 modules. The Gated Multimodal Embedding alleviates the difficulties of fusion when there are noisy modalities. The LSTM with Temporal Attention performs word level fusion at a finer fusion resolution between input modalities and attends to the most important time steps. As a result, the GME-LSTM(A) is able to better model the multimodal structure of speech through time and perform better sentiment comprehension. We demonstrate the effectiveness of this approach on the publicly-available Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving state-of-the-art sentiment classification and regression results. Qualitative analysis on our model emphasizes the importance of the Temporal Attention Layer in sentiment prediction because the additional acoustic and visual modalities are noisy. We also demonstrate the effectiveness of the Gated Multimodal Embedding in selectively filtering these noisy modalities out. Our results and analysis open new areas in the study of sentiment analysis in human communication and provide new models for multimodal fusion.

Multimodal Sentiment Analysis, Multimodal Fusion, Human Communication, Deep Learning, Reinforcement Learning

2017 \acmYear2017 \setcopyrightacmlicensed \acmConference[ICMI’17]19th ACM International Conference on Multimodal InteractionNovember 13–17, 2017Glasgow, UK\acmPrice15.00\acmDOI10.1145/3136755.3136801 \acmISBN978-1-4503-5543-8/17/11


Equal contribution.





Multimodal Sentiment Analysis, Word-Level Fusion, Reinforcement Learning]Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning


[500]Computing methodologies Artificial Intelligence \ccsdesc[500]Computing methodologies Computer Vision \ccsdesc[500]Computing methodologies Natural Language Processing \ccsdesc[500]Computing methodologies Machine Learning \ccsdesc[500]Information systems Sentiment Analysis

1 Introduction

Multimodal sentiment analysis is an emerging field at the intersection of natural language processing, computer vision, and speech processing. Sentiment analysis aims to find the attitude of a speaker or writer towards a document, topic or an event (Pang et al., 2008). Sentiment can be expressed by the spoken words, the emotional tone of the delivery and the accompanying facial expressions. As a result, it is helpful to combine visual, language, and acoustic modalities for sentiment prediction (Perez-Rosas et al., 2013). To combine cues from different modalities, previous work mainly focused on holistic video-level feature fusion. This was done with simple features (such as bag-of-words and average smile intensity) calculated over an entire video as representations of verbal, visual and acoustic features (Wang et al., 2016; Zadeh et al., 2016b). These simplistic fusion approaches mostly ignore the structure of speech by focusing on simple statistics from videos and combining modalities at an abstract level.

The cornerstone of our approach is capturing the full structure of speech using a time-dependent recurrent approach that is able to properly perform modality fusion at every timestep. This understanding of speech structure is important due to two major reasons: 1) There are local interactions between modalities, such as how loud a word is being uttered which has roots in language and acoustic modalities or whether or not a word was accompanied by a smile which has roots in language and vision. Considering local interaction helps in dealing with commonly researched problems in natural language processing such as ambiguity, sarcasm or limited context by providing more information from the visual and acoustic modalities. Consider the word “crazy”; this word can have a positive sentiment if accompanied by a smile or can have a negative sentiment if accompanied by a frown. At the same time the word “great” accompanied by a frown implies sarcastic speech. Also, in limited context, inference of sentiment intensity is difficult. For example the word “good” accompanied by neutral nonverbal behavior could mean that the utterance is positive; but the same word accompanied with big smile could mean highly positive sentiment. 2) There are global interactions between modalities mostly established based on temporal relations between modalities. Examples include a delayed laughter due to a speaker’s utterance of words or a delayed smile because of a speech pause. Each of the modalities also have their own intramodal interactions (such as how different gestures happen over the utterance), which can be characterized as the global structure of speech.

To properly model structure of speech, two key questions need to be answered: “what modality to look for at each moment in time?” and “what moments in speech are important in the communication?”. To address the first question, a model should be able to “gate” the useful modalities at each moment in time. If a modality does not contain useful information or the modality is too noisy that negatively affects the performance of the model, the model should be able to shut off the modality and perform inference based on information present in the other modalities. To address the second question, a model should be able to divert it’s “attention” to key moments of communication such as when a polarized word is uttered or when a smile happens. In this paper we introduce the Gated Multimodal Embedding LSTM with Temporal Attention model, which explicitly accounts for these two key questions by using a gated mechanism for multimodal fusion at each time step and a Temporal Attention Layer for sentiment prediction. Our model is able to explore the structure of speech through a stateful recurrent mechanism and perform fusion at word level between different modalities. This gives our model the ability to account for local (by word level fusion of multimodal information) and global (by stateful multimodal memory) interactions between modalities.

The remainder of the paper is as follows. In Section 2, we review the related work in multimodal sentiment analysis. In Section 3, we give formal definition of our problems and present our approaches in detail. In Section 4, we describe the CMU-MOSI dataset, experimental methodology and baseline models. The results on CMU-MOSI dataset are presented in Section 5. A detailed analysis of our model’s components is provided in Section 6. Section 7 concludes the paper.

2 Related Work

Deep learning approaches have been became extremely popular in the past few years Young et al. (2017). The field of multimodal machine learning specifically has gotten unprecedented momentum Baltrušaitis et al. (2017). Multimodal models have been used for sentiment analysis Zadeh et al. (2016b); Poria et al. (2017); Zadeh et al. (2016a); medical purposes, such as detection of suicidal risk, PTSD and depression Scherer et al. (2016); Venek et al. (2016); Yu et al. (2013); Valstar et al. (2016); emotion recognition Poria et al. (2017); image captioning and media description You et al. (2016); Donahue et al. (2015); question answering Antol et al. (2015); and multimodal translation Specia et al. (2016). Our work is specifically connected to the following areas:

Sentiment analysis on written text modality has been well-studied (Pang et al., 2008) with models that predict sentiment from language. Early works used the bag of words or n-gram representations (Yang and Cardie, 2012) to derive sentiment from individual words. Other approaches focused on opinionated words (Taboada et al., 2011; Poria et al., 2012, 2016), and some applied more complicated structures such as trees (Socher et al., 2013) and graphs (Poria et al., 2014). These structures aimed to derive the sentiment of sentence based on the sentiment of individual words and their compositions. More recent works used dependency-based semantic analysis (Agarwal et al., 2015), distributional representations for sentiment (Iyyer et al., 2015) and a convolutional architecture for the semantic modeling of sentences (Kalchbrenner et al., 2014). However, we are primarily working on spoken text rather than written text, which gives us the opportunity to integrate additional audio and visual modalities. These modalities are helpful by providing additional information, but sometimes may be noisy.

Multimodal sentiment analysis integrates verbal and nonverbal behavior to predict user sentiment. Though various multimodal datasets with sentiment labels exist (Wöllmer et al., 2013; Morency et al., 2011; Pérez-Rosas et al., 2013), the CMU-MOSI dataset (Zadeh et al., 2016b) is the first dataset with opinion level sentiment labels.

Recent multimodal sentiment analysis approaches focus on deep neural networks, including Convolutional Neural Networks (CNN) with multiple-kernel learning (Poria et al., 2015a) and Select-Additive Learning (SAL) CNN (Wang et al., 2016) which learns generalizable features across speakers. (Behnaz Nojavanasghari, 2016) uses an unimodal deep neural network for three modalities (language, acoustic and visual) and explores the effectiveness of early fusion and late fusion. The features of the three modalities are similar to our work, but to fuse the multimodal information their work simply uses an concatenation approach at video level while we propose and justify the use of more advanced methods for multimodal fusion.

Besides sentiment, other speakers’ attributes such as persuasion, passion and confidence could also be analyzed by similar methods (Park et al., 2014; Majumder et al., 2017). (Chatterjee et al., 2015) proposes an ensemble classification approach that combines two different perspectives on classifying multimodal data: the first perspective assumes inter-modality conditional independence, while the second perspective explicitly captures the correlation between modalities is and recognized by a clustering based kernel similarity approach. These methods can also be applied to multimodal sentiment analysis.

Our gated controller for visual and acoustic modality is inspired by the controller used by Zoph and Le (Zoph and Le, 2016), where they use a Recurrent Neural Network (RNN) controller to determine the structure of a CNN.

Compared to previous work, our method has two major advantages. To the best of our knowledge, our model is the first to use word level modality fusion, which means that we align each word to corresponding video frames and audio segments. Secondly, we are also the first to propose an attention layer and a input gate controller trained by reinforcement learning to approach the problem of noisy modalities.

3 Proposed Approaches

In this section, we give a detailed description of our proposed approach which will be divided into 2 modules: the Gated Multimodal Embedding Layer and the LSTM with Temporal Attention model. The Gated Multimodal Embedding Layer performs selective multimodal fusion at each time step (word level) using input modality gates, and the LSTM with Temporal Attention performs sentiment prediction with attention to each time step. Together, these modules combine to form the Gated Multimodal Embedding LSTM with Temporal Attention  (GME-LSTM(A)). The section ends with the training details for the GME-LSTM(A).

Figure 1: Architecture of the GME-LSTM(A) model for the visual modality. is the controller for the visual modality that selectively allows visual inputs to pass. FC-ReLU is a fully-connected layer with rectified linear unit (ReLU) as activation. After obtaining a sentiment prediction and loss , we use as the reward signal to train the visual input gate controller .

3.1 Gated Multimodal Embedding

The first component is the Gated Multimodal Embedding that performs multimodal fusion by learning the local interactions between modalities. Suppose the dataset contains video clips, each containing an opinion mapped to sentiment intensity. A video clip contains time steps, where each time step corresponds to a word. Each video clip is also labeled with the ground truth sentiment value . We align words with their corresponding video and audio segment using the Penn Phonetics Lab Forced Aligner (P2FA) (ucbvislab, 2013). P2FA is a software that computes an alignment between a speech audio file and a verbatim text transcript. Following the alignment, at each word level time step , we obtain the aligned feature vectors for language (word), acoustic and visual modalities: , , respectively.

One problem of previous models is that of noisy modalities during multimodal fusion since the textual modality may be negatively affected by the visual and audio modalities. As a result, useful textual features may be ignored because the corresponding visual or acoustic feature is noisy and important information may be lost.

Our solution is to introduce an on/off input gate controller that determines if the acoustic or visual modality at each time step should contribute to the overall prediction. The reason why we apply the input gate controller on acoustic/visual features while always letting textual features in is that the language modality is much more reliable for multimodal sentiment analysis than visual or acoustic. Also visual and acoustic modalities can be noisy or unreliable since audio visual feature extraction is done automatically using methods that add additional noise.

Mathematically, we formalize this with inputs and representing the audio and visual inputs at time step respectively. We have a controller , with weights , for determining the on/off of audio modalities, and , with weights , for determining the on/off of visual modalities. These controllers are implemented as deep neural networks and that take in and as input and outputs a binary label and (0/1) respectively. The binary output of these controllers mimics the act of accepting or rejecting a modality based on its noise level. The new inputs and to the network are:


We concatenate features from three different modalities: visual, audio and text to form the inputs to the word level LSTM with Temporal Attention, described in the next section. By extracting features with alignment at the word level, we exploit the temporal correlation among different modalities , but at the same time the model is less affected by the impact of noisy modalities during multimodal fusion.


3.2 LSTM with Temporal Attention 

The second component is a sentiment prediction model that captures the temporal interactions on the multimodal embedding layer. This component learns the global interactions between modalities for sentiment prediction. To do so, we use an LSTM with Temporal Attention (LSTM(A)). The Gated Multimodal Embedding  is passed as input to the LSTM at each time step . A LSTM (Hochreiter and Schmidhuber, 1997) with a forget gate (Gers et al., 2000) is used to learn global temporal information on multimodal input data :


where , and are the input, forget and output gates of the LSTM, is the LSTM memory cell and is the LSTM output, and linearly transform and respectively into the LSTM gate space and the operator denotes the Hadamard product (entry-wise product).

We use an attention model similar to (Wang, 2016) to selectively combine temporal information from the input modalities by adaptively predicting the most important time steps towards sentiment of a video clip. We expect relevant information to sentiment to have high attention weights. For example if a person is “crying” or “laughing”, this information is relevant to the sentiment of the opinion and should have higher importance than a neutral word such as “movie”. This attention mechanism also allows the modalities to act as complimentary information. In cases where language is not helpful, the model can adaptively focus on the presence of sentiment related non-verbal behaviors such as facial gestures and tone of voice.

Mathematically, we add a soft attention layer on top of the sequence of LSTM hidden outputs. is obtained by multiplying the hidden layer at each time step with a shared vector and passing the sequence through a softmax function.


The attention units are used to weight the importance of each time step’s hidden layer to final sentiment prediction. Suppose represents the matrix of all hidden units of the LSTM . Then the final sentiment prediction is obtained by:


where function represents a dense layer with a non-linear activation. We select Mean Absolute Error (MAE) as the loss function. Though Mean Square Error (MSE) is a more popular choice for loss function, MAE is a common criteria for sentiment analysis (Zadeh et al., 2016a).


Figure 1 shows the full structure of the GME-LSTM(A) model.

3.3 Training Details for GME-LSTM(A)

To train the GME-LSTM(A), we need to know how output decisions of the controller affects the performance of our LSTM(A) model. Given the weights of the gate controller and input data and , the controller decides whether we should reject an input or not. The rejected inputs are replaced with , while the accepted inputs are not changed. In this way we obtain the new inputs and . After we train the LSTM(A) with the new inputs , we get a MAE loss, , on the validation set. Here can be seen an indicator of how well our controller affects the performance of the model. Note that that lower MAE implies better performance, so we use as the reward signal to train the controllers.

Take the visual controller as an example: we are maximizing the expected reward, represented by :


where is the total number of time steps in the dataset. The sentiment prediction MAE in the reward signal is non-convex and non-differentiable with respect to the parameters of the GME since changes in the outputs of the GME change the MAE in a discrete manner. Straightforward gradient descent methods will not explore all the possible regions of the function. This form of problem has been studied in reinforcement learning where policy gradient methods balance exploration and optimization by randomly sampling many possible outputs of the GME controller before optimizing for best performance. Specifically, the REINFORCE algorithm (Williams, 1992) is used to iteratively update :


An empirical approximation of the above quantity is to sample the outputs of the controller (Zoph and Le, 2016):


where is the number of different inputs datasets that the controller samples, and is the MAE on the validation dataset after the model is trained on th inputs set.

In order to reduce variance of this estimation, we employ a baseline function (Zoph and Le, 2016):


where is an exponential moving average of the previous MAEs on the validation set.

If we take the visual input gate controller as an example, the detailed training algorithm for the visual input gate controller is shown in Algorithm 1. The acoustic input gate is trained in the same manner.

1:function trainGateController()
2:      for   do
3:            for   do
4:                 for   do
7:                        with probability
8:                 end for
10:            end for
13:      end for
14:end function
Algorithm 1 Train gate controller

4 Experimental Methodology

In this section, we describe the experimental methodology including the dataset, data splits for training, validation and testing, the input features and their preprocessing, the experimental details and finally review the baseline models that we compare our results to.

4.1 CMU-MOSI Dataset

We test on the Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset (Zadeh et al., 2016a), which is a collection of online videos in which a speaker is expressing his or her opinions towards a movie. Each video is split into multiple clips, and each clip contains one opinion expressed by one or more sentences. A clip has one sentiment label which is a continuous value representing speaker’s sentiment towards a certain aspect of the movie. Figure 2 depicts a snapshot from the CMU-MOSI dataset.

The CMU-MOSI dataset consists of 93 videos / 2199 labeled clips and training is performed on the labeled clips. Each video in the CMU-MOSI dataset is from a different speaker. We use the train and test sets defined in (Wang et al., 2016) which trains on 52 videos/1284 clips (52 distinct speakers), validates on 10 videos/229 clips (10 distinct speakers) and tests on 31 videos/686 clips (31 distinct speakers). There is no speaker dependent contamination in our experiments so our model is generalizable and learns speaker-independent features.

Figure 2: A snapshot from the CMU-MOSI dataset, where text, visual and audio features are aligned. For example, in the bottom row of Figure 2, the first scene is labeled with text - the speaker is currently saying the word “It”, this is aligned with the video clip of her speaking that word where she looks excited.

4.2 Input Features

We use text, video, and audio as input modalities for our task. For text inputs, we use pre-trained word embeddings (glove.840B.300d) (Pennington et al., 2014) to convert the transcripts of videos in the CMU-MOSI dataset into word vectors. This is a 300 dimensional word embedding trained on 840 billion tokens from the common crawl dataset. For audio inputs, we use COVAREP (Degottex et al., 2014) to extract acoustic features including 12 Mel-frequency cepstral coefficients (MFCCs), pitch tracking and voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters and maxima dispersion quotients. For video inputs, we use Facet (iMotions, 2017) and OpenFace (Baltrušaitis et al., 2016; Zadeh et al., 2017) to extract a set of features including facial action units, facial landmarks, head pose, gaze tracking and HOG features (Zhu et al., 2006).

4.3 Implementation Details

Before training, we select the best 20 features from Facet and 5 from COVAREP using univariate linear regression tests. The selected Facet and COVAREP features are linearly normalized by the maximum absolute value in the training set.

For the LSTM(A) model, we set the number of hidden units of the LSTM as 64. The maximum sequence length of the LSTM, , is 115. There are 50 units in the ReLU fully connected layer. The model is trained using ADAM (Kingma and Ba, 2014) with learning rate 0.0005 and MAE (mean absolute error) as the loss function.

For the GME-LSTM(A) model, the visual and audio controllers are each implemented as a neural network with one hidden layer of 32 units and sigmoid activation. The number of samples generated from the controller at each training step is 5. Each sampled LSTM(A) model is trained using ADAM (Kingma and Ba, 2014) with learning rate 0.0005 and MAE (mean absolute error) as the loss function. The input gate controller is then trained using ADAM (Kingma and Ba, 2014) with learning rate 0.0001.

5 Experimental Results

5.1 Baseline Models

We compare the performance of our methods to the following state-of-the-art multimodal sentiment analysis models:

SAL-CNN (Selective Additive Learning CNN) (Wang et al., 2016) is a multimodal sentiment analysis model that attempts to prevent identity-dependent information from being learned so as to improve generalization based only on accurate indicators of sentiment.

SVM-MD (Support Vector Machine Multimodal Dictionary) (Zadeh et al., 2016b) is a SVM trained for classification or regression on multimodal concatenated features for each video clip.

C-MKL (Convolutional Multiple Kernel Learning) (Poria et al., 2015b) is a multimodal sentiment analysis model which uses a CNN for textual feature extraction and multiple kernel learning for prediction.

RF (Random Forest) is a baseline intended for comparison to a non neural network approach.

Random is a baseline which always predicts a random sentiment intensity between (Zadeh et al., 2016b). This is designed as a lower bound to compare model performance.

Human performance was recorded when humans are asked to predict the sentiment score of each opinion segment (Zadeh et al., 2016b). This acts as a future target for machine learning methods.

Since sentiment analysis based on language has been well-studied, we also compare our methods with following text-based models:

RNTN (Recursive Neural Tensor Network) (Socher et al., 2013) is a well-known sentiment analysis method that leverages the sentiment of words and their dependency structure.

DAN (Deep Average Network) (Iyyer et al., 2015) is a simple but efficient sentiment analysis model that uses information only from distributional representation of the words.

D-CNN (DynamicCNN) (Kalchbrenner et al., 2014) is among the state of the art models in text-based sentiment analysis which uses a convolutional architecture adopted for the semantic modeling of sentences.

Finally, any model with “text” appended denotes the model trained only on the textual modality of the CMU-MOSI video clips.

5.2 Results

In this section, we summarize the results on multimodal sentiment analysis. In Table 1, we compare our proposed approaches with previous state-of-the-art multimodal as well as language-based baseline models for sentiment analysis (described in Section 5.1).

Modalities Method Acc F-score MAE
Text RNTN (Socher et al., 2013) (73.7) (73.4) (0.990)
DAN (Iyyer et al., 2015) 70.0 69.4 -
D-CNN (Kalchbrenner et al., 2014) 69.0 65.1 -
SAL-CNN text (Wang et al., 2016) 73.5 - -
SVM-MD text (Zadeh et al., 2016b) 73.3 72.1 1.186
RF text (Zadeh et al., 2016b) 57.6 57.5 -
LSTM text (ours) 67.8 51.2 1.234
LSTM(A) text (ours) 71.3 67.3 1.062
Multimodal Random 50.2 48.7 1.880
SAL-CNN (Wang et al., 2016) 73.0 - -
SVM-MD (Zadeh et al., 2016b) 71.6 72.3 1.100
C-MKL (Poria et al., 2015b) 73.5 - -
RF (Zadeh et al., 2016b) 57.4 59.0 -
LSTM (ours) 69.4 63.7 1.245
LSTM(A) (ours) 75.7 72.1 1.019
GME-LSTM(A) (ours) 76.5 73.4 0.955
Human 85.7 87.5 0.710
Table 1: Sentiment prediction results on test set using different text-based and multimodal methods. Numbers are reported in binary classification accuracy (Acc), F-score and MAE, and the best scores are highlighted in bold (excluding human performance). shows improvement over the state-of-the-art. Results for RNTN are parenthesized because the model was trained on the Stanford Sentiment Treebank dataset (Socher et al., 2013) which is much larger than CMU-MOSI.

The multimodal section of Table 1 shows the performance of our two proposed approaches compared to other multimodal baseline methods. The model we proposed, GME-LSTM(A) as well as the version without gated controller LSTM(A), both outperform multimodal and single modality sentiment analysis models. The GME-LSTM(A) model gives the best result achieved across all models, improving upon the state of the art by 4.08% in binary classification accuracy and 13.2% in MAE. Since GME-LSTM(A) is able to attend both in time, using soft attention as well as in input modality, using the Gated Multimodal Embedding Layer, it is not a surprise that this model outperforms all others.

The language section of Table 1 shows that LSTM(A) on a single modality, language, obtains slightly worse performance than some language-based methods. This is because these methods use more complicated language models such as dependency-based parse tree. However, by combining cues from audio and video with careful multimodal fusion, GME-LSTM(A) immediately outperforms all language-based and multimodal baseline models. This jump in performance shows that good temporal attention and multimodal fusion is key: our model benefit from the addition of input modalities more so than other models did.

6 Discussion

In this section, we analyze the usefulness of our model’s different components, demonstrating that the Temporal Attention Layer and the Gated Multimodal Embedding over input modalities are both crucial towards multimodal fusion and sentiment prediction.

6.1 LSTM with Temporal Attention Analysis

Method Modalities Acc F-score MAE
LSTM text 67.8 51.2 1.234
audio 44.9 61.9 1.511
video 44.9 61.9 1.505
text + audio 66.8 55.3 1.211
text + video 63.0 65.6 1.302
text + audio + video 69.4 63.7 1.245
Table 2: Sentiment prediction results on test set using LSTM model with different combinations of modalities. Numbers are reported in binary classification accuracy (Acc), F-score and MAE, and the best scores are highlighted in bold.
Method Modalities Acc F-score MAE
LSTM(A) text 71.3 67.3 1.062
audio 55.4 63.0 1.451
video 52.3 57.3 1.443
text + audio 73.5 70.3 1.036
text + video 74.3 69.9 1.026
text + audio + video 75.7 72.1 1.019
Table 3: Sentiment prediction results on test set using LSTM(A) model with different combinations of modalities. Numbers are reported in binary classification accuracy (Acc), F-score and MAE, and the best scores are highlighted in bold.

Language is most important in predicting sentiment. In both the LSTM model (Table 2) and the LSTM(A) model (Table 3), using only the text modality provides a better sentiment prediction than using unimodal audio and visual modalities.

Acoustic and visual modality are noisy. When we provide additional modalities to the LSTM model without attention (Table 2), the performance does not improve significantly. Using all three modalities actually leads to slightly worse performance in F-score and MAE as compared to using fewer input modalities. This allows us to deduce that the audio and video features are probably noisy and may hurt the model’s performance if multimodal fusion is not carefully performed.

Temporal Attention improves sentiment prediction. On the other hand, when we use the the LSTM(A) model, Table 3 shows that adding more modalities improves sentiment regression and classification. The LSTM(A) (Table 3) consistently outperforms the LSTM (Table 2) across all modality combinations. We hypothesize that by using temporal attention, the model will assign the largest attention weights to time steps where all 3 modalities give strong, consistent sentiment predictions and abandon noisy frames altogether. As a result, temporal attention improves sentiment prediction despite the presence of noisy acoustic and visual modalities.

Successful cases of the LSTM(A) model. To obtain a better insight into our model’s performance, we provide some successful cases to demonstrate the contribution of the Temporal Attention Layer. By further studying the attention weights in the LSTM(A), we can find which words/time steps the model focuses on. The following are examples of successful cases when we look at the textual modality alone. Each line represents a single transcript and the bold word indicates the word which the model assigns the highest attention weight to.

I thought it was fun.

And she really enjoyed the film.

But a lot of the footage was kind of unnecessary.

The highlighted words are all words that are good indicators of positive or negative sentiment.

He’s not gonna be looking like a chirper bright young man but early thirties really you want me to buy that.

Visual modality: Looks disappointed

LSTM sentiment prediction: 1.24

LSTM(A) sentiment prediction: -0.94

Ground truth sentiment: -1.8

Figure 3: Successful case 1: Although the highest weighted word extracted from the transcript (top) is “want”, with ambiguous sentiment, the LSTM(A) leverages the visual modality (center), where the speaker looks disappointed, to make a prediction on video sentiment much closer to ground truth (bottom).

The only actor who can really sell their lines is Erin.

Visual modality: Looks sad

LSTM sentiment prediction: 1.86

LSTM(A) sentiment prediction: -0.3

Ground truth sentiment: -1.0

Figure 4: Successful case 2: Although the highest weighted word extracted from the transcript (top) is “lines”, with ambiguous sentiment, the LSTM(A) leverages the visual modality (center), where the speaker looks sad, to make a prediction on video sentiment much closer to ground truth (bottom).

The LSTM(A) model combines word meanings with audio and visual indicators. Figure 3 and Figure 4 are examples where the LSTM(A) model is successful when we use all 3 modalities. In these examples, the LSTM(A) model is able to leverage the word level alignment of audio and visual modalities to overcome the ambiguity in the corresponding aligned word. The LSTM(A) model is able to determine overall video sentiment to a greater accuracy as compared to the LSTM model without attention.

Word level fusion enables fine grained multimodal analysis. We see that the model is indeed capturing the meaning of words and implicitly classifying them based on their sentiment: positive, negative or neutral. For neutral words, the model correctly looks at the aligned visual and audio modalities to make a prediction. Therefore, the model is learning the indicators of sentiment from facial gestures and tone of voice as well. This is a benefit of word level fusion since we can examine exactly what the model is learning at a finer resolution.

6.2 Gated Multimodal Embedding Analysis

Gated Multimodal Embedding helps multimodal fusion. The LSTM(A) model is still susceptible to noisy modalities. Table 4 shows that the GME-LSTM(A) model outperforms the LSTM(A) model on all metrics, indicating that there is value in attending in modalities using the Gated Multimodal Embedding .

Method Modalities Acc F-score MAE
LSTM text + audio + video 69.4 63.7 1.245
LSTM(A) text + audio + video 75.7 72.1 1.019
GME-LSTM(A) text + audio + video 76.5 73.4 0.955
Table 4: Sentiment prediction results on test set using LSTM, LSTM(A) and GME-LSTM(A) multimodal models. Numbers are reported in binary classification accuracy (Acc), F-score and MAE, and the best scores are highlighted in bold.

GME-LSTM(A) model correctly selects helpful modalities. To obtain a better insight into the effect of the Gated Multimodal Embedding Layer, a successful example is shown in Figure 5, where the input gate controller for the visual modality correctly identifies frames where obvious facial expressions are displayed, and rejects those with a blank expression.

LSTM(A) sentiment prediction: -2.00

GME-LSTM(A) sentiment prediction: 1.48

Ground truth sentiment: 1.2

Figure 5: Successful case 1: Across the entire video, the speaker’s facial features were rather monotonic except for one frame where she smiled brightly (left). Our visual input gate rejects the visual input at time steps before and after, but allows this frame to pass since the speaker is displaying obvious facial gestures. The prediction was much closer to ground truth as compared to without the input gate controller (right).

GME-LSTM(A) model correctly rejects noisy modalities. We now revisit a failure case of the LSTM(A) model, where the speaker is covering her mouth during the word that gives best sentiment prediction, “cute” (Figure 6). The LSTM(A) model is focusing on an uninformative time step and makes a poor sentiment prediction. In other words, the model may be confused if the added visual and audio modalities are uninformative or noisy. We found that the Gated Multimodal Embedding correctly rejects the noisy visual input at the time step of “cute” and the GME-LSTM(A) model gives a sentiment prediction closer to the ground truth (Figure 6). This is a good example where the GME-LSTM(A) model directly tackles the problem that motivated its development: the issue of noisy modalities that hurts performance when multimodal fusion is not carefully performed. Specifically, the GME-LSTM(A) model was able to learn that the visual modality was mismatched with the textual modality, further recognizing that the visual modality was noisy while the corresponding word was a good indicator of positive speaker sentiment.

First of all I’d like to say little James or Jimmy he’s so cute he’s so …

Visual modality: Hands cover mouth

LSTM sentiment prediction: 1.23

LSTM(A) sentiment prediction: -0.94

GME-LSTM(A) sentiment prediction: 1.57

Ground truth sentiment: 3.0

Figure 6: Successful case 2: The LSTM(A) extracts the wrong word from the sentiment, extracting “little” instead of the better word “cute” (top). Upon inspection, the speaker is covering her mouth when the word “cute″ is spoken (center), which leads to less attention weight on word “cute” since the modalities are not consistently strong at that frame. As a result, the LSTM(A) model makes a prediction on video sentiment that is further away from ground truth (bottom). However, the Gated Multimodal Embedding correctly rejects the noisy visual input at the time step of “cute” (bottom). Including the Gated Multimodal Embedding improves the sentiment prediction back closer to ground truth.

7 Conclusion

In this paper we proposed Gated Multimodal Embedding LSTM with Temporal Attention  model for multimodal sentiment analysis. Our approach is the first of it’s kind to perform multimodal fusion at word level. Furthermore to build a model that is suitable for the complex structure of speech, we introduce selective word-level fusion between modalities using gating mechanism trained using reinforcement learning. We use attention model to divert the focus of our model to important moments in speech. The stateful nature of our model allows for long interactions to be captured between different modalities. We show state of the art performance in MOSI dataset and we bring qualitative analysis of how our model is able to deal with various challenges of understanding communication dynamics.


  1. Basant Agarwal, Soujanya Poria, Namita Mittal, Alexander Gelbukh, and Amir Hussain. 2015. Concept-level sentiment analysis with dependency-based semantic parsing: a novel approach. Cognitive Computation 7, 4 (2015), 487–499.
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.
  3. Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2017. Multimodal Machine Learning: A Survey and Taxonomy. arXiv preprint arXiv:1705.09406 (2017).
  4. Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. IEEE, 1–10.
  5. Jayanth Koushik Louis-Philippe Morency Behnaz Nojavanasghari, Deepak Gopinath. 2016. Deep Multimodal Fusion for Persuasiveness Prediction.
  6. M. Chatterjee, S. Park, L.-P. Morency, and S. Scherer. 2015. Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits. In Proceedings of International Conference Multimodal Interaction (ICMI 2015).
  7. Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP—A collaborative voice analysis repository for speech technologies. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 960–964.
  8. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2625–2634.
  9. Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural computation 12, 10 (2000), 2451–2471.
  10. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.
  11. iMotions. 2017. Facial Expression Analysis. (2017).
  12. Mohit Iyyer, Varun Manjunatha, Jordan L Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Composition Rivals Syntactic Methods for Text Classification.. In ACL (1). 1681–1691.
  13. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).
  14. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
  15. Navonil Majumder, Soujanya Poria, Alexander Gelbukh, and Erik Cambria. 2017. Deep Learning-Based Document Modeling for Personality Detection from Text. IEEE Intelligent Systems 32, 2 (2017), 74–79.
  16. Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th international conference on multimodal interfaces. ACM, 169–176.
  17. Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval 2, 1–2 (2008), 1–135.
  18. Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach. In Proceedings of the 16th International Conference on Multimodal Interaction (ICMI ’14). ACM, New York, NY, USA, 50–57.
  19. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.
  20. Veronica Perez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-Level Multimodal Sentiment Analysis. In Association for Computational Linguistics (ACL). Sofia, Bulgaria.
  21. Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-Level Multimodal Sentiment Analysis.. In ACL (1). 973–982.
  22. Soujanya Poria, Basant Agarwal, Alexander Gelbukh, Amir Hussain, and Newton Howard. 2014. Dependency-based semantic parsing for concept-level text analysis. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 113–127.
  23. Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion. Information Fusion 1 (2017), 34.
  24. Soujanya Poria, Erik Cambria, and Alexander F Gelbukh. 2015a. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis.
  25. Soujanya Poria, Erik Cambria, and Alexander F. Gelbukh. 2015b. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. 2539–2544.
  26. Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Federica Bisio. 2016. Sentic LDA: Improving on LDA with semantic similarity for aspect-based sentiment analysis. In Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE, 4465–4473.
  27. Soujanya Poria, Alexander Gelbukh, Dipankar Das, and Sivaji Bandyopadhyay. 2012. Fuzzy clustering for semi-supervised learning–case study: Construction of an emotion lexicon. In Mexican International Conference on Artificial Intelligence. Springer, 73–86.
  28. Soujanya Poria, Haiyun Peng, Amir Hussain, Newton Howard, and Erik Cambria. 2017. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing (2017).
  29. Stefan Scherer, Gale M. Lucas, Jonathan Gratch, Albert Skip Rizzo, and Louis-Philippe Morency. 2016. Self-reported symptoms of depression and PTSD are associated with reduced vowel space in screening interviews. IEEE Transactions on Affective Computing 7, 1 (Jan. 2016), 59–73.
  30. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, Christopher Potts, et al. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Vol. 1631. Citeseer, 1642.
  31. Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation, Berlin, Germany. Association for Computational Linguistics.
  32. Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics 37, 2 (2011), 267–307.
  33. ucbvislab. 2013. p2fa-vislab. (2013).
  34. Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Dennis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. 2016. AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 3–10.
  35. Verena Venek, Stefan Scherer, Louis-Philippe Morency, Albert Rizzo, and John Pestian. 2016. Adolescent suicidal risk assessment in clinician-patient interaction. IEEE Transactions on Affective Computing (2016).
  36. Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. 2016. Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis. arXiv preprint arXiv:1609.05244 (2016).
  37. Minlie Zhao Li Zhu-Xiaoyan Wang, Yequan Huang. 2016. Attention-based LSTM for Aspect-level Sentiment Classification. EMNLP (2016).
  38. Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229–256.
  39. Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems 28, 3 (2013), 46–53.
  40. Bishan Yang and Claire Cardie. 2012. Extracting opinion expressions with semi-markov conditional random fields. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 1335–1345.
  41. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.
  42. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2017. Recent Trends in Deep Learning Based Natural Language Processing. arXiv preprint arXiv:1708.02709 (2017).
  43. Zhou Yu, Stefen Scherer, David Devault, Jonathan Gratch, Giota Stratou, Louis-Philippe Morency, and Justine Cassell. 2013. Multimodal prediction of psychological disorders: Learning verbal and nonverbal commonalities in adjacency pairs. In Semdial 2013 DialDam: Proceedings of the 17th Workshop on the Semantics and Pragmatics of Dialogue. 160–169.
  44. Amir Zadeh, Tadas Baltrušaitis, and Louis-Philippe Morency. 2017. Convolutional experts constrained local model for facial landmark detection. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2051–2059.
  45. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016a. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv preprint arXiv:1606.06259 (2016).
  46. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016b. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31, 6 (2016), 82–88.
  47. Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng, and Shai Avidan. 2006. Fast human detection using a cascade of histograms of oriented gradients. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, Vol. 2. IEEE, 1491–1498.
  48. Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description