# Speech Emotion Recognition With Dual-Sequence Lstm Architecture

###### Abstract

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new dual-level model that combines handcrafted and raw features for audio signals. Each utterance is preprocessed into a handcrafted input and two mel-spectrograms at different time-frequency resolutions. An LSTM processes the handcrafted input, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%—a 6% improvement over current state-of-the-art models—and is comparable with multimodal SER models that leverage textual information.

Jianyou Wang, Michael Xue, Ryan Culhane, Enmao Diao, Jie Ding, Vahid Tarokh^{†}^{†}thanks: This work was supported in part by Office of Naval Research Grant No. N00014-18-1-2244.
\address Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA

School of Statistics, University of Minnesota Twin Cities, Minneapolis, MN, USA
{keywords}
Speech Emotion Recognition, Mel-Spectrogram, LSTM, Dual-Sequence LSTM, Dual-Level Model

## 1 Introduction

As the field of Automatic Speech Recognition (ASR) rapidly matures, people are beginning to realize that the information conveyed in speech goes beyond its textual content. Recently, by employing deep learning, researchers have found promising directions within the topic of Speech Emotion Recognition (SER). As one of the most fundamental characteristics that distinguishes intelligent life forms from the rest, emotion is an integral part of our daily conversations. From the broad perspective of general-purposed artificial intelligence, the ability to detect the emotional contents of human speech has far-reaching applications and benefits. For instance, affective voice assistants will allow such conversations to be more engaging. Furthermore, the notion that machines can understand and perhaps some day produce emotions can profoundly change the way humans and machines interact.

Previous work in SER models on the benchmark IEMOCAP dataset [1] can be generally divided into two categories: unimodal and multimodal. Research that focuses on unimodal data uses only raw audio signals, whereas research in multimodal data leverages both audio signals and lexical information, and in some cases, visual information. Not surprisingly, since they take advantage of more information, multimodal models generally outperform unimodal models by 6-7%. Unimodal models can also be subdivided into two categories. Prior to 2014, the common approach was to extract handcrafted features such as prosodic features and MFCCs from raw audio signals frame by frame [2] and then pass the output through a recurrent layer. With the advent of deep learning, researchers are now transforming raw audio signals into spectrograms [3, 4], which are then mapped into a latent time series through several convolutional layers before going through a recurrent layer.

Some researchers think that audio data alone is not enough to make an accurate prediction [5], and thus many have turned to using textual information as well. However, it is possible that two utterances with the same textual content can have entirely different meanings when fueled with different emotions. Therefore, using textual information too liberally may lead to misleading predictions. It is our opinion that the full potential of audio signals has not been fully explored, and we propose several changes to the existing state-of-the-art framework for unimodal SER [6, 7].

In this paper, we make three major contributions to the existing unimodal SER framework. First, we propose a dual-level model that contains a neural network for the spectrogram input and another neural network for the handcrafted input, both of which are independent of each other until the final classification, but are trained jointly. We have found that the combination of these two networks provides a significant increase in accuracy. Second, inspired by the time-frequency trade-off [8], from each utterance we calculate two mel-spectrograms of different time-frequency resolutions instead of just one. Since these two spectrograms contain complementary information—namely, one has a better resolution along the time axis and the other has a better resolution along the frequency axis—we propose a novel variant of LSTM [9], denoted as Dual-Sequence LSTM (DS-LSTM), that can process two sequences of data simultaneously. Third, we propose a novel mechanism for data preprocessing that uses nearest-neighbor interpolation to address the problem of variable lengths between different audio signals. We have found that interpolation works better than more typical methods such as truncating and padding data, which lose information and also increase the computational cost.

The outline of this paper is given next. First, we discuss how we preprocess the data into a handcrafted input as well as two mel-spectrograms of different time-frequency resolutions. Next, we introduce our proposed dual-level model and describe how it processes these inputs and eventually classifies the utterance. Finally, we outline our experimental procedure and how our model compares to state-of-the-art methods as well as several baseline methods.

## 2 Research Methodology

### 2.1 Dataset Description

We used the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [1] in this work, a benchmark dataset containing about 12 hours of audio and video data, as well as text transcriptions. The dataset contains five sessions, each of which involves two distinct professional actors conversing with one another in both scripted and improvised manners. In this work, we utilize data from both scripted and improvised conversations as well as only audio data to stay consistent with the vast majority of prior work. We also train and evaluate our model on four emotions: happy, neutral, angry, and sad, resulting in a total of 5531 utterances (happy: 29.5%, neutral: 30.8%, angry: 19.9%, sad: 19.5%). We denote these 5531 utterances in the set .

### 2.2 Preprocessing

For extracting handcrafted features, we used the openSMILE toolkit [12], a software that automatically extracts features from an audio signal. Using the MFCC12_E_D_A configuration file, we extract 13 Mel-Frequency Cepstral Coefficients (MFCCs), as well as 13 delta and 13 acceleration coefficients, for a total of 39 acoustic handcrafted features. These features are extracted from 25 ms frames, resulting in a sequence of 39-dimensional handcrafted features per utterance .

For each utterance, we also propose to derive two mel-spectrograms of different time-frequency resolutions instead of just one, as done in previous research. One (denoted by ) is a mel-scaled spectrogram with a narrower window and thus a better time resolution, while the other (denoted by ) is a mel-scaled spectrogram with a wider window and thus a better frequency resolution. In our work, and are calculated from a short-time Fourier transform with 256 and 512 FFT points, respectively.

The standard method to deal with variable length in utterances is padding or truncation. Since there are rises and cadences in human conversation, we cannot assume the emotional contents are uniformly distributed within each utterance. Therefore, by truncating data, critical information is inevitably lost. On the other hand, padding is computationally expensive. We propose a different approach to deal with variable length between utterances: nearest-neighbor interpolation, in which we interpolate along the time axis for each mel-spectrogram to the median number of time steps for all the spectrograms, followed by a logarithmic transformation.

### 2.3 Proposed Model

#### 2.3.1 Dual-Level Architecture

Our proposed dual-level architecture is illustrated in Figure 1. It contains two separate models, and , the first for the handcrafted input and the second for the two mel-spectrograms. Each of these two models has a classification layer, the outputs of which are averaged to make the final prediction. The loss function is also the average of two different cross entropy losses from and .

#### 2.3.2 LSTM for Handcrafted Input

#### 2.3.3 CNN for Mel-Spectrograms

As mentioned earlier, for each utterance , we produce two mel-spectrograms with different time-frequency resolutions. We pass these two spectrograms into two independent 2D CNN blocks, each of which consist of two convolution and max-pooling layers. After both spectrograms go through the two convolution and max-pooling layers, they have a different number of time steps, one with and the other with , where . Before passing both sequences into the DS-LSTM, we use an alignment procedure to ensure they have the same number of time steps, taking the average of adjacent time steps in the sequence of length . After alignment, both sequences have the same number of time steps , where .

#### 2.3.4 Dual-Sequence LSTM

Following the alignment operation, we obtain two sequences of data, and , with the same number of time steps. Here, comes from mel-spectrogram , which records more information along the time axis, and comes from mel-spectrogram , which records more information along the frequency axis. It is entirely conceivable that sequences and will complement each other, and therefore it will be beneficial to process them through a recurrent network simultaneously.

As Figure 2 indicates, we propose a Dual-Sequence LSTM (DS-LSTM) that can process two sequences of data simultaneously. Let denote the Hadamard product, the concatenation of vectors, the sigmoid activation function, the hyperbolic tangent activation function, and the recurrent batch normalization layer, which keeps a separate running mean and variance for each time step [10].

(1) |

(2) |

(3) |

(4) |

(5) |

(6) |

(7) |

(8) |

After the execution of (8), is the hidden state for the next time step, but also goes through a batch normalization layer to be the input for the next layer of the DS-LSTM at time .

While an LSTM is a four-gated RNN, the DS-LSTM is a six-gated RNN, with one extra input gate at (3) and one extra intermediate memory cell at (6). The two intermediate memory cells and are derived from and , respectively, with the intuition that will capture more information along the time axis, while will capture more information along the frequency axis. Empirical experiments suggest that the forget gate, two input gates, and output gate should incorporate the maximum amount of information, which is the concatenation of and .

A recurrent batch normalization layer () is used to normalize the output of the forget gate, input gates, and output gate in order to speed up training and provide the model with a more robust regularization effect.

## 3 Experimental Setup and Results

### 3.1 Experimental Setup

For the CNN block used to process the mel-spectrograms, a 44 kernel is used without padding, and the max pooling kernel is 22 with a 22 stride. For each layer of the CNN, the output channels are 64 and 16, respectively. For the LSTM for the handcrafted input and the DS-LSTM for the mel-spectrograms, all hidden dimensions are 200, and each LSTM is single-directional with two layers. The weight and bias for recurrent batch normalization parameters are initialized as 0.1 and 0 respectively, as suggested by the original paper [10]. An Adam optimizer is used with learning rate set at 0.0001.

### 3.2 Baseline Methods

Since several modifications are proposed, we create six baseline models that consist of various parts of the whole model in order to better evaluate the value of each modification.

Base 1: , which is the LSTM-based model with a handcrafted input.

Base 2: CNN+LSTM, whose inputs, , are spectrograms with 256 FFT points. Inputs are passed through a CNN followed by an LSTM. Models such as these are developed in [6] and [11].

Base 3: CNN+LSTM, whose inputs, , are spectrograms with 512 FFT points. Inputs are passed through a CNN followed by an LSTM. Notice that the architecture is the same as Base 2.

Base 4: A combination of models of Base 2 and Base 3: , whose inputs are and . In this model, two LSTMs process two sequences of mel-spectrograms separately, and their respective outputs are averaged to make final classifications. Note this is different from our proposed DS-LSTM, which processes these two sequences within a single DS-LSTM cell.

Base 5: A combination of models of Base 1 and Base 4.

Base 6: A combination of models of Base 1 and Base 2.

In addition to the above six baseline models, we propose two models, and the dual-level model, +. We compare these models with the baseline models, as well as four state-of-the-art models that use standard 5-fold cross-validation for evaluation.

### 3.3 Results and Analysis

Mean WA | Mean UA | |

Base 1 = | 64.71.4 | 65.51.7 |

Base 2 = CNN+LSTM | 63.51.6 | 64.51.5 |

Base 3 = CNN+LSTM | 62.91.0 | 64.30.9 |

Base 4 = Base 2 + Base 3 | 64.41.8 | 65.21.8 |

Base 5 = Base 1 + Base 4 | 68.31.3 | 69.31.2 |

Base 6 = Base 1 + Base 2 | 68.50.8 | 68.91.2 |

D. Dai et. al (2019) [12] | 65.4 | 66.9 |

S. Mao et. al (2019) [13] | 65.9 | 66.9 |

R. Li et. al (2019) [6] | — | 67.4 |

S. Yoon et. al (2018) [14] * | 71.81.9 | — |

Proposed | 69.40.6 | 69.51.1 |

Proposed + | 72.70.7 | 73.30.8 |

* indicates the model uses textual information.

Table 1 clearly indicates our proposed model + outperforms all baseline models by at least 4.2% in mean weighted accuracy, and by at least 4.0% in mean unweighted accuracy. It also outperforms state-of-the-art unimodal SER models by at least 6.8% in mean weighted accuracy and 5.9% in mean unweighted accuracy. Although multimodal SER models typically have a higher accuracy, as it takes into account more information than just audio data, we see that our proposed model achieves comparable and even slightly better performances, with a 0.9% increase in mean weighted accuracy.

We would like to further investigate the effectiveness of each integrated part of the proposed dual-level model + by interpreting the results from Table 1. Both Base 2 and Base 3 take a single sequence of mel-spectrograms, and both perform slightly worse than Base 1 which just uses handcrafted features. This supports the claim that raw features in mel-spectrograms are harder to learn than handcrafted features. Base 4 is a naive combination of Base 2 and Base 3, and because the two LSTMs in Base 4 do not interact with each other, the complementary information between these two sequences of mel-spectrograms is not fully explored; therefore, Base 4 is also slightly worse than Base 1. Base 5 and Base 6 are both dual-level models that consider both handcrafted inputs and mel-spectrograms, and they both outperform Base 14. This demonstrates the effectiveness of the dual-level model.

More importantly, we observe that the proposed significantly outperforms Base 14. Comparing with Base 4, we see that when two separate LSTMs are replaced by the DS-LSTM, which has only six neural networks in its cell instead of eight neural networks in two LSTMs together, the weighted accuracy increases by 5% and the memory usage is reduced by 25%. This shows that the DS-LSTM is a successful upgrade from two separate LSTMs. When we consider the dual-level model +, it outperforms all baseline methods significantly.

## 4 Conclusion and Future work

In this paper, we have demonstrated the effectiveness of combining handcrafted and raw features in audio signals for emotion recognition. Furthermore, we introduced a novel LSTM architecture, denoted as DS-LSTM, which can process two sequences of data at once. We also proposed several modifications to the data preprocessing step. Our proposed model significantly outperforms baseline models and current state-of-the-art models on the IEMOCAP dataset, showing that unimodal models, which only rely on audio signals, have not reached their full potential.

Since it is beneficial to look at spectrograms from multiple time-frequency resolutions, one possible future direction is to enable the DS-LSTM to process multiple sequences of data in a fast and memory-efficient way without overfitting. We also would like to explore different options of increasing speaker variability, since we consider one of the primary challenges of classifying utterances on the IEMOCAP dataset to be its lack of speaker variability.

## References

- [1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. Narayanan, “Iemocap: interactive emotional dyadic motion capture database.,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- [2] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 2227–2231.
- [3] A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recognition from speech using deep learning on spectrograms,” in INTERSPEECH, 2017.
- [4] Jianfeng Zhao, Xia Mao, and Lijiang Chen, “Speech emotion recognition using deep 1d and 2d cnn lstm networks,” Biomed. Signal Proc. and Control, vol. 47, pp. 312–323, 2019.
- [5] E. Kim and J. W. Shin, “Dnn-based emotion recognition based on bottleneck acoustic features and lexical features,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6720–6724.
- [6] R. Li, Z. Wu, J. Jia, S. Zhao, and H. Meng, “Dilated residual network with multi-head self-attention for speech emotion recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6675–6679.
- [7] S. Yeh, Y. Lin, and C. Lee, “An interaction-aware attention network for speech emotion recognition in spoken dialogs,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6685–6689.
- [8] D. Donoho and P. Stark, “Uncertainty principles and signal recovery,” SIAM Journal on Applied Mathematics, vol. 49, no. 3, pp. 906–931, 1989.
- [9] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
- [10] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville, “Recurrent batch normalization,” CoRR, vol. abs/1603.09025, 2016.
- [11] Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, and Benoit Schmauch, “Cnn+lstm architecture for speech emotion recognition with data augmentation,” 2018.
- [12] D. Dai, Z. Wu, R. Li, X. Wu, J. Jia, and H. Meng, “Learning discriminative features from spectrograms using center loss for speech emotion recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 7405–7409.
- [13] S. Mao, D. Tao, G. Zhang, P. C. Ching, and T. Lee, “Revisiting hidden markov models for speech emotion recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6715–6719.
- [14] S. Yoon, S. Byun, and K. Jung, “Multimodal speech emotion recognition using audio and text,” CoRR, vol. abs/1810.04635, 2018.