Emotion Recognition From Speech With Recurrent Neural Networks

Emotion Recognition From Speech With Recurrent Neural Networks

Vladimir Chernykh
MIPT, Skoltech
&Pavel Prikhodko

In this paper the task of emotion recognition from speech is considered. Proposed approach uses deep recurrent neural network trained on a sequence of acoustic features calculated over small speech intervals. At the same time special probabilistic-nature CTC loss function allows to consider long utterances containing both emotional and neutral parts. The effectiveness of such an approach is shown in two ways. Firstly, the comparison with recent advances in this field is carried out. Secondly, human performance on the same task is measured. Both criteria show the high quality of the proposed method.


Emotion Recognition From Speech With Recurrent Neural Networks

  Vladimir Chernykh MIPT, Skoltech Moscow vladimir.chernykh@phystech.edu Pavel Prikhodko Skoltech Moscow p.prikhodko@skoltech.ru

1 Introduction

Nowadays machines can successfully recognize human speech. Automatic speech recognition (ASR) services can be found everywhere. Voice input interfaces are used for many applications from navigation system in mobile phones to Internet-of-Things devices. A lot of personal assistants like Apple Siri [1], Amazon Alexa [2], Yandex Alisa [3], or Google Duplex [4] were released recently and are already the inalienable part of the life.

Nevertheless this field is still rapidly emerging. Last year Google has released its Cloud API for speech recognition [5]. In the last Windows 10 one can find Cortana voice interface [6] integrated. Small startups all over the world as well as IT giants like Google, Microsoft and Baidu are actively doing research in this area.

Market size of both hardware and software for speech recognition has reached 55 billion dollars in 2016 and it continues to grow approximately 11% a year [7].

Therefore authors believe that this field is perspective and is worth to put an attention at.

1.1 Problem

Virtually all the ASR algorithms and services are simply transcribing audio recordings into written words. But that is only the first level of speech understanding.

During the conversation humans receive lots of meta-information apart from text. Examples might be the person who is speaking, his intonation and emotion, loudness, shades etc. These factors might considerably influence the true intended meaning of a phrase. Even turn it into opposite - that is what we call sarcasm or irony. Humans take all these elements into consideration while processing the phrase in the brain and only after that the final meaning is formed.

Accounting for these factors in purely retrieval systems, e.g. search engines, may be superfluous. But it becomes crucial in more human-involved systems like voice assistants, where the close communication with human is needed. To be able to detect the meaning of the spoken message correctly one needs to account not only for the semantics but also for the discussed type meta-information. Thus to build a more complete human-computer interaction system it is necessary to extract these features out of the audio signal.

This paper addresses only one of the questions arisen above: how to correctly recognize the emotional background of the voice? The main goal of the work is to answer this question. The main obstacles that complicate the solution are:

  • Emotions are subjective.

    They are complex psychological and social phenomena. People understand emotion differently. Thus there are many difficulties in defining the notion of emotion [8].

    Altrov et al. in [9] collected the corpus of Estonian speech with 4 emotions included: joy, anger, sadness, neutral. Then they asked people of different nationalities to evaluate it. Estonians, Latvians, Italians, Finns, Swedes, Danes, Norwegians and Russians took part in the experiment. Almost all nationalities are close to Estonians both geographically and culturally. Nevertheless Estonians perform much better than any other nationality showing about 69% mean class accuracy. All other people perform 10-15% worse and the only emotion that they recognize relatively well was sadness.

    Work of Altrov et al. [9] showed that there is significant intercultural differences in emotions understanding. But even inside one culture this understanding may vary greatly.

  • Assignment of the emotions to the audio recording.

    It is not obvious how one should assign emotional labels to the long audio recording or even continuous flow of speech. Should it be one emotion per whole recording or per one utterance? If one chooses utterance-based solution then how the split should be done? Is it possible for the utterance to have multiple emotions? These and few other questions put the methodology in the forefront.

  • Complexity and cost of database collection.

    Databases for usual speech recognition task are relatively easy to collect: one can take dialogues from the films, Youtube blogs, news, etc. and annotate them. Almost the only requirement is the high quality of the audio recording.

    When it comes to the emotions there is a huge problem with all of these sources. Emotions in them are dramatically biased. In news most of the speech is neutral. In films set of emotions depends on the genre but the distribution is almost always biased towards the one prevailing emotion.

    Another way is to collect the database artificially. The following big problem arises here: how to record a predefined emotion in a natural way? Douglas-Cowie et al. suggest to use professional actors [10]. Actors are given either with the topics and asked to improvise on this topic or with the scripted material which they should read. At the time of reading actors are to show the predefined emotion. Busso et al. give the overview and the comparison of these two approaches in their paper [11].

    The set of emotions to use is another important question. There should enough emotions to cover all the basic human reactions but not too many to be able to play and assess them reliably. Picard et al. describe the how and why the emotions should be chosen in their work [12]. They suggest to use at least 5 basic emotions: happiness, anger, sadness, neutral, frustration.

    The other side of this coin is how the emotions should be measured and evaluated. Cowie et al. give their view to this problem in their paper [13]. Authors propose to use 3D Valence-Arousal-Dominance ordinal space as well as categorical labels for the evaluation of the utterances. Moreover, many assessors are needed for one utterance to be able to evaluate it consistently.

    Altogether, these peculiarities make the collection of the database very complicated, time-consuming and expensive task.

    One of the good methodology and collection examples is IEMOCAP database presented by Busso et al. in [14]. IEMOCAP is used in this work and will be described in more details later.

Some of these questions are resolved by authors of this paper, others are tackled by the authors of database used, third are inherent to the problem and can not be avoided.

1.2 Related works

The problem described in section 1.1 has previously been considered by few works.

Majority of the works state the emotion recognition task as a classification problem where one utterance has exactly one label.

Before the deep learning era people have come with many different methods which mostly extract complex low-level handcrafted features out of the initial audio recording of the utterance and then apply conventional classification algorithms. One of the approaches is to use generative models like Hidden Markov Models or Gaussian Mixture Model to learn the underlying probability distribution of the features and then to train a Bayessian classifier using maximal likelihood principle. Variations of this method was introduced by Shuller et al. in 2003 in [15] and by Lee et al. in 2004 in [16]. Another common approach is to gather a global statistics over local low-level features computed over the parts of the signal and apply a classification model. Eyben et al. in 2009 [17] and Mower et al. in 2011 [18] used this approach with Support Vector Machine as a classification model. Lee et al. in 2011 in [19] used Decision Trees and Kim et al. in 2013 in [20] utilized K Nearest Neighbours instead of SVM. People also tried to adapt popular speech recognition methods to the task of emotion recognition: for more information look at works of Hu et al. in 2007 [21] and Nwe et al. in 2013 in [22].

One of the first deep learning end-to-end approaches was presented by Han et al. in 2014 in their work [23]. Their idea is to split each utterance into frames and calculate low-level features at the first step. Then authors used densely connected neural network with three hidden layers to transform this sequence of features to the sequence of probability distributions over the target emotion labels. Then these probabilities are aggregated into utterance-level features using simple statistics like maximum, minimum, average, percentiles, etc. After that the Extreme Learning Machine (ELM) [24] is trained to classify utterances by emotional state.

In the continuation of the Han et al. work Lee and Tashev presented their paper [25] in 2015. They have used the same idea and approach as Han et al. in [23]. The main contribution is that they replaced simple densely-connected network with recurrent neural network (RNN) with Long short-term memory (LSTM) units. Lee and Tashev have also introduced probabilistic approach to learning which is in some points similar to approach presented in current paper. But they continued to use local probabilities aggregation into gloabal feature vector and ELM on top of them.

The main drawbacks of these two approaches are that they are using very simple and naive aggregation functions and ELMs. The latter is actively criticized by the research community last years and Yann LeCun in particular [26].

This work in its first edition was written in early 2017 [27] and was aimed to get rid of the drawbacks discussed above by applying fully end-to-end pipeline without handcrafted parts in the middle.

After that few purely deep learning and end-to-end approaches based on modern architectures have already arisen. Neumann and Vu in their 2017 paper [28] used currently popular attentive architecture. Attention is a mechanism that was firstly introduced by Bahdanau et al. in 2015 in [29] and now is state-of-the-art in the field of machine translation [30]. Xia et al. in their 2017 work [31] used a slightly different approach based in Deep Belief Networks (DBN) and continuous problem statement in 2D Valence-Arousal space. Each utterance can be assessed in ordinal scale and then embedded into multidimensional space. Regions in this space are associated with different emotions. The task then is to learn how to embed the utterances in this space. One of the most recent and interesting works was presented in 2018 by Lakomkin et al. in [32]. They suggested to do a transfer learning from usual speech recognition task to the emotion recognition. One might anticipate this method to work well because the speech corpora for speech recognition are far better developed - they are bigger and better annotated. Authors performed a fine-tuning of the DeepSpeech [33] kind of network trained on LibriSpeech [34].

In spite of existence of few more recent papers on this topic, the quality of the model proposed in this paper is on par with them. At the same time it allows for some extensions like the sequence of emotion labels as an output which other approaches do not support to the best of authors’ knowledge.

2 Data

All experiments are carried out with audio recordings from the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [14]. There are also few more emotional speech databases the overview of which can be found in [35, 36]. IEMOCAP is chosen because it has one of the most elaborate acquisition methodology, free academic license, long recordings duration and good markup.

2.1 Database structure

IEMOCAP [14] consists of approximately 12 hours of recordings. Audio, video and facial keypoints data was captured during the live sessions. Each session is a sequence of dialogues between man and woman. In total 10 people split into 5 pairs took part in the process. All involved people are professional actors and actresses from Drama Department of University of Southern California [14]. The recording process took place at the professional cinema studio. Actors seated across each other at "social" distance of 3 meters. It enables more realistic communication.

Before the recording actors were given with the topic of the conversation and the emotional tone in which they should perform. There are two types of dialogues: scripted (actors were given with the text) and improvised.

After recording of these conversations authors divided them into utterances with speech (see figure 0(a)).

(a) Utterance duration distribution
(b) Emotional labels distribution
Figure 1: Data overview

Note that audio was captured using two microphones. Therefore the recordings contain two channels which correspond to male and female voices. Sometimes they interrupt each other. In these moments the utterances might intersect. This intersection takes about 9% of all utterances time. It might lead to undesired results because microphones were place relatively near each other and thus inevitably captures both voices.

After the recording assessors (3 or 4) were asked to evaluate each utterance based on both audio and video streams. The evaluation form contained 10 options (neutral, happiness, sadness, anger, surprise, fear, disgust, frustration, excited, other). In this work only only 4 of them are taken for the analysis: anger, excitement, neutral and sadness (as ones of the most common, [12]). Figure 0(b) shows the distribution of considered emotions among the utterances.

(a) Expert consistency
(b) Utterances taken for the work
Figure 2: Markup details

Emotion is assigned to the utterance if and only if at least half of experts were consistent in their evaluation. About 25% of the utterances do not satisfy this condition and emotion label was not assigned at all (see figure 1(b)). Moreover, significantly less than a half of remained utterances have consistent assessment from all the experts (figure 1(a)). This statistics confirms the statement from section 1.1 that emotion is a subjective notion. Therefore it is reasonable to assume that there is no way to classify emotions accurately even if humans fail to do so.

2.2 Preprocessing

The raw signal has the sample rate of 16 kHz and thus working with it requires enormous computational power. There are technologies (e.g. Google Wavenet [37, 38]) that deal with it but for now these algorithms can hardly work online even with Google computational power.

The goal is to reduce the amount of computations down to the acceptable while preserving as much information as possible. Each utterance is divided into intersecting intervals (frames) of 200 milliseconds (overlap by 100 milliseconds). Then acoustic features are calculated over each frame. The resulted sequence of feature vectors represents initial utterance in low dimensional space ans serves as an input to the model.

Authors also experimented with different frame durations from 30 milliseconds to 200 milliseconds. 30 milliseconds roughly correspond to the duration of one phoneme in the normal flow of spoken English. 200 milliseconds is the approximate duration of one word. Experiments do not show significant difference in terms of quality. But computation time rises with the reduction in frame duration due to bigger number of frames. Thus authors decided to stay with 200ms.

Note that labels are presented only for utterances. It means that the task is weakly labelled in a sense that not every frame is labelled.

The key point here is the set of features to calculate. All possible features can be classified into 3 buckets:

  • Acoustic

    They describe the wave properties of a speech. It includes Fourier frequencies, energy-based features, Mel-Frequency Cepstral Coefficients (MFCC) and similar.

  • Prosodic

    This type of features measures peculiarities of speech like pauses between words, prosodies and loudness. These speech details depend on a speaker, and use of them in the speaker-free systems is debatable. Therefore they are not used in this work.

  • Linguistic

    These features are based on semantic information contained in speech. Exact transcriptions require a lot of assessor’s work. In future it is possible to include speech recognition to the pipeline to use automatically recognized text. But for now authors do not use linguistic features.

The current feature extraction algorithm utilizes only acoustic features. PyAudioAnalysis [39] library by Giannakopoulos is used. More precisely, 34 features are calculated:

  • 3 Time-domain: zero crossing rate, energy, entropy of energy

  • 5 Spectral-domain: spectral centroid, spectral spread, spectral entropy, spectral flux, spectral rolloff

  • 13 MFCCs

  • 13 Chroma: 12-dimensional chroma vector, standard deviation of chroma vector

In future authors plan to get rid of the handcrafted features and switch to the Convolutional Neural Network (CNN) based feature extraction algorithm.

The final output of the preprocessing step is the sequence of 34-dimensional vectors for each utterance. The length of the sequence depends on the duration of the utterance.

3 Method

In this paper the Connectionist Temporal Classification (CTC) [40] approach is used to classify speakers by emotional state from the audio recording.

The raw input data is the sound signal which is high-frequency time series. After all the preprocessing steps described in section 2.2 this sound signal is represented as a sequence of multidimensional frame feature vectors. The task is to map this long input sequence into short sequence of emotions which are presented in the recording.

The major difficulty is the significant difference in input and output sequences lengths. The input sequence length might be about 100 which is about 10 seconds with the chosen preprocessing settings. Output sequence length is usually no more than 2-4. Two orders of magnitude difference. In this case usual solutions such as padding of output sequence of bucketing (which is used in Google Neural Machine Translation [41]) can hardly be applied.

CTC addresses this problem in an essential way by utilizing three main concepts:

  • Introduce additional NULL label which corresponds to the absence of any other label and extends the initial labels set.

  • Bijective sequence-to-sequence learning, i.e., one-to-one mapping from sequence of frame features to the sequence of extended labels.

  • Collapse resulting sequence w.r.t. duplicates of labels and introduced extra label.

In case of emotion recognition these features are inherently implied by the essence of the task. On the one hand one utterance may contain several different emotions but on the other hand there might be considerable parts of recording without any sign of emotions.

Thus there are strong reasons to believe that one can benefit from usage of Connectionist Temporal Classification approach in this problem.

3.1 Notation

Let be the set of labels and — extended label set.

Assume that is the dataset where is the true sequence of labels and — corresponding -dimensional feature sequence. It is worth to mention that the lengths of these sequences and may not be the same in general case, the only condition is that .

Next let’s introduce the set of decision functions or models in which the best model is to be found. In case of neural network with the fixed architecture it is essential to associate the set of functions with the network weights space and thus function and vector of weights are interchangeable.

Having the set of functions one need to know how to choose the best. For that purpose probabilistic approach and maximal likelihood training is used (one can learn more in [42]). Assume that the model can also calculate the probability measure p of any sequence being its output. Then one wants the likelihood of the dataset to be as high as possible:

The optimal model then can be found as:

This method can be seen from the angle of loss functions and Empirical Risk Minimizer (see [43])

In case of neural network models the optimization is usually carried out with gradient descent type algorithms.

3.2 CTC approach

CTC is the one of the sequence-to-sequence prediction methods that deals with the different lengths of the input and output sequences. The main advantage of CTC is that it chooses the most probable label sequence (labeling) regarding the various ways of aligning it with the initial sequence. The probability of the particular labeling is added up from the probabilities of every its alignment.

In the figure 3 the pipeline of the CTC method is depicted.

Figure 3: CTC pipeline

Recurrent neural network (RNN) with fixed architecture (see details in section 4.3) is chosen as a space of classifiers . The only requirement for the structure is to output the sequence of the same length as it takes as an input.

Think of RNN as a mapping from the input space to the sequence of probability distributions over the extended label set :

where is the output of the softmax layer and represents the estimation of the probability of observing class at the timestep .

For every input X let’s define the path — it is an arbitrary sequence from with the length of . Then the conditional probability of the path is

The problem is that the path can contain NULL class which is unacceptable in the final output. First of all one needs to get rid of the NULLs. For that purpose mapping is introduced. It basically consists of two steps:

  1. Delete all consequent repeated labels

  2. Delete all NULLs

Consider the following example: . Notice that is the surjective mapping. By means of it the paths are transformed into labelings. To compute the probability of the labeling one needs to sum up probabilities of all paths that wrap into this particular labeling:

The direct calculation of requires summation over all corresponding paths which is exhaustive task. There are possible paths. Graves et al. [40] derived a new efficient forward-backward dynamic programming algorithm for that. The initial idea was taken from HMM decoding algorithm introduced by Rabiner [44].

Finally, the objective function is

Neural network here plays a role of probability measure p evaluator and the more it trains the more accurate probability estimations it gives. To enable the neural network training with the standard gradient-based methods Graves et al. [40] suggested differentiation technique naturally embedded into dynamic programming algorithm.

The final model chooses the labeling with the highest probability:

However one has exponential number of labelings and thus the task of accurate probability computation is intractable. There are two main heuristics for tackling this problem:

  1. Best path search

    It approximates the most probable labeling with the wrapped version (after transformation) of the most probable path.

  2. Beam search

    It keeps track of the fixed length prefix to choose the most probable label at each step. Best path search is a special case of beam search where the beam width equals to 1.

Both heuristics are tested during the experiments.

4 Experiments

In the series of experiments authors investigate proposed approach and compare it to the different baselines for emotion recognition. All the code can be found in the github repository [45].

One of the main obstacles with the speech recognition task in practice is that it is usually weakly supervised (as described in section 2.2). Here it means that there are a lot of frames in the utterance but only one emotional label. At the same time it is obvious that for sufficiently long periods of speech not all the frames contain emotions. CTC loss function suggests one of the ways to overcome this issue.

Authors choose two more methods and provide a comparison between them and CTC in the same setting. The algorithms are described at section 4.2 while the results are reported at section 4.4.

In all the methods and algorithms discussed below the frame features are calculated as described in section 2.2.

Please also note, that in IEMOCAP database each utterance has only one emotion. Therefore in CTC approach the length of all the real output sequence equals to one . Thus one can consider the output sequence of emotion labels as one emotion assigned to the utterance and vectors , as scalars , .

4.1 Metrics

First of all, one need to decide on the evaluation criteria. In this work authors follow the suggestion from Lee et al. [25] and uses two main metrics to evaluate and compare the models:

  • Overall (weighted) accuracy

    It is a usual accuracy which is calculated as a fraction of correct answers over all examples.

  • Mean class (unweighted) accuracy

    The idea is to take accuracy only inside one class and then average these values across all classes.

In both formulas above the square brackets denote indicator function.

Overall accuracy is the standard metric which is common to use and thus easy to compare with the results from other papers. But it has one major drawback. It does not account for the class imbalance. While in the case of IEMOCAP dataset, e.g., neutral class is approximately 1.7x times bigger than excitement. Therefore authors introduce mean class accuracy which taked into account the differences in class sizes and get rid of the imbalance influence on the metric value.

4.2 Baselines

In this subsection one can find the description and the performance report of the baselines algorithms.

4.2.1 Framewise

The core idea of this method is to classify each frame separately. Remember that the task is weakly supervised the following workflow is chosen:

  • Take two loudest frames from each utterance. Loudness in this context equals to the spectral power

  • Assign these frames with the emotion of the utterance

  • Train the frame classification model

  • Label all frames in all utterances using fitted model

  • Classify utterances based on the obtained frame-level labels

The naive assumption here is that the whole utterance can be represented by 2 loudest frames.

Random Forest Classifier [46] is used as a classification model. To assign emotion to the utterance majority voting is applied to the emotion labels of the frames. More detailed description of the algorithm, hyperparameters setting and code might be found in the github repository [45].

Figure 4: Framewise classification model. The title of each plot is the real emotion of the utterance. Each emotion is depicted with the color, x-axis shows the number of frame, y-axis gives the probability of classifying the frame with the emotion.

In the figure 4 there are the results of this method for randomly chosen validation set utterances. One can observe that for short utterances it works fine but with longer utterances it becomes sawtooth and unstable.

For the methodology and results of the overall comparison with other methods please see section 4.4 and table 1.

4.2.2 One-label

One-label approach implies that every utterance has only one emotional label notwithstanding its length. In other words sequence-to-label learning paradigm is used here in contrast with sequence-to-sequence learning in CTC.

The important detail is that all major modern deep learning frameworks (like TensorFlow, Keras, PyTorch, etc.) can group data into batches. Batch is in fact a multidimensional tensor. Mini-batch gradient descent and its modifications is the de facto standard method of training for neural networks. But the peculiarity here is that only the tensors of the same dimensions can be packed into the batch. After the preprocessing steps described in section 2.2 the input data is the sequences of the same dimension (34) but of the different length which depends on the duration of the utterance. Thus it is impossible to pack them into batch and train a network efficiently.

There are couple of solutions to this problem, e.g., padding or bucketing [41]. Here authors use padding. The idea is to make all the sequences of the same length. For that short sequences are appended with zeros and long sequences are truncated to the unified length. In this work the unified length equals to 78 which is approximately the 90%-percentile of all sequences lengths. After that step the training can be done efficiently using mini-batch approaches. Authors used Adam [47] optimizer for the training.

One-label approach also requires the definition of the network architecture. Authors decided to use same architecture for all of the approaches to be able to fairly compare them. One-label architecture is depicted in the figure 8 of Appendix A. It contains stacked Bidirectional LSTM units and dense classification layers on top of them. Categorical cross-entropy loss function is used. For more detailed description of the network structure and training procedure see figure 8 in Appendix A and code in [45].

The methodology and results of the overall comparison with other methods are described section 4.4 and table 1.

4.3 Ctc

Although CTC approach can inherently account for more than one label in the utterance, the design of the IEMOCAP database implies only one emotion per utterance (see sections 2.2 and 4). Consequently there are four valid types of label sequences from which can be generated by the network (see figure 5).

Figure 5: Valid sequences of labels. "Emo" label in all schemes represents exactly one emotion. It can be one and only one of the 4 emotion discussed in 2.1: anger, excitement, neutral and sadness

Each type of the sequence is later collapsed by the transformation during CTC decoding step (see section 3.2). Note that all 4 valid sequence types are collapsed into one "Emo" label.

When applying the CTC approach one faces the same problem with different input sequence lengths as one saw in One-label approach in section 4.2.2. The solution here is the same. Input sequences are padded or truncated to the length of 78. The only difference is that one keeps track of the initial sequence length to decode the resulting output sequence even better by not taking into account padded places (see figure 9 and code [45] for more details).

CTC approach requires the neural network architecture. As it is mentioned in section 4.2.2 authors decided to use same architecture for all of the approaches to be able to fairly compare them. CTC architecture is shown in the figure 9 of Appendix A. It contains stacked Bidirectional LSTM units and dense classification layers on top of them. CTC loss function is used. For more detailed description of the network structure and training procedure see figure 9 in Appendix A and code in [45].

The methodology and results of the overall comparison with other methods are described section 4.4 and table 1.

4.4 Comparison

In this section we provide a comparison between all three approaches described above in sections 4.2.1, 4.2.2, 4.3.

Each method is tested using grouped cross-validation approach. In usual k-fold cross-validation approach the dataset is randomly split into into k disjoint folds. At each of k steps the the fold is used as a test set and all other folds are used as a training set.

Grouped cross-validation assumes that each data sample has an additional label. This label shows the group of the sample. Group in this context might be any kind of common property that samples share. In this work the group is a speaker. It means that the group labels contains all samples that were spoken by one person (and only them). Grouped cross-validation splits the data in such a way that samples from the one group can not be in both training and test sets simultaneously.

Grouped cross-validation technique allows to ensure that the model quality is measured in speaker independent way. It means that the model is not overfitted to the manner of particular speakers presented in the training set.

IEMOCAP dataset contains 10 speakers which were recorded by pairs. Each speaker has roughly the same number of utterances. If one was to split the data into groups according to the speaker then one would get only 10% of data for the test. That might be to unstable. Thus authors decided to form groups not by speakers exactly but by pair of speakers that were recorded simultaneously. In that way 20% of data is split for the test which is more stable.

The results of 5-fold grouped cross-validation averaged across folds are shown in the table 1.

Method Overall accuracy Mean class accuracy
Dummy 35% 25%
Framewise 45% 41%
One-label 51% 49%
CTC 54% 54%
Human 69% 70%
Table 1: Methods comparison

First row with "Dummy" method corresponds to the naive classification model which always answers with label of the largest training class. In IEMOCAP case it is neutral class. "Framewise" and "One-label" rows represent the described baseline models. "CTC" shows the model investigated in this paper. As one can notice CTC performs slightly better than One-label approach and much better than Framewise and Dummy.

The last line in this table shows the human performance at the same task. Authors conducted the series of experiments to measure it. This process is described in more details in section 4.6.

4.5 Error structure

Observing the quality of the CTC model in section 4.4 authors also decided to further investigate it. Graves et al. in [40] reports huge gap in quality over the classical models. Here the gain is about 3-5%. For that reason the error structure is studied.

First of all, let’s look at predictions distribution in comparison with real expert labels. This is done by means of confusion matrix shown in the figure 5(a). Busso et al. in [14] mention that audio signal plays the main role in sadness recognition while angry and excitement are better detected via video signal which accompanied audio during the assessors work on IEMOCAP. This hypothesis seems to be true in relation to CTC model. Sadness recognition percentage is much higher than the others.

(a) Confusion matrix
(b) Misclassification rates
Figure 6: CTC BLSTM error structure

In section 2.1 authors have already described that expert answers are not fully consistent sometimes (see figure 1(a)). It allows to speak about the reliability of the label. Figure 5(b) shows how the model quality depends on the expert confidence degree. On the x-axis one can see the number of experts whose answer differs from the final emotion assigned to the utterance. y-axis shows the emotion label. In each cell of a table there is a model error percentage when classifying corresponding emotion at corresponding confidence level. The more red the cell is the the bigger the error is.

In fact this matrix gives an interesting piece of information. If one takes in account only those utterances in which experts were consistent then one gets approximately 65% accuracy. It sounds more promising than 54%.

Going further, authors investigate the wrong predictions themselves and not only their distribution. In inconsistent samples some experts give answers that are not the same as the final emotion assigned to the utterances. These answers can be arbitrary emotion from the full IEMOCAP list. Here authors filter only four considered emotions from all the wrong answers.

In the first row of the table 2 there is the percentage of inconsistent answers from utterances labeled as the header name which falls into considered four emotions. For example, 17% in column "Anger" means the following: utterances finally labeled as angry have some inconsistent expert answers; 17% of these answers have labels from the set of considered 4 emotions.

In the second row there is the percentage of model answers that coincide with the inconsistent answer of expert in this case. Note that there can not be more than one inconsistent answer because otherwise half of the experts would be inconsistent and utterance should not be included into the dataset at all.

Anger Excitement Neutral Sadness
Considered ratio 17% 22% 36% 39%
Model accuracy 51% 73% 71% 74%
Table 2: Residual accuracy

In other words, table 2 shows how frequently the errors of our model coincide with the human divergence in emotion assessment. If the errors of the model were random then second row of the table would contain approximately 33% at each cell. In the case of the CTC model this percentage is much higher. It means that the models make the mistakes which are similar to human mistakes. This topic is further discussed in the section 4.6.

4.6 Human performance

Observing the inconsistency of experts and other problems of the markup described in the sections 4.5 and 2 authors come with the idea to see how humans perform at this task.

This question was previously arisen in the papers. As authors have already described in the section 1.1, Altrov et al. did the same work in [9]. They used almost the same 4 classes (joy, anger, sadness, neutral) thus the results might be comparable. Native language speakers scored about 69% mean class accuracy. All other people perform 10-15% worse.

In this work a simple interface (fig. 6(b)) for relabelling speech corpus was developed. The idea is to see how well humans can solve this classification task. One can consider that as a humanized machine learning model.

Five people were involved in the experiment. All of them were authors’ lab colleagues (not professional actors or psychologist) and their native language is Russian. Each of them was asked to assess the random subset of the utterances. There is a possibility to see the correct answer after one gives own answer. This allows for positive feedback loop and kind of "model training" in terms of humanized machine learning model. During the experiment a small fraction of the utterances (2 from each emotion, 8 in total) was excluded from the main dataset. These utterances were given to the assessors prior to the main experiment as a kind of training examples. Through these mechanism assessors were able to get used to the system, way how actors talk, tune the volume level and other parameters. Answers at these preliminary stage were not included in the final statistics. Finally, each utterance was assessed by at least 2 assessors.

In the figure 6(a) one can see the results of the experiment taken.

(a) Human error structure
(b) Labelling interface
Figure 7: Human labeling

Both overall accuracy and mean class accuracy are about 70% (see table 1). These numbers confirm the idea that the emotion is the subjective notion and it is hardly probable for any model to achieve even this 70%. In this light the model error structure investigated in the section 4.5 becomes crucial because human errors are not random. Humans make mistakes in the cases where the emotion is indeed unclear. For example, it is hard to confuse angry and sadness, but it is easy to do so for excitement and happiness.

It leads to the conclusion that to be able to see the real quality of the model one should look not only at the accuracy numbers but also at the error structure. It should be reasonable and resembles human structure. In case both criteria are satisfied (high enough accuracy and reasonable error structure) one can say that the model is good. Error structure analysis for CTC model which is carried out in section 4.5 satisfies both criteria and thus the investigated CTC model can be considered to work well.

5 Conclusion

In this paper authors propose a novel algorithm for emotion recognition from audio based on Connectionist Temporal Classification approach. There are two main advantages of the suggested method:

  • It takes into account that even the emotional utterance might contain parts where there is no emotions

  • It can predict the sequence of emotions for one utterance

Conducted experiments lead to the results are comparable with the state-of-the-art in this field. Authors provide an in-depth analysis of the models answers and errors. Moving further, the human performance on this task is measured to be able to understand the possible limits of the model improvements. The initial suggestion that emotion is a subjective notion is approved and it turns out that the gap between human and proposed model is not so big. Moreover, the error structure for the humans and the model is similar which becomes one more argument in favor of the model.

Authors have few plans on the future development of the current work. One way is to get rid of the handcrafted MFCC feature extraction and switch to the learnable methods like Convolutional Neural Networks. Another way is to apply domain adaptation techniques and transfer the knowledge from the speech recognition methods to the the emotion detection using pretraining and fine-tuning.


  • [1] Apple Siri. https://www.apple.com/ios/siri/, 2018.
  • [2] Amazon Alexa. https://developer.amazon.com/alexa, 2018.
  • [3] Yandex Alisa. https://alice.yandex.ru/, 2018.
  • [4] Google Duplex. https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html, 2018.
  • [5] Google Speech API. https://cloud.google.com/speech-to-text/, 2017.
  • [6] Microsoft Cortana. https://www.microsoft.com/en-us/cortana, 2017.
  • [7] Grand View Research. Voice recognition market size, share & trends analysis report by component, by application (artificial intelligence, non-artificial intelligence), by vertical, by regions, and segment forecasts, 2018 - 2024. Technical report, 2018.
  • [8] L. Devillers, L. Vidrascu, and L. Lamel. Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4):407 – 422, 2005.
  • [9] R. Altrov and H. Pajupuu. The influence of language and culture on the understanding of vocal emotions. Journal of Estonian and Finno-Ugric Linguistics, 6(3), 2015.
  • [10] E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach. Emotional speech: Towards a new generation of databases. Speech Communication, 40(1):33 – 60, 2003.
  • [11] C. Busso and S. Narayanan. Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the iemocap database. In Interspeech 2008, 2008.
  • [12] R. W. Picard. Affective computing. Technical Report 321, MIT Media Laboratory Perceptual Computing Section, 1995.
  • [13] R. Cowie and R. R. Cornelius. Describing the emotional states that are expressed in speech. Speech Communication, 40(1):5 – 32, 2003.
  • [14] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42(4), 2008.
  • [15] B. Schuller, G. Rigoll, and M. Lang. Hidden markov model-based speech emotion recognition. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 401–404, 2003.
  • [16] C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan. Emotion recognition based on phoneme classes. In ICSLP, pages 889–892, 2004.
  • [17] F. Eyben, M. Wöllmer, and B. Schuller. Openear - introducing the munich open-source emotion and affect recognition toolkit. In 3rd International Conference on Affective Computing and Intelligent Interaction, pages 1–6, 2009.
  • [18] E. Mower, M. J. Mataric, and S. Narayanan. A framework for automatic human emotion classification using emotion profiles. IEEE Transactions on Audio, Speech, and Language Processing, 19(5):1057–1070, 2011.
  • [19] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9):1162 – 1171, 2011.
  • [20] Y. Kim and E. M. Provost. Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3677–3681, 2013.
  • [21] H. Hu, M. X. Xu, and W. Wu. Gmm supervector based svm with spectral features for speech emotion recognition. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, volume 4, pages 413–416, 2007.
  • [22] T. L. Nwe, N. T. Hieu, and D. K. Limbu. Bhattacharyya distance based emotional dissimilarity measure for emotion classification. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7512–7516, 2013.
  • [23] K. Han, D. Yu, and I. Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In Interspeech 2014, 2014.
  • [24] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: Theory and applications. Neurocomputing, 70(1):489 – 501, 2006.
  • [25] J. Lee and I. Tashev. High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech 2015, 2015.
  • [26] Y. LeCun. https://www.facebook.com/yann.lecun/posts/10152872571572143, 2015.
  • [27] V. Chernykh, G. Sterling, and P. Prihodko. Emotion recognition from speech with recurrent neural networks. ArXiv e-prints, 2017.
  • [28] M. Neumann and N. T. Vu. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. In Interspeech 2017, 2017.
  • [29] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR. 2015.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. 2017.
  • [31] R. Xia and Y. Liu. A multi-task learning framework for emotion recognition using 2d continuous space. IEEE Transactions on Affective Computing, 8:3–14, 2017.
  • [32] E. Lakomkin, C. Weber, S. Magg, and S. Wermter. Reusing neural speech representations for auditory emotion recognition. ArXiv e-prints, 2018.
  • [33] D. et al. Amodei. Deep speech 2: End-to-end speech recognition in english and mandarin. In 33rd International Conference on Machine Learning, volume 48. 2016.
  • [34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
  • [35] D. Ververidis and C. Kotropoulos. A review of emotional speech databases. In Panhellenic Conference on Informatics, pages 560–574, 2003.
  • [36] The Association for the Advancement of Affective Computing. http://emotion-research.net/wiki/Databases, 2014.
  • [37] A. Van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. ArXiv e-prints, 2016.
  • [38] A. Van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, and K. Kavukcuoglu. Parallel wavenet: Fast high-fidelity speech synthesis. ArXiv e-prints, 2017.
  • [39] T. Giannakopoulos. Pyaudioanalysis: An open-source python library for audio signal analysis. PloS one, 10(12), 2015.
  • [40] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning, 2006.
  • [41] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, and M. Norouzi. Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv e-prints, 2016.
  • [42] C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.
  • [43] M Mohri, Rostamizadeh A., and Talwalkar A. Foundations of Machine Learning. The MIT Press, 2012.
  • [44] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 1989.
  • [45] V. Chernykh. https://github.com/vladimir-chernykh/emotion_recognition, 2018.
  • [46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2011.
  • [47] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.

Appendix A

Figure 8: BLSTM network architecture for one-label approach. lstm_1 and lstm_3 layers process the input sequences in the forward order while lstm_2 and lstm_4 do it in the backward order. After processing the sequence in the backward order the output of lstm_2 and lstm_4 is reversed one more time to be in the forward order. After that outputs of lstm_1, lstm_2 and lstm_3, lstm_4 are stacked as shown. Note that here (in contrast with CTC architecture in figure 9) last LSTM layers lstm_3 and lstm_4 output only the last state and not the whole sequence. Thus one does not need TimeDistributed Keras wrapper and can go with simple Dense layer. Last softmax layer has 4 output units because the chosen subset of IEMOCAP dataset has 4 emotions (see section 2.1). Note also that the real length of the initial input sequence (without padding) is not taken into account in this approach.
Figure 9: BLSTM network architecture for CTC approach. lstm_1 and lstm_3 layers process the input sequences in the forward order while lstm_2 and lstm_4 do it in the backward order. After processing the sequence in the backward order the output of lstm_2 and lstm_4 is reversed one more time to be in the forward order. After that outputs of lstm_1, lstm_2 and lstm_3, lstm_4 are stacked as shown. TimeDistributed Keras wrapper allows to apply one and the same dense layer to each element of the input sequence. Last Lambda layer allows to perform CTC decoding step. Additional input layer data_len contains the real length of the initial input sequence (without padding) which allows for more precise decoding.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description