Speech Emotion Recognition with Data Augmentation and Layer-wise Learning Rate Adjustment
In this work, we design a neural network for recognizing emotions in speech, using the standard IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. Applying techniques of data augmentation, layer-wise learning rate adjustment and batch normalization, we obtain highly competitive results, with weighted accuracy and unweighted accuracy on four emotions. Moreover, we show that the model performance is strongly correlated with the labeling confidence, which highlights a fundamental difficulty in emotion recognition.
Providing high quality interaction between a human and a machine is a very challenging and active field of research with numerous applications. An important part of this domain is recognition of human speech emotions by computer systems. In the last years, impressive progress has been achieved in speech recognition by means of deep learning (Baidu; Medennikov+2016; Saon+2016; gatedCNN). These achievements also include significant results on speech emotion recognition (SER), see e.g. (KimLP13; microsoft; tspredictor).
In this work we build a neural network for SER on the IEMOCAP dataset (Busso2008) and achieve a result highly competitive to the state of the art. 111To our knowledge, the present state of the art has been achieved in (microsoft). However the cross-validation procedure performed in this paper (as in other works publishing results obtained on this dataset) includes only five folds (see below for more details) of the dataset out of the ten possible. On the other hand, our experiments showed (see section 3) that the performance strongly depends on the part of the data which is used for measuring the scores. As a consequence the results obtained by 5-fold cross validation without clarification what data has been used for the measurement are not possible to compare with. With some choices of the test dataset, our model outperforms the result of (microsoft), with some it does not. Therefore we propose to use 10-fold cross validation as the correct way for measuring the scores on IEMOCAP dataset and present our scores correspondingly.
When treating a SER problem with deep learning, one either creates hand-crafted acoustic features (MFCC, pitch, energy, ZCR…), which are used as inputs to a neural network, or sends the data, after some preprocessing (e.g. Fourier transform), directly to a neural network. We apply the second strategy by transforming the audio signal to a spectrogram, which is then used as an input to convolutional layers, followed by recurrent ones. Such a choice of architecture, which has recently demonstrated very competitive performance (Baidu; tspredictor), is motivated by the fact that training deep long short-term memory (LSTM, (lstm)) or gated linear unit (GRU, (gru)) networks is very hard. In this sense, adding few convolutional layers in the beginning of the network is an efficient way to reduce dimensionality of the data and can significantly simplify the training procedure. On the other hand, it is also possible to use a deep CNN for extracting high-level features, which are then fed to a RNN for final time aggregation. We test a variety of architectures with different depths for the convolutional (1-6 layers) and recurrent modules (1-4), achieving the best scores with a 4+1 scenario2224 convolutional layers and 1 Bi-LSTM layer.
To address challenges of class imbalance and data scarcity, we examine a vocal tract length perturbation for the purpose of data augmentation, and show that it improves the performance. In line with (BNrecNN; Baidu; RBN; LN) we apply batch normalization to the recurrent layers and analyze its action on the considered data. We show that, even when batch normalization is applied conservatively, it may still result in data distorsion, leading to faster overfitting and performance degradation. We also used soft-labeling in order to reflect the fact that multiple labels can be assigned to each sample of the IEMOCAP dataset. Although we did not manage to obtain better results by taking this information into account, we demonstrate clear dependence of the model performance on the data labeling confidence. Finally, our experiments show that per-layer learning rate adjustment appears to be a crucial factor of the model performance, which can be related either to the particular architecture choice or to a more general phenomenon.
1.1 Dataset description
IEMOCAP (Interactive Emotional Dyadic Motion Capture), collected at the University of Southern California (USC) (Busso2008), is one of the standard datasets for emotion recognition. It consists of twelve hours of audio and video recordings performed by 10 professional actors (five women and five men) and organized in 5 sessions of dialogues between two actors of different genders, either playing a script or improvising. The dataset also provides text corresponding to the recordings and face images. However, in this work, we only deal with audio data. Each sample of the audio set is an utterance associated to an emotion label. Labeling was made by six students of USC, three at a time for each utterance. The annotators were allowed to assign multiple labels if necessary. The final true label for each utterance was chosen by majority vote if the emotion category with the highest vote was unique. Since the annotators reached consensus more often when labeling improvised utterances (83.1%) than scripted ones (66.9%) (Busso2008), we concentrate only on the improvised part of the dataset. For the sake of comparison with state-of-the-art approaches, we predict four of the most represented emotions: neutrality, sadness, anger and happiness, which leaves us with 2280 utterances.
1.2 Spectrograms generation
Here we briefly discuss the data preprocessing we used. The signal is converted to a spectrogram by means of short-time Fourier transformation (STFT) with upper cut-off frequency of and Hann windowing, a standard choice for the narrow-band spectral analysis:
where is Hann window, defined as follows
is an utterance signal at timestep , stands for frequency, defines window position, finally and correspond to the window shift and size correspondingly. We used the following spectrogram generation parameters: , and frequency ranges of and .
We compute the spectrogram as the magnitude squared of the STFT. Although the phase information seems to be dropped, it is still reconstructible (sturmel2011signal):
Finally, we rescale the spectrogram logarithmically in order to reflect human hearing ability (WeberâFechner law) and to avoid compression of low-frequency band, where fundamental frequencies lie.
2 Data augmentation
One of the main difficulties when dealing with the IEMOCAP dataset is class imbalance (see figure 1). The most abundant class corresponds to the neutral emotion.In addition, it is reasonable to suppose that emotionally neutral speech may present in other utterances as a background to the labeled emotion. This enhances the challenge to distinguish between neutral and other emotions. An interesting approach to cope with this problem has been proposed in (microsoft). In the spirit of the Connectionist Temporal Classification (CTC) method (CTC), the authors assigned to each time step of the utterance a random label taking value of either the emotion corresponding to the utterance or a Null label corresponding to non-emotional frames, modeling in this way other emotions which can appear in the utterance. Performing training by means of expectation-maximization algorithm, the authors increased weighted and unweighted accuracies by 2-3% (see subsection 3.3 for terminology clarification). Another approach has been applied in (tspredictor), where the prediction procedure was realized in two steps. In case when the main model predicted the neutral emotion, the utterance was directed to the three other models performing binary classification between neutral and one of the other emotions. This strategy resulted in 2.5% increase of unweighted accuracy, but in turn decreased the weighted accuracy by 1.5%.
Apart from class imbalance, the IEMOCAP dataset presents another major drawback: it is relatively small, which makes the validation procedure unstable. To cope with both obstacles, we examined data augmentation by means of vocal tract length perturbation (VTLP), at the same time oversampling the least represented classes of the dataset, happiness and anger. VTLP is based on the speaker normalization technique considered in (FWN), where it was implemented to reduce interspeaker variability. The difference in human’s vocal tract length can be modeled by rescaling the peaks of significant formants along the frequency axis with a factor taking values in the approximate range . Therefore, in order to get rid of this variablility, one should estimate the factor for each speaker and accordingly normalize the spectrograms. Applied inversely, the same idea can be used for data augmentation (Jaitly2013VocalTL; IBMAUG; yerevan2016): in order to generate new samples, one just has to perform rescaling of the original spectrograms along the frequency axis while keeping the scaling factor in the range . Both approaches, normalization and augmentation, pursue the same objective: to enforce the invariance of the model to speaker-dependent features, since they are not relevant to the classification criterion. Augmentation, however, is easier to implement because we don’t need to estimate the scaling factor of each speaker, and therefore we stick to this option.
Rescaling of frequencies has been performed as follows (FWN):
where is the upper cut-off frequency and is defined to be larger than the highest significant formants (we took ). Therefore, we rescale the frequencies below with , and then rescale the rest to ensure that the considered diapason stays constant (see Fig. 2).
We tried two strategies of data augmentation. In the first one, a single uniformly distributed value was sampled at each epoch and used to rescale all training examples, and no rescaling was applied to the validation set. In the second strategy, each spectrogram was rescaled with an individually generated for the training, as well as for the validation sets. For evaluation, we used the majority vote of the model predictions on eleven copies of the test set with . We present the scores obtained with the second augmentation strategy, which provided the best result.
3 Model description and experiments
As has been mentioned above, the IEMOCAP dataset consists of five sessions, each being a conversation between a man and a woman, giving 10 speakers in total. In order to see how well the model can generalize to different speakers, we took the validation and test sets to correspond to two different speakers of one of the sessions. The training set was composed of the four remaining sessions. In the course of experiments, we observed that the performance strongly depends on which speakers are chosen for the test set (see Tab. 2). Therefore we choose 10-fold cross-validation strategy, in order to average over all possible choices of the test set. Interestingly, to the best of our knowledge, all the other results reported on the IEMOCAP dataset were obtained by 5-fold cross-validation. In this case the choice of the validation and test sets is not rigorously defined333For instance, one could systematically use female speakers as validation and male speakers as test, or inversely and the scores obtained in this way are not possible to compare with.
For evaluating the model performance, we chose weighted (WA) and unweighted (UA) accuracies. WA is the standard accuracy computed over the whole test set. UA is an average over accuracies computed for each emotion separately. First, we compute the metrics for each fold and then present the scores as the average over all the folds. Since for imbalanced datasets, UA is a more relevant characteristic, we rather concentrated our efforts on getting a high UA, in line with most of the other works on IEMOCAP.
We considered architectures with 1-6 convolutional layers, 1-4 Bi-LSTM layers and a dense layer with softmax nonlinearity on top of the network (see Fig. 3). As an optimization procedure, we used stochastic gradient descent with Nesterov momentum. For the regularization of weights we used L2-regularization.
Due to the significant variability of the data samples in the time length (from 21 to 909 time steps for window size and shift ) we perform zero-padding of the samples along the time axis. In order to avoid the aggregation of the artificially added time steps by Bi-LSTM, we put a masking layer between the convolutional and Bi-LSTM modules. The size of the mask has been derived from the temporal size of the corresponding spectrogram and action of the convolutional strides on it.
Finally we normalize the samples according to the general statistics of the dataset:
where and are the average and standard deviation of the spectrogram pixels computed over the whole dataset along both time and frequency axes. Such normalization significantly improves the convergence time of the model. However, applied to networks of small depth ( convolutional layers), it results in strong overfitting.
As it has been mentioned above, we conducted a variety of experiments with different depth of convolutional and Bi-LSTM modules. The presence of pooling layers alternating with the convolutions noticeably decreased the performance and has been discarded in the beginning of the experiments. We examined different scenarios: ”shallow CNN + deep Bi-LSTM”, ”deep CNN + shallow Bi-LSTM” and ”deep CNN + deep Bi-LSTM”. The best results has been achieved with a choice of 4 convolutional and 1 Bi-LSTM layers.
In the table 1 we present the results of the best model and also contribution to the performance of the techniques we applied. One can see that oversampling allowed to increase UA by , but resulted in decrease for WA. Data augmentation with VTLP led to increase of both metrics by and for UA and WA correspondingly. As we discuss in the section , by performing per-layer gradient analysis of the network, we came to idea to adjust the learning rate layer-wisely. It resulted in the significant improvement of the UA by . Finally, considering a larger range of the frequencies (8kHz) increased the UA by . The experiments with deeper Bi-LSTM module did not lead to any improvement despite the usage of batch normalization (see section 3.1).
|Augmentation during training||-||-||+augm||-||+aug||+aug|
|Oversampling () of happiness and anger||-||+over||+over||+over||+over||+over|
|Frequency range (kHz)||4||4||4||4||4||8|
|Fold||Session||Gender||WA (%)||UA (%)|
3.1 Batch normalization
Different kinds of techniques have been applied for normalization of the recurrent layers (BNrecNN; Baidu; RBN; LN). In some cases they were successful, demonstrating acceleration of the convergence and better performance, in some others (see e.g. (BNrecNN)) they rather led to stronger overfitting and degradation of the outcome. We suppose that such uncertainty in the results can be caused by characteristics of the considered data. Batch normalization technique proved to be extremely efficient when applied to images, which are usually characterized by the presence of very clear and robust correlations. Unlike images, time series data such as speech are much more fragile and applying normalization techniques might lead to destroying important information.
The most potentially destructive normalization is the so-called frame-wise way (see e.g. (BNrecNN)), when the statistics is accumulated for each feature and each time step separately:
Here, , and are the batch, temporal and frequency index respectively, is the batch size and is preactivation. Then a batch-normalized recurrent layer can be written in the following way,
if the hidden and the input parts are treated separately as in (RBN). Here, stands for the standard batch normalization operation (DBLP:journals/corr/IoffeS15), , , are activation, hidden state and input, and , are the corresponding weights. Due to the fact that averaging is performed only along the batch axis, the frame-wise normalization may cause strong signal distortion along both time and frequency axes. Increasing the batch size may partly reduce this effect, but often, it is preferable to keep the batch size small, in order to provide greater variability.
A more delicate approach is to average also over the temporal axis which leads to sequence-wise batch normalization successfully applied in (Baidu):
and stands for the sum of the sample time lengths over the batch. In this case, batch normalization is applied only to the input part of the recurrent layer:
Nevertheless, it can still cause distortion along the frequency axis. This is why we choose an even more conservative normalizing strategy which supposes averaging the samples over all the axes:
and is a product of and the feature number. Here batch normalization is applied as in (12). In this case, normalization is performed layer-wise (as in (LN)) and batch-wise at the same time (further, for simplicity, we refer to this normalization method as layer-wise batch normalization).
We examined layer-wise batch normalization applied to the recurrent module for models with 4 convolutional and 1-4 Bi-LSTM layers. The experiments with a small batch size demonstrated faster overfitting and degradation of the performance compared to the baseline. The fact that batch normalization has been applied not only batch-wise, but also layer-wise, should have reduced the influence of the batch size, which is often crucial when batch normalization is used. However, experimenting with a larger batch size, we realized that, in our case, it still strongly affects the performance (see Tab. 3). Therefore, it is possible that further augmenting the batch size would lead to even better results. Unfortunately, due to GPU memory restrictions, we could not verify it.
3.2 Layer-wise learning rate adjustment
When deepening the convolutional module of the baseline model (starting from 3-4 convolutional layers), we observed degradation rather than improvement of the performance. Then, by analyzing the gradients corresponding to the different layers, we noticed an interesting phenomenon: the gradient with respect to the weights of the convolutional module was significantly larger than the gradient with respect to the Bi-LSTM weights (see Fig. 4). Therefore, to make the convolutional module learn better, we increased the learning rate with respect to the weights of the convolutional layers. In order to compensate the possible overfitting effect of this action, we also increased regularization of the convolutional weights. This modification noticeably improved the performance (see Tab. 1) and allowed as well to decrease convergence time. Interestingly, the same kind of phenomenon has been recently observed in (gradnorm). Considering different kinds of neural networks, the authors demonstrated that decreasing the learning rate with the network depth can significantly improve the convergence speed. Thus, this observation might be contingent on a more general phenomenon.
3.3 Annotations and soft-labeling
Emotional content of natural human speech is complex, being an intertwinement of different emotions. In addition, perception of human emotions is rather subjective. That is why labeling of the IEMOCAP dataset has been performed by several annotators, who were allowed to assign more than one emotion label (Busso2008). The authors of (mower2009) took into account this multi-label assignment. They divided the dataset in groups depending on the agreement of the annotators between each other. Following this idea, we introduce two subsets of the data. When all three evaluators agree on a common label, we refer to labeling as unanimous (prototypical in (mower2009)). When evaluators don’t agree on the emotion, we refer to labeling as ambiguous (non-prototypical majority-vote consensus in (mower2009)). Among the IEMOCAP improvisation utterances only are labeled unanimously, while compose the ambiguous subset. In particular, for utterances labeled as neutral and happiness, the percentage of unanimous samples falls to and respectively (see Tab. 4), which demonstrates huge ambiguity in labeling of these classes.
|Utterance||True||Ann.1||Ann.2||Ann.3||Hard label||Soft label||Weight||Subset|
|Ses01F_impro05_M020||neu||oth||neu, sad||neu||1 0 0 0||0.75||0||0.25||0||0.67||amb|
|Ses02M_impro08_F023||hap||hap||hap||neu||0 1 0 0||0.33||0.67||0||0||1||amb|
|Ses02M_impro06_M012||sad||sad, ang||sad||sad||0 0 1 0||0||0||0.83||0.17||1||una|
|Ses01F_impro01_M011||ang||ang||oth, ang||oth||0 0 0 1||0||0||0||1||0.5||amb|
In the this section, we analyze per-class performance of our best model and show how it changes depending on to which subset (unanimous or ambiguous) the samples belong. Table 6 summarizes the prediction results. One can see that per-class accuracies are not primarily determined by the number of available samples (e.g. sadness is recognized much better than the neutral emotion, even though it is much less presented in the dataset), but also related to the agreement between the annotators. Indeed, the best predicted emotions are those with the highest proportion of unanimously labeled samples (see Tab. 4). Although oversampled, happiness is by far the least accurately recognized emotion (), while anger () and sadness () are most often correctly predicted. The best model’s UA is , with a significant difference between the unanimous () and the ambiguous () subset. Considering each emotion separately, per-class accuracies are higher on unanimous subset than on ambiguous one (except for the neutral emotion), with a maximum difference of for anger (see shaded columns in Tab. 6).
When the classifier failed to predict correctly, we checked if the emotion ranked second by the network (looking at the softmax outputs) was the right one (see columns -- in Tab. 6). We observed that for happiness and the neutral emotion (classes predicted least confidently), the label predicted as the second choice of the model very often coincided with the true label. In this case, a possible supplementary technique to improve the scores is two-step prediction already tested in (tspredictor). However, in this work we explore another method to improve the classification. We take into account available multi-label annotation by introducing soft-labeling during training. In order to reflect the confidence of a given label we assign it a probability depending on the multiple labels given by the annotators for the corresponding utterance (e.g. see shaded columns in Tab. 5). For instance, if an utterance is labeled as the neutral emotion by two annotators and as sadness by the third one, its hard label is ”neutral” (which can be encoded in a one-hot vector as (1, 0, 0, 0)), while its soft label is a mixture of two emotions: the neutral emotion, with a weight of and sadness with a weight of (which can be encoded as (0.67, 0, 0.33, 0)). Sometimes an annotator assign an label out of the set we are considering (e.g. ”excitement”). In order to take it into account we use appropriate weighting. When all the multi-labels assigned to an utterance belong to the set of interest, the utterance has weight , while utterances with at least one of the multi-labels outside of this set have a smaller weight (see Tab. 5). The loss function of the training process is still categorical cross-entropy, but with the soft labels in place of the hard ones.
The results are presented in table 6. Looking on the per-class performance, one can see that the only class which benefits from soft-labeling is the neutral emotion. Performance on the other classes is significantly worse. Since the neutral emotion class is the abundant one, this results in higher WA, but UA decreases.
In this work, we investigated several techniques to enhance speech emotion recognition from spectrograms, demonstrating highly competitive performance. Moreover, a careful analysis of the results allows to disentangle the contribution of each of the applied techniques. Our work addresses hyperparameter optimization as well as exploration of the data.
Following the modern trends in speech analysis, we used a mixed CNN-LSTM architecture, exploiting the capacity of convolutional layers to extract high-level representations from raw inputs. Interestingly, we noticed that parameters of convolutional and LSTM layers are trained at a very different pace, which hinders the exploitation of the model’s potential. Therefore, the learning rate adjustment turns out to be essential for taking full benefit from this architecture. This technique accounts for a 1.2-1.4% improvement of unweighted accuracy. We also investigated the effect of batch normalization, an indispensable tool in most image recognition tasks. However, application of batch normalization to time-series data is not always advised and might lead to data distortion. In order to preserve the signal structure as much as possible we performed the normalization layer-wise as well as batch-wise. Nevertheless, we did not manage to increase performance compared to the baseline, which might be caused by the small batch size we had to use in order to fit into the available GPU memory.
Gathering and labeling speech data relevant to automatic emotion recognition is difficult. Although one of the standard and appropriate datasets for this task, IEMOCAP is still flawed with scarcity and class imbalance. Consequently, as noted in previous works, cross-validation is essential for an unbiased measure of model performance, since the results vary considerably depending on which speaker is held out for measuring accuracy. Here, we advocate in favour of 10-fold rather than 5-fold cross-validation, which leaves less ambiguity on the result. We exploit data augmentation and minor class oversampling, which proved to be successful in enhancing the detection of underrepresented classes. The combination of both techniques resulted in a 1.8% increase of unweighted accuracy with respect to the baseline. Finally, besides limitations of the dataset, the task itself presents an intrinsic difficulty, reflected by the fact that in most cases, human annotators themselves do not agree on the emotion. As a consequence, our neural network often misclassifies the ambiguous samples. To overcome this issue, we tried to make use of available information on individual annotators by introducing soft labels. However, this turned out to be detrimental to unweighted accuracy, since it only favors the detection of the major class.
In view of the success of the mixed CNN-LSTM architecture for the emotion recognition task, a possible direction of future work would be to use convolutional LSTM (conv_lstm), where the matrix products defining the LSTM components are replaced with convolutions. Given the importance of data augmentation, another promising idea is to employ generative adversarial networks (gan) for the purpose of data augmentation. This approach, which has been proven successful in image classification (gan_aug), would be an alternative to VLTP for synthetizing new realistic samples.
We are very grateful to DreamQuark team and Axel Orgogozo for discussions and helpful suggestions on the manuscript. This research was supported by the French National Association of Research and Technology, the DreamQuark Company (Paris, France) and the Computer Science Laboratory for Mechanics and Engineering Sciences (LIMSI-CNRS, Orsay, France).