Emotion Classification from Noisy Speech - A Deep Learning Approach
This paper investigates the performance of “Deep Learning” for speech emotion classification when the speech is compounded with noise. It reports on the classification accuracy and concludes with the future directions for achieving greater robustness for emotion recognition from noisy speech.
Deep Learning has revolutionised many fields of science including pattern recognition and machine learning. Its tremendous power in learning the non-linear relationships between visible and hidden layers over many layers of Deep Neural Networks (DNNs) makes it suitable for learning the most powerful features. Likewise, Deep Learning has revolutionised the field of speech recognition. A Deep Learning approach called, “Deep Speech”  has significantly outperformed the state-of-the-art commercial speech recognition systems, such as Google Speech API and Apple Dictation. It achieves a word error rate of 19.1%, whereas the best commercial systems achieved 30.5% error. We capitalise on this success and aim to develop a deep learning framework for classifying emotion in speech.
Emotion classification from speech is a widely studied topic. Before Deep Learning, Gaussian Mixture Models jointly with Hidden Markov Model were the most popular method for emotion classification from speech. After the Deep Learning breakthrough in speech recognition, the landscape of emotion recognition from speech has started to change. A number of studies have emerged where Deep Learning is used for speech emotion recognition. Linlin Chao  et al. use Autoencoders, which is the simplest form of DNN. It is a simple 3-layer neural network where output units are directly connected back to input units. Typically, the number of hidden units is much less than the number of visible (input/output) ones. As a result, when data is passed through the network, the input vector is first compressed (encoded) to ”fit” in a smaller representation, and then reconstructed (decoded) back. The task of training is to minimise an error or reconstruction, i.e. find the most efficient compact representation (encoding) of the input data. Autoencoders are efficient in discriminating some data vectors in favour of others, however, they are not generative. That is they cannot generate new data, and therefore cannot generate as rich features as the advanced DNNs.
The other form of Deep Neural Networks is the Deep Belief Networks, which uses stacked Restricted Boltzmann Machines (RBMs) to form a deep architecture. RBM has a two layer construction: a hidden layer and a visible layer and the learning procedure consists of several steps of Gibbs sampling, that is, in propagation, hidden layer is sampled given the visible layer and in reconstruction visible layer is sampled given the hidden layers. The reconstruction and the propagation passes are repeated adjusting the weights to minimise reconstruction error. Unlike Autoencoder, RBM is a generative model. It can generate samples from learned hidden representations.
A number of studies have used DBNs for emotion classification from voice. In , DBN is used in conjunction with Hidden Markov Model. This work provides insights into similarities and differences between speech and emotion recognition. Authors in  use a similar DBN-HMM architecture. In , authors test DBNs for feature learning and classification and also in combination with other classifiers (while using DBNs for learning features only) like k-Nearest Neighbour (kNN), Support Vector Machine (SVM) and others, which are widely used for classification . An interesting approach called Optimized Multi-Channel Deep Neural Network (OMC-DNN) has been proposed in , which for speech emotion recognition uses input features generated as simple 2D black and white images representing graphs of the MFCC coefficients.
Finally, Convolution Neural Networks (CNNs) , which is another deep architecture, has also been used for emotion recognition from speech. CNNs are somewhat similar to RBMs, but instead of learning single global weight matrix between two layers, they aim to find a set of locally connected neurons, i.e., neurons that are spatially close to each other. CNNs are mostly used in image recognition. This is because, when dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume. For the same reason, CNNs are hardly applicable for input other than images. Furthermore, this paper does not consider noisy speech.
Despite a number of attempts of using DBNs for emotion recognition from speech, no study has particularly looked into emotion recognition from “noisy speech”. In this paper, we consider this challenge.
Ii Work in Progress
Apart from the capacity of producing powerful discriminative features, DBNs are also naturally robust to noise. We use simulations to validate the robustness of DBNs. As shown in Figure 1, artificial noise is imputed to clean emotional utterance. Utterances are divided into segments and passed onto the DBNs for segment level emotion classification.
The Berlin emotional speech database  is used in experiments for classifying discrete emotions. In this database, ten actors (5m/5f) each uttered ten sentences (5 short and 5 longer, typically between 1.5 and 4 seconds) in German to simulate seven different emotions: anger, boredom, disgust, anxiety/fear, happiness, and sadness. Utterances scoring higher than 80% emotion recognition rate in a subjective listening test are included in the database. We classify all the seven emotions in this work. The numbers of speech files for these emotion categories in the presented Berlin database are: anger (127), boredom (81), disgust (46), fear (69), joy (71), neutral (79) and sadness (62). In order to simulate noise-corrupted speech signals, the DEMAND noise database  has been used in this paper. This database involves 18 types of noises including white noise and noises at the cafeteria, car, restaurant, etc. The DBN implementation has been adopted from . The DBN used in the paper uses three RBMs, where the first two RBMs use 1000 hidden unit each, and the third RBM uses 2000 hidden units. For each speech segment, a 13 coefficient MFCC vector 111MFCC features are widely used for acoustic signals  is generated and used as the input to the DBN. The output is the classification result.
Emotion recognition performance under different noise conditions is shown in Figure 2. To provide emphasis on robustness, we do not group the results based on the type of emotions, we rather present the overall classification accuracy. For each noise category, the results show the “percentage difference in accuracy” between emotion classification from clean and noisy speech. We made a number of observations:
The percentage errors can be categorised into three groups: error less than 10%, error less than 20% but greater than 10% percent, error less than 30% but greater than 20%.
From listening, the noise types causing the smallest error (error less than 10%) have low magnitude and low variability, such as noises produced inside a car, inside kitchen etc. Noises causing the higher errors have relatively higher variabilities.
In general, accuracy is diminished in the presence of noise except in the case of “washing”, when the accuracy increased. It is not unnatural for DBN to perform better in the presence of noise as this helps avoid overfitting, when the noise is not dominant.
Iii Conclusion and Future Work
We are currently developing methods to achieve greater robustness with DBNs for speech emotion recognition. This involves designing algorithms to obtain the optimal configuration (e.g., number of layers, nodes per layer etc.) of the DBNs. We are also developing an algorithm for optimal feature selection to achieve the best classification performance. Finally, we are conducting some in-depth analysis on the effect of various noise types in speech emotion recognition and how to overcome those effects.
-  A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,” CoRR, vol. abs/1412.5567, 2014. [Online]. Available: http://arxiv.org/abs/1412.5567
-  L. Chao, J. Tao, M. Yang, and Y. Li, “Improving generation performance of speech emotion recognition by denoising autoencoders,” in Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 2014, pp. 341–344.
-  D. Le and E. M. Provost, “Emotion recognition from spontaneous speech using hidden markov models with deep belief networks,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 216–221.
-  J. Niu, Y. Qian, and K. Yu, “Acoustic emotion recognition using deep neural network,” in Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 2014, pp. 128–132.
-  M. E. Sánchez-Gutiérrez, E. M. Albornoz, F. Martinez-Licona, H. L. Rufiner, and J. Goddard, “Deep learning for emotional speech recognition,” in Pattern Recognition. Springer, 2014, pp. 311–320.
-  B. Wei, M. Yang, R. K. Rana, C. T. Chou, and W. Hu, “Distributed sparse approximation for frog sound classification,” in Proceedings of the 11th international conference on Information Processing in Sensor Networks. ACM, 2012, pp. 105–106.
-  M. N. Stolar, M. Lech, and I. S. Burnett, “Optimized multi-channel deep neural network with 2d graphical representation of acoustic speech features for emotion recognition,” in Signal Processing and Communication Systems (ICSPCS), 2014 8th International Conference on. IEEE, 2014, pp. 1–6.
-  Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 801–804.
-  F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech.” in Interspeech, vol. 5, 2005, pp. 1517–1520.
-  J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings,” The Journal of the Acoustical Society of America, vol. 133, no. 5, pp. 3591–3591, 2013.
-  M. A. Keyvanrad and M. M. Homayounpour, “A brief survey on deep belief networks and introducing a new object oriented matlab toolbox (deebnet v2. 1),” arXiv preprint arXiv:1408.3264, 2014.
-  B. Wei, M. Yang, Y. Shen, R. Rana, C. T. Chou, and W. Hu, “Real-time classification via sparse representation in acoustic sensor networks,” in Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems. ACM, 2013, p. 21.