Paul commented this

The work uses a new dual-level model that combines handcrafted and raw features for audio signals to determine emotions…

JI replied this
Replied just now

Hi Paul S, Thank you for the comment. Now allow me to ansewer this questions: Firstly, IEMOCAP is a well-known dataset for the task of speech emotion analysis, detailed description of how the dataset is compsoed can be found in previous work, which is omitted in the paper for brevity. In short, each sentence is manually annotated by a few experts, and by majority vote, each sentence is labled with the emotion which the majority of the experts agree.    Secondly, the use of handcrafted features have proven to be useful in speech emotion analysis in previous work, but the limitation of hand-crafted features are that they are derived from deterministic mathematical formula, for example, fundamental frequency. Another limitation of hand-crafted features is that audio signals cannot be recreated from them. Researchers have been trying to directly use raw audio signals, but an 8s sentece has 128k datapoints in it, which is proven to be computationally hard to learn, therefore in recent years, spectrograms are used. Now you are correct that spectrograms are also features, but they are more comprehensive in the sense that audio signals can be faithfully recreated from them. So mel-spectorrgams have been widely used as 2D audio signals.    Lastly, in the analysis section of the paper, a significant increases in accuracy is observed when the inputs are two spectrograms instead of just one. The idea why using two spectrograms are each time point is more effective than one is the following: spectorgrams are derived under the assumption that the audio is stationary in each window, therefore it could be beneficial to change up the window size, in this paper, window sizes of 256 and 512 are used. More importantly, by the law of time-frequency trade-off, the longer the window, the more information about frequency is obtained in each spectrogram, yet less information about how the frequency varies is obtained. Ideally the model benefits from as much information about frequency and time as possible, so the paper uses a new LSTM architecture which incorporates both spectrograms. 


You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description