Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition

Characterizing Types of Convolution in Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition

motion plays a fundamental role in our daily lives for effective communication, which underlies the abilities of humans to interact, collaborate, empathize and even compete with others. Researchers have been working on understanding human emotion, or in general human behaviors [1], for years from both psychological and computational perspectives for several reasons including because it serves as a lens to observe the dynamics of one’s internal mental state. Moreover, with the advent of artificially intelligent agents, it is hardly an overstatement to stress the importance of emotion recognition in supporting natural and engaging human-machine interaction.

Human behavioral cues often mix and manifest multiple sources of information together. To robustly recover affective information from multiplexed behavioral cues renders emotion recognition a challenging task. For example, speech contains not only linguistic content of what is said but also attributes of the speaker such as identity, gender and age, speaking style, and language background as well as information about the environment and context. All of these factors are entangled and transmitted through a single channel during speech articulation. Speech emotion recognition, therefore, involves the inverse process of disentangling these signals and identifying affective information.

A multitude of studies on the subject of emotion recognition have discovered a number of emotion-related parameters based on prior knowledge of psychology, speech science, vision science and through signal processing and machine learning approaches. The commonly used features include pitch, log-Mel filterbank energies (log-Mels), Mel-frequency cepstral coefficients (MFCCs) and perceptual linear prediction in the acoustic modality, and Haar, local binary pattern, histogram of oriented gradients and scale-invariant feature transform in visual modality. A variety of classifiers based on these features have been reported to perform well. In particular, an extensive feature set consisting of thousands of hand-engineered parameters has been recommended in the past few INTERSPEECH challenges [2], and in a recent meta research review article [3].

In addition to hand-crafted feature engineering, deep learning [4] provides an alternative approach to formulate appropriate features for the task at hand. In the last few years, convolutional neural networks (CNNs) [7] have demonstrated outstanding performances in various applications including image recognition, object detection, and recently speech acoustic modeling. Compared to hand-crafted features, a CNN that learns from a large number of training samples via a deep architecture can capture a higher-level representation for the task-specific knowledge distilled from annotated data. In the area of speech emotion recognition, several researchers have investigated the effectiveness of CNNs into automatically learning affective information from signal data [12].

Information encoded in speech signals is inherently sequential. Moreover, psychological studies have shown that affective information involves a slow temporal evolution of mental states [16]. Based on this observation, previous studies have also investigated the use of architectures that explicitly model the temporal dynamics, such as hidden Markov models (HMM) [17] or recurrent neural networks (RNN) [18] for recognizing human emotion in speech.

Furthermore, there is a growing trend in combining CNN and RNN into one architecture and to train the entire model in an end-to-end fashion. The motivation behind a holistic training is derived from the need to avoid greedily enforcing the distribution of intermediate layers to approximate that of labels, which is believed to maximally exploit the advantage of deep learning over traditional learning methods and would lead to an improved performance. For example, Sainath et al. proposed an architecture, called the Convolutional Long Short-Term Memory Deep Neural Networks (CLDNN) model, made up of a few convolutional layers, long short-term memory gated (LSTM) RNN layers and fully connected (FC) layers in the respective order. They trained CLDNNs on the log-Mel filterbank energies [21] and on the raw waveform speech signal [22] for speech recognition, and showed that both CLDNN models outperform CNN and LSTM alone or combined. Likewise, Huang et al. [23] and Lim et al. [24] reported CLDNN-based speech emotion recognition experiments, on log-Mels and spectrograms respectively, using similar benchmark settings to highlight the superior performance resulting from an end-to-end training.

In this work, we extend our previous work in [23] to characterize four types of convolutional operations in a CLDNN for speech emotion recognition. We use log-Mels and MFCCs as input to the proposed models depending on their spectral-temporal correlation. In particular, we compare spectral decorrelation power between one type of the convolutional operations and the discrete cosine transformation (DCT), under both clean and noisy conditions. In addition, we quantitatively and visually analyze modules in the proposed CLDNN-based models in order to gain insights into the information flows within the models.

The outline of this paper is as follows. Section 1 reviews previous related work. Section 2 presents the architecture of the proposed models and Section 3 describes three competitive baseline models. Section 4 introduces the corpus and data augmentation procedure. Section 5 details the experimental settings and the results are interpreted in Section ?. Section 7 concludes this paper.

1Related Work

Before the present era of deep learning, speech emotion recognition systems prevalently relied on a two-stage training approach, where feature engineering and classifier training were performed separately. Commonly used hand-crafted features include pitch, MFCC, log-Mels and the recommended feature sets from the INTERSPEECH challenges. Support vector machine (SVM) and extreme learning machine (ELM) were two of the most competitive classifiers. For the ease to compare models, Eyben et al. [3] summarized the performances by a SVM trained on the INTERSPEECH challenge feature sets over several public corpora. Yan et al. [25] recently proposed a sparse kernel reduced-rank regression for bimodal emotion recognition (SKRRR) from facial expressions and speech, which has achieved one of the state-of-the-art performances on the eNTERFACE’05 [26] corpus.

Han et al. [27] employed a multilayer perceptron (MLP) to learn from spliced data frames and took statistics of aggregated frame posteriors as utterance-level features. An MLP-ELM supervised by these utterance features and the corresponding labels has been shown to outperform the MLP-SVM model.

It has been known that emotion involves temporal variations of mental state. To exploit this fact, Wöllmer et al. [18] and Metallinou et al. [19] conducted experiments at the conversation-level to show that human emotion depends on the context of a long-term temporal relationship using HMM and Bi-directional LSTM (BLSTM). Lee et al. [20] posed speech emotion recognition at the utterance level as a sequence learning problem and trained an LSTM with a connectionist temporal classification objective to align voiced frames with emotion activation.

Deep CNN models were initially applied to computer vision related tasks and have achieved many ground-breaking results [7]. Recently, researchers have started to consider their use in the acoustic domain, including speech recognition [28], audio event detection [31] and speech emotion recognition [12]. Abdel-Hamid et al. [28] concluded that one of the advantages in using CNNs to learn from less processed features such as raw waveforms, spectrograms and log-Mels is their ability to reduce spectral variation, including speaker and environmental variabilities; this capability is attributed to structures such as local connectivity, weight sharing, and pooling. When training a CNN model for speech emotion recognition, Mao et al. [12] proposed to learn the filters in a CNN on spectrally whitened spectrograms. The learning, however, is carried out by a sparse auto-encoder in an unsupervised fashion. Anand et al. [13] benchmarked two types of convolutional operations in their CNN-based speech emotion recognition systems: the spectral-temporally convolutional operation and the full-spectrum temporally convolutional operation. Their results showed the full-spectrum temporal convolution is more favorable for speech emotion recognition. They also reported the performance of an LSTM trained on the raw spectrograms.

Recently, Sainath et al. proposed the CLDNN architecture for speech recognition based on the log-Mels [21] and the raw waveform signal [22], in which both models have been shown to more competitive than a LSTM and a CNN model alone or combined. They also demonstrated that with a sufficient amount of training data (roughly hours), a CLDNN trained on the raw waveform signal can match the one trained on the log-Mels. Moreover, they found the raw waveform and the log-Mels in fact provide complementary information. Based on the CLDNN architecture, Trigeorgis et al. [33] published a model using the raw waveform signal for continuous emotion tracking. Huang et al. [23] trained a CLDNN model on the log-Mels for speech emotion recognition and quantitatively analyzed the difference in spectrally decorrelating power between the discrete cosine transformation and the convolutional operation. Lim et al. [24] repeated the comparison between CNN, LSTM and CLDNN for speech emotion recognition using spectrograms. Ma et al. [34] applied the CLDNN architecture to classifying depression based on the log-Mels and spectrograms. They employed the full-spectrum temporally convolutional operation on the log-Mels but the temporally-only convolutional operation on the spectrograms.

On the multi-modal side, Zhang et al. [14] fine-tuned the AlexNet on spectrograms and images, separately, for audio-visual emotion recognition but only applied time-averaging for temporal pooling. Tzirakis et al. [35] extended the uni-modal work in [33] to make use of visual cues. They fine-tuned the pre-trained ResNet model [11] for facial expression recognition and then re-trained the concatenated bimodal network with the LSTM layers re-initialized again.

We intend to extend our work in [23] to study four types of convolutional operations for speech emotion recognition. Our contributions are multi-fold. First of all, we consider all commonly used convolutional operations for offering a comprehensive understanding, including two types covered in [13]. Second, unlike previous studies [13] that increased training corpus size internally, we perform data augmentation with a noise corpus. As a result, we evaluate the proposed models under both clean and noisy conditions to quantitatively measure the influence of noise on different types of convolutional operations. To the best of our knowledge, we are the first to study noise influence on types of convolutional operations. Furthermore, we carry out module-wise evaluation and visualization to analyze the information flows of different factors encoded in speech and their interplay along the depth of an architecture.

This work is similar to Anand et al. [13] because we both report the benchmarking of convolutional types. However, in addition to the novelty aforementioned, we train our models in an end-to-end fashion on log-Mels and MFCCs depending on their locally spectral-temporal correlation. Moreover, we keep the testing partition speaker-independent of the training parition. Ma et al. [34] also experimented with two types of convolutional operations but they applied them to different features. As a result, it is difficult to draw a fair conclusion from the comparison. This work is also similar to Trigeorgis et al. [33], Lim et al. [24] and Huang et al. [23], where all adopt the CLDNN architecture for speech emotion recognition/tracking but the underlying features and the intended goals are different.

Figure 1: An overview of the proposed neural networks for speech emotion recognition. Upper Left: Four types of the convolutional operation over a given two-dimensional input \mathbf{X}_t=[\mathbf{x}_{t-l},\cdots,\mathbf{x}_t,\cdots,\mathbf{x}_{t+r}] are defined, including the full-spectrum temporal convolution (FST-Conv), the spectral-temporal convolution (ST-Conv), the temporal only convolution (T-Conv), the spectral only convolution (S-Conv). The shape (height h and width w together) of a filter determines the type of a convolutional operation. Filters of shape M\times w (FST-Conv) consider all (M) frequency bands for w frames per scan. Filters of shape h\times w (ST-Conv) only process local spectral-temporal information. Filters of shape 1\times w (T-Conv) and of shape h\times 1 (S-conv) only observe local information along their designated direction, respectively. Upper Right: An LDNN model, consisting of a bi-directional long short-term memory (BLSTM) gated recurrent neural layer followed by four fully connected feed-forward neural layers (FC), serves to be the common sub-network architecture for each of the proposed models. Bottom: we propose eight models, labelled by (i)-(viii), with or without the aforementioned convolutional operations on the log-Mels and the MFCCs to present a thorough comparison. Since MFCCs do no possess locality in the spectral domain, we do not apply the S-Conv and ST-Conv convolutional operations to MFCCs. Please refer to Section  for more details.
Figure 1: An overview of the proposed neural networks for speech emotion recognition. Upper Left: Four types of the convolutional operation over a given two-dimensional input are defined, including the full-spectrum temporal convolution (FST-Conv), the spectral-temporal convolution (ST-Conv), the temporal only convolution (T-Conv), the spectral only convolution (S-Conv). The shape (height and width together) of a filter determines the type of a convolutional operation. Filters of shape (FST-Conv) consider all frequency bands for frames per scan. Filters of shape (ST-Conv) only process local spectral-temporal information. Filters of shape (T-Conv) and of shape (S-conv) only observe local information along their designated direction, respectively. Upper Right: An LDNN model, consisting of a bi-directional long short-term memory (BLSTM) gated recurrent neural layer followed by four fully connected feed-forward neural layers (FC), serves to be the common sub-network architecture for each of the proposed models. Bottom: we propose eight models, labelled by (i)-(viii), with or without the aforementioned convolutional operations on the log-Mels and the MFCCs to present a thorough comparison. Since MFCCs do no possess locality in the spectral domain, we do not apply the S-Conv and ST-Conv convolutional operations to MFCCs. Please refer to Section for more details.

2Deep Convolutional Recurrent Models

In this section we describe the proposed deep convolutional recurrent networks and details of structurally different convolutional operations on the log-Mels and the MFCCs. Figure 1 illustrates the overview of the models we design for speech emotion recognition. In the upper left part of Figure 1, we define four types of convolutional operations depending on the shape of their feature maps. By dividing the convolutional operations into four types, we expect to understand their differences for a finer analysis after they have been optimized to learn from the spectral-temporal signals; in the upper right part of Figure 1, we depict a deep recurrent neural network, called the LDNN model, as the common sub-network architecture for every model. As a convolutional layer is applied locally in time, the LDNN model is supposed to model the long-term temporal relationship within an utterance; in the bottom part of Figure 1, all models are presented for a comprehensive study to understand the role a convolutional layer plays in learning the affective information in speech. For each model, we only consider spectral-temporal features as its input. Specifically, an emotional utterance is represented by a sequence of spectral features . These spectral features can either be the log-Mels or the MFCCs depending on the application scenario. Overall, we present eight models based on the combinations of the factors including the input features (the log-Mels or the MFCCs) and the type of convolutional operations (spectral only, temporal only, spectral-temporal or full-spectrum temporal). In the following subsections, we give a brief review of the convolutional and recurrent neural layers and introduce corresponding notations.

2.1Types of Convolutional Operations

A convolutional neural layer that receives an input tensor consists of a convolutional function , an activation function and an optional pooling function .

The convolutional function is defined by feature maps of shape , where the th component of is given as

in which .

Likewise, it is straight-forward to formulate the pooling function acting on an input through a filter of shape by the component-wise definition:

where is a sub-tensor of lying on the th slice of with its first entry aligned to , and is the pooling operation, usually the max or the mean functions. In Eqs. (Equation 1) and (Equation 2), , , and are the strides, i.e. the amount of shift, of the filters in the convolutional or the pooling operations in their respective directions.

Typical choices of the activation functions include the sigmoid function , the hyperbolic tangent function and the rectified linear unit (ReLU) .

Concisely, a convolutional neural layer can be summarized as a function composition

where denotes the element-wise application.

In this work, we concentrate entirely on the convolution function and adjust the pooling function accordingly. In particular, we are interested in the relationship between the acoustic emotional pattern learnt by the model and the shape of the filter in feature maps. To this end, we divide the shapes of the filters into four categories to highlight their structural differences: the full-spectrum temporally (FST-Conv), the spectral-temporally (ST-Conv), the temporally only (T-Conv) and the spectrally only (S-Conv) convolutional operations. In what follows, we mathematically define each category.

FST-Conv First of all, we consider filters of shape for , where denotes the number of spectral bands and specifies the width on the temporal axis. Since this type of filters covers the entire spectrum, they convolve with the input tensor only in the temporal direction and as a result the pooling function can only perform temporal pooling.

ST-Conv A ST-Conv layer contains filters of shape , where and . This type of filters observes local spectral-temporal information at a time and is free to convolve with the input tensor in both directions. Accordingly, the pooling function also operates on the convolved tensor through a two-dimensional filter.

T-Conv A T-Conv layer is similar to a FST-Conv layer except that the filters in a T-Conv layer has a shape of for . These filters convolve with the input tensor along the temporal direction from one frequency band to another and ignore spectrally contextual information. The pooling function acts on the convolved tensor along the temporal direction correspondingly.

S-Conv A spectrally only convolutional neural layer consists of filters of shape , where and the pooling function down-samples the convolved tensor along the spectral direction. Note that the S-Conv type is closely related to the traditional signal processing techniques; for example, DCT transformation from log-Mels to MFCCs belongs to this category when except that the filters in DCT are mathematically pre-defined; see Section 3.3 for more details.

For each type of the convolutional operations, we employ a stride of . Since our focus is on the convolutional operations, we employ a fixed pooling size of and a fixed stride of in their respective direction(s) of convolution. Table ? summarizes the parameters for all Conv layers.

A summary of the parameters for each model architecture. denotes the spectral dimensionality and var stands for variable parameters for tuning. The dash symbol indicates the situation where the parameter tuning is not applicable.
var var

2.2Deep Recurrent Neural Network

Suppose the input is a sequence of vectors . The Elman type simple recurrent neural network RNN [37] is defined through the following equations:

where as an non-linear recurrent transformation of all past history represents the system memory at time , is an affine mapping from a space of type to one of type , and is the activation function for type . Here , and denote the input, hidden and output vectors, respectively. However, training a simple RNN with the back-propagation algorithm may cause the issues of gradient vanishing or explosion. Although heuristic techniques such as gradient clipping can alleviate the issue of gradient explosion, the gradient vanishing problem is mitigated by an enhanced architecture: the LSTM architecture [6].

An LSTM is able to decide when to read from the input, to forget the memory or to write an output by controlling a gating mechanism. By definition an LSTM learns the following internal controlling functions:

where , , , , and represent the input, forget, output, gate, cell and output vectors, respectively. In particular, the change from non-linear multiplicative recurrence in Eq. (Equation 3) to linear additive recurrence in Eq. (Equation 4) theoretically prevents gradients from vanishing during back-propagating the error through time. Moreover, studies have found that a BLSTM layer can further improve upon a unidirectional LSTM in applications such as speech recognition [38], translation [39] and emotion recognition [18] as it fuses information from the past and the future.

Suppose an takes in a sequence and returns , and another takes in a reversed sequence and returns . A , which is made of two LSTMs, runs on two sequences and and gives another sequence , where is the concatenation of and .

2.3CLDNN-based Models

Before defining a variety of CLDNN-based models, we introduce a shared sub-network architecture among them. The sub-network contains one BLSTM layer followed by four fully connected feed-forward layers. Each direction of the BLSTM layer has cells so the BLSTM outputs a sequence of vectors in . We take a mean pooling over the output of the BLSTM layer to obtain the utterance representation rather than using the output vector at the last time step. A dropout mechanism [40] of probability is fixed and applied to the representation to regularize the learning process. These four FC layers have their own size of , respectively, where denotes the number of emotion classes, in which the first three FC layers are activated by the ReLU and the last one by the softmax function for classification. This architecture based on (B)LSTM and FC layers is conveniently called an LDNN model [21]. Note that we employ a BLSTM layer instead of an LSTM layer as in [21] because it has been shown that the ability of a BLSTM to integrate future information into representation learning is beneficial to emotion recognition.

In the bottom of the LDNN sub-network, there are two Conv layers. Each Conv layer has feature maps and each of them is activated by the ReLU. Formally, we define X-CLDNN model to be an LDNN sub-network architecture specified above on top of two X-Conv layers, where .

A Conv layer is often said to be local because its feature maps when being computed at a local region on the input tensor depend only on the entries that the feature maps currently overlap with. As a result, we expect the input tensor to preserve locality in both spectral and temporal directions in general. However, due to the aforementioned structural differences, it is reasonable to relax this expectation a little bit accordingly. For example, a ST-Conv certainly requires its input tensor to maintain spectral-temporal correlation locally while a (FS)T-Conv and a S-Conv only need such locality preservation in the temporal or spectral direction, respectively. Taking this issue into consideration, in this work, we apply all four types of the Conv layer to the log-Mels and denote the corresponding CLDNN-based models as X-CLDNN (log-Mels) for . On the other hand, because the discrete cosine transformation decorrelates the spectral energies, the MFCCs may not maintain locality in the spectral domain. Therefore, we apply only temporal convolutional operations to the MFCCs and denote these CLDNN-based models as X-CLDNN (MFCCs) for .

3Baseline Models

We evaluate our CLDNN-based models for understanding the convolutional operations by comparing with three baseline models on a speech emotion recognition task. First of the baseline models uses the low-level descriptors and their statistical functionals within an utterance to train a support vector machine. The other two of the baseline models are based on the BLSTM recurrent neural networks and take the log-Mels and the MFCCs features as its input, respectively.

3.1Support vector machine with the Low-Level Descriptors and Their Statistical Functionals

Many speech scientific studies have empirically found emotion correlating parameters, also known as the low-level descriptors (LLDs), along different aspects of phonation and articulation in speech, such as speech rate in the time domain, fundamental frequency or formant frequency in the frequency domain, intensity or energy in the amplitude domain, or relative energy in different frequency bands in the spectral energy domain. Furthermore, statistical functionals of an entire emotional utterance are derived from the LLDs to obtain global information, complementary to local information captured by frame-level LLDs. Popular selections of these parameters for developing machine learning algorithms in practical applications often amount to several thousands of features. For example, in the INTERSPEECH 2013 computational paralinguistics challenge, the recommended feature set contains parameters of the LLDs and statistical functionals altogether [2]. Fortunately, researchers have identified the support vector machine as one of the most effective machine learners for using these hand-crafted high-dimensional features [3].

To make our work comparable to the published results, we set up the first baseline model similar to the evaluation experiments conducted in [3]. We use the openSMILE toolkit [41] to extract the acoustic feature sets for INTERSPEECH Challenges from 2009 to 2013 , including Emotion Challenge (EC, parameters), Paralinguistic Challenge (PC, parameters), Speaker State Challenge (SSC, parameters), Speaker Trait Challenge (STC, parameters) and Computational Paralinguistic ChallengE (ComParE, parameters). On each of these feature sets, we train a SVM for speech emotion recognition.

3.2LDNN with the log-Mels

As suggested by previous studies [17], explicit temporal modeling is beneficial for speech emotion recognition, in which a recurrent neural network is a better choice than a hidden Markov model for its outstanding ability to model longer-term temporal relationship. Meanwhile, in order to build a competitive as well as compatible baseline model with respect to the CLDNN-based model, we take the LDNN architecture defined in Section 2.3 as our second baseline model. In particular, we use the log-Mels as the input to the LDNN model as the “raw” feature set without temporal or spectral convolutional operations. We denote this model as the LDNN (log-Mels).

3.3LDNN with the MFCCs

MFCCs are related to log-Mels via a mathematical construct: the discrete cosine transformation (DCT). Specifically, the relationship is defined as the following:

where MFCC[] and log-Mel[] are the th and the th coefficients of MFCCs and log-Mels, respectively, and is the number of the Mel-scaled filter banks.

We can easily convert Eq. (Equation 5) into a convolutional operation along the spectral direction, in which scenario all feature maps are thus tensors of shape . For the th feature map , its th component

is pre-defined mathematically based on the prior knowledge of signal processing, rather than task-specifically learnt from training samples. With this development, Eq. (Equation 5) can be succinctly summarized as

where represents the mathematically pre-defined spectrally only convolutional layer transforming log-Mels into MFCCs. Note that the properties of a conventional convolutional layer, such as the pooling function and the non-linear activation function are missing in this special configuration of a convolutional layer. In fact, there is no convolutional operation per se. Nevertheless, the purpose for this identification of DCT as a convolutional operation is to encapsulate this spectral modeling into the language of convolutional operations, to help us focus on the difference among various convolutional operations and mostly to contrast DCT with the S-Conv layer.

Our third baseline model is an LDNN model which takes the MFCCs as its input. Similarly, we denote this model as the LDNN (MFCCs). By comparing the the performances of the LDNN (MFCCs) and the S-CLDNN (log-Mels), we are able to quantitatively demonstrate the advantages of the S-Conv layer over the DCT-CNN layer.

4Databases Description

4.1The Clean Set

We use the eNTERFACE’05 emotion database [26], which is a publicly-available multi-modal corpus of elicited emotional utterances, to evaluate the performance of our proposed models. Although the entire database contains speech, facial expression and text, in this work we only conduct experiments on the audio modality. This database includes subjects from various countries, in which of them were male and were female. Each subject was asked to listen carefully to short stories, and each of them was designed to elicit a particular emotion from among the archetypal emotions defined by Ekman et al. [42]. The subjects then reacted to each of the scenarios to express their emotion according to a proposed script in English. Each subject was asked to speak five utterances per emotion class for emotion classes (anger, disgust, fear, happiness, sadness, and surprise). For each recorded emotional utterance, there is one corresponding global label describing the affective information conveyed by the whole utterance. The resulting corpus, however, is slightly unbalanced in the emotion class distribution because the subject has only two utterances portraying happiness, so the total number of emotional utterances in this corpus is . We call the set of these utterances the clean set. The average length of utterances is around seconds, and the total duration of the clean set amounts to roughly hour. We believe it is the moderate number of speakers and a variety of their cultural backgrounds that render it one of the most popular corpora for benchmarking speech emotion recognition models.

4.2The Noisy Set

Deep neural networks have a well-known reputation of being data-hungry. Despite the aforementioned diversity, it is not data-efficient enough to train a deep neural network as big as a CLDNN on the clean set alone for it would potentially incur a high risk of over-fitting. Various techniques have been proposed to implicitly or explicitly regularize the training process of deep neural networks in order to prevent over-fitting as well as to improve the generalization performance, such as dropout [40], early-stopping [43], data augmentation [36], transfer learning [15] and the recent group convolution approach [44]. In addition to the dropout mechanism and the early-stopping strategy, we also adopt the data augmentation approach to artificially increase the number of our data samples for the purpose of implicit regularization. To be precise, we aggressively mix samples from the clean set with samples from another publicly-available database, called the MUSAN corpus [46], for a few randomly chosen levels of signal-to-noise ratio (SNR).

The MUSAN corpus consists of three portions: music, speech and noise. As speech and music may inherently convey affective information, mixing samples from these two portions with clean emotional utterances would unnecessarily complicate the learning process and would possibly result in a suboptimal system due to a mixture of inconsistent emotion types. Therefore, to avoid adding confounding factors to clean emotional utterances, we only use the noise portion in the MUSAN corpus for data augmentation. The noise portion contains samples of assorted noise types, including technical noises, such as Dual-tone multi-frequency (DTMF) tones, dialtones, fax machine noises, and ambient sounds, such as car idling, thunder, wind, footsteps, paper rustling, rain, animal noises, and so on so forth. The total duration of the noise portion is about hours. We generate artificially corrupted data based on the clean set using the following recipe. For each clean utterance, noise samples are uniformly selected from the noise portion and levels of the SNR are uniformly chosen from the interval . Mixing the clean utterance with the combinations of the noise samples and SNR levels augments the clean set by a factor of . Note that randomly selecting samples from the noise portion gives an advantage over simply using a fixed subset of the noise portion. Due to the stochasticity, the probability of choosing the same set of noise samples is on the order of one out of , which is almost impossible. By carefully eliminating potential artificial patterns, we hope the deep neural networks could concisely capture the true underlying acoustic emotion prototypes.

We call this set of the resulting noisy utterances the noisy set, as opposed to the clean set defined above. The total duration of the noisy set is about hours. In the following sections, when referring to the clean condition, we mean the experiments are conducted on the clean set; on the other hand, when referring to the noisy condition, we mean they are conducted on the union of the noisy and the clean sets. Moreover, we further randomly divide the set of subjects into training, validation and testing (TVT) partitions under the percentage constraint of ::2, respectively, for experimental convenience. Partitioning the subject set, instead of the utterance set, allows us to maintain speaker independence across all experiments.

5Speech Emotion Recognition Experiments

In this section, we evaluate the proposed models with the following experiments:

  1. Baseline models

    1. SVM with openSMILE features

    2. LDNN (MFCCs)

    3. LDNN (log-Mels)

  2. CLDNN-based models

    1. T-CLDNN (MFCCs)

    2. FST-CLDNN (MFCCs)

    3. T-CLDNN (log-Mels)

    4. S-CLDNN (log-Mels)

    5. ST-CLDNN (log-Mels)

    6. FST-CLDNN (log-Mels)

The purposes of these experiments are multi-fold. The comparison between the baseline models and the CLDNN-based models aims to demonstrate the effectiveness of the convolutional operations in learning the affective information. Within the category of CLDNN-based models, the goal is to quantify the difference between types of convolutional operations.

5.1SVM with openSMILE features

For the first set of baseline experiments, we employ two evaluation strategies. In the first one, we perform a leave-one-subject-out (LOSO) cross validation. Since we train our deep neural network models using the TVT partitions, the second strategy evaluates the performances of SVM classifiers on the TVT partitions for a fair comparison. In addition, we also take the regular pre-processing procedures, including speaker standardization for removing speaker characteristics and class weighting for slight class imbalance. We conduct the baseline experiments using SVM classifiers trained on the acoustic feature sets in the past INTERSPEECH challenges. The SVM classifiers are trained on these hand-crafted high-dimensional features using the Scikit-Learn machine learning toolkit [47] with linear, polynomial and radial basis function (RBF) kernels. All of SVM experiments are conducted under the clean condition.

5.2CLDNN-based Models with the MFCCs and the log-Mels

To begin with, we extract the log-Mels and the MFCCs using the KALDI toolkit [48] with a window size of ms and a window shift of ms. In both cases, the number of Mel-frequency filterbanks is chosen to be . It has been shown [3] that due to the strong energy compaction property of the discrete cosine transformation, the lower order MFCCs are more important for affective and paralinguistic analysis, while the higher order MFCCs are more related to the phonetic content understanding. In fact, the INTERSPEECH challenges feature sets contain the first - orders of MFCCs; however, the Geneva Minimalistic Acoustic Parameter Set (GeMAPS) [3] recommends the use of only the first orders of the MFCCs. In this work, we keep the conventional first coefficients when computing the MFCCs. After feature extraction, we splice the raw log-Mels and raw MFCCs with a context of frames in the left and frames in the right. At this point, each spliced log-Mel or spliced MFCC lives in or , respectively. An emotional utterance is now represented as a sequence of spliced spectral vectors . We train the LDNN (log-Mels) and the LDNN (MFCCs) as depicted in Figure 1 with their corresponding inputs.

A summary of the ranges for parameter tuning on each type of the convolutional layers, where denotes the spectral dimensionality and the subscripts of and correspond to the first and the second convolutional layers, respectively.
1 : 1 :
: 1 : 1
: : : :
: 1 :

In order to accommodate the inputs to various CLDNN models in Figure 1, we further reshape each to a matrix with the shape of or . We train the X-CLDNN (log-Mels) and X-CLDNN (MFCCs) on the emotional utterances for each training utterance and for . The ranges of the tunable parameters for the convolutional layers are summarized in Table ?, where as shown we focus mostly on the first Conv layer. We exhaust all of the parameter combinations for the S-Conv, T-Conv and FST-Conv types when tuning the architectural parameters. Note that, however, the search space of the optimal parameter set for the ST-Conv is rather huge. Therefore, instead of exploring all of the combinations aimlessly, we limit our attention to the combinations of top parameters from the S-Conv and T-Conv.

We use the Keras library [49] on top of the Theano [50] backend to specify the network architectures and execute the learning processes on an NVIDIA K40 Kepler GPU. The weights of all deep neural network models are learnt by minimizing the cross-entropy objective through the Adam method [51] to adjust the parameters in the stochastic optimization with an initial learning rate being . The size of mini-batch is fixed to due to the capacity of the GPU memory as well as the pursuit for a better generalizing power [52]. An early-stopping strategy [43] with the patience of epochs is employed to avoid over-training. We train all deep neural network models with the emotional utterances in the training partition under the noisy condition; we perform parameter tuning on the validation partition, and the most competitive model on the validation partition under the noisy (clean) condition is tested under the noisy (clean) condition, respectively.

6Experimental Results

We present our experimental results for speech emotion recognition in this section. Even though the class imbalance in the corpus is insignificant, throughout the entire section, we use the un-weighted accuracy (UA) as the performance metric to avoid being biased to the larger classes.

6.1SVM with openSMILE features

Table ? summarizes the results of using SVM classifiers to identify the emotion class of an emotional utterance with one of the 6 archetypal emotions. Based on the LOSO evaluation strategy, a SVM with the STC feature set gives the best baseline performance, while under the TVT evaluation strategy, a SVM with the ComParE feature set stands out among other feature sets. It is clear from these results that a SVM learns better from higher-dimensional feature sets such as the ComParE and the STC sets, which is also a consistent phenomenon observed in [3]. Yan et al. [25] recently published a baseline result on the eNTERFACE’05 corpus using the PC feature set. They trained a SVM classifier on the PC feature set with a speaker-dependent five-fold cross validation evaluation strategy as one of their baseline models. Their baseline work is comparable to ours, and is included in the Table ? as well.

The performances (UA (%)) of the optimal SVM model, the LDNN-based models and the CLDNN-based models. The sparse kernel reduced rank regression (SKRRR) [25] is one of the state-of-the-art models on the eNTERFACE’05 corpus.
noisy clean
75.51 88.33
78.87 90.42
83.44 87.92
84.45 92.92
84.23 92.92
82.73 91.67
84.26 93.75
86.21 94.58

6.2LDNN with the MFCCs and the log-Mels

We present the results of the LDNN-based models in Table ?. Under the noisy condition, the LDNN (MFCCs) and the LDNN (log-Mels) models are able to accurately classify and of the testing samples, respectively. Under the clean condition, they give a performance of and , respectively. One can easily observe that there is a gap of and , respectively, between LDNN (MFCCs) and LDNN (log-Mels) under each condition. Since MFCCs are DCT transformed log-Mels, it implies that DCT may have removed a certain amount of affective information when transforming the log-Mels into the MFCCs. The widened gap under the noisy condition also suggests MFCCs are more sensitive to noise compared to log-Mels, which renders learning from MFCCs a more challenging task. Nevertheless, both LDNN models achieve promising results comparable to that by one of the state-of-the-art models on the eNTERFACE’05 corpus, the sparse kernel reduced rank regression (SKRRR) [25].

6.3CLDNN with the MFCCs and the log-Mels

Finally, Table ? also presents the effectiveness of the CLDNN-based models for classifying emotional utterances into one of the 6 archetypal emotions. First of all, notice that with the CNN layers all CLDNN-based models improve upon their LDNN-based counterparts under both noisy and clean conditions, except that the T-CLDNN (MFCCs) results in a slightly inferior performance under the clean condition. Since MFCCs are rather sensitive to noise, it is likely that the T-Conv layers are mainly optimized to reduce prominent variations due to the artificial noise while neglecting other subtle factors of variation such as speaker or gender. Yet, the result from the FST-CLDNN (MFCCs) also suggests that the MFCCs still contain a reasonable amount of affective information which is learnable by a suitable architecture.

Among the X-CLDNN (log-Mels) models, the order of performances from high to low is the FST-CLDNN (log-Mels), the ST-CLDNN (log-Mels), the T-CLDNN (log-Mels) and the S-CLDNN (log-Mels). The fact that the FST-Conv outperforms the ST-Conv is consistent with the conclusion from [13]. However, the margin is not as significant when there is an LDNN sub-network to help with temporal modeling. It has been reported that the S-Conv layer in a S-CLDNN (log-Mels) would degrade the performance for speech recognition under a moderately noisy condition [53]. The authors attributed this deterioration to the noise-enhanced difficulty for local filters to make decision when learning to capture translational invariance. This attribution seems valid when we contrast the FST-Conv with the other three types. Actually, if we take a closer look, we can easily discover that there is a varying degree of enhanced difficulty to the type of convolutional operations, in which the S-Conv suffers from noise the most, followed by the T-Conv and the ST-Conv to a roughly equivalent degree and finally the FST-Conv the least. Even though we validate on the clean validating partition for selecting the model to be tested on the clean testing partition, the performances under the clean condition demonstrate a similar trend influenced by noise since we carried out the training process under the noisy condition.

One of our goals is to benchmark the strength of the S-Conv and the discrete cosine transformation for spectral modeling. Specifically, the fair comparison should be between the LDNN (MFCCs) and the S-CLDNN (log-Mels) where the DCT-CNN and the S-Conv layers, respectively, act on the spliced log-Mels along the spectral direction, and both of them have an LDNN sub-network for further temporal modeling. Despite the negative impact on the S-Conv layer by noise, it is interesting to observe a stark performance gap between them under the noisy condition. Even under the clean condition, the S-CLDNN (log-Mels) still has a leading margin by more than . Due to its task independence, DCT is not particularly designed to decorrelate the affective information from the other factors. Moreover, since the DCT-CNN layer is shallow and structurally simple, the S-Conv layer has an advantage over DCT as it is deeper and thus better at disentangling the underlying factors of variations [54]. This strength is manifested the most especially when it comes to the noise-related factors. Given that the MFCCs still carry a reasonable amount of affective information, these significant differences in performance between the S-Conv and DCT can be best explained by the inability of DCT to adequately disentangle the affective information from other irrelevant factors of variations.

Last but not the least, we notice that temporally convolutional operations and temporally recurrent operations are learning complimentary information. For instance, the LDNN (log-Mels) models the evolution of affective information through temporal recurrence alone, while the FST-CLDNN (log-Mels) does so by fitting itself to the dynamics via temporal convolution and then temporal recurrence, which improves upon the LDNN (log-Mels) and results in a more competitive system.

6.4Module-wise evaluations

We have so far analyzed the proposed models from an end-to-end perspective and observed interesting phenomena. Although this kind of external analysis has distilled certain working knowledge, what we are equally interested in is the internal mechanism within these models. Along these lines, a key step is to track the flow of relevant information using techniques such as information regularizer [57] or layer-wise evaluation [58]. In this work, we take the second approach due to its simplicity. To make it clear, we only evaluate the intermediate representations at the module level, where by module we mean the CNN module (two Conv layers), the BLSTM module (a BLSTM layer) and the multi-layer perceptron (MLP) module (four FC layers) that make up a CLDNN model.

To begin with, we take the trained CLDNN-based model as the feature extractors and the activated responses of each layer as the discriminative features. For each CLDNN model, we only keep the extraction from the output layer of each module. In addition, the raw spectral-temporal features are presented to serve as the lower bound. A mean pooling over the temporal direction is applied to the raw features, the output of the CNN module and the output of the BLSTM module to form an utterance representation for each of them. In order to quantify the improvement of the representations for speech emotion recognition achieved by each module, we train a SVM classifier on the utterance representation from the output of each module as well as the raw features. The experiment setting is similar to the SVM baseline, where only the clean set is used and the evaluation is based on the TVT strategy.

Quantitative Analysis

The performances (UA (%)) of a SVM classifier trained on the spliced log-Mels, the spliced MFCCs and the output of each module from all CLDNN-based models under the clean condition.
Model (features) Raw CNN BLSTM MLP
23.75 52.50 88.75 88.75
23.75 56.25 88.75 92.50
27.92 59.17 93.33 93.33
27.92 45.83 88.33 91.67
27.92 55.83 89.17 93.75
27.92 54.17 89.17 94.58

Figure 2: 2.506
Figure 2:
Figure 3: 1.148
Figure 3:
Figure 4: 0.263
Figure 4:

Figure 5: 1.298
Figure 5:
Figure 6: 2.800
Figure 6:
Figure 7: 7.762
Figure 7:

Figure 8: 2.988
Figure 8:
Figure 9: 5.292
Figure 9:
Figure 10: 17.292
Figure 10:

Table ? summarizes the results of the module-wise evaluation. As shown in the second column, even though the training and the testing are carried out under the clean condition, the discrete cosine transformation degrades the performance once again. Nevertheless, most of the CNN modules have helped to lift the discriminative power to around regardless of the raw features except for a particularly under-performing model, the S-CLDNN (log-Mels), which based on the previous analysis is known to suffer from noise drastically. One can easily observe that each type of the Conv layers is learning a different representation and hence results in different levels of discriminative power.

It is interesting to note that the SVMs trained on the activations of the CNN module in the {T,ST}-CLDNN (log-Mels) give a better accuracy than that based on the FST-CLDNN (log-Mels), but from a holistic perspective the FST-Conv based system is the most robust one. This may reflect one of the biggest advantages of the end-to-end training approach over the traditional layer-wise approach, which works on feature engineering and classifier training separately; i.e. a greedy layer-wise training that forces the distribution of an intermediate layer to prematurely approximate the distribution of the label is likely to result in a suboptimal system.

Going deeper into the networks, we can see most of the BLSTM modules have further improved the discriminative power to the level of - except for the T-CLDNN (log-Mels). In fact, as we take a closer look at the T-CLDNN (MFCCs) and the T-CLDNN (log-Mels), we find that they both attain one of their optimal forms of affective representation at the output of the BLSTM module. Instead of implying their MLP modules have done nothing based on the constant performance, it may suggest that their MLP modules are integrating out irrelevant information while maintaining the optimal representation. Finally, in the other CLDNN models, the MLP modules further refine the representation to make the prediction an easier task. To sum up, in terms of the UA, the contributions from the CNN module, the BLSTM module and the MLP module are , and , respectively.


In addition to the quantitative analysis of each module, we also present the visualization of the representations to gain intuition toward the internal working mechanism. In order to demonstrate the interplay between the modules and the other irrelevant information, we take into consideration two other types of information which along with the affective information are embedded in the original utterances at the same time; that is, the gender and speaker information. For every representation extracted from each module, we assign three labels to it, including the gender of the speaker (female, male), the serial number of the speaker (sN, where ) and the emotional class (ang for anger, dis for disgust, fea for fear, hap for happiness, sad for sadness, and sur for surprise). On the clean training partition, a linear discriminant analysis (LDA) is applied to the representations and projects them onto the space spanned by the first components, where stands for the number of classes. Each LDA is carried out with respect to these three labels separately and a class prior is employed to match the number of samples in each class. Moreover, we also compute the intra-cluster and inter-cluster inertia [59] on the extracted representations for each label using the following definitions

where is the training set, is the subset of containing only a specific class, is the arithmetic center of , is the member of and is the Euclidean distance. Note that the vectors are the original representations rather than the LDA projections. One may expect to have a small intra-cluster inertia and a large inter-cluster inertia when assessing the quality of a good clustering; in other words, the following ratio measures the quality of a clustering

where the smaller the value of the better a clustering.

Fig. ? shows an example of the visualization for the modules in the T-CLDNN (log-Mels). The first, second and third rows correspond to the affective, speaker and gender information, while the first, second and third columns denote the output of the CNN, the BLSTM and the MLP modules, respectively. In each subplot, every dot indicates an utterance, where utterances within the same class are painted with the same color and their centers of classes are marked with according labels such as hap, s07 and female. The title of each subplot is the value for the distributions in the subplot.

Based on the visualization or the values in the first row, it is clear that the CLDNN model is gradually learning to discriminate different affective patterns. Out of the six emotion classes, anger consistently seems to be the most prominent class across different architectures, and sadness is ranked the second. The progressively improving separability in the first row confirms our quantitative analysis results as well.

On the other hand, the speaker and the gender information are rather salient at the beginning of the architecture. As the forward propagation proceeds, these two types of information are getting filtered out incrementally. Note that the LDA projections on the second and the third rows are computed on the raw extracted representations with their respective labels, i.e. speaker and gender labels, and yet the deteriorating separability is apparently evident from the scatter plots and the increasing trend of .

Based on the results of the quantitative analysis, we thought it is the BLSTM module that discards most amount of irrelevant information compared to the other modules. However, contrary to our initial expectation, it is the MLP module that excessively degrades the separability among speaker or gender classes. For instance, even at the output of the BLSTM module, the model still keeps a fair amount of gender information (Figure 9) but at the output of the MLP module the centers of the male and the female utterances are practically overlapping each other (Figure 10). Previous studies have shown that the higher-level representation of a deep neural networks could better disentangle the underlying factors of variations embedded in the input signals [54]. This visualization suggests that the CNN and the BLSTM modules are mostly playing a role to lift the input tensor into a high-dimensional manifold, a role similar to the kernel method, for disentangling the affective factor from the others, and consequently the MLP module is mainly responsible for integrating out the other factors of variations in order to optimize the corresponding objective function. In addition, this observation also vividly explains the working mechanism of multi-tasking learning that learns multiple related tasks jointly by sharing a common sub-network in the front, and of transfer learning approach that freezes the underlying layers in a pre-trained model and re-learns or fine-tunes the top few, often fully-connected, layers.

The progression from the second column to the third column corroborates our working hypothesis in the quantitative analysis about the T-CLDNN models as well. Instead of doing nothing, the MLP module in the T-CLDNN model is refining the representations while keeping the affective information.

For the visualization of all CLDNN-based models, please refer to Appendix Section 8.


We report the benchmarking of four types of convolutional operations in deep convolutional recurrent neural networks for speech emotion recognition, including the spectrally only, the temporally only, the spectral-temporally, and the full-spectrum temporal convolutional operations. We found these types suffer from noise to a varying degree, in which noise negatively influences the S-Conv the most, followed by the T-Conv and the ST-Conv, and the FST-Conv the least. Under both conditions, the FST-Conv outperforms all of the other three types, and one of the state-of-the-art models under the clean condition.

Even though the S-Conv is the weakest type, the comparison between the S-CLDNN (log-Mels) and the LDNN (MFCCs) shows a significant performance gap between them, which can mostly be attributed to the difference between the S-Conv and the discrete cosine transformation. On the other hand, the FST-CLDNN (MFCCs) is still able to achieve a reasonably good accuracy. These two experiments suggest that although DCT may discard certain amount of affective information, the loss does not entirely account for the performance gap. However, we may link the mediocre performance of the LDNN (MFCCs) to the inability of DCT to adequately disentangle the affective information from other correlated irrelevant factors of variations such as speaker and gender differences and those caused by noise. Based on previous studies of deep neural networks, it is likely the shallow and structurally simple architecture of the DCT-DNN and its task-independent nature leads to such incapability of DCT.

Meanwhile, we also found that the temporal convolution and the temporal recurrence are able to learn complementary information, and the combination of both results in a robust model such as the FST-CLDNN. Nevertheless, we only consider the architecture of a CNN module followed by a BLSTM module. It would be interesting to see if an architecture of a BLSTM module followed by a CNN module would make any difference.

In order to understand the internal mechanism within a CLDNN model, we quantitatively analyzed the module-wise discriminative power by training a SVM on the extracted activations from the output of modules. The reported accuracy can be viewed as an approximated measure of quality in the sense of readiness to exploit the affective information. From the results in Table ?, we found the CNN module, the BLSTM module and the MLP module contribute a refinement of , and to the quality, respectively. This ranking is not surprising as studies from psychology [16] or computational paralinguistics [17] all point out emotion is characterized by temporally dependent dynamics. Nevertheless, our findings have shown that the CNN module is capable of significantly enhancing the separability for emotional classes compared to raw features, particularly when under a noisy condition.

In addition, we visualize three types of information along the depth of the proposed models, including the affective, speaker and gender information. From the visualization, we observe that the model is progressively learning to discriminate different emotional patterns, in which anger and sadness are two of the most prominent emotional classes across all models. What’s more interesting is that other irrelevant factors of variations are integrated out at a varying rate from one module to another. Specifically, the CNN and the BLSTM modules still keep a moderate portion of the gender and speaker information but in the end the MLP module refines the learnt representations by drastically reducing other type of variations.

8Visualization of All Models

Figure 11: 2.821
Figure 11:
Figure 12: 1.193
Figure 12:
Figure 13: 0.314
Figure 13:

Figure 14: 1.278
Figure 14:
Figure 15: 2.857
Figure 15:
Figure 16: 7.376
Figure 16:

Figure 17: 1.777
Figure 17:
Figure 18: 4.303
Figure 18:
Figure 19: 19.403
Figure 19:




Figure 20: 2.655
Figure 20:
Figure 21: 1.329
Figure 21:
Figure 22: 0.317
Figure 22:

Figure 23: 1.298
Figure 23:
Figure 24: 2.615
Figure 24:
Figure 25: 7.747
Figure 25:

Figure 26: 2.015
Figure 26:
Figure 27: 4.821
Figure 27:
Figure 28: 23.587
Figure 28:

Figure 29: 2.670
Figure 29:
Figure 30: 1.731
Figure 30:
Figure 31: 0.361
Figure 31:

Figure 32: 1.391
Figure 32:
Figure 33: 2.507
Figure 33:
Figure 34: 7.057
Figure 34:

Figure 35: 3.338
Figure 35:
Figure 36: 5.656
Figure 36:
Figure 37: 16.753
Figure 37:

Figure 38: 2.850
Figure 38:
Figure 39: 1.023
Figure 39:
Figure 40: 0.315
Figure 40:

Figure 41: 1.326
Figure 41:
Figure 42: 2.436
Figure 42:
Figure 43: 7.273
Figure 43:

Figure 44: 2.441
Figure 44:
Figure 45: 4.318
Figure 45:
Figure 46: 13.363
Figure 46:

Figure 47: 2.495
Figure 47:
Figure 48: 1.593
Figure 48:
Figure 49: 0.301
Figure 49:

Figure 50: 1.486
Figure 50:
Figure 51: 2.563
Figure 51:
Figure 52: 6.749
Figure 52:

Figure 53: 2.502
Figure 53:
Figure 54: 4.185
Figure 54:
Figure 55: 12.230
Figure 55:


  1. S. S. Narayanan and P. Georgiou, “Behavioral signal processing: Deriving human behavioral informatics from speech and language,” Proceedings of IEEE, vol. 101, no. 5, pp. 1203–1233, May 2013.
  2. B. W. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. R. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism,” in Proceedings of Interspeech, 2013.
  3. F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. Andre, C. Busso, L. Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. Truong, “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, 2015.
  4. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1,” 1986, ch. Learning Internal Representations by Error Propagation, pp. 318–362.
  5. Y. L. Cun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel, and D. Henderson, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems, 1990, pp. 396–404.
  6. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, 1997.
  7. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
  8. M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of the European Conference on Computer Vision, 2014.
  9. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations, 2015.
  10. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  11. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  12. Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
  13. N. Anand and P. Verma, “Convoluted feelings convolutional and recurrent nets for detecting emotion from audio data,” Technical Report, Stanford University, 2015.
  14. S. Zhang, S. Zhang, T. Huang, and W. Gao, “Multimodal deep convolutional neural network for audio-visual emotion recognition,” in Proceedings of the International Conference on Multimedia Retrieval, 2016.
  15. B. Milde and C. Biemann, “Using representation learning and out-of-domain data for a paralinguistic speech task,” in Proceedings of Interspeech, 2015.
  16. K. Oatley, D. Keltner, and J. Jenkins, Understanding emotions.1em plus 0.5em minus 0.4emBlackwell, 1996.
  17. B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” in Proceedings of the International Conference on Multimedia and Expo, 2003.
  18. M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling,” in Proceedings of Interspeechi, 2010.
  19. A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive learning for enhanced audiovisual emotion classification,” IEEE Transactions on Affective Computing, vol. 3, no. 2, Apr. 2012.
  20. J. Lee and I. Tashev, “High-level feature representation using recurrent neural network for speech emotion recognition,” in Proceedings of Interspeech, 2015.
  21. T. N. Sainath, O. Vinyals, A. W. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
  22. T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in Proceedings of Interspeech, 2015.
  23. C. W. Huang and S. S. Narayanan, “Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition,” in Proceedings of the IEEE International Conference on Multimedia and Expo, 2017.
  24. W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and recurrent neural networks,” in Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2016.
  25. J. Yan, W. Zheng, Q. Xu, G. Lu, H. Li, and B. Wang, “Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech,” IEEE Transactions on Multimedia, vol. 18, no. 7, pp. 1319–1329, 2016.
  26. O. Martin, I. Kotsia, B. M. Macq, and I. Pitas, “The eNTERFACE’05 audio-visual emotion database,” in Proceedings of the International Conference on Data Engineering Workshops, 2006.
  27. K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” in Proceedings of Interspeech, 2014.
  28. O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 22, no. 10, pp. 1533–1545, Oct. 2014.
  29. D. Li, O. Abdel-Hamid, and D. Yu, “A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion,” in Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, 2013.
  30. W. Chan and I. Lane, “Deep convolutional neural networks for acoustic modeling in low resource languages,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
  31. S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, “Cnn architectures for large-scale audio classification,” 2016, arXiv:1609.09430.
  32. N. Takahashi, M. Gygli, and L. V. Gool, “AEnet: Learning deep audio features for video analysis,” 2017, arXiv:1701.00599.
  33. G. Trigeorgis, F. Ringeval, R. B., E. Marchi, M. N. a., B. Schuller, and S. Zafeiriou, “Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing, 2016.
  34. X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang, “DepAudioNet: An efficient deep model for audio based depression classification,” in Proceedings of the International Workshop on Audio/Visual Emotion Challenge, 2016.
  35. P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” 2017, arXiv:1704.08619.
  36. G. Keren, J. Deng, J. Pohjalainen, and B. Schuller, “Convolutional neural networks with data augmentation for classifying speakers native language,” in Proceedings of Interspeech, 2016.
  37. J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.
  38. A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional lstm networks for improved phoneme classification and recognition,” in Proceedings of the International Conference on Artificial Neural Networks, 2005.
  39. M. Sundermeyer, T. Alkhouli, J. Wuebker, and H. Ney, “Translation modeling with bidirectional recurrent neural networks,” in Proceedings of the Conference on Empirical Methods on Natural Language Processing, 2014.
  40. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, Jan. 2014.
  41. F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE: The Munich versatile and fast open-source audio feature extractor,” in Proceedings of the ACM International Conference on Multimedia, 2010.
  42. P. Ekman, E. R. Sorenson, W. V. Friesen et al., “Pan-cultural elements in facial displays of emotion,” Science, vol. 164, no. 3875, pp. 86–88, 1969.
  43. D. Maclaurin, D. Duvenaud, and R. P. Adams, “Early stopping is nonparametric variational inference,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2016.
  44. T. S. Cohen and M. Welling, “Group equivariant convolutional networks,” in Proceedings of the International Conference on Machine Learning, 2016.
  45. S. Dieleman, J. De Fauw, and K. Kavukcuoglu, “Exploiting cyclic symmetry in convolutional neural networks,” in Proceedings of the International Conference on Machine Learning, 2016.
  46. D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  47. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  48. D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann, Y. Qian, P. Schwarz, and G. Stemmer, “The KALDI speech recognition toolkit,” in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2011.
  49. F. Chollet, “Keras,”, 2015.
  50. Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv:1605.02688, May 2016.
  51. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
  52. N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” in Proceedings of the International Conference on Learning Representations, 2017.
  53. T. N. Sainath and B. Li, “Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks,” in Proceedings of Interspeech, 2016.
  54. R. Pascanu, C. Gulcehre., K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” in Proceedings of the International Conference on Learning Representations, 2014.
  55. X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in Proceedings of the International Conference on Machine Learning, 2011.
  56. I. J. Goodfellow, Q. V. Le, A. M. Saxe, H. Lee, and A. Y. Ng, “Measuring invariances in deep networks,” in Proceedings of the International Conference on Neural Information Processing Systems, 2009, pp. 646–654.
  57. C. W. Huang and S. S. Narayanan, “Flow of renyi information in deep neural networks,” in Proceedings of the IEEE International Workshop of Machine Learning for Signal Processing, 2016.
  58. G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,” arXiv:1610.01644, Oct. 2016.
  59. L. Lebart, A. Morineau, and J. Fenelon, Traitement des donn´ees statistiques.1em plus 0.5em minus 0.4emDunod, 1979.
  60. J.-C. Lamirel, P. Cuxac, R. Mall, and G. Safi, A New Efficient and Unbiased Approach for Clustering Quality Evaluation.1em plus 0.5em minus 0.4emSpringer Berlin Heidelberg, 2012, pp. 209–220.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description