Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks

Abstract

This paper presents an audiovisual-based emotion recognition hybrid network. While most of the previous work focuses either on using deep models or hand-engineered features extracted from images, we explore multiple deep models built on both images and audio signals. Specifically, in addition to convolutional neural networks (CNN) and recurrent neutral networks (RNN) trained on facial images, the hybrid network also contains one SVM classifier trained on holistic acoustic feature vectors, one long short-term memory network (LSTM) trained on short-term feature sequences extracted from segmented audio clips, and one Inception(v2)-LSTM network trained on image-like maps, which are built based on short-term acoustic feature sequences. Experimental results show that the proposed hybrid network outperforms the baseline method by a large margin.

\name

Xin Guo   Luisa F. Polanía   Kenneth E. Barner1 \address University of Delaware, Department of Electrical and Computer Engineering, Newark, DE, USA
{guoxin, barner}@udel.edu
Target Corporation, Sunnyvale, CA, USA
Luisa.PolaniaCabrera@target.com {keywords} Audio-video emotion recognition, multimodal fusion, long short-term memory networks

1 Introduction

Emotion recognition is relevant in many computing areas that take into account the affective state of the user, such as human-computer interaction [1], human-robot interaction [2], music and image recommendation [3], affective video summarization [4], and personal wellness and assistive technologies [5]. Although emotion recognition is an interesting problem, it is also very challenging unless the recording conditions are well controlled. Emotion recognition “in the wild” suffers from many issues that need to be overcome, such as cluttered backgrounds, large variances in face pose and illumination, video and audio noise, and occlusions.

Figure 1: The overall structure of the proposed hybrid network. For visualization purposes, only one VGG-LSTM on faces is shown in the diagram; however, note that the hybrid network contain two VGG-LSTMs with the same network structure but trained on faces detected by different methods.

Recently, hybrid neural networks combining CNNs and RNNs [6, 7, 8] have become the state-of-the-art for emotion recognition. Of particular interest are the top-performing works of the EmotiW Challenge, whose goal is to advance emotion recognition in unconstrained conditions by providing researchers with a platform to benchmark the performance of their algorithms on “in-the-wild” datasets. One of the sub-challenges of the EmotiW challenge is the audio-video emotion recognition sub-challenge, which is based on an augmented version of the AFEW dataset [9] that contains short video clips extracted from movies that have been annotated for seven different emotions.

Deep learning has played an important role in most of the sub-challenge winning submissions. In 2013, the winners presented a method that combines CNNs for static faces, an auto-encoder for human action recognition, a deep belief network for audio information, and a shallow network architecture for feature extraction of the mouth region [10]. The winners of the 2014 sub-challenge used CNNs for feature extraction of the aligned faces provided by the challenge organizers [11], while in 2016, the winners of the sub-challenge proposed a hybrid network architecture that combines 3D CNNs and a CNN-RNN in a late-fusion fashion [6].

While a variety of methods based on images have been proposed, the audio channel has been explored in a less extent. Existing approaches that exploit the audio channel for emotion recognition include the use of support vector machines (SVM) [12, 6] , random forests [13, 14] and CNNs trained on comprehensive acoustic vectors extracted by openSmile [15]. We propose to fully exploit the audio-channel information in this paper. Inspired by the recurrent support vector machines designed by Yun et al[16] on event detection, we propose an LSTM [17] trained on short-term audio features extracted from segmented audio clips. Furthermore, a CNN-RNN network trained on image-like maps, formed by stacking short-term audio features, is also presented in this paper. The proposed hybrid network (Figure 1) results in the combination of a CNN-RNN network trained on images and audio-based models, and achieves an overall accuracy of and on the validation and testing sets, respectively, surpassing the audio-video emotion recognition sub-challenge baseline of 38.81% on the validation set with significant gains.

2 The proposed method

2.1 VGG-LSTM based on Faces

A traditional CNN-LSTM neural network [6, 14] is explored to learn emotion from faces. Video frames are extracted at a frequency of 60 fps. Faces and facial landmarks are first detected within each frame using the method described in [18], then a 2D affine transformation where the left and right eye corners of all the images are aligned to the same positions is performed (the code of the face detection and alignment algorithms is developed based on [19]).

Aligned faces are used as input to the VGG-16 convolutional neural network [20]. The VGG architecture is modified by changing the number of neurons in the last layer to 7, indicating 7 emotion classes. This modified VGG architecture is initialized with the parameters of the VGG-FACE model, except for the last fully-connected layer which is initialized with weights sampled from a Gaussian distribution of zero mean and variance , and trained from scratch with the learning rate for the weight and bias filters set to be 10 times larger than the overall learning rate. The VGG-FACE model was presented as the result of training the 16-layer VGG architecture on a large-scale dataset containing 2.6M images of 2.6K celebrities and public figures for face recognition in [21]

The training procedure is three-fold. First, the modified VGG network is trained on the facial expression recognition 2013 (FER-2013) database [22]. The FER-2013 database contains images corresponding to basic emotions. The idea of this step is to transfer the knowledge from face recognition to face emotion recognition. Second, the resulting model is fine-tuned on the detected faces of the AFEW dataset. Third, the layers of the fine-tuned model after the “fc6” layer are replaced by a one-layer LSTM and a final fully connected layer with output units. The weights of the LSTM are initialized with values drawn from a uniform distribution over [-0.01, 0.01] and the bias terms are initialized to 0. The combined VGG-LSTM is trained end-to-end. The LSTM layer has nodes and the length of the input sequence is . Face images extracted at every 8 frames of the original video sequence are selected as input to the VGG-LSTM network. Experimental results show that the frame gap helps improve the classification accuracy since facial change is more visible in this way.

Unlike some existing works that first train the CNN and use their “fc6” features as input vector sequences for the LSTM network, the proposed structure connects the VGG and LSTM networks end-to-end and learns all the parameters simultaneously. Experimental results show that our VGG-LSTM outperforms the results of the winner of the audio-video emotion recognition sub-challenge in 2016.

AN DI FE HA NE SA SU
AN 53.12 6.25 7.81 0 17.19 3.12 12.50
DI 17.50 27.50 7.50 2.50 25.00 15.00 5.00
FE 21.74 4.35 23.91 13.04 6.52 17.39 13.04
HA 7.94 1.59 0 84.13 0 4.76 1.59
NE 11.11 11.11 7.94 6.35 53.97 6.35 3.17
SA 8.20 4.92 1.64 6.56 22.95 55.74 0
SU 21.74 6.52 17.39 4.35 17.39 4.35 28.26
Table 1: Confusion matrix results of the VGG-LSTM network, trained on aligned faces, when tested on the validation set. The attained overall accuracy is and the unweighted average of the per-class accuracies is .
AN DI FE HA NE SA SU
AN 64.06 1.56 7.81 1.56 12.50 6.25 6.25
DI 22.50 15.00 5.00 10.00 25.00 20.00 2.50
FE 32.61 8.70 26.09 4.35 13.04 8.70 6.52
HA 9.52 3.17 0 73.02 6.35 6.35 1.59
NE 14.29 11.11 1.59 3.17 63.49 6.35 0
SA 16.39 11.48 6.56 8.20 13.11 40.98 3.28
SU 32.61 6.52 17.39 0 15.22 8.70 19.57
Table 2: Confusion matrix results of the VGG-LSTM network, trained on the faces provided by the challenge organizers, tested on the validation set. The attained overall accuracy is and the unweighted average of the per-class accuracies is .
AN DI FE HA NE SA SU
AN 76.56 0 3.12 9.38 7.81 3.12 0
DI 25.00 0 0 42.50 20.00 12.50 0
FE 23.91 0 30.43 23.91 13.04 8.70 0
HA 15.87 1.59 9.52 42.86 20.63 9.52 0
NE 12.70 1.59 3.17 34.92 46.03 1.59 0
SA 11.48 0 11.48 26.23 21.31 27.87 1.64
SU 19.57 0 17.39 36.96 13.04 13.04 0
Table 3: Confusion matrix of the audio SVM model on the validation set, with overall accuracy of and unweighted average of the per-class accuracies of .
AN DI FE HA NE SA SU
AN 48.44 1.56 0 15.62 15.62 18.75 0
DI 15.00 2.50 0 35.00 32.50 15.00 0
FE 30.43 0 0 19.57 28.26 21.74 0
HA 12.70 4.76 0 33.33 30.16 19.05 0
NE 6.35 3.17 0 17.46 52.38 20.63 0
SA 11.48 6.56 0 22.95 32.79 26.23 0
SU 17.39 0 2.17 30.43 36.96 13.04 0
Table 4: Confusion matrix of the audio LSTM model on the validation set, with overall accuracy of and unweighted average of the per-class accuracies of .

2.2 Acoustic SVM Classifier

An SVM classifier, which is learned based on the 1582-dimensional acoustic features extracted using openSMILE, is incorporated into the hybrid network. Acoustic features include low level descriptors, such as energy, mel-frequency cepstral coefficients (MFCCs), linear predictive coding (LPC), zero-crossing rate (ZCR), spectral flux, spectral roll-off, chroma vector, and statistical features summarized by functions, such as mean and standard deviation.

2.3 LSTM based on Audio clips

Instead of extracting holistic features, each audio is divided into segments of length 100 ms, using an overlapping factor of 50%, and then segment-level features are extracted to form a sequence of vectors. Specifically, short-term features are extracted for each segment using pyAudioAnalysis [23]. Features include ZCR, energy, entropy of energy, spectral centroid, spectral spread, spectral flux, spectral roll-off, MFCCs, chroma vector, and chroma deviation. Assume that the audio signal has length , then the number of sequences is . Therefore, this feature extraction process results in sequences of dimension . Since each audio is of different length, a sequence length converter is applied to make the number of sequences be at least by copying the last feature vector of the sequence times when the number of sequences is less than 16. A one-layer LSTM with neurons is trained on the sequence of feature vectors. Unlike the audio SVM model which focuses on the holistic properties of the signal, the audio LSTM model focuses on learning the dynamic temporal behavior of the audio signals.

2.4 Inception(v2)-LSTM based on Audio Maps

In this section, the sequence of feature vectors from Section 2.3 is converted into sequential image-like maps. Specifically, the feature vectors are organized in matrix form to build an image-like map of dimension . The next step is to segment this image-like map into smaller maps of size using an overlapping factor of . For the architecture proposed in this section, the sequence length must be a multiple of and greater or equal than . If this condition is not satisfied, then the last column of the image-like map is replicated times, where is the closest multiple of 17 larger than . This approach results in a sequence of image-like maps, whose sequence length is .

A network similar to the VGG-LSTM network described in section 2.1, Inception(v2)-LSTM, is developed to train on image-like maps. Instead of using the VGG architecture, we use Inception-v2 [24] to train the image-like maps first. The number of output units of the last layer is changed to , and the training parameters, such as the learning rate and the weight decay are set the same as the ones used in Inception-v2 on ImageNet [25].

After the training of the modified Inception-v2 on the individual image-like maps, the layers after the “global_pool” layer of the Inception-v2 architecture are replaced by a one-layer LSTM with neurons and a fully connected layer with outputs. The resulting network is referred to as Inception(v2)-LSTM. This network takes a sequence of 8 image-like maps at a time and learns the features end-to-end to model the dynamic temporal properties of the sequence. Since the sequence length of the image-like maps is and needs to be greater than 8 to serve as input to the Inception(v2)-LSTM architecture, the last image-like map of the sequence is copied times to make the sequence satisfy the minimum length requirement. The initial learning rate is set to and decreases every iterations by a factor of . The batch size, the weight decay and the maximum number of iterations are set to , and , respectively.

AN DI FE HA NE SA SU
AN 56.25 0 0 29.69 10.94 3.12 0
DI 12.50 0 0 57.50 27.50 2.50 0
FE 13.04 0 2.17 45.65 36.96 2.17 0
HA 11.11 0 0 52.38 26.98 9.52 0
NE 6.35 0 0 52.38 39.68 1.59 0
SA 8.20 0 0 63.93 16.39 11.48 0
SU 6.52 2.17 2.17 56.52 26.09 6.52 0
Table 5: Confusion matrix of the audio Incetion(v2)-LSTM model on the validation set, with overall accuracy of and unweighted average of the per-class accuracies of .
AN DI FE HA NE SA SU
AN 77.55 0 4.08 8.16 7.14 2.04 1.02
DI 32.50 10.00 2.50 12.50 20.00 20.00 2.50
FE 31.43 0 50.00 1.43 5.71 7.14 4.29
HA 20.83 0 1.39 63.89 10.42 3.47 0
NE 16.69 1.04 7.77 6.22 50.78 11.92 2.59
SA 22.50 1.25 11.25 11.25 16.25 36.25 1.25
SU 10.71 3.57 35.71 10.71 14.29 25.00 0
Table 6: Confusion matrix results of submission 6 when evaluating the hybrid network on the testing dataset. The attained overall accuracy is and the un-weighted average of the per-class accuracies is .

3 Experimental Results

3.1 Database

The AFEW database used in EmotiW 2017 contains 773, 383 and 653 audio-video movie clips in the training, validation and testing datasets, respectively. The task is to assign a single emotion label to a video clip from the basic emotions, namely, anger, disgust, fear, happiness, neutral, sad and surprise. Participants compete on the accuracy of the testing data2.

3.2 Results of the Proposed Models

Confusion matrices for each model are shown in Tables 1 through  5. One VGG-LSTM model is trained on the aligned faces, which are obtained as described in Section 2.1, and another VGG-LSTM model is trained on the faces provided by the challenge organizers. Our best VGG-LSTM model achieves an overall classification accuracy of , outperforming the accuracy obtained by the winner of the 2016 audio-video emotion recognition sub-challenge, which suggests that the frame gap introduced by the proposed VGG-LSTM model is a better way to represent the dynamics of face expression in video. The second VGG-LSTM model trained on the faces provided by the challenge organizers complements the proposed model, and the combination of the two achieves a classification accuracy of on the validation set.

Audio models, including audio SVM, audio LSTM and audio Inception(v2)-LSTM have lower accuracy than the VGG-LSTM models trained on faces. However, they perform well on the anger class, and therefore, it improves the overall accuracy of the hybrid network.

The aforementioned deep models are combined using decision fusion. Grid search is employed to find the model weights that maximize the classification accuracy on the validation set. Fused hybrid network achieves a classification accuracy of on the validation set, while the challenge baseline is of accuracy . When trained on a combination of the training and validation sets, the accuracy on the testing set of the proposed hybrid network is . The corresponding confusion matrix is shown in Table 6.

4 Conclusions

In this paper, we proposed an audiovisual-based hybrid network that combines the predictions of models for emotion recognition in the wild, with an emphasis on exploring audio channels. The overall accuracy of the proposed method achieves and classification accuracy on the validation and testing dataset, respectively, surpassing the audio-video emotion recognition sub-challenge baseline of 38.81% on the validation set with significant gains.

Footnotes

  1. thanks: The work is supported by the National Science Foundation under Grant No. 1319598.
  2. Note that since the class distribution is unbalanced, the accuracy participants compete on is the overall accuracy, which is computed on all the samples of the testing set. The unweighted average of the per-class accuracies is also provided in this paper.

References

  1. R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, 2001.
  2. D. Kulic and E.A. Croft, “Affective state estimation for human–robot interaction,” IEEE Transactions on Robotics, vol. 23, no. 5, pp. 991–1000, 2007.
  3. M. Shan, F. Kuo, M. Chiang, and S. Lee, “Emotion-based music recommendation by affinity discovery from film music,” Expert systems with applications, vol. 36, no. 4, pp. 7666–7674, 2009.
  4. H. Joho, J.M. Jose, R. Valenti, and N. Sebe, “Exploiting facial expressions for affective video summarisation,” in Proceedings of the ACM international conference on image and video retrieval. ACM, 2009, p. 31.
  5. M. Pantic, A. Pentland, A. Nijholt, and T.S. Huang, “Human computing and machine understanding of human behavior: A survey,” in Artifical Intelligence for Human Computing, pp. 47–71. Springer, 2007.
  6. Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emotion recognition using CNN-RNN and C3D hybrid networks,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 2016, pp. 445–450.
  7. P. Khorrami, T. Le Paine, K. Brady, C. Dagli, and T.S. Huang, “How deep neural networks can improve emotion recognition on video data,” in IEEE International Conference on Image Processing. IEEE, 2016, pp. 619–623.
  8. S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent neural networks for emotion recognition in video,” in Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, 2015, pp. 467–474.
  9. A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collecting large, richly annotated facial-expression databases from movies,” vol. 19, no. 3, pp. 34–41, July 2012.
  10. S.E. Kahou et al., “Combining modality specific deep neural networks for emotion recognition in video,” in Proceedings of the 15th International conference on multimodal interaction. ACM, 2013, pp. 543–550.
  11. M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen, “Combining multiple kernel methods on Riemannian manifold for emotion recognition in the wild,” in Proceedings of the 16th International Conference on Multimodal Interaction. ACM, 2014, pp. 494–501.
  12. C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sept. 1995.
  13. L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001.
  14. B. Sun, Q. Wei, L. Li, Q. Xu, J. He, and L. Yu, “Lstm for dynamic emotion and group emotion recognition in the wild,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, New York, NY, USA, 2016, ICMI 2016, pp. 451–457, ACM.
  15. J. Yan et al., “Multi-clue fusion for emotion recognition in the wild,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, New York, NY, USA, 2016, ICMI 2016, pp. 458–463, ACM.
  16. Y. Wang and F. Metze, “Recurrent support vector machines for audio-based multimedia event detection,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, New York, NY, USA, 2016, ICMR ’16, pp. 265–269, ACM.
  17. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
  18. V. Kazemi and Josephine Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in CVPR, 2014.
  19. T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in unconstrained images,” in CVPR, June 2015.
  20. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  21. O.M. Parkhi, A. Vedaldi, A. Zisserman, et al., “Deep face recognition,” in BMVC, 2015, vol. 1, p. 6.
  22. I.J. Goodfellow et al., “Challenges in representation learning: A report on three machine learning contests,” in International Conference on Neural Information Processing. Springer, 2013, pp. 117–124.
  23. T. Giannakopoulos, “pyaudioanalysis: An open-source python library for audio signal analysis,” PloS one, vol. 10, no. 12, 2015.
  24. C. Szegedy et al., “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015.
  25. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409227
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description