End-to-End Audiovisual Fusion with LSTMs ††thanks: Accepted to Auditory-Visual Speech Processing Conference 2017
Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple streams/modalities takes place via another BLSTM. An absolute improvement of 1.9% in the mean F1 of 4 nonlingusitic vocalisations over audio-only classification is reported on the AVIC database. At the same time, the proposed end-to-end audiovisual fusion system improves the state-of-the-art performance on the AVIC database leading to a 9.7% absolute increase in the mean F1 measure. We also perform audiovisual speech recognition experiments on the OuluVS2 database using different views of the mouth, frontal to profile. The proposed audiovisual system significantly outperforms the audio-only model for all views when the acoustic noise is high.
Stavros Petridis, Yujiang Wang, Zuwei Li, Maja Pantic
Dept. Computing, Imperial College London
EEMCS, University of Twente \firstname.lastname@example.org, email@example.com
Index Terms: Audiovisual Fusion, End-to-end Deep Learning, Audiovisual Speech Recognition
Audiovisual fusion approaches have been successfully applied to various problems like speech recognition [1, 2], emotion recognition [3, 4], laughter recognition  and biometric applications . The addition of the visual modality is particularly useful in noisy environments where the performance of audio-only classifiers is degraded. As a consequence, the visual information, which is not affected by acoustic noise, can significantly improve the performance of audio-only classifiers in noisy environments.
Recently, several deep learning approaches for audiovisual fusion have been presented. The vast majority of them follow a two step approach where features are first extracted from the audio and visual modalities and then are fed to a classifier. Ngiam et al.  applied principal component analysis (PCA) to the mouth region of interest (ROI) and spectrograms and trained a deep autoencoder to extract bottleneck features. The features from the entire utterance were fed to a support vector machine (SVM) ignoring the temporal dynamics of the speech. Hu et al.  used a similar approach where PCA was applied to mouth ROIs and spectrograms and a recurrent temporal multimodal restricted Boltzmann machine was trained to extract features which are fed to an SVM. Ninomiya et al.  applied PCA to the mouth ROIs and concatenated Mel-Frequency Cepstral Coefficients (MFCCs) and trained a deep autoencoder to extract bottleneck features which were fed to a Hidden Markov Model (HMM) in order to take into account the temporal dynamics. Mroueh et al.  used concatenated MFCCs together with scattering coefficients extracted from the mouth ROI in order to train a deep network with a bilinear softmax layer. Takashima et al.  used a convolutional neural network to extract bottleneck features from lip images and Mel-maps which were fed to an HMM. It is clear that none of the above works follows an end-to-end architecture.
Few works have been presented very recently which follow an end-to-end approach for visual speech recognition (lipreading). Wand et al.  used a fully connected layer followed by two LSTM layers to perform lipreading directly from raw mouth ROIs. Petridis et al.  used a deep autoencoder together with an LSTM for end-to-end lipreading from raw pixels. Assael et al.  used a CNN with gated recurrent units for end-to-end sentence-level lipreading.
To the best of our knowledge, the only work which performs end-to-end training for audiovisual speech recognition is . An attention mechanism is applied to both the mouth ROIs and MFCCs and the model is trained end-to-end. However, the system does not use the raw audio signal or spectrogram but relies on MFCC features.
In this paper, we extend our previous work  and present an end-to-end audiovisual fusion model for speech recognition and nonlinguistic vocalisation classification which jointly learns to extract audio/visual features directly from raw inputs and perform classification (Fig. 1). To the best of our knowledge, this is the first end-to-end model which performs audiovisual fusion from raw mouth ROIs and spectrograms. The proposed model consists of multiple identical streams, one per modality, which extract features directly from the raw images and spectrograms. Each stream consists of an encoder which compresses the high dimensional input to a low dimensional representation. The encoding layers in each stream are followed by a BLSTM which models the temporal dynamics. Finally, the information of the different streams/modalities is fused via a BLSTM which also provides a label for each input frame. We perform classification of nonlinguistic vocalisations on AVIC database achieving state-of-the-art performance for audiovisual fusion, with an absolute increase in the mean F1 measure by 9.7%. The proposed system also results in an absolute increase of 1.9% in the mean F1 measure compared to the audio-only model. In addition, we also perform experiments on audiovisual speech recognition using different lip views, from frontal to profile, on OuluVS2. The end-to-end audiovisual fusion outperforms the audio-only model when the noise level is high and results in the same performance when clean audio is used.
The databases used in this study are the OuluVS2  and AVIC . The OuluVS2 contains 52 speakers saying 10 utterances, 3 times each, so in total there are 156 examples per utterance. The utterances are the following: “Excuse me”, “Goodbye”, “Hello”, “How are you”, “Nice to meet you”, “See you”, “I am sorry”, “Thank you”, “Have a good time”, “You are welcome”. The mouth ROIs are provided and they are downscaled as shown in Table 1 in order to keep the aspect ratio of the original videos constant. Video is recorded at 30 frames per second (fps) and audio at 48 The unique feature of OuluVS2 is that it provides multiple lip views. To the best of our knowledge it is the only publicly available database with 5 lip views between 0°and 90°. The LiLir dataset  also contains five views but it is not publicly available at the moment, and the TCD-Timit database  contains only two views, frontal and 30°.
The AVIC corpus is an audiovisual dataset containing scenario-based dyadic interactions. A subject is interacting with an experimenter who plays the role of a product presenter and leads the subject through a commercial presentation. The subjects role is to listen to the presentation and interact with the experimenter depending on his/her interest on the product.
Annotations for laughter, hesitation, consent and other human noises, which are grouped into one class called garbage, are provided with the database and those are used in this study. In total 21 subjects were recorded, 11 males and 10 females with most subjects being non-native speakers. Similarly to previous works [17, 20, 21] vocalisations that were very short ( 120 ) were excluded. In total, 247, 1136, 308 and 582 examples for the laughter, hesitation, consent and garbage class, respectively, were used. Examples of laughter and hesitation are shown in Fig. 3 and 3, respectively.
A video camera was used to record the subject’s reaction, positioned in front of him/her, at 25 fps. The audio signal was recorded by a lapel microphone at 44.1 .
AVIC does not provide mouth ROIs so sixty eight points were tracked on the face using the tracker proposed in . The faces were first aligned using a neutral reference frame in order to normalise them for rotation and size differences. This is done using an affine transform using five stable points, two eyes corners in each eye and the tip of the nose. Then the center of the mouth is located based on the tracked points and a bounding box with size 85 by 129 is used to extract the mouth ROI. Finally, the mouth ROIs are downscaled to 30 by 45.
3 End-to-end Audiovisual Fusion
The proposed deep learning system for audiovisual fusion is shown in Fig. 1. It consists of two identical streams which extract features directly from the raw input images and the spectrograms111Spectrogram frame are computed over a 40 windows with 30 overlap. , respectively. Each stream consists of two parts: an encoder and a BLSTM. The encoder follows a bottleneck architecture in order to compress the high dimensional input image to a low dimensional representation at the bottleneck layer. The same architecture as in  is used, with 3 hidden layers of sizes 2000, 1000 and 500, respectively, followed by a linear bottleneck layer. The rectified linear unit is used as the activation function for the hidden layers. The (first derivatives) and (second derivatives)  features are also computed, based on the bottleneck features, and they are appended to the bottleneck layer. In this way, during training we force the encoding layers to learn compact representations which are discriminative for the task at hand but also produce discriminative and features. This is in contrast to the traditional approaches which pre-compute the and features at the input level and as a consequence there is no control over their discriminative power.
The second part is a BLSTM layer added on top of the encoding layers in order to model the temporal dynamics of the features in each stream. The BLSTM outputs of each stream are concatenated and fed to another BLSTM in order to fuse the information from all streams. The output layer is a softmax layer which provides a label for each input frame. The majority label over each utterance is used in order to label the entire utterance. The entire system is trained end-to-end which enables the joint learning of features and classifier. In other words, the encoding layers learn to extract features from raw images and spectrograms which are useful for classification using BLSTMs.
4 Experimental Setup
4.1 Evaluation Protocol
We first partition the data into training, validation and test sets. The same protocol as in  is used for the AVIC dataset where the first 7 subjects are used for testing, the next 7 for training and the last 7 for validation.
The protocol suggested in  is used for the OuluVS2 dataset where 40 subjects are used for training and validation and 12 for testing. We randomly divided the 40 subjects into 35 and 5 subjects for training and validation purposes, respectively. This means that there are 1050 training utterances, 150 validation utterances and 360 test utterances.
Since all the experiments are subject independent we first need to reduce the impact of subject dependent characteristics. This is done by subtracting the mean image, computed over the entire utterance, from each frame.
As mentioned in section 2 the audio and visual features are extracted at different frame rates. Therefore they need to by synchronised. This is achieved by upsampling the visual features, to match the frame rate of the audio features (100fps), by linear interpolation similarly to .
Finally, due to randomness in initialisation, every time a deep network is trained the results are slightly different. In order to present a more objective evaluation we run each experiment 10 times and we report the mean and standard deviation of the performance measures.
4.3.1 Single Stream Training
Initialisation: First, each stream is trained independently. The encoding layers are pre-trained in a greedy layer-wise manner using Restricted Boltzmann Machines (RBMs) . Since the input (pixels or spectrograms) is real-valued and the hidden layers are either rectified linear or linear (bottleneck layer) four Gaussian RBMs  are used. Each RBM is trained for 20 epochs with a mini-batch size of 100 and L2 regularisation coefficient of 0.0002 using contrastive divergence. The learning rate is fixed to 0.001 as suggested in  when visible/hidden units are linear.
As recommended in  the data should be z-normalised, i.e., the mean and standard deviation should be equal to 0 and 1, respectively, before training an RBM with linear input units. Hence, each image is z-normalised before pre-training the encoding layers. Similarly, each spectrogram frame is also z-normalised.
End-to-End Training: Once the encoder has been pretrained then the BLSTM is added on top and its weights are initialised using glorot initialisation . The Adam training algorithm  is used for end-to-end training with a mini-batch size of 10 utterances. The default learning rate of 0.001 led to unstable training so it was reduced to 0.0003. Early stopping with a delay of 5 epochs was also used in order to avoid overfitting and gradient clipping was applied to the LSTM layers.
4.3.2 Audiovisual Training
Initialisation: Once the single streams have been trained then they are used for initialising the corresponding streams in the multi-stream architecture. Then another BLSTM is added on top of all streams in order to fuse the single stream outputs. Its weights are initialised using glorot initialisation.
End-to-End Training: Finally, the entire network is trained jointly using Adam with a mini-batch size of 10 utterances. Since the individual streams are already initialised at good values a lower learning rate is used, 0.0001, to fine tune the entire network. Early stopping and gradient clipping were also applied similarly to single stream training.
In this section we report the results on OuluVS2 and AVIC databases. We have experimented with using the end-to-end audiovisual system shown in Fig. 1 but also with the individual streams, i.e., audio- and video-only classification. In the latter case, we just use the corresponding single stream, encoder + BLSTM.
5.1 Results on AVIC database
Results for the AVIC database are shown in Table 2. Since this is an imbalanced dataset, see section 2, using just the classification rate can be misleading. Hence, we also report the unweighted average recall (UAR) rate and the mean F1 measure over all 4 classes. First of all, we see that the proposed end-to-end system significantly outperforms the current state-of-the-art on the AVIC database, which is based on handcrafted features and prediction-based audiovisual fusion . It results in a statistically significant absolute mean F1 improvement of 19% and 9.8% for the audio-only and audiovisual classification, respectively.
It is also clear that the audio-only classifier performs much better than the video-only classifier. This is expected since most of the information is carried by the audio channel. In addition, some vocalisations can be accompanied by subtle facial expressions, like hesitation in Fig. 3, or even no facial expression at all. However, the visual modality is still useful and the audiovisual combination using the end-to-end model results in a statistically significant absolute improvement of 2% of the mean F1 over the audio-only model.
|Current State-of-the-art |
|A||54.1 (2.2)||58.7 (2.4)||58.8 (2.4)|
|V||44.0 (2.0)||48.9 (2.5)||48.5 (2.6)|
|A + V||65.3 (2.9)||64.9 (3.0)||72.6 (3.0)|
|A||73.1 (2.3)||72.6 (3.3)||79.6 (1.7)|
|V||45.4 (5.2)||48.4 (4.1)||66.9 (1.4)|
|A + V||75.1 (1.5)||73.8 (1.5)||80.3 (1.5)|
|Stream||V||A + V|
|Frontal||91.8 (1.1)||98.6 (0.5)|
|30°||87.3 (1.6)||98.7 (0.5)|
|45°||88.8 (1.4)||98.3 (0.4)|
|60°||86.4 (0.6)||98.6 (0.6)|
|Profile||91.2 (1.3)||98.9 (0.5)|
|Clean Audio||98.5 (0.6)|
5.2 Results on OuluVS2 database
We consider a single view scenario where we train and test models on data recorded from a single view. Results are shown in Table 3. This dataset is balanced so we just report the classification rate which is the default performance measure for this database . The best performance in video-only experiments is achieved by the frontal and profile views followed by the 45°, 30°and 60°views. The audio-only model achieves a very high classification accuracy, 98.5%. This is due to the audio signal being clean, without any background noise, and the participants uttering phrases which are much longer than the vocalisations on AVIC database. We also notice that audiovisual fusion does not lead to an improvement over the audio-only model. This is not surprising, given the very high accuracy already achieved by the audio classifier in clean conditions.
In order to test the benefits of audiovisual fusion we have run experiments under varying noise levels. The audio signal is corrupted by additive babble noise from the NOISEX database  so as the signal-to-noise ratio (SNR) varies from 0dB to 20dB. Results are shown in Table 4. As expected, the audio model is significantly affected by the addition of noise and its performance is degraded more and more as the noise level increases leading to a classification rate of 28.4% at 0dB. All audiovisual models significantly outperform the audio only model due to presence of the visual modality which is not affected by acoustic noise.
It is worth pointing out, that although there are significant differences in performance between the views in the video-only case, they all result in almost the same performance in the audiovisual case when audio is clean, i.e. no noise added and 15/20dB. However, as the acoustic noise level increases their differences become more evident. It is interesting that the combination of noisy audio with different views does not follow exactly the same pattern as observed in Table 3. Between 0dB and 10dB, the combinations of audio with the 45°and frontal views are the best ones. The combination of audio and the 60°view is the worst one, which is consistent with Table 3 but surprisingly the combination of audio and profile view also performs poorly at 0dB. This is an indication that there could be a non-linear interaction between audio and different views when the noisy levels increase but this deserves further investigation.
We should also mention, that beyond 10 dB the performance of the audiovisual fusion model is worse than the performance of the video-only system, which varies between 86.4% (60°view) and 91.8% (frontal view). This is probably due to the fact that the audiovisual system is trained with clean audio data. Given the very high classification accuracy achieved by the audio-only model under clean conditions, the fusion model is probably heavily biased towards audio. The fact that audiovisual fusion results in the same performance as audio-only classification under clean conditions is also an indication towards that direction. As a consequence, when the levels of acoustic noise increase the performance becomes worse than the video-only model, however it is still able to extract some useful information from the visual modality and significantly outperform the audio-only classifier.
Finally, we should also mention that we experimented with CNNs for the encoding layers but this led to worse performance than the proposed system. Chung and Zisserman  report that it was not possible to train a CNN on OuluVS2 without the use of external data. Similarly, Saitoh et al.  report that they were able to train CNNs on OuluVS2 only after data augmentation was used. This is likely due to the rather small training set. We also experimented with data augmentation which improved the performance but did not exceed the performance of the proposed system.
|Audio||96.5 (1.5)||91.1 (3.2)||73.3 (5.5)||48.1 (6.6)||28.4 (4.7)|
|Audio + 0°||97.8 (0.6)||94.6 (0.9)||85.1 (1.4)||71.0 (1.3)||57.5 (1.2)|
|Audio + 30°||97.9 (0.4)||94.2 (0.7)||84.2 (1.4)||70.6 (1.1)||56.8 (3.1)|
|Audio + 45°||97.4 (0.7)||94.5 (1.2)||85.8 (1.9)||72.1 (2.6)||58.1 (3.3)|
|Audio + 60°||97.6 (0.6)||94.2 (1.2)||84.1 (1.3)||69.3 (1.7)||53.3 (3.6)|
|Audio + 90°||97.6 (0.8)||95.3 (1.1)||84.8 (1.9)||70.8 (3.2)||53.9 (4.5)|
In this work, we present an end-to-end visual audiovisual fusion system which jointly learns to extract features directly from the pixels and spectrograms and perform classification using BLSTM networks. Results on audiovisual classification of nonlinguistic vocalisations demonstrate that the proposed model achieves state-of-the-art performance on the AVIC database. In addition, audiovisual speech recognition experiments using different lip views on OuluVS2 demonstrate that the proposed end-to-end model outperforms the audio-only classifier for high level of acoustic noise. The model can be easily extended to multiple streams so we are planning to perform audiovisual multi-view speech recognition and investigate the influence of audio to the different views.
This work has been funded by the European Community Horizon 2020 under grant agreement no. 645094 (SEWA) and no. 688835 (DE- ENIGMA).
-  S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. on Multimedia, vol. 2, no. 3, pp. 141–151, 2000.
-  G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proc. of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003.
-  Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A survey of affect recognition methods: Audio, visual and spontaneous expressions,” IEEE Trans. Pattern Analysis and Mach. Intel., vol. 31, no. 1, pp. 39–58.
-  H. Gunes and B. Schuller, “Categorical and dimensional affect analysis in continuous input: Current trends and future directions,” Image and Vision Computing, vol. 31, no. 2, pp. 120–136, 2013.
-  S. Petridis and M. Pantic, “Audiovisual discrimination between speech and laughter: Why and when visual information might help,” IEEE Trans. on Multimedia, vol. 13, no. 2, pp. 216 –234, April 2011.
-  P. S. Aleksic and A. K. Katsaggelos, “Audio-visual biometrics,” Proceedings of the IEEE, vol. 94, no. 11, pp. 2025–2044, 2006.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proc. of ICML, 2011, pp. 689–696.
-  D. Hu, X. Li et al., “Temporal multimodal learning in audiovisual speech recognition,” in IEEE CVPR, 2016, pp. 3574–3582.
-  H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, and K. Takeda, “Integration of deep bottleneck features for audio-visual speech recognition,” in Conf. of the International Speech Communication Association, 2015.
-  Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for audio-visual speech recognition,” in IEEE ICASSP, 2015, pp. 2130–2134.
-  Y. Takashima, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, and K. Nakazono, “Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss,” Interspeech 2016, pp. 277–281, 2016.
-  M. Wand, J. Koutn, and J. Schmidhuber, “Lipreading with long short-term memory,” in IEEE ICASSP, 2016, pp. 6115–6119.
-  S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recognition with lstms,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 2592–2596.
-  Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet: Sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016.
-  J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” Accepted to IEEE CVPR, 2017.
-  I. Anina, Z. Zhou, G. Zhao, and M. Pietikäinen, “Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis,” in IEEE FG, 2015, pp. 1–5.
-  B. Schuller, R. Mueller, F. Eyben, J. Gast, B. Hoernler, M. Woellmer, G. Rigoll, A. Hoethker, and H. Konosu, “Being bored? Recognising natural interest by extensive audiovisual integration for real-life application,” Image and Vision Comp., vol. 27, no. 12, pp. 1760–1774, 2009.
-  Y. Lan, B. J. Theobald, and R. Harvey, “View independent computer lip-reading,” in IEEE International Conference on Multimedia and Expo, 2012, pp. 432–437.
-  N. Harte and E. Gillen, “Tcd-timit: An audio-visual corpus of continuous speech,” IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 603–615, 2015.
-  F. Eyben, S. Petridis, B. Schuller, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks,” in Proc. IEEE ICASSP, 2011, pp. 5844–5847.
-  S. Petridis and M. Pantic, “Prediction-based audiovisual fusion for classification of non-linguistic vocalisations,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 45–58, 2016.
-  V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in IEEE CVPR, 2014, pp. 1867–1874.
-  G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
-  S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al., “The htk book,” vol. 3, p. 175, 2002.
-  T. Saitoh, Z. Zhou, G. Zhao, and M. Pietikäinen, “Concatenated frame image based cnn for visual speech recognition,” in Asian Conference on Computer Vision. Springer, 2016, pp. 277–289.
-  G. Hinton, “A practical guide to training restricted boltzmann machines,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 599–619.
-  X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  A. Varga and H. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247–251, 1993.
-  J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian Conference on Computer Vision. Springer, 2016, pp. 87–103.