Visual Speech Enhancement using Noise-Invariant Training
Visual speech enhancement is used on videos shot in noisy environments to enhance the voice of a visible speaker and to reduce background noise. While most existing methods use audio-only inputs, we propose an audio-visual neural network model for this purpose. The visible mouth movements are used to separate the speaker’s voice from the background sounds.
Instead of training our speech enhancement model on a wide range of possible noise types, we train the model on videos where other speech samples of the target speaker are used as background noise. A model trained using this paradigm generalizes well to various noise types, while also substantially reducing training time.
The proposed model outperforms prior audio visual methods on two public lipreading datasets. It is also the first to be demonstrated on a general dataset not designed for lipreading. Our dataset was composed of weekly addresses of Barack Obama.
Speech enhancement aims to improve speech quality and intelligibility when audio is recorded in noisy environments. Applications include telephone conversations, video conferences, TV reporting and more. In hearing aids, speech enhancement can reduce discomfort and increase intelligibility . Speech enhancement has also been applied as a preliminary step in speech recognition and speaker identification systems [44, 32].
We propose an audio-visual end-to-end neural network model for separating the voice of a visible speaker from background noise. Once the system is trained on a specific speaker, it can be used to enhance the voice of same speaker. We assume a video showing the face of the target speaker is available along with the noisy soundtrack, and use the visible mouth movements to isolate the desired voice from the background noise.
Making a speech enhancement neural network model robust to different noises is commonly achieved by training on diverse noise mixtures . In this work, we avoid training against all variety of noise mixtures. The number of possible noise combinations is unbounded, and would inevitably bias the trained model. Instead, we train the model on the most challenging case, where speech of the target speaker interferes with his or her own voice. This training strategy makes our model effective on different noise types and substantially reduces training time.
We show the effectiveness of our approach by evaluating the performance of the trained model in different enhancement experiments. First, we assess its performance on two public benchmark audio-visual datasets: GRID corpus  and TCD-TIMIT , both designed for continuous audio-visual speech recognition and lip reading research. On both datasets, our performance exceeds that of prior work. We also demonstrate speech enhancement on public weekly addresses of Barack Obama.
1.1 Related work
Traditional speech enhancement methods include spectral restoration [40, 11], Wiener filtering  and statistical model-based methods . Recently, deep neural networks have been adopted for speech enhancement [30, 42, 37], generally outperforming the traditional methods .
Audio-only deep learning based speech enhancement
Previous methods for single-channel speech enhancement mostly use audio only input. Lu et al.  train a deep auto-encoder for denoising the speech signal. Their model predicts a mel-scale spectrogram representing the clean speech. Pascual et al.  use generative adversarial networks and operate at the waveform level. Separating mixtures of several people speaking simultaneously has also became possible by training a deep neural network to differentiate between the unique speech characteristics of different sources e.g. spectral bands, pitches and chirps, as shown in [22, 3]. Despite their decent overall performance, audio-only approaches have poor ability to separate similar human voices, as commonly observed in same-gender mixtures.
Visually-derived speech and sound generation
Different approaches exist for generation of intelligible speech from silent video frames of a speaker [13, 12, 8]. In , Ephrat et al. generate speech from a sequence of silent video frames of a speaking person. Their model uses the video frames and the corresponding optical flow to output a spectrogram representing the speech. Owens et al.  use a recurrent neural network to predict sound from silent videos of people hitting and scratching objects with a drumstick.
Audio-visual multi-modal learning
Recent research in audio-visual speech processing makes extensive use of neural networks. The work of Ngiam et al.  is a seminal work in this area. They demonstrate cross modality feature learning, and show that better features for one modality (e.g., video) can be learned if both audio and video are present at feature learning time. Multi-modal neural networks with audio-visual inputs have also been used for lip reading , lip sync  and robust speech recognition .
Audio-visual speech enhancement
Work has also been done on audio-visual speech enhancement and separation . Kahn and Milner [25, 24] use hand-crafted visual features to derive binary and soft masks for speaker separation. Hou et al.  propose convolutional neural network model to enhance noisy speech. Their network gets a sequence of frames cropped to the speaker’s lips region and a spectrogram representing the noisy speech, and outputs a spectrogram representing the enhanced speech. Gabbay et al.  feed the video frames into a trained speech generation network , and use the spectrogram of the predicted speech to construct masks for separating the clean voice from the noisy input.
2 Neural Network Architecture
Our speech enhancement neural network model gets two inputs: (i) a sequence of video frames showing the mouth of the speaker; and (ii) a spectrogram of the noisy audio. The output is a spectrogram of the enhanced speech. The network layers are stacked in encoder-decoder fashion, as shown in Figure 1. The encoder module consists of a dual tower Convolutional Neural Network which takes the video and audio inputs and encodes them into a shared embedding representing the audio-visual features. The decoder module consists of transposed convolutional layers and decodes the shared embedding into a spectrogram representing the enhanced speech. The entire model is trained end-to-end.
2.1 Video Encoder
In our experiments the video input to the neural network is a sequence of 5 consecutive gray scale video frames of size , cropped and centered on the mouth region. While using 5 frames worked well, other number of frames might also work.
The video frames are the input to the video encoder, comprised of 6 consecutive convolution layers described in Table 1. Each layer is followed by Batch Normalization , Leaky-ReLU  for non-linearity, max pooling, and Dropout  of 0.25.
2.2 Audio encoder
Both input and output audio are represented by log mel-scale spectrograms comprised of 80 mel frequencies from 0 to 8000 Hz. The spectrogram length corresponds to the duration of 5 video frames.
As previously done in several audio encoding networks [9, 18], we also design our audio encoder as a convolutional neural network using the spectrogram as input. The network consists of 5 convolution layers as described in Table 2. Each layer is followed by Batch Normalization  and Leaky-ReLU  for non-linearity. We use strided convolutions instead of max pooling in order to maintain temporal order.
|1||64||5 5||2 2|
|2||64||4 4||1 1|
|3||128||4 4||2 2|
|4||128||2 2||2 1|
|5||128||2 2||2 1|
2.3 Shared representation
The video encoder outputs a feature vector of 2,048 neurons and the audio encoder outputs a feature vector of 3,200. The feature vectors are concatenated into a shared embedding representing the audio-visual features, having 5,248 values. The shared embedding is then fed into a block of 3 consecutive fully-connected layers, each having 1,312 neurons (compressing the representation size of 5,248 by 4). The resulting vector is then fed into the audio decoder.
2.4 Audio decoder
The audio decoder consists of 5 transposed convolution layers, mirroring the layers of the audio encoder. The last layer is of the same size as the input spectrogram, representing the enhanced speech.
The network is trained to minimize the mean square error () loss between the output spectrogram and the target speech spectrogram. We use Adam  optimizer with an initial learning rate of for back propagation. Learning rate is decreased by 50% once learning stagnates i.e. the validation error does not improve for 5 epochs.
3 Noise-invariant training
Neural networks with multi-modal inputs can often be dominated by one of the inputs . Different approaches have been considered to overcome this issue in previous work. Ngiam et al.  introduce a strategy for audio-visual multi-modal training. In part of the training data, they zero out one of the input modalities (e.g., video) and only have the other input modality (e.g., audio) available. This idea has been adopted in lip reading  and speech enhancement . Hou et al.  output the same input video as an auxiliary for learning features from both modalities.
In this work, we introduce a new training strategy for audio-visual speech enhancement. We train our model on mixtures where the noise is another sample of voice of the same speaker. Since separating two overlapping sentences spoken by the same person is not possible using audio only information, the network is forced to exploit the visual features in addition to the audio features. This technique also greatly reduces training time and the amount of training data needed compared to typical training on a wide range of noises.
4 Implementation details
In all our experiments, video is resampled to 25 fps. The video file is cut to non-overlapping segments of 5 frames each, representing 200 ms. In each frame we find 68 facial landmarks suggested by . Using only the subset of 20 mouth landmarks, we crop a mouth-centered window of size pixels from each frame. The network input is a sequence of 5 consecutive grayscale frames, and therefore input size is . We normalize the video inputs over the entire training data by subtracting the mean video frame and dividing by the standard deviation.
The corresponding audio signal is resampled to 16 kHz. Short-Time-Fourier-Transform (STFT) is applied to the waveform signal. The spectrogram (STFT magnitude) is used as input to the neural network, and the phase is saved for reconstruction of the enhanced signal. We set the STFT window size to 640 samples, which equals to 40 milliseconds and corresponds to the length of a single video frame. We shift the window by hop length of 160 samples at a time, creating an overlap of 75%. Log mel-scale spectrogram is computed by multiplying the spectrogram by a mel-spaced filterbank. The log mel-scale spectrogram comprises 80 mel frequencies from 0 to 8000 Hz. We slice the spectrogram to pieces of length 200 milliseconds corresponding to the length of 5 video frames, resulting in spectrograms of size : 20 temporal samples, each having 80 frequency bins. The noisy waveform signal is peak-normalized to 1.
Audio post processing
Each segment of 200 ms of the original video consists of 5 video frames and 20 temporal samples of the spectrogram. After processing by our model, 20 temporal samples of the spectrogram, covering a 200 ms segment of an enhanced spectrogram, are obtained. All the enhanced segments are concatenated together to create the complete enhanced spectrogram. The waveform is then reconstructed by multiplying the mel-scale spectrogram by the pseudo-inverse of the mel-spaced filterbank, followed by applying the inverse STFT. The phase information was obtained directly from the noisy input signal, preserving the original phase of each frequency. The reconstructed waveform signal is peak-denormalized by multiplying the signal by the normalization factor of the noisy input signal.
Sample frames from the datasets are shown in Figure 2.
We perform our base experiments on the GRID audio-visual sentence corpus , a large dataset of audio and video (facial) recordings of 1,000 sentences spoken by 34 people (18 male, 16 female). A total of 51 different words are contained in the GRID corpus. Videos have a fixed duration of 3 seconds at a frame rate of 25 fps with a resolution of pixels, resulting in sequences comprising 75 frames.
We also perform experiments on the TCD-TIMIT dataset . This dataset consists of 60 volunteer speakers with around 200 videos each, as well as three lipspeakers, people specially trained to speak in a way that helps the deaf understand their visual speech. The speakers are recorded saying various sentences from the TIMIT dataset , and are recorded using both front-facing and 30 degree cameras. The TIMIT dataset has about 2 orders of magnitude more words in its vocabulary than GRID.
Mandarin Sentences Corpus
Hou et al.  prepared an audio-visual dataset containing video recordings of 320 utterances of Mandarin sentences spoken by a native speaker. Each sentence contains 10 Chinese characters with phoneme designed to distribute equally. The length of each utterance is approximately 3-4 seconds. The utterances were recorded in a quiet room with sufficient light, and the speaker was captured from frontal view.
Obama Weekly Addresses
We assess our model’s performance in more general conditions compared to datasets specifically prepared for lip-reading. For this purpose we use a dataset containing weekly addresses given by Barack Obama. This dataset consists of 300 videos, each of 2-3 minutes long. The dataset varies greatly in scale (zoom), background, lighting and face angle, as well as in audio recording conditions, and includes an unbounded vocabulary.
While we report an evaluation of our model using common speech quality metrics, listening to audio samples is essential to understand the effectiveness of speech enhancement methods. Supplementary material is available on our project web page 111Examples of speech enhancement can be found at http://www.vision.huji.ac.il/speaker-separation for this purpose.
6.1 Evaluation of Speech Quality
The results of our experiments are assessed using the Perceptual Evaluation of Speech Quality (PESQ) , which evaluates the quality of a degraded (or enhanced) signal relatively to a reference signal (in our case, the original clean speech signal). PESQ scores vary on a scale of 1 to 4.5, where higher is better. In addition to the measurements we assessed the intelligibility and quality of our results using informal human listening.
GRID and TCD-TIMIT
We compare the performance of our proposed model to previous audio-visual speech enhancement results on the two public lipreading datasets of GRID and TCD-TIMIT. For each dataset, we test our model on noisy mixtures consisting of the voice of the target speaker, where the noise is the voice of a different speaker of same gender. The noise waveform is amplified to have the same peak amplitude as the speech signal. The comparison is summarized in Table 3.
|et al. ||et al. ||et al. |
We further examine the performance of our approach on different noise types (Table 4). Here, we selected one male speaker and one female speaker from the GRID dataset, and mixed 80 of their speech samples with three different noise types: other males, other females, and non-speech ambient noise. For the ambient noise we used 20 types of noises such as rain, motorcycle engine, basketball bouncing, etc.
|speaker||noise||noisy signal||enhanced signal|
In order to better demonstrate the capability of our method, we use a dataset comprised of the weekly addresses of Barack Obama. To our knowledge, this is the first work demonstrating audio visual speech enhancement using a non research-oriented dataset. Speech samples from the LibriSpeech  dataset were peak normalized to, and mixed with, speech samples from the Obama dataset, and then enhanced by our model. The results are presented in Table 5.
|speaker||noise||noisy signal||enhanced signal|
In the final experiment, we want to confirm the effectiveness of our training strategy compared to the conventional training on multiple noise types. For this purpose, we use the Mandarin dataset described earlier and compare our method to Hou et al. .
Hou et al. train their model using a training set containing 91 different noise types in addition to a constant car engine noise in all of the samples. The clean utterances were mixed with the noises at 5 different signal-to-noise ratios. In the testing stage, they used 10 other noise types with another sample of car engine noise. We did not train our model on all possible noises. Instead, we trained our model on overlapping sentences spoken by the same mandarin speaker. For testing, we used the exact same test set and setup as in their experiment.
We evaluate both enhancement methods in a human listening study by native mandarin speakers. Our study showed that both methods, Hou and ours, improve intelligibility from the initial noisy inputs, and that no method has a significant advantage over the other. Unlike the human study, our model achieves slightly lower PESQ quality score (see Table 3), but it is well known that PESQ is not accurate on Chinese . The human listening study shows that similar performance can be obtained even without a wide range of noisy samples in the training data.
Furthermore, the improved quality measured by the objective score in Tables 4 and 5 indicates that our approach is robust to the noise type. It should be noted that the intelligibility slightly decreases when the ambient noise is very loud in comparison to the speech, as was observed in our informal listening study.
Finally, in Figure 3 we present spectrograms from our enhancement flow on one test sample.
7 Concluding remarks
In this work we have provided several key contributions. First, we have proposed an audio-visual end-to-end neural network model for enhancing the voice of a visible speaker from background noise. Furthermore, we have introduced an effective training strategy for audio-visual speech enhancement using training data of overlapping sentences spoken by the same person. Thus, our trained model is robust to similar vocal characteristics of the target and noise speakers, due to the utilization of visual information.
We have shown that the proposed model consistently improves the quality and intelligibility of noisy speech, and outperforms previous works on two public benchmark datasets. Finally, we have demonstrated for the first time audio-visual speech enhancement on a general dataset not generated for lipreading research.
It is interesting to note that our method is not successful in cases when one input modality, video or audio, is missing. Models specially trained on voice only or video only perform better in these cases. This is a result of the assumption, and training data, where both audio and video include useful information.
Acknowledgment. This research was supported by Israel Science Foundation and by Israel Ministry of Science and Technology.
-  Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. Lipnet: Sentence-level lipreading. arXiv:1611.01599, 2016.
-  A. W. Bronkhorst. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica, 86(1):117–128, January 2000.
-  Z. Chen. Single Channel auditory source separation with neural network. PhD thesis, Columbia Univ., 2017.
-  F. L. Chong, I. V. McLoughlin, and K. Pawlikowski. A methodology for improving PESQ accuracy for Chinese speech. In TENCON’05, 2005.
-  J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. arXiv:1611.05358, 2016.
-  J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In ACCV’16, pages 251–263, 2016.
-  M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America, 120(5):2421–2424, 2006.
-  T. L. Cornu and B. Milner. Generating intelligible audio speech from visual speech. In IEEE/ACM Trans. Audio, Speech, and Language Processing, 2017.
-  J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders. arXiv:1704.01279, 2017.
-  Y. Ephraim. Statistical-model-based speech enhancement systems. Proc. of the IEEE, 80(10):1526–1555, 1992.
-  Y. Ephraim and D. Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. ASSP, 32(6):1109–1121, 1984.
-  A. Ephrat, T. Halperin, and S. Peleg. Improved speech reconstruction from silent video. In ICCV’17 Workshop on Computer Vision for Audio-Visual Media, 2017.
-  A. Ephrat and S. Peleg. Vid2speech: speech reconstruction from silent video. In ICASSP’17, 2017.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR’16, pages 1933–1941, 2016.
-  A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg. Seeing through noise: Speaker separation and enhancement using visually-derived speech. arXiv:1708.06767, 2017.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NISTIR 4930, 1993.
-  L. Girin, J. L. Schwartz, and G. Feng. Audio-visual enhancement of speech in noise. J. of the Acoustical Society of America, 109 6:3007–20, 2001.
-  E. M. Grais and M. D. Plumbley. Single channel audio source separation using convolutional denoising autoencoders. arXiv:1703.08019, 2017.
-  N. Harte and E. Gillen. Tcd-timit: An audio-visual corpus of continuous speech. IEEE Trans. Multimedia, 17(5):603–615, 2015.
-  J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W. Chang, and H.-M. Wang. Audio-visual speech enhancement based on multimodal deep convolutional neural network. arXiv:1703.10893, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML’15, pages 448–456, 2015.
-  Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey. Single-channel multi-speaker separation using deep clustering. arXiv:1607.02173, 2016.
-  V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR’14, pages 1867–1874, 2014.
-  F. Khan. Audio-visual speaker separation. PhD thesis, University of East Anglia, 2016.
-  F. Khan and B. Milner. Speaker separation using visually-derived binary masks. In Auditory-Visual Speech Processing (AVSP), 2013.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  M. Kolbæk, Z.-H. Tan, and J. Jensen. Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 25(1):153–167, 2017.
-  J. Lim and A. Oppenheim. All-pole modeling of degraded speech. IEEE Trans. on Acoustics, Speech, and Signal Processing, 26(3):197–210, 1978.
-  P. C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., 2nd edition, 2013.
-  X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoising autoencoder.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML’13, volume 30, 2013.
-  A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng. Recurrent neural networks for noise reduction in robust asr. In 13th Conf. of the International Speech Communication Association, 2012.
-  J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML’11, pages 689–696, 2011.
-  K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4):722–737, 2015.
-  A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In CVPR’16, 2016.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
-  S. Parveen and P. Green. Speech enhancement with missing data techniques using recurrent neural networks. In ICASSP’04, volume 1, pages I–733, 2004.
-  S. Pascual, A. Bonafonte, and J. Serrà. Segan: Speech enhancement generative adversarial network. arXiv:1703.09452, 2017.
-  A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP’01, volume 2, pages 749–752, 2001.
-  P. Scalart et al. Speech enhancement based on a priori signal to noise estimation. In ICASSP’96, volume 2, pages 629–632, 1996.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
-  S. Tamura and A. Waibel. Noise reduction using connectionist models. In ICASSP’88, pages 553–556, 1988.
-  L.-P. Yang and Q.-J. Fu. Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. J. of the Acoustical Society of America, 117(3):1001–1004, 2005.
-  D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero. A minimum-mean-square-error noise reduction algorithm on mel-frequency cepstra for robust speech recognition. In ICASSP’08, pages 4041–4044, 2008.