Time Domain Audio Visual Speech Separation
Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder which can extract lip embedding from video steams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on 2 and 3 speakers cases respectively, compared to audio-only TasNet and frequency domain audio-visual networks.
Time Domain Audio Visual Speech Separation
Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Tencent AI Lab, Shenzhen, China ††thanks: This work was done when the first author was an intern in Tencent AI lab.
Tencent AI Lab, Bellevue, USA
Index Terms: audio visual speech separation, speech enhancement, TasNet, multi-modal learning
The goal of speech separation is to separate each source speaker from the mixture signal. Although it has been studied for many years, speech separation is still a difficult problem, especially in noisy and reverberated environment. Several audio-only speech separation methods were recently proposed, such as uPIT , DPCL , DANet  and TasNet . However, in these approaches, the number of target speakers has to be known as a prior information and assumed to be unchanged during training and testing. In addition, the separation results of these systems cannot be associated to the speakers, which greatly limits their application scenarios.
If we can extract some target speaker dependent features, the task of speech separation will become the target speaker extraction problem. This has several clear advantages over the blind separation approaches. First, as the model only extracts one target speaker each time from the mixture, the prior knowledge about the number of speakers is no longer needed. Second, as the target speaker features were given to the model, the issue of label permutation is apparently avoided. These make separation of the target speaker more practical than the blind separation solutions.
Several target speaker separation approaches have been explored in the past [5, 6, 7]. In , the authors proposed a system named VoiceFilter that used d-vectors  as the embedding of target speaker for the separation network. Similarly  used a short anchor utterance as auxiliary input for target speaker separation. However, speaker or utterance based features are not robust enough, which could heavily affect system performance, especially when noise exists or same-gender speakers are mixed.
Alternatively, the visual information is acoustic noise insensitive and highly correlated to the speech content. Combining the audio and visual information has previously been investigated for speech separation [9, 10], speech enhancement [11, 12], and speech recognition [13, 14, 15]. The results have shown great potential of making use of visual features as complimentary information source. For speech separation,  proposed an encoder-decoder model architecture to separate the voice of a visible speaker from background noise. The noisy spectrogram and center cropped video frames were used as the input of the audio/visual encoder respectively.  built a deeper network via stacking depth-wise separable convolution blocks, which includes a magnitude network to estimate Time-Frequency (TF) masks  and a phase network to predict clean phase from the mixture phase. The magnitude network used pre-trained lip embeddings  as visual features. Similarly  built a model based on dilated convolutions and LSTMs using face embeddings as visual features and yielded good results on a large scale audio-visual dataset.
Most of previous audio-visual separation systems have handled the audio stream in TF domain and thus the accuracy of the estimated TF-mask is the key to the success of those systems. On the other hand, the phase of signals is less considered, although it can significantly affect separation quality [18, 19, 20, 21]. To incorporate the phase information,  used a phase subnet to refine original noisy phase, and  adopted newly proposed complex masks  instead of traditional real masks (e.g., IRM, PSM  etc). Recently, [4, 23] proposed a new encoder-decoder framework named TasNet, directly separating speech on time domain with impressive results on public wsj0-2mix dataset. In this paper, we propose a new separation network that generalizes the TasNet to enable multi-modal fusion of auditory and visual signals. Given a raw waveform of mixture speech and the corresponding video stream of the target speaker, our audio-visual speech separation model can extract audio of the the target speaker directly.
The contribution of this work includes: 1) A new structure for multi-modal target separation is proposed and to illustrate the effectiveness of the structure, a comprehensive comparison111Audio samples of the compared systems can be checked out at https://funcwj.github.io/online-demo/page/tavs is performed with three typical separation models, uPIT  (frequency domain audio-only), Conv-TasNet  (time domain audio-only) and Conv-FavsNet (frequency domain audio-visual). 2) To the best of our knowledge, this is the first work that performs audio visual separation directly on the time domain. Experiments on recently released in-the-wild videos  show that the proposed structure brings significant improvements compared to all other baseline models. 3) Previous visual features are not well designed for speech separation. In this work, the visual (lip) embeddings are specifically trained to represent the phonetic information. Different modeling units for the embedding network, such as words, phonemes and context-dependent phones, are also investigated and compared.
2 Proposed system
This section will introduce the architecture of our proposed time domain audio visual speech separation network. Generally speaking, the network is fed with chunks of raw waveform and corresponding video frames, and predicts the speech of target speaker directly. Scale-invariant source-to-noise ratio (Si-SNR) is used as the training objective function, which is defined as
are estimated signal and target source respectively and normalized to have zero mean, where is an optimal scaling factor computed via
We also use Si-SNR as the evaluation metric, which is thought to be a more robust measure of separation quality compared to the original SDR .
The proposed structure is mainly inherited from audio only structure Conv-TasNet proposed in , which contains three parts, an audio encoder/decoder and a separation network. Given an audio mixture chunk , the encoder tries to encode mixture samples as some non-negative vector sequences in another feature space, while the separation network estimates masks of each source defined on such space. The audio decoder is used to reconstruct each masked results into time domain again. The total framework could be described as
means number of speaker sources in the mixture.
However, when video stream is available, our motivation is to reformulate the separation network, which can be fed with both audio and video embedding sequences and only generates masks of target speaker , and it could be written as
where means video embeddings of the target speaker, and means audio/video encoder. Finally, separated results could be obtained through the formula below:
2.2 Video encoder
There are several options for the input of video encoder, raw image frames, lip or face embeddings. Our work follows the work in , which extracts embeddings from a single spatio-temporal convolutional layer followed by a 18-layer ResNet . But we improve the lip embedding by trying different classification targets, e.g., word, context dependent phone (CD-phone) and context independent phone (CI-phone). More details will be discussed in the experimental section below.
The structure of the video encoder is similar to , which contains several temporal convolutional blocks () with pre-activation. In our experiments, and residual connection is kept, although they do not have significant impact according to our results. ReLU and batch normalization  are also included in each block. We use depth-wise separable convolution layer  to implement temporal convolutional blocks, which greatly reduce the model parameters.
2.3 Audio encoder/decoder
We keep the implementation of audio encoder/decoder as same as the original Conv-TasNet, which mainly performs 1D convolution operation and deconvolution operation on time domain audio signals and mixture representation, respectively. Thus audio encoder/decoder could be represented as:
and denote size of kernel and stride in 1D convolution operation, respectively. In this paper, we use and by default.
2.4 Separation network
Separation network is designed for estimating masks of target speaker, conditioned on the outputs of the video and audio encoders. This network also includes an audio-visual fusion layer and several audio/fusion convolutional blocks (/), as depicted in Fig.2.
The encoded audio features are fed into the audio-visual fusion layer together with video stream . Features of the two streams are fused in a fusion layer, which is performed through a simple concatenation operation over the channel dimensions, followed by a position-wise projection onto new dimensions. In order to synchronize the resolution of audio and video streams, upsampling is done on video stream before concatenation if necessary. The description above could be written as:
Finally, fusion convolutional blocks are fed with fused features and produce target masks:
here means an arbitrary non-linear function. We use ReLU in all our experiments, without loss of generality.
3 Experimental setup and results
In the experiment, we created two-speaker and three-speaker mixtures using utterances from Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset , which consists of thousands of spoken sentences from BBC television with their corresponding transcriptions. The training, validation and test sets are generated according to the broadcast date, and thus those sets are not overlapped.
Short utterances (less than 2s) are dropped and 21075 utterances in total are used for data generation, with 19445 for training and the rest for validation and testing, respectively. The details of utterances used for simulation are summarized in Table 1. Two- and three-speaker mixtures are generated by randomly selecting different utterances and mixing them at various signal-to-noise ratios (SNR) between -5 dB and 5 dB. The sampling rate is 16kHz. To ensure the videos of each source are available in a mixture, longer sources are truncated to be aligned with the shortest one. The source segment is synchronous with the video stream in 25 fps. Finally, we simulated 40k (25h+ in total), 5k and 3k utterances for training, validation and test222The Si-SNR on two- and three-speaker mixtures are 0.01dB and -3.33dB, respectively set, respectively.
3.2 Training details
Similar to , we first train lipreading network with a spatio-temporal convolutional front-end on the LRW dataset, which has been proposed for word level classification task with 76.02% classification accuracy on the test set . However, the LRW dataset does not have utterance level audio transcripts, which makes it unfeasible to replace word-level training targets with smaller pieces, such as CD/CI-phones. Instead, we choose LRS2’s pre-train set to train the phone-level lip embedding model which differs from the previous work in . Following Kaldi  recipes, phoneme alignments are derived from the GMM system and sub-sampled to the video sampling rate. The training progress is similar to the one on the word level task, except using frame-level cross-entropy loss instead of sequence level. After training is done, lip embeddings of each video stream are extracted. We use 256-dimensional embeddings for each video frame in all experiments.
The audio visual network is trained with 2s audio/video chunks using Adam  optimizer for 80 epochs with early stopping when there is no improvement on validation loss for 6 epochs. Initial learning rate is set to and halved during training if there is no improvement for 3 epochs on validation loss.
3.3 Results and comparisons
On both datasets, we report oracle mask results as well as two conventional audio based methods on frequency and time domain, respectively: uPIT-BLSTM with PSM as training target and Conv-TasNet based on the best non-causal configurations. Results are shown in Table 2. The audio streams in LRS2 are not as clean as WSJ0, which results in the performance regression on audio based methods compared to previously reported results, especially when the number of speakers increases.
3.3.1 Results with different lip embeddings
We first report the audio-visual results on word-level lip embeddings trained on the LRW dataset, which already leads to significant improvements compared with Conv-TasNet. The resolutions of videos on LRW and LRS2 dataset do not match, so we crop the center 70 70 pixel region and then resample to 112 112. By considering that the word-level classification task can not produce robust lip embeddings for separation, we train embeddings with phoneme targets instead on pre-train set of LRS2. This leads to better results compared to word-level embeddings particularly on difficult tasks as shown in Table 3.
Normalization techniques are critical in our experiments. As shown in Table 3, replacing batch normalization with global normalization leads to improvement on both 2 and 3-speaker mixture datasets. Since we do not use a very deep video encoder, the residual connection in the video block affects less on the final results. By increasing the number of blocks in video encoders, no significant improvement is achieved, possibly due to the well trained visual features.
3.3.2 Impact of separation networks
The final results largely depend on the target speaker masks produced by the separation network. In addition to the normalization layer mentioned in Section 3.3.1, we also tuned the number of fusion blocks in our experiments. By fixing the number of blocks in the separation network (), increasing the number of fusion blocks brings more context information of fused features. Table 4 illustrates that using one audio block and three fusion blocks achieves the best performance.
Based on above discussion, we further compared CD/CI-phones on 3-speaker dataset with different number of fusion blocks. Results show that using CD-phone (3048 units) as training targets brings slightly worse result. We therefore choose CI-phone in the following experiments.
3.3.3 Comparision with frequency domain networks
To further investigate the advances of the time domain approach, we trained a frequency domain audio-visual separation model named Conv-FavsNet for comparison. Conv-FavsNet removes the audio encoder from the proposed time-domain network and replaces the decoder with a linear layer, which transforms the output of convolutional blocks to TF-masks. We use linear spectrogram computed with 40ms hanning window and 10ms shift as input audio features and PSM as training targets. The loss function is defined as:
where denotes the estimated target speaker masks, and denote the magnitude of target speaker and mixture signal, respectively. Results are shown in Table 5. We can find that the noisy phase certainly affects the quality of the reconstructed waveform as oracle phase can bring additional 3dB Si-SNR improvement which thus matches our time domain audio-visual results.
3.3.4 Multi-speaker training
For the target separation architecture, we find that models trained on hard tasks work well on easy tasks. For example, in our experiments, the model trained on the 3-speaker dataset also performs well on the 2-speaker mixture signals. This motivates us to train models on mixed training set of 2 and 3-speaker mixtures, which is denoted as multi-speaker training. In fact, the architecture of the target isolating network is independent of the number of speakers. The label of training samples is determined by auxiliary features, e.g., lip embeddings in this work. This is applicable to the real-world scenarios where the number of speakers is difficult to be detected most of the time. Results in Table 6 shows that training on the combined 2- and 3-speaker dataset achieves 14.02dB and 9.92dB Si-SNR on 2 and 3-speaker test sets, respectively.
In this paper, we have proposed a time domain audio-visual target speech separation architecture incorporating the raw waveform and the target speaker’s video stream to predict the target speech waveform directly. Lip embedding extractor is pretrained for extracting movement information from the video streams. We find that word-level and phoneme-level lip embeddings effectively benefit the separation network. Compared with the audio-only methods and the frequency domain audio-visual method, the proposed approach improves more than 3dB and 4dB in terms of Si-SNR on 2 and 3-speaker test sets, respectively.
-  M. Kolbæk, D. Yu, Z.-H. Tan, J. Jensen, M. Kolbaek, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017.
-  J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 31–35.
-  Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 246–250.
-  Y. Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
-  K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Speaker-aware neural network based beamformer for speaker extraction in speech mixtures,” in Interspeech, 2017.
-  J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” arXiv preprint arXiv:1807.08974, 2018.
-  Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” arXiv preprint arXiv:1810.04826, 2018.
-  L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
-  T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” arXiv preprint arXiv:1804.04121, 2018.
-  A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” arXiv preprint arXiv:1804.03619, 2018.
-  A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” Proc. of Interspeech 2018, pp. 1170–1174, 2018.
-  A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through noise: Visually driven speaker separation and enhancement,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 3051–3055.
-  J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild.” in CVPR, 2017, pp. 3444–3453.
-  T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
-  B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett et al., “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162, 2018.
-  D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
-  T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with lstms for lipreading,” arXiv preprint arXiv:1703.04105, 2017.
-  Z.-Q. Wang, J. L. Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” arXiv preprint arXiv:1804.10204, 2018.
-  D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 24, no. 3, pp. 483–492, 2016.
-  J. L. Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Hershey, “Phasebook and friends: Leveraging discrete representations for source separation,” 2018.
-  Z. Q. Wang, K. Tan, and D. L. Wang, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective,” 2018.
-  H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 708–712.
-  Y. Luo and N. Mesgarani, “Tasnet: Surpassing ideal time-frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
-  J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr-half-baked or well done?” arXiv preprint arXiv:1811.02508, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.