VoiceFilter: Targeted Voice Separation by
Speaker-Conditioned Spectrogram Masking
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
VoiceFilter: Targeted Voice Separation by
Speaker-Conditioned Spectrogram Masking
|1Google Inc., USA 2Idiap Research Institute, Switzerland|
Index Terms— Source separation, speaker recognition, spectrogram masking, speech recognition
Recent advances in speech recognition have led to performance improvement in challenging scenarios such as noisy and far-field conditions. However, speech recognition systems still perform poorly when the speaker of interest is recorded in crowded environments, i.e., with interfering speakers in the foreground.
One way to deal with this issue is to first apply a speech separation system on the noisy audio in order to separate the voices from different speakers. Therefore, if the noisy signal contains speakers, this approach would yield outputs with a potential additional output for the noise. A classical speech separation task like this needs to cope with two main challenges. First, identifying the number of speakers in the recording, which in realistic scenarios is unknown. Secondly, the optimization of a speech separation system may require to be invariant to the permutation of speaker labels, as the order of the speakers should not have an impact during training . Leveraging on advances in deep neural networks, several successful works have been introduced to address these problems, such as deep clustering , deep attractor network , and permutation invariant training .
This work addresses the task of isolating the voices of a subset of speakers of interest from the commonality of all the other speakers and noises. For example, such subset can be formed by a single target speaker issuing a spoken query to a personal mobile device, or the members of a house talking to a shared home device. We will also assume that the speaker(s) of interest can be individually characterized by previous reference recordings, e.g. through an enrollment stage. This task is closely related to classical speech separation, but in a way that it is speaker-dependent. In this paper, we will refer to the task of speaker-dependent speech separation as voice filtering. We argue that for voice filtering, speaker-independent techniques such as those presented in [1, 2, 3] may not be a good fit. In addition to the challenges described previously, these techniques require an extra step to determine which output – out of the possible outputs from the speech separation system – corresponds to the target speaker(s), by e.g. choosing the loudest speaker, running a speaker verification system on the outputs, or matching a specific keyword.
A more end-to-end approach for the voice filtering task is to treat the problem as a binary classification problem, where the positive class is the speech of the speaker of interest, and the negative class is formed by the combination of all foreground and background interfering speakers and noises. By speaker-conditioning the system, this approach suppresses the three aforementioned challenges: unknown number of speakers, permutation problem, and selection from multiple outputs. In this work, we aim to condition the system on the speaker embedding vector of a reference recording. The proposed approach is the following. We first train a LSTM-based speaker encoder to compute robust speaker embedding vectors. We then train separately a time-frequency mask-based system that takes two inputs: (1) the embedding vector of the target speaker, previously computed with the speaker encoder; and (2) the noisy multi-speaker audio. This system is trained to remove the interfering speakers and output only the voice of the target speaker. This approach can be easily extended to more than one speaker of interest by repeating the process in turns, for the reference recording of each target speaker.
Similar related literature exists for the task of voice filtering. For example, in [4, 5], the authors achieved impressive results by doing an indirect speaker conditioning of the system on the visual information (lips movement). However, a solution like that would require simultaneously using speech and visual information, which is something that may not be available in certain type of applications, where a reference speech signal may be more practical. In [6, 7] the authors use either a one-hot vector or a speaker posterior as additional input to the system. We argue that a solution like that may be difficult to generalize to unseen speakers during training time. In contrast, our system purely relies on the audio signal and can easily generalize to unknown speakers by using a highly representative embedding vector for the speaker, a.k.a. the d-vector, which we trained using the state-of-the-art generalized end-to-end loss . We also demonstrate how our system can significantly improve ASR in scenarios with speaker interference, in addition to source-to-distortion ratio results.
This paper is organized as follows. In Section 2, we describe our approach to the problem, and provide the details of how we train the neural networks. In Section 3, we describe our experimental setup, including the datasets we use and the evaluation metrics. The experimental results are presented in Section 4. We draw our conclusions in Section 5, with discussions on future work directions.
The system architecture is shown in Fig. 1. The system consists of two separately trained components: the speaker encoder (in red), and the VoiceFilter system (in blue), which uses the output of the speaker encoder as an additional input. In this section, we will describe these two components.
2.1 Speaker encoder
The purpose of the speaker encoder is to produce a speaker embedding from an audio sample of the target speaker. This system is based on a recent work from Wan et al. , which achieves a high performance on text-dependent and text-independent speaker verification, as well as on speaker diarization  and multispeaker TTS .
The speaker encoder is a 3-layer LSTM network trained with the generalized end-to-end loss . It takes as inputs log-mel filterbank energies extracted from windows of 1600 ms, and outputs speaker embeddings, called d-vectors, which have a fixed dimension of 256. To compute a d-vector on one utterance, we extract sliding windows with 50% overlap, and average the L2-normalized d-vectors obtained on each window.
2.2 VoiceFilter system
The VoiceFilter system is based on the recent work of Wilson et al. , developed for speech enhancement. As shown in Fig. 1, the neural network takes two inputs: a d-vector of the target speaker, and a magnitude spectrogram computed from a noisy audio. The network predicts a soft mask, which is element-wise multiplied with the input (noisy) magnitude spectrogram to produce an enhanced magnitude spectrogram. To obtain the enhanced waveform, we directly add the phase of the noisy audio to the enhanced magnitude spectrogram, and apply an inverse STFT on the result.111Samples of output audios are available at: https://google.github.io/speaker-id/publications/VoiceFilter The network is trained to minimize the difference between the masked magnitude spectrogram and the target magnitude spectrogram computed from the clean audio.
The VoiceFilter network is composed of 8 convolutional layers, 1 LSTM layer, and 2 fully connected layers, each with ReLU activations except the last layer, which has a sigmoid activation. The values of the parameters are provided in Table 1. The d-vector is repeatedly concatenated to the output of the last convolutional layer in every time frame. The resulting concatenated vector is then fed as the input to the following LSTM layers. We decide to inject the d-vector between the convolutional layers and the LSTM layer and not before the convolutional layers for two reasons. First, the d-vector is already a compact and robust representation of the target speaker, thus we do not need to modify it by applying convolutional layers on top of it. Secondly, convolutional layers assume time and frequency homogeneity, and thus cannot be applied on an input composed of two completely different signals: a magnitude spectrogram and a speaker embedding.
|Layer||Width||Dilation||Filters / Nodes|
While training the VoiceFilter system, the input audios are divided into segments of 3 seconds each and are converted, if necessary, to single channel audios with a sampling rate of 16 kHz.
3 Experimental setup
In this section, we describe our experimental setup: the data used to train separately the two components of the system, as well as the metrics to assess the systems.
Although our speaker encoder network has exactly the same network topology as the text-independent model described in , we use more training data in this system. Our speaker encoder is trained with two datasets combined by the MultiReader technique introduced in . The first dataset consists of anonymized voice query logs in English from mobile and farfield devices. It has about 34 million utterances from about 138 thousand speakers. The second dataset consists of LibriSpeech , VoxCeleb , and VoxCeleb2 . This model has a 3.06% equal error rate (EER) on our internal en-US phone audio test dataset, compared to the 3.55% EER of the one reported in .
For training and evaluating the VoiceFilter network, we use the VCTK database  and the LibriSpeech database . For VCTK, we randomly take 99 speakers for training, and 10 speakers for testing. For LibriSpeech, we used the training and development set defined in the protocol of the database: the training set contains 2338 speakers, and the development set contains 73 speakers. These two databases contain read speech, and each utterance contains the voice of one speaker. Thus we had to generate data to train the VoiceFilter system, which is explained next.
3.1.2 Data generation
From the system diagram in Fig. 1, we see that one training step involves three inputs: (1) the clean audio from the target speaker, which is the ground truth; (2) the noisy audio containing multiple speakers; and (3) a reference audio from the target speaker (different from the clean audio) over which the d-vector will be computed.
This training triplet can be obtained by using three audios from a clean dataset, as shown in Fig. 2. The reference audio is picked randomly among all the utterances of the target speaker, and is different from the clean audio. The noisy audio is generated by mixing the clean audio and an interfering audio randomly selected from a different speaker. More specifically, it is obtained by directly summing the clean audio and the interfering audio, then trimming the result to the length of the clean audio.
We have also tried to multiply the interfering audio by a random weight following a uniform distribution either within or within . However, this did not affect the performance of the VoiceFilter system in our experiments.
To evaluate the performnce of different VoiceFilter models, we use two metrics: the speech recognition Word Error Rate (WER) and the Source to Distortion Ratio (SDR).
3.2.1 Word error rate
As mentioned in Sec. 1, the main goal of our system is to improve speech recognition. Specifically, we want to reduce the WER in multi-speaker scenarios, while preserving the same WER in single-speaker scenarios. The speech recognizer we use for WER evaluation is a version of the conventional phone models discussed in , which is trained on a YouTube dataset.
For each VoiceFilter model, we care about four WER numbers:
Clean WER: Without VoiceFilter, the WER on the clean audio.
Noisy WER: Without VoiceFilter, the WER on the noisy (clean + interence) audio.
Clean-enhanced WER: the WER on the clean audio processed by the VoiceFilter system.
Noisy-enhanced WER: the WER on the noisy audio processed by the VoiceFilter system.
A good VoiceFilter model should have these two properties:
Noisy-enhanced WER is significantly lower than Noisy WER, meaning that the VoiceFilter is improving speech recognition in multi-speaker scenarios.
Clean-enhanced WER is very close to Clean WER, meaning that the VoiceFilter has minimal negative impact on single-speaker scenarios.
3.2.2 Source to distortion ratio
The SDR is a very common metric to evaluate source separation systems  and requires to know both the clean signal and the filtered signal. It is an energy ratio, expressed in dB, between the energy of the target signal contained in the enhanced signal and the energy of the errors (coming from the interfering speakers and artifacts). Thus, the higher it is, the better.
4.1 Word error rate
|VoiceFilter: no LSTM||12.2||35.3|
|VoiceFilter trained on VCTK||21.1||37.0|
|VoiceFilter trained on LibriSpeech||5.9||34.3|
In Table 2, we present the results of VoiceFilter models trained and evaluated on the LibriSpeech dataset. The architecture of the VoiceFilter system is shown in Table 1, with a few different variantions of the LSTM layer: (1) no LSTM layer, i.e., only convolutional layersdirectly followed by fully connected layers; (2) a uni-directional LSTM layer; (3) a bi-directional LSTM layer. In general, after applying VoiceFilter, the WER on the noisy data is significantly lower than before, while the WER on the clean dataset remains close to before. There is a significant gap between the first and second model, meaning that processing the data sequentially with an LSTM is an important component of the system. Morever, using a bi-directional LSTM layer we achieve the best WER on the noisy data. With this model, applying the VoiceFilter system on the noisy data reduces the speech recognition WER by a relative 58.1%. In the clean scenario, the performance degradation caused by the VoiceFilter system is very small: the WER is 11.1% instead of 10.9%.
In Table 3, we present the WER results of VoiceFilter models evaluated on the VCTK dataset. With a VoiceFilter model trained also on VCTK, the WER on the noisy data after applying VoiceFilter is significantly lower than before, reduced relatively by 38.9%. However, the WER on the clean data after applying VoiceFilter is significantly higher. This is mostly because the VCTK training set is too small, containing only 99 speakers. If we use a VoiceFilter model trained on LibriSpeech instead, the WER on the noisy dataset further decreases, while the WER on the clean data reduces to 5.9%, which is even smaller than before applying VoiceFilter. This means: (1) The VoiceFilter model is able to generalize from one dataset to another; (2) We are improving the acoustic quality of the original clean audios, even if we didn’t explicitly train it this way.
Note that the LibriSpeech training set contains about 20 times more speakers than VCTK (2338 speakers instead of 99 speakers), which is the major difference between the two models shown in Table 3. Thus, the results also imply that we can further improve our VoiceFilter model by training with even more speakers.
4.2 Source to distortion ratio
|VoiceFilter Model||SDR (dB)|
|VoiceFilter: no LSTM||11.9|
Together in the table, we presented the SDR of a blind source separation model using the permutation-invariant loss . The SDR is close to the VoiceFilter model when both using bi-directional LSTM.
In Table 2, we tried a few variants of the VoiceFilter model on LibriSpeech, and the best WER performance was achieved by bi-directional LSTM. However, it’s likely that a similar good performance could also be achieved by adding more layers or nodes to uni-directional LSTM. Future work includes exploring more variants and fine-tuning the hyper-parameters to achieve better performance with lower computational cost, but that is beyond the focus of this paper.
5 Conclusions and future work
In this paper, we have demonstrated the effectiveness of using a discriminatively-trained speaker encoder to condition the speech separation task. Such a system is more applicable to real scenarios because it does not require prior knowledge about the number of speakers and removes the permutation problem. We have shown that a VoiceFilter model trained on the LibriSpeech dataset reduces the speech recognition WER from 55.9% to 23.4% in two-speaker scenarios, while the WER stays approximately the same on single-speaker scenarios.
This system could be improved by taking a few steps: (1) training on larger and more challenging databases such as VoxCeleb 1 and 2 ; (2) adding more interfering speakers; and (3) computing the d-vectors over several utterances instead of only one to obtain more robust speaker embeddings. Another interesting direction would be to train the VoiceFilter system to perform joint voice separation and speech enhancement, i.e., to remove both the interfering speakers and the noise. To do so, we could add different noises when mixing the clean audio with interfering utterances. This approach will be part of future investigations. Finally, the VoiceFilter system could also be trained jointly with the speech recognition system to further increase the WER improvement.
The authors would like to thank Yiteng (Arden) Huang, Jason Pelecanos, and Fadi Biadsy for the helpful discussions.
-  John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
-  Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractor network for single-microphone speaker separation,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
-  Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
-  Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” arXiv preprint arXiv:1804.03619, 2018.
-  Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman, “The conversation: Deep audio-visual speech enhancement,” arXiv preprint arXiv:1804.04121, 2018.
-  Katerina Zmolikova, Marc Delcroix, Keisuke Kinoshita, Takuya Higuchi, Atsunori Ogawa, and Tomohiro Nakatani, “Speaker-aware neural network based beamformer for speaker extraction in speech mixtures,” in Interspeech, 2017.
-  Jun Wang, Jie Chen, Dan Su, Lianwu Chen, Meng Yu, Yanmin Qian, and Dong Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” arXiv preprint arXiv:1807.08974, 2018.
-  Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
-  Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno, “Speaker diarization with lstm,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5239–5243.
-  Ye Jia, Yu Zhang, Ron J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Conference on Neural Information Processing Systems (NIPS), 2018.
-  Kevin Wilson, Michael Chinen, Jeremy Thorpe, Brian Patton, John Hershey, Rif A. Saurous, Jan Skoglund, and Richard F. Lyon, “Exploring tradeoffs in models for low-latency speech enhancement,” in International Workshop on Acoustic Signal Enhancement (iWAENC), 2018.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
-  Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
-  Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
-  Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
-  Hagen Soltau, Hank Liao, and Hasim Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” arXiv preprint arXiv:1610.09975, 2016.
-  Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.