Personal VAD: Speaker-Conditioned Voice Activity Detection
In this paper, we propose “personal VAD”, a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.
In modern speech processing systems, voice activity detection (VAD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, VAD not only improves the performance of downstream components by discarding non-speech signals, but also significantly reduces the overall computational cost due to its relatively small size.
A typical VAD system uses a frame-level classifier on acoustic features to make speech/non-speech decisions for each audio frame (e.g. with 25ms width and 10ms step). Poor VAD systems could either mistakenly accept background noise as speech or falsely reject speech. False accepting non-speech as speech largely slows down the downstream automatic speech recognition (ASR) processing. It is also computationally expensive as ASR models are normally much larger than VAD models. On the other hand, false rejecting speech leads to deletion errors in ASR transcriptions (a few milliseconds of missed audio could remove an entire word). A good VAD model needs to work accurately in challenging environments, including noisy conditions, reverberant environments and environments with competing speech. Significant research has been devoted to finding the optimal VAD features and models [17, 5, 3, 4, 15]. In the literature, LSTM-based VAD is a popular architecture for sequential modeling of the VAD task, showing state-of-the-art performance [3, 15].
In many scenarios, especially on-device speech recognition , the computational resources such as CPU, memory, and battery are typically limited. In such cases, we wish to run the computationally intensive components such as speech recognition only when the target user is talking to the device. False triggering such components in the background while only speech signals from other talkers or TV noises are present would cause battery drain and bad user experience. Thus, having a tiny model that only passes through speech signals from the target user is very necessary, which is our motivation of developing the personal VAD system.
Although standard speaker recognition and speaker diarization techniques [19, 12, 16, 20, 22, 13] can be directly used for the same task, we argue that the personal VAD system is largely preferred here for a couple of reasons:
To minimize the latency of the whole system, an accept/reject decision is needed upon the arrival of each frame immediately, which prefers frame-level inference of the model. However, many state-of-the-art speaker recognition and diarization systems usually require window-based or segment-based inference, or even offline full-sequence inference.
To minimize battery consumption on the device, the model must be very small, while most speaker recognition and diarization models are pretty big (typically millions of parameters).
Unlike speaker recognition or diarization, in personal VAD, we don’t need to distinguish between different non-target speakers.
In fact, we implemented a baseline system by directly combining a standard speaker verification model and a standard VAD model for the personal VAD task, as described in Section 2.2.1, and found that its performance is worse than a dedicated personal VAD model. To the best of our knowledge, this work is the first lightweight solution that aims at directly detecting the voice activity of a target speaker in real time.
The proposed personal VAD is a VAD-alike neural network, conditioned on the target speaker embedding or the speaker verification score. Instead of determining whether a frame is speech or non-speech in standard VAD, personal VAD extends the determination to three classes: non-speech, target speaker speech, and non-target speaker speech.
The rest of the paper is organized as following. In Section 2.1, we first briefly describe our speaker verification system, which will be used during the training of personal VAD. Then in Section 2.2, we propose four different architectures to achieve personal VAD. In the training of personal VAD, we first treat it as a three-class classification problem and use cross entropy loss to optimize the model. In addition, we noticed that the discriminitivity between non-speech and non-target speaker speech is relatively less important than between target speaker speech and the other two classes in personal VAD. Therefore, we further propose a weighted pairwise loss to enforce the model to learn these differences, as introduced in Section 2.3. We evaluate the model on an augmented version of the LibriSpeech dataset , with experimental setup described in Section 3.2, model configuration described in Section 3.3, metrics explained in Section 3.4, and results presented in Section 3.5. Conclusions are drawn in Section 4.
2.1 Recap of speaker verification system
Personal VAD relies on a pre-trained text-independent speaker recognition/verification model to encode the speaker identity into embedding vectors. In this work, we use the “d-vector” model introduced in , which has been successfully applied to various applications including speaker diarization [20, 22], speech synthesis , source separation , and speech translation . We retrained the 3-layer LSTM speaker verification model using data from 8 languages for language robustness and better performance. During inference, the model produces embeddings on sliding windows, and a final aggregated embedding named “d-vector” is used to represent the voice characteristics of this utterance, as illustrated in Fig. 1. The cosine similarity between two d-vector embeddings can be used to measure the similarity of two voices.
In a real application, users are required to follow an enrollment process before enabling speaker verification or personal VAD. During enrollment, d-vector embeddings are computed from the target user’s recordings, and stored on the device. Since the enrollment is a one-off experience and can happen on server-side, we can assume the embeddings of the target speakers are available at runtime with no cost.
2.2 System architecture
A personal VAD system should produce frame-level class labels for three categories: non-speech (ns), target speaker speech (tss), and non-target speaker speech (ntss). We implemented four different architectures to achieve personal VAD, as illustrated by Fig. 2. All four architectures rely on the embedding of the target speaker, which is acquired via the enrollment process.
Score combination (SC)
Our first approach to implement personal VAD is to simply combine a standard pre-trained speaker verification system and a standard VAD system, as shown in Fig. 2(a). We use this implementation as a baseline for other approaches, since it does not require training any new model.
We denote the frame of the input acoustic features at time as , where is the dimensionality of the acoustic features. For example, we use 40-dimensional log Mel-filterbank energies as the features. We use subscript to denote the subsequence ending at time , i.e. . A standard VAD model and a speaker verification model run independently on the acoustic features. The standard VAD produces unnormalized probabilities of speech (s) and non-speech (ns) for each frame:
where . The speaker verification model produces an embedding at each frame:
then the embedding is verified against the target speaker embedding , which was acquired during enrollment process:
To transform the standard VAD probability to personal VAD probabilities and , we combined it with the resulting speaker verification cosine similarity score , such that:
There are two major disadvantages of this architecture. First, it is running a window-based speaker verification model at a frame level without any adaptation, and such inconsistency could cause significant performance degradation. However, training frame-level speaker verification models is often unscalable due to the difficulties to batch utterances of different length. Second, this architecture requires running a speaker verification system at runtime, which can be expensive since speaker verification models are usually much bigger than VAD models.
Score conditioned training (ST)
As shown in Fig. 2(b), our second approach uses the speaker verification model to produce a cosine similarity score for each frame, as explained in Eq. (3), then concatenates this cosine similarity score to the acoustic features:
The concatenated feature vector is 41-dimensional, as represents the 40-dimensional log Mel-filterbank energies. We train a new personal VAD network that takes the concatenated features as input, and outputs the probabilities of the three class labels for each frame:
This approach still requires running the speaker verification model at runtime. However, since it retrains the personal VAD model based on the speaker verification scores, it is expected to perform better than simply combining the scores of two individually trained systems.
Embedding conditioned training (ET)
As shown in Fig. 2(c), the third approach directly concatenates the target speaker embedding (acquired in the enrollment process) with the acoustic features:
Since our embedding is 256-dimensional, the concatenated feature vector here is 296-dimensional. Then we train a new personal VAD network, which outputs the probabilities of three class at the frame level similar to Eq. (6).
This approach is similar to a knowledge distillation  process. The large speaker verification model was pre-trained on a large-scale dataset individually. Following this, when we train the personal VAD model, we use the speaker embeddings of the target speaker to “distill the knowledge” from the large speaker verification model to the small personal VAD model. As a result, it does not require running the large speaker verification model at runtime, which becomes the most lightweight solution among all architectures.
Score and embedding conditioned training (SET)
As shown in Fig. 2(d), this approach concatenates both the frame-level speaker verification score and the target speaker embedding to the acoustic features to train a new personal VAD model:
The concatenated feature vector in this approach is 297-dimensional. This approach makes use of the most information from the speaker verification system. However, it still requires running the speaker verification model at runtime, so it’s not a lightweight solution.
2.3 Weighted pairwise loss
With an input frame and the corresponding ground truth label , personal VAD can be thought of as a ternary classification problem.
However, in personal VAD, our goal is to detect the voice activity from only the target speaker. Audio frames that are classified into class ns and ntss will be discarded similarly by downstream components. As a result, confusion errors between <ns,ntss> have less impact to the system performance than errors between <tss,ntss> and <tss,ns>. Inspired by Tuplemax loss , here we propose a weighted pairwise loss to model the different tolerance to each class pair. Given and , we define weighted pairwise loss as:
where is the weight between class and class . By setting lower weight to <ns,ntss> errors than <tss,ntss> and <tss,ns> errors, we can enforce the model to be more tolerant to the confusion between <ns,ntss> and to focus on distinguishing tss from ns and ntss.
An ideal dataset to train and evaluate personal VAD would be a dataset such that: (1) each utterance in it contains natural speaker turns; and (2) it contains enrollment utterances for each individual speaker. Unfortunately, to the best of our knowledge, no public dataset in the community really satisfies both requirements. Although some datasets for speaker diarization  have natural speaker turns, they do not provide enrollment utterances for individual speakers. Alternatively, datasets containing enrollment utterances for individual speakers usually do not have natural speaker turns.
To address this limitation, we conducted experiments on an augmented version of the LibriSpeech dataset . To simulate speaker turns, we concatenate single-speaker utterances from different speakers into multi-speaker utterances (see Section 3.2.1). We also noisify the concatenated utterances with reverberant room simulators to mitigate the concatenation artifacts (see Section 3.2.2).
In the LibriSpeech dataset, the training set contains 960 hours of speech, where 460 hours of them are “clean” speech and the other 500 hours are “noisy” speech. The testing set also consists of both “clean” and “noisy” speech. In all the experiments, we use the concatenated LibriSpeech training set to train the models. We use both the original LibriSpeech testing set and the concatenated LibriSpeech testing set for evaluation, as described in the following sections. For all the datasets, we use forced alignment to produce the frame-level ground truth labels used in training and evaluation.
|Method||Loss||Without MTR||With MTR||Network parameters|
|SC (baseline)||N/A||0.886||0.970||0.872||0.900||0.777||0.908||0.768||0.801||4.88 (SV) + 0.06 (VAD)|
|ST||CE||0.956||0.968||0.956||0.957||0.905||0.885||0.905||0.901||4.88 (SV) + 0.06 (PVAD)|
|SET||0.970||0.969||0.972||0.969||0.938||0.888||0.938||0.928||4.88 (SV) + 0.13 (PVAD)|
The baseline system does not require training any new model.
3.2 Experimental settings
In the training corpora of standard VAD, each utterance usually only contains the speech from one single speaker. However, personal VAD aims to find the voice activity of a target speaker in a conversation where multiple speakers could be engaged. Therefore, we cannot directly use the standard VAD training corpora to train personal VAD. To simulate the conversational speech, we concatenate utterances from multiple speakers into a longer utterance, and then we randomly select one of the speakers as the target speaker in the concatenated utterance.
To generate a concatenated utterance, we draw a random number indicating the number of utterances used for concatenation from a uniform distribution:
where and are the minimal and maximal numbers of utterances used for concatenation. The waveforms from the randomly selected utterances are concatenated, and one of the speakers is assumed as the target speaker of the concatenated utterance. At the same time, we modify the VAD ground truth label of each frame according to the target speaker: “non-speech” frames remain the same, while “speech” frames are modifed to either “target speaker speech” or “non-target speaker speech” according to whether the source utterance is from the target speaker.
In our experiments, we generated concatenated utterances for training set and concatenated utterances for testing sets. We use and for both sets, to cover both single-speaker and multi-speaker scenarios.
For both training and evaluations, we apply a data augmentation technique named “multistyle training” (MTR)  on our datasets to avoid domain overfitting and mitigate concatenation artifacts. During MTR, the original (concatenated) source utterance is noisified with multiple randomly selected noise sources, using a randomly selected room configuration. Our noise sources include:
827 audios of ambient noises recorded in cafes;
786 audios recoreded in silent environments;
6433 YouTube segments containing background music or noise.
We generated 3 million room configurations using a room simulator to cover different reverberation conditions. The distribution of the signal-to-noise ratio (SNR) of our MTR is shown in Fig. 3.
3.3 Model configuration
The acoustic features are 40-dimensional log Mel-filterbank energies, extracted on frames with 25ms width and 10ms step. For both standard VAD model and personal VAD model, we used a 2-layer LSTM network with 64 cells, followed by a fully-connected layer with 64 neurons. We also tried larger networks but did not see performance improvements, possibly due to the limited variety in training data. We used TensorFlow  for training and inference. During training, we used Adam optimizer  with a learning rate of . For the models with weighted pairwise loss, we set and explored different values for .
To reduce the model size and accelerate the runtime inference, we quantized the parameters of the model to 8-bit integer values following . With this quantization, our model using the ET architecture, which has only around 130 thousand parameters and is the smallest among all architectures (see Table 1), will be only 130 KB in size.
To evaluate the performance of the proposed method, we computed the Average Precision (AP)  for each class and the mean Average Precision (mAP) over all the classes. AP and mAP are most common metrics for multi-class classification problems. AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. AP can be computed as:
where and are the recall and precision at the -th threshold, respectively. We adopted the micro-mean
We conducted three groups of experiments to evaluate the proposed method. First, we compared the four architectures for personal VAD. Following this, we examined the effectiveness of weighted pairwise loss and compared it against conventional cross entropy loss. Finally, we evaluated personal VAD on a standard VAD task, to see if personal VAD can replace standard VAD without performance degradation.
In the first group of experiments, we compared the performance of four personal VAD architectures described in Fig. 2. We evaluated these systems on the concatenated LibriSpeech testing set. Additionally, to explore the performance of personal VAD on noisy speech, we also applied data augmentation technique (MTR) on the testing set. In personal VAD tasks, the most important metric is the AP for class tss, as downstream processes will only be applied to the speech produced by the target speaker.
We reported the evaluation results on the testing set with and without MTR, as shown in Table 1. Results show that ST, ET, and SET significantly outperform the baseline SC system in all cases. When applying MTR to the testing set, we observed an even larger performance gain between the proposed methods and the baseline. Among the proposed systems, SET achieved the highest AP for tss, and ST slightly outperforms ET. However, both ST and SET require to run speaker verification model (4.88 million parameters) to compute the cosine similarity score during inference time, which would largely increase both the number of parameters in the system and inference computational cost. By contrast, ET obtained 0.932 (without MTR) / 0.878 (with MTR) AP for class tss on the testing set with a model of only 0.13 million parameters ( 40 times smaller), which is more appropriate for on-device applications.
Loss function comparisons
In the second group of experiments, we compared the proposed weighted pairwise loss against the conventional cross entropy loss. Here we only consider the ET architecture, as it is much more lightweight while achieving reasonably good performance. Similarly, we evaluated the systems on the concatenated LibriSpeech testing set with and without MTR.
In Fig. 4, we plot the AP for tss against different values of in weighted pairwise loss. From the results, we observed that using a smaller value of than and will improve the performance, which demonstrates that confusion errors between <ns,ntss> have less impact to the system performance than errors between <tss,ntss> and <tss,ns>.
However, when becomes too small (e.g. or ), we found performance degradations from the curve. This result shows that completely ignoring the difference between ntss and ns is harmful to the system performance as well. In another word, it is insufficient to simply treat personal VAD task as a binary classification problem (target speaker speech v.s. other). The best performance is reached when setting , with detailed results listed in Table 1.
Personal VAD on standard VAD tasks
If we want to replace a standard VAD component with personal VAD, we also need to guarantee that the performance degradation on a standard speech/non-speech task is minimal. Finally, we conducted an experiment for personal VAD on standard VAD tasks. We evaluated two personal VAD models (ET architecture with cross entropy loss, and ET architecture with weighted pairwise loss) on the non-concatenated LibriSpeech testing data (so each utterance only has the target speaker). For comparison purpose, we also implemented a standard VAD model with the same network structure (2-layer LSTM network with 64 cells, followed by a fully-connected layer with 64 neurons).
The results are shown in Table 2. We can see that the AP for class speech (s) is very close between personal VAD models and the standard VAD model, which justifies replacing standard VAD by personal VAD. Additionally, the architectures of personal VAD models and the standard VAD model are the same in this experiment, so replacing standard VAD by personal VAD will not increase the model size or computational cost at inference time.
|Method||Loss||Without MTR||With MTR|
|Personal VAD (ET)||CE||0.991||0.965||0.979||0.893|
|Personal VAD (ET)||WPL||0.991||0.967||0.979||0.901|
In this paper, we proposed four different architectures to implement personal VAD, a system that detects the voice activity of a target user in real time. Among the different architectures, using a single small network that takes acoustic features and enrolled target speaker embedding as inputs achieves near-optimal performance with smallest runtime computational cost. To model the tolerance to different types of errors, we proposed a new loss function, the weighted pairwise loss, which proves to have better performance than a conventional cross entropy loss. Our experiments also show that personal VAD and standard VAD perform equally well on a standard VAD task. In summary, our findings suggest that, by focusing only on the desired target speaker, a personal VAD can reduce the overall computational cost of speech recognition systems operating in noisy environments.
- thanks: * Equal contribution. Shaojin performed this work as an intern at Google.
- Without loss of generality, we ignore the subscript for the time dimension, and use to represent both original and concatenated input features in our notations here.
- (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §3.3.
- (2016) On the efficient representation and execution of deep acoustic models. arXiv preprint arXiv:1607.04683. Cited by: §3.3.
- (2017) Endpoint detection using grid long short-term memory networks for streaming speech recognition. In Interspeech, pp. 3812–3816. Cited by: §1.
- (2018) Temporal modeling using dilated convolution and gating for voice-activity-detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5549–5553. Cited by: §1.
- (2013) All for one: Feature combination for highly channel-degraded speech activity detection. In Interspeech, pp. 709–713. Cited by: §1.
- (2019) Streaming end-to-end speech recognition for mobile devices. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §1.
- (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.3.
- (2019) Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. In Interspeech 2019, pp. 1123–1127. External Links: Cited by: §2.1.
- (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490. Cited by: §2.1.
- (2017) Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. In Interspeech, pp. 379–383. External Links: Cited by: §3.2.2.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
- (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. Cited by: §1.
- (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §1.
- (2015) Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §1, §3.1.
- (2017) Improved end-of-query detection for streaming speech recognition. In Interspeech, pp. 1909–1913. Cited by: §1.
- (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1.
- (2015) Improvements to the IBM speech activity detection system for the DARPA RATS program. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
- (2019) Tuplemax loss for language identification. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5976–5980. Cited by: §2.3.
- (2018) Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. Cited by: §1, Figure 1, §2.1.
- (2018) Speaker diarization with LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5239–5243. Cited by: §1, §2.1, §3.1.
- (2019) VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Interspeech 2019, pp. 2728–2732. External Links: Cited by: §2.1.
- (2019) Fully supervised speaker diarization. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6301–6305. Cited by: §1, §2.1.
- (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, pp. 30. Cited by: §3.4.