Personal VAD: Speaker-Conditioned Voice Activity Detection

Personal VAD: Speaker-Conditioned Voice Activity Detection

Abstract

In this paper, we propose “personal VAD”, a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

\ninept\name

Shaojin Ding* 2  Quan Wang* 1  Shuo-yiin Chang1  Li Wan1  Ignacio Lopez Moreno11 \address1Google Inc., USA   2Texas A&M University, USA
shjd@tamu.edu   { quanw, shuoyiin, liwan, elnota } @google.com

1 Introduction

In modern speech processing systems, voice activity detection (VAD) usually lives in the upstream of other speech components such as speech recognition and speaker recognition. As a gating module, VAD not only improves the performance of downstream components by discarding non-speech signals, but also significantly reduces the overall computational cost due to its relatively small size.

A typical VAD system uses a frame-level classifier on acoustic features to make speech/non-speech decisions for each audio frame (e.g. with 25ms width and 10ms step). Poor VAD systems could either mistakenly accept background noise as speech or falsely reject speech. False accepting non-speech as speech largely slows down the downstream automatic speech recognition (ASR) processing. It is also computationally expensive as ASR models are normally much larger than VAD models. On the other hand, false rejecting speech leads to deletion errors in ASR transcriptions (a few milliseconds of missed audio could remove an entire word). A good VAD model needs to work accurately in challenging environments, including noisy conditions, reverberant environments and environments with competing speech. Significant research has been devoted to finding the optimal VAD features and models  [17, 5, 3, 4, 15]. In the literature, LSTM-based VAD is a popular architecture for sequential modeling of the VAD task, showing state-of-the-art performance [3, 15].

In many scenarios, especially on-device speech recognition [6], the computational resources such as CPU, memory, and battery are typically limited. In such cases, we wish to run the computationally intensive components such as speech recognition only when the target user is talking to the device. False triggering such components in the background while only speech signals from other talkers or TV noises are present would cause battery drain and bad user experience. Thus, having a tiny model that only passes through speech signals from the target user is very necessary, which is our motivation of developing the personal VAD system.

Although standard speaker recognition and speaker diarization techniques [19, 12, 16, 20, 22, 13] can be directly used for the same task, we argue that the personal VAD system is largely preferred here for a couple of reasons:

  1. To minimize the latency of the whole system, an accept/reject decision is needed upon the arrival of each frame immediately, which prefers frame-level inference of the model. However, many state-of-the-art speaker recognition and diarization systems usually require window-based or segment-based inference, or even offline full-sequence inference.

  2. To minimize battery consumption on the device, the model must be very small, while most speaker recognition and diarization models are pretty big (typically millions of parameters).

  3. Unlike speaker recognition or diarization, in personal VAD, we don’t need to distinguish between different non-target speakers.

In fact, we implemented a baseline system by directly combining a standard speaker verification model and a standard VAD model for the personal VAD task, as described in Section 2.2.1, and found that its performance is worse than a dedicated personal VAD model. To the best of our knowledge, this work is the first lightweight solution that aims at directly detecting the voice activity of a target speaker in real time.

The proposed personal VAD is a VAD-alike neural network, conditioned on the target speaker embedding or the speaker verification score. Instead of determining whether a frame is speech or non-speech in standard VAD, personal VAD extends the determination to three classes: non-speech, target speaker speech, and non-target speaker speech.

The rest of the paper is organized as following. In Section 2.1, we first briefly describe our speaker verification system, which will be used during the training of personal VAD. Then in Section 2.2, we propose four different architectures to achieve personal VAD. In the training of personal VAD, we first treat it as a three-class classification problem and use cross entropy loss to optimize the model. In addition, we noticed that the discriminitivity between non-speech and non-target speaker speech is relatively less important than between target speaker speech and the other two classes in personal VAD. Therefore, we further propose a weighted pairwise loss to enforce the model to learn these differences, as introduced in Section 2.3. We evaluate the model on an augmented version of the LibriSpeech dataset [14], with experimental setup described in Section 3.2, model configuration described in Section 3.3, metrics explained in Section 3.4, and results presented in Section 3.5. Conclusions are drawn in Section 4.

2 Approach

2.1 Recap of speaker verification system

Personal VAD relies on a pre-trained text-independent speaker recognition/verification model to encode the speaker identity into embedding vectors. In this work, we use the “d-vector” model introduced in [19], which has been successfully applied to various applications including speaker diarization [20, 22], speech synthesis [9], source separation [21], and speech translation [8]. We retrained the 3-layer LSTM speaker verification model using data from 8 languages for language robustness and better performance. During inference, the model produces embeddings on sliding windows, and a final aggregated embedding named “d-vector” is used to represent the voice characteristics of this utterance, as illustrated in Fig. 1. The cosine similarity between two d-vector embeddings can be used to measure the similarity of two voices.

In a real application, users are required to follow an enrollment process before enabling speaker verification or personal VAD. During enrollment, d-vector embeddings are computed from the target user’s recordings, and stored on the device. Since the enrollment is a one-off experience and can happen on server-side, we can assume the embeddings of the target speakers are available at runtime with no cost.

Figure 1: The speaker verification system [19] produces an utterance-level d-vector by aggregating window-level embeddings.

2.2 System architecture

A personal VAD system should produce frame-level class labels for three categories: non-speech (ns), target speaker speech (tss), and non-target speaker speech (ntss). We implemented four different architectures to achieve personal VAD, as illustrated by Fig. 2. All four architectures rely on the embedding of the target speaker, which is acquired via the enrollment process.

Figure 2: Four different architectures to implement personal VAD: (a) SC: Run standard VAD and frame-level speaker verification independently, and combine their results. This is used as a baseline for other aproaches. (b) ST: Concatenate frame-level speaker verification score with acoustic features to train a personal VAD model. (c) ET: Concatenate speaker embedding with acoustic features to train a personal VAD model. (d) SET: Concatenate both speaker verification score and speaker embedding with acoustic features to train a personal VAD model.

Score combination (SC)

Our first approach to implement personal VAD is to simply combine a standard pre-trained speaker verification system and a standard VAD system, as shown in Fig. 2(a). We use this implementation as a baseline for other approaches, since it does not require training any new model.

We denote the frame of the input acoustic features at time as , where is the dimensionality of the acoustic features. For example, we use 40-dimensional log Mel-filterbank energies as the features. We use subscript to denote the subsequence ending at time , i.e. . A standard VAD model and a speaker verification model run independently on the acoustic features. The standard VAD produces unnormalized probabilities of speech (s) and non-speech (ns) for each frame:

(1)

where . The speaker verification model produces an embedding at each frame:

(2)

then the embedding is verified against the target speaker embedding , which was acquired during enrollment process:

(3)

To transform the standard VAD probability to personal VAD probabilities and , we combined it with the resulting speaker verification cosine similarity score , such that:

(4)

There are two major disadvantages of this architecture. First, it is running a window-based speaker verification model at a frame level without any adaptation, and such inconsistency could cause significant performance degradation. However, training frame-level speaker verification models is often unscalable due to the difficulties to batch utterances of different length. Second, this architecture requires running a speaker verification system at runtime, which can be expensive since speaker verification models are usually much bigger than VAD models.

Score conditioned training (ST)

As shown in Fig. 2(b), our second approach uses the speaker verification model to produce a cosine similarity score for each frame, as explained in Eq. (3), then concatenates this cosine similarity score to the acoustic features:

(5)

The concatenated feature vector is 41-dimensional, as represents the 40-dimensional log Mel-filterbank energies. We train a new personal VAD network that takes the concatenated features as input, and outputs the probabilities of the three class labels for each frame:

(6)

where .

This approach still requires running the speaker verification model at runtime. However, since it retrains the personal VAD model based on the speaker verification scores, it is expected to perform better than simply combining the scores of two individually trained systems.

Embedding conditioned training (ET)

As shown in Fig. 2(c), the third approach directly concatenates the target speaker embedding (acquired in the enrollment process) with the acoustic features:

(7)

Since our embedding is 256-dimensional, the concatenated feature vector here is 296-dimensional. Then we train a new personal VAD network, which outputs the probabilities of three class at the frame level similar to Eq. (6).

This approach is similar to a knowledge distillation [7] process. The large speaker verification model was pre-trained on a large-scale dataset individually. Following this, when we train the personal VAD model, we use the speaker embeddings of the target speaker to “distill the knowledge” from the large speaker verification model to the small personal VAD model. As a result, it does not require running the large speaker verification model at runtime, which becomes the most lightweight solution among all architectures.

Score and embedding conditioned training (SET)

As shown in Fig. 2(d), this approach concatenates both the frame-level speaker verification score and the target speaker embedding to the acoustic features to train a new personal VAD model:

(8)

The concatenated feature vector in this approach is 297-dimensional. This approach makes use of the most information from the speaker verification system. However, it still requires running the speaker verification model at runtime, so it’s not a lightweight solution.

2.3 Weighted pairwise loss

With an input frame and the corresponding ground truth label , personal VAD can be thought of as a ternary classification problem.2 The network outputs the unnormalized distribution of over the three classes, denoted as . We use to denote the unnormalized probability of the -th class. To train the model, we minimize the cross entropy loss as:

(9)

where .

However, in personal VAD, our goal is to detect the voice activity from only the target speaker. Audio frames that are classified into class ns and ntss will be discarded similarly by downstream components. As a result, confusion errors between <ns,ntss> have less impact to the system performance than errors between <tss,ntss> and <tss,ns>. Inspired by Tuplemax loss [18], here we propose a weighted pairwise loss to model the different tolerance to each class pair. Given and , we define weighted pairwise loss as:

(10)

where is the weight between class and class . By setting lower weight to <ns,ntss> errors than <tss,ntss> and <tss,ns> errors, we can enforce the model to be more tolerant to the confusion between <ns,ntss> and to focus on distinguishing tss from ns and ntss.

3 Experiments

3.1 Datasets

An ideal dataset to train and evaluate personal VAD would be a dataset such that: (1) each utterance in it contains natural speaker turns; and (2) it contains enrollment utterances for each individual speaker. Unfortunately, to the best of our knowledge, no public dataset in the community really satisfies both requirements. Although some datasets for speaker diarization [20] have natural speaker turns, they do not provide enrollment utterances for individual speakers. Alternatively, datasets containing enrollment utterances for individual speakers usually do not have natural speaker turns.

To address this limitation, we conducted experiments on an augmented version of the LibriSpeech dataset [14]. To simulate speaker turns, we concatenate single-speaker utterances from different speakers into multi-speaker utterances (see Section 3.2.1). We also noisify the concatenated utterances with reverberant room simulators to mitigate the concatenation artifacts (see Section 3.2.2).

In the LibriSpeech dataset, the training set contains 960 hours of speech, where 460 hours of them are “clean” speech and the other 500 hours are “noisy” speech. The testing set also consists of both “clean” and “noisy” speech. In all the experiments, we use the concatenated LibriSpeech training set to train the models. We use both the original LibriSpeech testing set and the concatenated LibriSpeech testing set for evaluation, as described in the following sections. For all the datasets, we use forced alignment to produce the frame-level ground truth labels used in training and evaluation.

Method Loss Without MTR With MTR Network parameters
tss ns ntss mean tss ns ntss mean (million)
SC (baseline) N/A 0.886 0.970 0.872 0.900 0.777 0.908 0.768 0.801 4.88 (SV) + 0.06 (VAD)
ST CE 0.956 0.968 0.956 0.957 0.905 0.885 0.905 0.901 4.88 (SV) + 0.06 (PVAD)
ET 0.932 0.962 0.946 0.946 0.878 0.873 0.890 0.883 0.13 (PVAD)
SET 0.970 0.969 0.972 0.969 0.938 0.888 0.938 0.928 4.88 (SV) + 0.13 (PVAD)
ET WPL 0.955 0.965 0.961 0.959 0.916 0.883 0.920 0.912 0.13 (PVAD)

The baseline system does not require training any new model.

Table 1: Architecture and loss function comparison results. SC: Score combination, the baseline system. ST: Score conditioned training. ET: Embedding conditioned training. SET: Score and embedding conditioned training. CE: Cross entropy loss. WPL: Weighted pairwise loss (). We report the Average Precision (AP) for each class, and the mean Average Precision (mAP) over all the classes. Network parameters include 4.88 million parameters from the speaker verification (SV) model, if it is used during inference.

3.2 Experimental settings

Utterance concatenation

In the training corpora of standard VAD, each utterance usually only contains the speech from one single speaker. However, personal VAD aims to find the voice activity of a target speaker in a conversation where multiple speakers could be engaged. Therefore, we cannot directly use the standard VAD training corpora to train personal VAD. To simulate the conversational speech, we concatenate utterances from multiple speakers into a longer utterance, and then we randomly select one of the speakers as the target speaker in the concatenated utterance.

To generate a concatenated utterance, we draw a random number indicating the number of utterances used for concatenation from a uniform distribution:

(11)

where and are the minimal and maximal numbers of utterances used for concatenation. The waveforms from the randomly selected utterances are concatenated, and one of the speakers is assumed as the target speaker of the concatenated utterance. At the same time, we modify the VAD ground truth label of each frame according to the target speaker: “non-speech” frames remain the same, while “speech” frames are modifed to either “target speaker speech” or “non-target speaker speech” according to whether the source utterance is from the target speaker.

In our experiments, we generated concatenated utterances for training set and concatenated utterances for testing sets. We use and for both sets, to cover both single-speaker and multi-speaker scenarios.

Multistyle training

For both training and evaluations, we apply a data augmentation technique named “multistyle training” (MTR) [10] on our datasets to avoid domain overfitting and mitigate concatenation artifacts. During MTR, the original (concatenated) source utterance is noisified with multiple randomly selected noise sources, using a randomly selected room configuration. Our noise sources include:

  • 827 audios of ambient noises recorded in cafes;

  • 786 audios recoreded in silent environments;

  • 6433 YouTube segments containing background music or noise.

We generated 3 million room configurations using a room simulator to cover different reverberation conditions. The distribution of the signal-to-noise ratio (SNR) of our MTR is shown in Fig. 3.

Figure 3: Histogram of SNR (dB) of our multistyle training.

3.3 Model configuration

The acoustic features are 40-dimensional log Mel-filterbank energies, extracted on frames with 25ms width and 10ms step. For both standard VAD model and personal VAD model, we used a 2-layer LSTM network with 64 cells, followed by a fully-connected layer with 64 neurons. We also tried larger networks but did not see performance improvements, possibly due to the limited variety in training data. We used TensorFlow [1] for training and inference. During training, we used Adam optimizer [11] with a learning rate of . For the models with weighted pairwise loss, we set and explored different values for .

To reduce the model size and accelerate the runtime inference, we quantized the parameters of the model to 8-bit integer values following [2]. With this quantization, our model using the ET architecture, which has only around 130 thousand parameters and is the smallest among all architectures (see Table 1), will be only 130 KB in size.

3.4 Metrics

To evaluate the performance of the proposed method, we computed the Average Precision (AP) [23] for each class and the mean Average Precision (mAP) over all the classes. AP and mAP are most common metrics for multi-class classification problems. AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight. AP can be computed as:

(12)

where and are the recall and precision at the -th threshold, respectively. We adopted the micro-mean3 over all the classes when computing mAP to take class imbalance into account, which averages APs over all the samples.

3.5 Results

We conducted three groups of experiments to evaluate the proposed method. First, we compared the four architectures for personal VAD. Following this, we examined the effectiveness of weighted pairwise loss and compared it against conventional cross entropy loss. Finally, we evaluated personal VAD on a standard VAD task, to see if personal VAD can replace standard VAD without performance degradation.

Architecture comparisons

In the first group of experiments, we compared the performance of four personal VAD architectures described in Fig. 2. We evaluated these systems on the concatenated LibriSpeech testing set. Additionally, to explore the performance of personal VAD on noisy speech, we also applied data augmentation technique (MTR) on the testing set. In personal VAD tasks, the most important metric is the AP for class tss, as downstream processes will only be applied to the speech produced by the target speaker.

We reported the evaluation results on the testing set with and without MTR, as shown in Table 1. Results show that ST, ET, and SET significantly outperform the baseline SC system in all cases. When applying MTR to the testing set, we observed an even larger performance gain between the proposed methods and the baseline. Among the proposed systems, SET achieved the highest AP for tss, and ST slightly outperforms ET. However, both ST and SET require to run speaker verification model (4.88 million parameters) to compute the cosine similarity score during inference time, which would largely increase both the number of parameters in the system and inference computational cost. By contrast, ET obtained 0.932 (without MTR) / 0.878 (with MTR) AP for class tss on the testing set with a model of only 0.13 million parameters ( 40 times smaller), which is more appropriate for on-device applications.

Loss function comparisons

Figure 4: Mean Average Precision (mAP) of personal VAD (ET) with different values of in weighted pairwise loss. The weight between ns and ntss is displayed in log scale.

In the second group of experiments, we compared the proposed weighted pairwise loss against the conventional cross entropy loss. Here we only consider the ET architecture, as it is much more lightweight while achieving reasonably good performance. Similarly, we evaluated the systems on the concatenated LibriSpeech testing set with and without MTR.

In Fig. 4, we plot the AP for tss against different values of in weighted pairwise loss. From the results, we observed that using a smaller value of than and will improve the performance, which demonstrates that confusion errors between <ns,ntss> have less impact to the system performance than errors between <tss,ntss> and <tss,ns>.

However, when becomes too small (e.g. or ), we found performance degradations from the curve. This result shows that completely ignoring the difference between ntss and ns is harmful to the system performance as well. In another word, it is insufficient to simply treat personal VAD task as a binary classification problem (target speaker speech v.s. other). The best performance is reached when setting , with detailed results listed in Table 1.

Personal VAD on standard VAD tasks

If we want to replace a standard VAD component with personal VAD, we also need to guarantee that the performance degradation on a standard speech/non-speech task is minimal. Finally, we conducted an experiment for personal VAD on standard VAD tasks. We evaluated two personal VAD models (ET architecture with cross entropy loss, and ET architecture with weighted pairwise loss) on the non-concatenated LibriSpeech testing data (so each utterance only has the target speaker). For comparison purpose, we also implemented a standard VAD model with the same network structure (2-layer LSTM network with 64 cells, followed by a fully-connected layer with 64 neurons).

The results are shown in Table 2. We can see that the AP for class speech (s) is very close between personal VAD models and the standard VAD model, which justifies replacing standard VAD by personal VAD. Additionally, the architectures of personal VAD models and the standard VAD model are the same in this experiment, so replacing standard VAD by personal VAD will not increase the model size or computational cost at inference time.

Method Loss Without MTR With MTR
s ns s ns
Standard VAD CE 0.992 0.975 0.975 0.918
Personal VAD (ET) CE 0.991 0.965 0.979 0.893
Personal VAD (ET) WPL 0.991 0.967 0.979 0.901
Table 2: Evaluation on a standard VAD task. We report the Average Precision (AP) for speech (s) and non-speech (ns).

4 Conclusions

In this paper, we proposed four different architectures to implement personal VAD, a system that detects the voice activity of a target user in real time. Among the different architectures, using a single small network that takes acoustic features and enrolled target speaker embedding as inputs achieves near-optimal performance with smallest runtime computational cost. To model the tolerance to different types of errors, we proposed a new loss function, the weighted pairwise loss, which proves to have better performance than a conventional cross entropy loss. Our experiments also show that personal VAD and standard VAD perform equally well on a standard VAD task. In summary, our findings suggest that, by focusing only on the desired target speaker, a personal VAD can reduce the overall computational cost of speech recognition systems operating in noisy environments.

Footnotes

  1. thanks: * Equal contribution. Shaojin performed this work as an intern at Google.
  2. Without loss of generality, we ignore the subscript for the time dimension, and use to represent both original and concatenated input features in our notations here.
  3. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean and M. Devin (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §3.3.
  2. R. Alvarez, R. Prabhavalkar and A. Bakhtin (2016) On the efficient representation and execution of deep acoustic models. arXiv preprint arXiv:1607.04683. Cited by: §3.3.
  3. S. Chang, B. Li, T. N. Sainath, G. Simko and C. Parada (2017) Endpoint detection using grid long short-term memory networks for streaming speech recognition. In Interspeech, pp. 3812–3816. Cited by: §1.
  4. S. Chang, B. Li, G. Simko, T. N. Sainath, A. Tripathi, A. van den Oord and O. Vinyals (2018) Temporal modeling using dilated convolution and gating for voice-activity-detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5549–5553. Cited by: §1.
  5. M. Graciarena, A. Alwan, D. Ellis, H. Franco, L. Ferrer, J. H. Hansen, A. Janin, B. S. Lee, Y. Lei and V. Mitra (2013) All for one: Feature combination for highly channel-degraded speech activity detection. In Interspeech, pp. 709–713. Cited by: §1.
  6. Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu and R. Pang (2019) Streaming end-to-end speech recognition for mobile devices. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §1.
  7. G. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.2.3.
  8. Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen and Y. Wu (2019) Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model. In Interspeech 2019, pp. 1123–1127. External Links: Document, Link Cited by: §2.1.
  9. Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno and Y. Wu (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490. Cited by: §2.1.
  10. C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath and M. Bacchiani (2017) Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home. In Interspeech, pp. 379–383. External Links: Document, Link Cited by: §3.2.2.
  11. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  12. C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan and Z. Zhu (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. Cited by: §1.
  13. A. Nagrani, J. S. Chung and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §1.
  14. V. Panayotov, G. Chen, D. Povey and S. Khudanpur (2015) Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §1, §3.1.
  15. M. Shannon, G. Simko, S. Chang and C. Parada (2017) Improved end-of-query detection for streaming speech recognition. In Interspeech, pp. 1909–1913. Cited by: §1.
  16. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1.
  17. S. Thomas, G. Saon, M. Van Segbroeck and S. S. Narayanan (2015) Improvements to the IBM speech activity detection system for the DARPA RATS program. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
  18. L. Wan, P. Sridhar, Y. Yu, Q. Wang and I. L. Moreno (2019) Tuplemax loss for language identification. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5976–5980. Cited by: §2.3.
  19. L. Wan, Q. Wang, A. Papir and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. Cited by: §1, Figure 1, §2.1.
  20. Q. Wang, C. Downey, L. Wan, P. A. Mansfield and I. L. Moreno (2018) Speaker diarization with LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5239–5243. Cited by: §1, §2.1, §3.1.
  21. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. L. Moreno (2019) VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Interspeech 2019, pp. 2728–2732. External Links: Document, Link Cited by: §2.1.
  22. A. Zhang, Q. Wang, Z. Zhu, J. Paisley and C. Wang (2019) Fully supervised speaker diarization. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6301–6305. Cited by: §1, §2.1.
  23. M. Zhu (2004) Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, pp. 30. Cited by: §3.4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
406443
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description