Speaker Diarization with LSTM
Abstract
For many years, ivector based speaker embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based speaker embeddings, also known as dvectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of dvector based speaker verification systems to develop a new dvector based approach to speaker diarization. Specifically, we combine LSTMbased dvector audio embeddings with recent work in nonparametric clustering to obtain a stateoftheart speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that dvector based diarization systems offer significant advantages over traditional ivector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with outofdomain data from voice search logs.
Speaker Diarization with LSTM
Quan Wang^{1} Carlton Downey^{2} Li Wan^{1} Philip Andrew Mansfield^{1} Ignacio Lopez Moreno^{1} 
^{1}Google Inc., USA ^{2}Carnegie Mellon University, USA 
^{1} { quanw, liwan, memes, elnota } @google.com ^{2} cmdowney@cs.cmu.edu 
Index Terms— Speaker diarization, deep learning, speaker embedding, LSTM, spectral clustering
1 Introduction
Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It answers the question “who spoke when” in a multispeaker environment. It has a wide variety of applications including multimedia information retrieval, speaker turn analysis, and audio processing. In particular, the speaker boundaries produced by diarization systems have the potential to significantly improve acoustic speech recognition (ASR) accuracy.
A typical speaker diarization system usually consists of four components: (1) Speech segmentation, where the input audio is segmented into short sections that are assumed to have a single speaker, and the nonspeech sections are filtered out; (2) Speaker embedding extraction, where specific features such as MFCCs [1], speaker factors [2], or ivectors [3, 4, 5] are extracted from the segmented sections; (3) Clustering, where the number of speakers is determined, and the extracted speaker embeddings are clustered into these speakers; and optionally (4) Resegmentation [6], where the clustering results are further refined to produce the final diarization results.
In recent years, neural network based speaker embeddings (dvectors) have seen widespread use in speaker verification applications [7, 8, 9, 10, 11], often significantly outperforming previously stateoftheart techniques based on ivectors. However, most of these applications belong to textdependent speaker verification, where the speaker embeddings are extracted from specific detected keywords [12, 13]. In contrast, speaker diarization requires textindependent embeddings which work on arbitrary speech.
In this paper, we explore a textindependent dvector based approach to speaker diarization. We leverage the work of [11] to train an LSTMbased textindependent speaker verification model, then combine this model with recent work in nonparametric spectral clustering algorithm to obtain a stateoftheart speaker diarization system.
While several authors have had explored using neural network embeddings for diarization tasks, their work has largely focused on using feedforward DNNs to directly perform diarization. For example, [14] uses DNN embeddings trained on PLDAinspired loss. In contrast, our work uses RNNs (specifically LSTMs [15]), which better capture the sequential nature of audio signals, and our generalized endtoend training architecture directly simulates the enrollverify runtime logic.
There have been several attempts to apply spectral clustering [16] to the speaker diarization problem [17, 3]. However, to the authors’ knowledge, our work is the first to combine LSTMbased dvector embeddings with spectral clustering. Furthermore, as part of our spectral clustering algorithm, we present a novel sequence of affinity matrix refinement steps which act to denoise the affinity matrix, and are crucial to the success of our system.
The remainder of this paper is organized as follows: In Sec. 2, we describe how the LSTMbased textindependent speaker verification model presented in [11] can be adapted to featurize raw audio data and prepare it for clustering. In Sec. 3, we describe four different clustering algorithms and discuss the pros and cons of each in the context of speaker diarization, culminating with a modified spectral clustering algorithm. Experimental results and discussions are presented in Sec. 4, and conclusions are in Sec. 5.
2 Diarization with DVectors
Wan et al. recently introduced an LSTMbased [15] textindependent speaker embedding network for speaker verification [11]. Their model is trained on fixedlength segments extracted from a large corpus of arbitrary speech. They showed that the dvector embeddings produced by such networks usually significantly outperform ivectors in an enrollmentverification 2stage application. We now describe how this model can be modified for purposes of speaker diarization.
The flowchart of our diarization system is provided in Fig. 1. In this system, audio signals are first transformed into frames of width 25ms and step 10ms, and logmelfilterbank energies of dimension 40 are extracted from each frame as the network input. These frames form overlapping sliding windows of a fixed length, on which we run the LSTM network. The lastframe output of the LSTM is then used as the dvector representation of this sliding window.
The dvectors are grouped into speech segments determined by a Voice Activity Detector (VAD). These speech segments are further divided into smaller segments using a maximal segmentlength limit (e.g. 400ms in our experiments), which determines the temporal resolution of the diarization results. For each segment, the corresponding dvectors are first L2 normalized, then averaged to form an embedding of the segment.
The above process serves to reduce arbitrary length audio input into a sequence of fixedlength embeddings. We can now apply a clustering algorithm to these embeddings in order to determine the number of unique speakers, and assign each each part of the audio to a specific speaker.
3 Clustering
In this section, we introduce the four clustering algorithms that we integrated into our diarization system. We place particular focus on the spectral offline clustering algorithm, which significantly outperformed the alternative approaches across experiments.
We note that clustering algorithms can be separated into two categories according to the runtime latency:

Online clustering: A speaker label is immediately emitted once a segment is available, without seeing future segments.

Offline clustering: Speaker labels are produced after the embeddings of all segments are available.
Offline clustering algorithms typically outperform Online clustering algorithms due to the additional contextual information available in the offline setting. Furthermore, a final resegmentation step can only be applied in the offline setting. Nonetheless, the choice between online and offline depends primarily on the nature of the application — where the system is intended to be deployed. For example, latencysensitive applications such as live video analysis typically restrict the system to online clustering algorithms.
3.1 Naive online clustering
This is a prototypical online clustering algorithm. We apply a threshold on the similarities between embeddings of segments. To be consistent with the generalized endtoend training architecture [11], cosine similarity is used as our similarity metrics.
In this clustering algorithm, each cluster is represented by the centroid of all its corresponding embeddings. When a new segment embedding is available, we compute its similarities to centroids of all existing clusters. If they are all smaller than the threshold, then create a new cluster containing only this embedding; otherwise, add this embedding to the most similar cluster and update the centroid.
3.2 Links online clustering
“Links” is an algorithm we developed to generalize the naive online clustering algorithm by allowing anisotropic cluster distributions. In this algorithm, each cluster is represented as a mixture of isotropic subclusters.
In other words, let be an embedding, a cluster is then modeled as a probability distribution , which is defined as:
(1) 
where is the distribution of the th subcluster within this cluster. is the weight of this subcluster, determined by the number of embeddings it contains. Cluster membership is determined by comparing the cumulative probability against a global threshold :
(2) 
Assume the data in a subcluster are generated by a Gaussian , where is the angle between the embedding and its center , and is a fixed value. The likelihood of given the data is simply:
(3) 
And the subcluster probability distribution is then estimated as:
(4) 
where the integral assumes the prior distribution of is uniform.
When a new embedding is available, we compute the similarities between and the centers of all subclusters. For the most similar subcluster, we compare against a global threshold :

If , we add to this subcluster. We recursively merge two subclusters within the same class, if the similarity between their centers exceeds .

If , we create a new subcluster containing only . Let the cluster containing the most similar subcluster be , then we use to determine if we should add this new subcluster to .
Then we check the integrity of the clusters. If a cluster can be partitioned into two clusters and , such that , we have , then we split this cluster.
3.3 KMeans offline clustering
Like in many diarization systems [18, 3, 19], we integrated the KMeans clustering algorithm with our system. Specifically, we use KMeans++ for initialization [20]. To determine the number of speakers , we use the “elbow” of the derivatives of conditional Mean Squared Cosine Distances^{1}^{1}1We define cosine distance as . (MSCD) between each embedding to its cluster centroid:
(5) 
3.4 Spectral offline clustering
Our spectral clustering algorithm consists of the following steps:

Construct the affinity matrix , where is the cosine similarity between th and th segment embedding when , and the diaginal elements are set to the maximal value in each row: .

Apply the following sequence of refinement operations on the affinity matrix :

Gaussian Blur with standard deviation ;

Rowwise Thresholding: For each row, set elements smaller than this row’s percentile to 0;

Symmetrization: ;

Diffusion: ;

Rowwise Max Normalization: .
These refinements act to both smooth and denoise the data in the similarity space as shown in Fig. 2, and are crucial to the success of the algorithm. The refinements are based on the temporal locality of speech data — contiguous speech segments should have similar embeddings, and hence similar values in the affinity matrix.
We now provide the intuition behind each of these operations: The Gaussian blur acts to smooth the data, and reduce the effect of outliers. Rowwise thresholding serves to zeroout affinities between embeddings belonging to two different speakers. Symmetrization restores matrix symmetry which is crucial to the spectral clustering algorithm. The diffusion steps draws inspiration from the Diffusion Maps algorithm [21], and serves to sharpen the image resulting in clear boundaries between sections of the affinity matrix belonging to distinct speakers. Finally, the rowwise max normalization serves to rescale the spectrum of the matrix to ensure undesirable scale effects do not occur during the subsequent spectral clustering step.


After all refinement operations have been applied, perform eigendecomposition on the refined affinity matrix. Let the eigenvalues be: . We use the maximal eigengap to determine the number of clusters :
(6) 
Let the eigenvectors corresponding to the largest eigenvalues be . We replace the th segment embedding by the corresponding dimension in these eigenvectors: . Then we use the same KMeans algorithm in Sec. 3.3 to cluster these new embeddings, and produce speaker labels.
3.5 Discussion
Speech data analysis is an extremely challenging problem domain, and conventional clustering algorithms such as KMeans often perform poorly. This is due to a number of unfortunate properties inherent to speech data, which include:

NonGaussian Distributions: Speech data are often NonGaussion. In this setting, the centroid of a cluster (central to KMeans clustering) is not a sufficient representation.

Cluster Imbalance: In speech data, it is often the case that one speaker will speak often, while other speakers will speak rarely. In this setting, KMeans may incorrectly split large clusters into several smaller clusters.

Hierarchical Structure: Speakers fall into various groups according to gender, age, accent, etc. This structure is problematic since the difference between a male and a female speaker is much larger than the difference between two female speakers. This makes it difficult for KMeans to distinguish between clusters corresponding to groups, and clusters corresponding to distinct speakers. In practice, this often causes KMeans to incorrectly cluster all embeddings corresponding to male speakers into one cluster, and all embeddings corresponding to female speakers into another.
The problems caused by these properties are not limited to KMeans clustering, but are endemic to most parametric clustering algorithms. Fortunately, these problems can be mitigated by employing a nonparametric connectionbased clustering algorithm such as spectral clustering.
4 Experiments
Clustering  Embedding  CALLHOME American English Eval  NIST RT03 English CTS Eval  

Confusion  FA  Miss  Total  Confusion  FA  Miss  Total  
Naive  ivector  26.41  2.40  3.55  32.36  35.35  4.66  2.62  42.63 
dvector  12.41  1.94  4.51  18.87  18.76  4.09  4.45  27.30  
Links  ivector  25.40  2.40  3.55  31.36  33.56  4.66  2.62  40.84 
dvector  11.02  1.94  4.51  17.47  18.56  4.09  4.45  27.10  
KMeans  ivector  22.86  2.40  3.55  28.81  24.38  4.66  2.62  31.66 
dvector  7.29  1.94  4.51  13.75  7.80  4.09  4.45  16.34  
Spectral  ivector  14.59  2.40  3.55  20.54  13.84  4.66  2.62  21.12 
dvector  6.03  1.94  4.51  12.48  3.76  4.09  4.45  12.30 
4.1 Models
We run experiments with all combinations of both ivector and dvector models, with the four clustering algorithms discussed in Sec. 3. Both models are trained on an anonymized collection of voice searches, which has around 36M utterances and 18K speakers.
The ivector model is trained using 13 PLP coefficients with delta and deltadelta coefficients. The GMMUBM includes 512 Gaussians, and the total variability matrix includes 100 eigenvectors. The final ivectors are reduced to 50dimensional using LDA.
The dvector model is a 3layer LSTM network with a final linear layer. Each LSTM layer has 768 nodes, with projection [22] of 256 nodes.
Our Voice Activity Detection (VAD) model is a very small GMM model using the same PLP features as ivector. It only has two full covariance Gaussians: one for speech, and one for nonspeech. We found this simple VAD generalizes better across domains (from queries to telephone) for diarization than CLDNN [23] VAD models.
4.2 Datasets
We report Diarization Error Rates (DER) on three standard public datasets: (1) CALLHOME American English [24] (LDC97S42 + LDC97T14); (2) 2003 NIST Rich Transcription (LDC2007S10), the English conversational telephone speech (CTS) part; and (3) 2000 NIST Speaker Recognition Evaluation (LDC2001S97), Disk8.
The first two datasets are English only, and are relatively smaller. Thus we use these two datasets to compare different algorithms.
The third dataset is used by most diarization papers, and is usually directly referred to as “CALLHOME” in literature. It contains 500 utterances distributed across six languages: Arabic, English, German, Japanese, Mandarin, and Spanish.
4.3 Experiment setup
Our diarization evaluation tool is based on the pyannote.metrics library [25].
The CALLHOME American English dataset has a default 20vs20 utterances division for DevvsEval. For NIST RT03 CTS, we randomly divide the 72 utterances into 14vs58 Dev and Eval sets. For each diarization system, we tune the parameters such as Voice Activity Detector (VAD) threshold, LSTM window size/step (Fig. 1), and clustering parameters on the Dev set, and report the DER on the Eval set.
For NIST RT03 CTS, we only report DERs based on those provided unpartitioned evaluation map (UEM) files. For the other two datasets, as is the standard convention in literature [2, 3, 4, 6, 14, 26], we tolerate errors less than 250ms in locating segment boundaries.
As is typical, for each audio file, multiple channels are merged into a single channel [3, 6, 19], and we do not process the parts that are before the first annotation or after the last annotation. Additionally, as is standard in literature, we exclude overlapped speech (multiple speakers speaking at the same time) from our evaluation. For offline clustering algorithms, we constrain the system to produce at least 2 speakers.
4.4 Results
Our experimental results are shown in Table 1, 2 and 3. We report the total DER together with its three components: False Alarm (FA), Miss, and Confusion. FA and Miss are mostly from Voice Activity Detection errors, and partly from the aggregation from framelevel ivectors or windowlevel dvectors to segments. The FA and Miss differences between ivector and dvector are due to their different window sizes/steps and aggregation logics.
In Table 1, we can see that dvector based diarization systems significantly outperform ivector based systems. For dvector systems, the optimal sliding window size and step are 240ms and 120ms, respectively.
We also observe that as expected, offline diarization produces significantly better results than online diarization. Specifically, online diarization predicts the incorrect number of speakers much more frequently than offline diarization. This problem could potentially be mitigated by the addition of a “burnin” stage before entering the online mode.
In Table 2, we compare our dvector + spectral clustering system with others’ work on the same dataset. Though our LSTM model is completely trained on outofdomain and Englishonly data, we can still achieve stateoftheart performance on this multilingual dataset. The performance could potentially be further improved by using indomain training data and adding a final resegmentation step.
Additionally, in Table 3, we followed the same practice in [26] to evaluate our system on a subset of 109 utterances from CALLHOME American English that have 2 speakers (called CH109 in [19]). Number of speakers is fixed to 2 for this evaluation.
Method  Confusion  FA  Miss  Total 

dvector + spectral  12.0  2.2  4.6  18.8 
Castaldo et al. [2]  13.7  —  —  — 
Shum et al. [3]  14.5  —  —  — 
Senoussaoui et al. [4]  12.1  —  —  — 
Sell et al. [6] (+VB)  13.7 (11.5)  —  —  — 
Romero et al. [14] (+VB)  12.8 (9.9)  —  —  — 
Method  Confusion  FA  Miss  Total 

dvector + spectral  5.97  2.51  4.06  12.54 
Zajíc et al. [26]  7.84  —  —  — 
4.5 Discussion
Though we listed DER metrics from different papers in Table 2 and 3, we find that it is difficult to fully align these numbers, an unfortunately common problem in the diarization community. This is due primarily to the large number of moving parts required for a functional diarization pipeline. For example, different teams use different Voice Activity Detection marks (not publicly available), different training datasets, and different Dev sets for parameter tuning.
The evaluation protocols and software also differ from paper to paper. Most teams exclude FA and Miss from their evaluations, and directly refer to Confusion as their DER. However, we observed that a poor VAD with high Miss usually filters out the difficult parts in the speech, and makes the clustering problem much easier. Some papers like [19] use the nonstandard Speaker Clustering Errors in frame percentage as their metrics, and also exclude FA and Miss from this error. Additionally, it’s unclear how overlapped speech is handled in some papers.
In our experiments, we do our best to ensure the comparisons are as fair as possible, and avoid tuning parameters on Eval sets.
5 Conclusions
In this paper, we built on the success of dvector based speaker verification systems to develop a new dvector based approach to speaker diarization. Specifically, we combined LSTMbased dvector audio embeddings with recent work in nonparametric clustering to obtain a stateoftheart speaker diarization system. We conducted experiments on four clustering algorithms combined with both ivectors and dvectors, and reported the performance on three standard public datasets: CALLHOME American English, NIST RT03 English CTS, and NIST SRE 2000. In general, we observed that dvector based systems achieve significantly lower DER than ivector based systems.
6 Acknowledgements
We would like to thank Dr. Hervé Bredin for the continuous support with the pyannote.metrics library. We would like to thank Dr. Gregory Sell and Prof. Pietro Laface for helping us understand the evaluation datasets. We would like to thank Yash Sheth and Richard Rose for the helpful discussions.
References
 [1] Patrick Kenny, Douglas Reynolds, and Fabio Castaldo, “Diarization of telephone conversations using factor analysis,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 6, pp. 1059–1070, 2010.
 [2] Fabio Castaldo, Daniele Colibro, Emanuele Dalmasso, Pietro Laface, and Claudio Vair, “Streambased speaker segmentation using speaker factors and eigenvoices,” in Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008, pp. 4133–4136.
 [3] Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
 [4] Mohammed Senoussaoui, Patrick Kenny, Themos Stafylakis, and Pierre Dumouchel, “A study of the cosine distancebased mean shift for telephone speech diarization,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 1, pp. 217–227, 2014.
 [5] Gregory Sell and Daniel GarciaRomero, “Speaker diarization with plda ivector scoring and unsupervised calibration,” in Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 413–417.
 [6] Gregory Sell and Daniel GarciaRomero, “Diarization resegmentation in the factor analysis subspace,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4794–4798.
 [7] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052–4056.
 [8] Yuhsin Chen, Ignacio LopezMoreno, Tara N Sainath, Mirkó Visontai, Raziel Alvarez, and Carolina Parada, “Locallyconnected and convolutional neural networks for small footprint speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [9] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “Endtoend textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5115–5119.
 [10] F A Rezaur Rahman Chowdhury, Quan Wang, Li Wan, and Ignacio Lopez Moreno, “Attentionbased models for textdependent speaker verification,” arXiv preprint arXiv:1710.10470, 2017.
 [11] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized endtoend loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2017.
 [12] Guoguo Chen, Carolina Parada, and Georg Heigold, “Smallfootprint keyword spotting using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4087–4091.
 [13] Rohit Prabhavalkar, Raziel Alvarez, Carolina Parada, Preetum Nakkiran, and Tara N Sainath, “Automatic gain control and multistyle training for robust smallfootprint keyword spotting with deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4704–4708.
 [14] Daniel GarciaRomero, David Snyder, Gregory Sell, Daniel Povey, and Alan McCree, “Speaker diarization using deep neural network embeddings,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.
 [15] Sepp Hochreiter and Jürgen Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [16] Ulrike Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
 [17] Huazhong Ning, Ming Liu, Hao Tang, and Thomas S Huang, “A spectral clustering approach to speaker diarization.,” in INTERSPEECH, 2006.
 [18] Oshry BenHarush, Ortal BenHarush, Itshak Lapidot, and Hugo Guterman, “Initialization of iterativebased speaker diarization systems for telephone conversations,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 414–425, 2012.
 [19] Dimitrios Dimitriadis and Petr Fousek, “Developing online speaker diarization system,” in INTERSPEECH, 2017.
 [20] David Arthur and Sergei Vassilvitskii, “kmeans++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
 [21] Ronald R Coifman and Stéphane Lafon, “Diffusion maps,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 5–30, 2006.
 [22] Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long shortterm memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [23] Rubén Zazo Candil, Tara N Sainath, Gabor Simko, and Carolina Parada, “Feature learning with rawwaveform cldnns for voice activity detection,” 2016.
 [24] A Canavan, D Graff, and G Zipperlen, “Callhome american english speech ldc97s42,” LDC Catalog. Philadelphia: Linguistic Data Consortium, 1997.
 [25] Hervé Bredin, “pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems,” hypothesis, vol. 100, no. 60, pp. 90, 2017.
 [26] Zbynĕk Zajíc, Marek Hrúz, and Ludĕk Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement,” in INTERSPEECH, 2017.