End-to-End Neural Speaker Diarization with Permutation-Free Objectives

End-to-End Neural Speaker Diarization with Permutation-Free Objectives


In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset. Our source code is available online at https://github.com/hitachi-speech/EEND.


Yusuke Fujita thanks: The first author performed the work while at Center for Language and Speech Processing, Johns Hopkins University as a Visiting Scholar. , Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe \address Hitachi, Ltd. Research & Development Group, Japan
Center for Language and Speech Processing, Johns Hopkins University, USA \email{yusuke.fujita.su, naoyuki.kanda.kn, shota.horiguchi.wk, kenji.nagamatsu.dm}@hitachi.com, shinjiw@ieee.org

Index Terms: end-to-end speaker diarization, permutation-free scheme, overlapping speech, neural network

1 Introduction

Speaker diarization is the process of partitioning speech segments according to the speaker identity. It is an important process for a wide variety of applications such as information retrieval from broadcast news, meetings, and telephone conversations [Tranter2006, Anguera2012]. It also helps automatic speech recognition performance in multi-speaker conversation scenarios in meetings (ICSI [Janin03, etin2006OverlapIM], AMI [Renals2008, Kanda2019ICASSP]) and home environments (CHiME-5 [Barker2018, Du2018, Boeddecker2018, Kanda2018, Kanda2019ICASSP]).

Typical speaker diarization systems are based on extraction and clustering of speaker representations [Meignier2010LIUMSA, Shum2013, Sell2014, Senoussaoui2014, Dimitriadis2017, Romero2017, Maciejewski2018CharacterizingPO, Wang2018LSTM]. The system first extracts speaker representations such as i-vectors [Dehak2011, Shum2013, Sell2014, Maciejewski2018CharacterizingPO], d-vectors[Wan2018, Wang2018LSTM], or x-vectors [Snyder2018, Romero2017]. Then, the speaker representations of short segments are partitioned into speaker clusters. Various clustering algorithms have been adopted, such as Gaussian mixture models [Meignier2010LIUMSA, Shum2013], agglomerative hierarchical clustering [Meignier2010LIUMSA, Sell2014, Romero2017, Maciejewski2018CharacterizingPO], mean shift [Senoussaoui2014], k-means [Dimitriadis2017, Wang2018LSTM], Links [Mansfield2018, Wang2018LSTM], and spectral clustering [Wang2018LSTM]. These clustering-based diarization methods have shown to be effective in various datasets (see the DIHARD challenge 2018 activities, e.g., [Sell2018dihard, Diez2018, Sun2018]).

However, there are two problems in the clustering-based method. Firstly, the clustering-based method implicitly assumes one speaker per segment, so it is difficult to deal with speaker-overlapping speech. Secondly, it cannot be optimized to minimize diarization errors directly because the clustering is performed in an unsupervised manner.

To deal with speaker-overlapping speech, a neural network based source separation model was recently proposed [Neumann2019]. The model separates one speaker’s time-frequency mask in one iteration, and separates another speaker’s mask in another iteration. Utilizing the source separation technique, speaker diarization is realized even in overlapping speech. However, their source separation training objective does not necessarily minimize diarization errors. Aiming at the speaker diarization problem, it will be better to use a diarization error-oriented objective function. Moreover, there is another drawback in their method that real multi-speaker recordings cannot be used for training, because their model requires clean, non-overlapping reference speech for training.

For the optimization based on diarization errors, a fully supervised diarization method has been proposed [Zhang2018]. This method formulates the speaker diarization problem based on a factored probabilistic model, which consists of modules for speaker change, speaker assignment and feature generation. However, in their method, the speaker-change model assumes one speaker for each segment, which hinders the application of the method for speaker-overlapping speech.

In this paper, we propose a novel end-to-end neural network-based speaker diarization model (EEND). In contrast to previous techniques, the EEND can both deal with overlapping speech as well as be trained directly to minimize diarization errors. Given an audio recording with utterances by multiple speakers, our recurrent neural network estimates joint speech activities of all speakers frame-by-frame. This model is categorized as multi-label classification, similar to the well-known method in sound event detection (SED) [Adavanne2017]. Unlike SED, the output speaker labels are ambiguous in the training stage, i.e. not corresponding to any fixed class, which is known in the source separation research field as a permutation problem. To solve the problem, we introduce a permutation-free scheme [Hershey2016, Yu2017] into the training objective function. The model is trained in an end-to-end fashion using the objective function that provides minimal diarization errors.

The EEND has various advantages over the conventional methods. Firstly, the EEND can explicitly handle overlapping speech by simply feeding overlapping speech as input during training and inference. Secondly, the EEND does not require separate modules for speech activity detection, speaker identification, source separation, or clustering. The proposed model integrates their functionality into a single neural network. Thirdly, unlike the source separation model, the EEND does not require clean, non-overlapping speech for training the model with synthetic conversational mixtures. This enables the use of domain adaptation using real overlapping speech conversations.

2 Proposed Method

2.1 Neural probabilistic model of speaker diarization

Speaker diarization is the process of partitioning speech segments according to the speaker identity. In other words, speaker diarization determines “who spoke when.” We formulate the speaker diarization task as a multi-label classification problem. It can be formulated as follows:

Given an observation sequence from an audio signal, estimate the speaker label sequence . Here, is a -dimensional observation feature vector at time index . Speaker label denotes a joint activity for multiple () speakers at time index . For example, and represent an overlap situation of both speakers and being present at time index . Thus, determining is a sufficient condition to determine the speaker diarization information.

The most probable speaker label sequence is estimated among all possible speaker label sequences , as follows:


can be factorized using conditional independence assumption as follows:


Here, we assume the frame-wise posterior is conditioned on all inputs, and each speaker is present independently.

The frame-wise posterior is modeled with bi-directional long short-term memory (BLSTM), as follows:


where is a BLSTM layer which accepts an input sequence and outputs -dimensional hidden activations at time index .111It is a concatenated vector of -dimensional forward and backward LSTM outputs. We use -layer stacked BLSTMs. The frame-wise posteriors is calculated from using a fully-connected layer and the element-wise sigmoid function .

The difficulty on training of the model described above is that the model have to deal with the speaker permutations: changing an order of speakers within a correct label sequence is also regarded as correct. An example of the permutations in a two-speaker case is shown in Fig. 1. In this paper, we call this the label ambiguity. This label ambiguity obstructs the training of the neural network when we just use a standard binary cross entropy loss function.

Figure 1: Two-speaker end-to-end neural speaker diarization (EEND) model trained with the PIT loss and the DPCL loss

To cope with the label ambiguity problem, we introduce two permutation-free loss functions as shown in Fig. 1. The first loss function is the permutation-invariant training (PIT) loss function, which is used for considering all the permutations of ground-truth speaker labels. The second loss function is the Deep Clustering (DPCL) loss function, which is used for encouraging hidden activations of the network to be speaker-discriminative representations. Note that the use of multi-label classification model is similar to the well-known SED method [Adavanne2017]. However, in contrast to the SED, the multi-label classification model used for speaker diarization suffers from the label ambiguity problem. Our contribution is introducing two permutation-free loss functions to cope with the label ambiguity problem.

2.2 Permutation-invariant training loss

The neural network is trained to minimize the error between the output predicted in Eq. 6 and the ground-truth speaker label . Considering that the speaker label has ambiguity of their permutations, we introduce the permutation-free scheme [Hershey2016, Yu2017]. More specifically, we utilize the utterance-level permutation-invariant training (PIT) criterion [Kolbak2017] in the proposed method. We apply the PIT criterion on time sequence of speaker labels instead of time-frequency mask used in [Kolbak2017]. The PIT loss function is written as follows:


where is a set of all the possible permutation of (), and is the -th permutation of the ground-truth speaker label, is the binary cross entropy function between the label and the output.

2.3 Deep Clustering loss

Assuming that the neural network extracts speaker representation in lower layers and then performs segmentation using higher layers, the middle layer activations can be regarded as the speaker representation. Therefore, we introduce a speaker representation learning criterion on the middle layer activations.

Here, the -th layer activations obtained from Eq. 5 are transformed into normalized -dimensional embedding as follows:


where is the element-wise hyperbolic tangent function and is the L2 normalization function. We apply the Deep Clustering (DPCL) loss function [Hershey2016] so that the embeddings are partitioned into speaker-dependent clusters as well as overlapping and non-speech clusters. For example in a two-speaker case, we generate four clusters (Non-speech, Speaker 1, Speaker 2, and Overlapping) as shown in Fig. 1.

DPCL loss function [Hershey2016] is used as follows:


where , and is a matrix for each row represents one-hot vector converted from where those elements are in the power set of speakers. is the Frobenius norm. The loss function encourages the two embeddings at different time indices to be close together if they are in the same cluster and far away if they are in different clusters.

Then we use multi-objective training introducing a mixing parameter :


Thus, we derive end-to-end neural speaker diarization with the above permutation-free objective.

3 Experiments

3.1 Data

This paper mainly conducted our experiments with simulated speech mixtures to verify the effectiveness of the proposed method for controlled overlap situations. Each mixture is simulated by Algorithm 1. Unlike the existing mixture simulation for source separation studies [Hershey2016], we consider a diarization-style mixture simulation: each speech mixture should have dozens of utterances per speaker with reasonable silence intervals between utterances. The silence intervals are controlled by the average interval . Larger values generate speech with less overlap. We show performance for differing overlap ratio controlled by in the result section Sec.3.5.

Input :    // Set of speakers, noises, RIRs and SNRs
  // Set of utterance lists
  // #speakers per mixture
  // Max. and min. #utterances per speaker
  // average interval
Output :   // mixture
2 Sample a set of speakers from
3   // Set of speakers’ signals
4 forall  do
5         // Concatenated signal
6       Sample from   // RIR
7       Sample from for  to  do
8             Sample   // Interval
14 Sample from   // Background noise
15 Sample from   // SNR
16 Determine a mixing scale from and
17 repeat until reach the length of
Algorithm 1 Mixture simulation.

The set of utterances used for the simulation is comprised of Switchboard-2 (Phase I, II, III), Switchboard Cellular (Part 1, Part2), and NIST Speaker Recognition Evaluation datasets (2004, 2005, 2006, 2008). All recordings are telephone speech sampled at 8 kHz. Total number of speakers in these corpora is 6,381. We split them into 5,743 speakers for the training set and 638 speakers for the test set. Since there is no time annotations in these corpora, we extract utterances using speech activity detection (SAD) based on time-delay neural networks and statistics pooling222The SAD model: http://kaldi-asr.org/models/m4. This data preparation and SAD is performed using Kaldi speech recognition toolkit [Povey_ASRU2011].

The set of background noises is from MUSAN corpus [Snyder2015]. We used 37 recordings which are annotated as “background” noises. The set of room impulse responses (RIRs) is the Simulated Room Impulse Response Database used in [Ko2017]. The total number of RIRs is 10,000. The SNR values are sampled from 10, 15, and 20 dBs. We generated two-speaker mixtures for each speaker have 20-40 utterances (). We used differing number of mixtures for the training set, and 500 mixtures for the test set.

3.2 Experimental setup

We extracted 23-dimensional log-Mel-filterbank features with 25 ms frame length and 10 ms frame shift. Each features are concatenated with those from the previous 7 frames and subsequent 7 frames. To deal with a long audio sequence in our BLSTM, we subsampled the concatenated features by a factor of 10.

For our neural network, we used 5-layer BLSTM with 256 hidden units in each layer. For the DPCL loss, we used the second layer of BLSTM outputs to form 256-dimensional embedding. We used the Adam [Kingma2014] optimizer with initial learning rate of . The batch size was 10. The number of training epoch was 20. Our implementation was based on Chainer [chainer_learningsys2015].

Because the output of the neural network is a probability of speech activity for each speaker, a threshold is required to obtain the decision of speech activity for each frame. We set the threshold to 0.5 for the evaluation on simulated speech mixtures. Furthermore, we apply 11-frame median filtering to prevent the production of unreasonably short segments.

3.3 Performance metric

We evaluated the proposed method with diarization error rate (DER) [NISTRT09]. In many prior studies, DER had not included miss or false alarm errors due to using oracle speech/non-speech labels. Overlapping speech segments had also been excluded from the evaluation. For our DER computation we evaluated all errors, including both non-speech and overlapping speech segments, because the proposed method includes both speech activity detection and overlapping speech detection functionality. As is typical, we use a collar tolerance of 250 ms around both the start and end of each segment.

3.4 Baseline system

We compared the proposed method with the two conventional clustering-based systems [Sell2018dihard]. The i-vector system and the x-vector system were created using the Kaldi CALLHOME diarization recipe333https://github.com/kaldi-asr/kaldi/tree/master/egs/callhome_diarization. To evaluate non-speech segments, we used speech segments extracted by SAD as described in Sec. 3.1.

3.5 Results

PIT loss DPCL loss DER (%)
- - 41.74
- 25.14
Table 1: Effect of loss functions evaluated on simulated speech generated with . The models are trained using 10,000 mixtures genarated with .
Number of training mixtures DER(%)
10,000 23.79
20,000 14.66
100,000 12.28
Table 2: Effect of the number of training mixtures evaluated on simulated speech generated with . The models are trained with .
i-vector 33.74 25.82 1.05 6.88
x-vector 28.77 25.82 1.05 1.90
EEND (proposed) 12.28 4.47 5.20 2.61
Table 3: Detailed DERs (%) evaluated on simulated speech generated with the . MI, FA and CF denote miss, false alarm and confusion error rates, respectively. The proposed model is trained using 100,000 mixtures generated with .
Evaluation set Simulated mixtures CALLHOME
2 3 5 -
overlap ratio (%) 27.3 19.1 11.1 11.8
i-vector 33.74 30.43 25.96 12.10
x-vector 28.77 24.46 19.78 11.53
EEND 12.28 14.36 19.69 23.07 (31.01)
Table 4: DERs (%) on different overlapping conditions. For the evaluation on simulated mixtures, the proposed model is trained using 100,000 mixtures generated with . For the evaluation on the CALLHOME dataset, the proposed model is trained with 26,712 telephone recordings. The DER without the domain adaptation is shown in the parenthesis.

We evaluated the effect of the proposed loss functions. Without PIT loss, we used binary cross entropy loss with the fixed permutation444We sorted the speaker names in a lexical order to obtain the fixed permutation.. With PIT and DPCL losses, we set the mixing parameter . The results are shown in Table 1. It is observed that PIT loss is essential for training of our neural network. It also demonstrates that DPCL loss helps improve performance.

The comparison with different numbers of training mixtures is shown in Table 2. It is observed that increasing the number of training samples improves the performance. Because our proposed method can be trained with any speech mixture with corresponding time annotations, it is possible to utilize large scale speech corpora for improving robustness of the system.

We compared the proposed method with the baseline systems using the simulated speech mixtures. The results are shown in Table 3. It is observed that miss rate is dominant in the DER of the baseline systems, due to the lack of capability for overlapping speech. In contrast, the proposed method achieved significantly low miss rate. The results indicate that the proposed method successfully detects overlapping segments as well as single-speaker segments and silence segments. Regarding the confusion error rate, the proposed method is better than the i-vector system, while it is worse than the x-vector system. For reducing the confusion errors, it is possible to use data augmentation for learning noise and speaker variations, which is utilized in the x-vector system.

To investigate the robustness to variable conditions, we evaluated different overlap ratio controlled by the average interval . Larger values generate less overlapping speech. The results with different overlapping ratios are shown in Table 4. The baseline systems show better performance on less overlapping speech as expected. However, the proposed method unexpectedly showed degraded performance on less overlapping speech. The result suggests that the network had overfit to the specific overlap ratio: 27.3%. Investigation with various overlap ratio settings of training data is among our future work.

In addition, we evaluated the proposed method on real telephone conversations using the CALLHOME dataset. We split two-speaker recordings from the CALLHOME dataset into two subsets: an adaptation set of 155 recordings and a test set of 148 recordings. Our neural network was trained with a set of 26,172 two-speaker recordings from telephone speech recordings as described in Sec.3.1. The overlap ratio of the training data was 5.8%. Then, it was retrained with the adaptation set. For this retraining, we used the Adam optimizer with initial learning rate of and ran 5 epochs. For the postprocessing, we adjusted the threshold to 0.6 so that the DER of the adaptation set has the minimum value. Table 4 shows the DERs evaluated on the CALLHOME test set. Unfortunately, the proposed method produces worse DER than the baseline systems. This is likely because our training set has very different overlap ratio (5.8%) from the CALLHOME test set (11.8%). To reduce this condition mismatch, we tried domain adaptation. The result showed a significant DER reduction. The relative improvement introduced by domain adaptation was 25.6%. Although the DER of the proposed method was still behind those of the baseline systems, we expect it will be much improved by developing better simulation techniques of training data or just by feeding more real data, as was suggested by the result with simulated data. We will address these directions in our future work.

4 Conclusion

We proposed an end-to-end neural speaker diarization method that is directly optimized with a diarization-error-oriented objective. The experimental results show that the proposed method outperforms conventional clustering-based methods evaluated on simulated speech mixtures. Furthermore, domain adaptation with real speech data achieved a significant DER reduction on the CALLHOME dataset.

5 Acknowledgements

We would like to thank Matthew Maciejewski and Xuankai Chang for their comments that greatly improved the manuscript.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description