Pairwise Discriminative Neural PLDA for Speaker Verification

Pairwise Discriminative Neural PLDA for Speaker Verification

Abstract

The state-of-art approach to speaker verification involves the extraction of discriminative embeddings like x-vectors followed by a generative model back-end using a probabilistic linear discriminant analysis (PLDA). In this paper, we propose a Pairwise neural discriminative model for the task of speaker verification which operates on a pair of speaker embeddings such as x-vectors/i-vectors and outputs a score that can be considered as a scaled log-likelihood ratio. We construct a differentiable cost function which approximates speaker verification loss, namely the minimum detection cost. The pre-processing steps of linear discriminant analysis (LDA), unit length normalization and within class covariance normalization are all modeled as layers of a neural model and the speaker verification cost functions can be back-propagated through these layers during training. We also explore regularization techniques to prevent overfitting, which is a major concern in using discriminative back-end models for verification tasks. The experiments are performed on the NIST SRE 2018 development and evaluation datasets. We observe average relative improvements of 8% in CMN2 condition and 30% in VAST condition over the PLDA baseline system.

\name

Shreyas Ramoji, Prashant Krishnan V, Prachi Singh, Sriram Ganapathy \addressLearning and Extraction of Acoustic Patterns (LEAP) Lab
Department of Electrical Engineering, Indian Institute of Science, Bengaluru \ninept

{keywords}

X-vectors, PLDA, Neural PLDA, Soft Detection Cost, Speaker Verification.

1 Introduction

The earliest successful approach to speaker recognition used the Gaussian mixture modeling (GMM) from the training data followed by an adaptation using maximum-aposteriori (MAP) rule [18]. The development of i-vectors as fixed dimensional front-end features for speaker recognition tasks was introduced in [12, 8]. Recently, neural network embeddings trained on a speaker discrimination task were also derived as features to replace the i-vectors. These features called x-vectors [21] were shown to perform better than the i-vectors for speaker recognition [14].

Following the extraction of x-vectors/i-vectors, different pre-processing steps are employed to transform the embeddings. The common steps include linear discriminant analysis (LDA) [8], unit length normalization [10] and within-class covariance normalization (WCCN) [11]. The transformed vectors are modeled with probabilistic linear discriminant analysis (PLDA) [13]. The PLDA model is used to compute a log likelihood ratio from a pair of enrollment and test embeddings which is used to verify whether the given trial is a target or non-target.

In this paper, we propose a neural back-end model which jointly performs pre-processing and scoring. It operates on pairs of x-vector embeddings (a pair of enrollment and test x-vectors), and outputs a score that allows the decision of target versus non-target hypotheses. The implementation using neural layers allows the entire model to be learnt using a speaker verification cost. The use of conventional cost functions like binary cross entropy tend to overfit the model to the training speakers, thereby performing poorly on evaluation sets. In an attempt to avoid this, we use the NIST SRE normalized detection cost [20] to optimize the neural back-end model. With several experiments on the NIST SRE 2018 development and evaluation dataset, we show that the proposed approach improves significantly over the state-of-the-art x-vector based PLDA system.

The rest of the paper is organized as follows. In Section 2, we highlight relevant prior work done in the field of discriminative back-end for speaker verification. Section 3 describes the front-end configurations used for feature processing and x-vector extraction. Section 4 describes the proposed neural network architecture used, and the connection with generative PLDA model. In Section 5, we present a smooth approximation to the NIST SRE detection cost function, and discuss regularization methods. This is followed by discussion of results in Section 6 and a brief set of concluding remarks in Section 7.

2 Related Prior Work

The common approaches for scoring in speaker verification systems include support vector machines (SVMs) [3], Gaussian back-end model [15, 1] and the probabilistic linear discriminant analysis (PLDA) [13]. Some efforts on pairwise generative and discriminative modeling are discussed in [5, 7, 6]. The discriminative version of PLDA with logistic regression and support vector machine (SVM) kernels has also been explored in  [2]. In this work, the authors use the functional form of the generative model and pool all the parameters needed to be trained into a single long vector. These parameters are then discriminatively trained using the SVM loss function with pairs of input vectors. The discriminative PLDA (DPLDA) is however prone to over-fitting on the training speakers and leads to degradation on unseen speakers in SRE evaluations [22]. The regularization of embedding extractor network using a Gaussian back-end scoring has been investigated in [9].

Recently, end-to-end approaches to speaker verification have also been examined. For example, in [19], the i-vector extraction with PLDA scoring has been jointly derived using a deep neural network architecture and the entire model is trained using a binary cross entropy training criterion. The use of triplet loss in end-to-end speaker recognition has shown promise for short utterances [24]. Wan et. al. [23] proposed a generalized end-to-end loss inspired by minimizing the centroid mean of within speaker distances while maximizing across speaker distances. However, in spite of these efforts, most of the successful systems for SRE evaluations continue to use the generative PLDA back-end model.

In this paper, we argue that the major issue of over-fitting in discriminative back-end systems arises from the choice of the model and loss function. In the detection cost metrics ( and ) for SRE, the false-alarm errors have more significance compared to miss errors. Thus, incorporating the SRE evaluation metric directly in the optimization avoids the over-fitting problem. Further, by training multiple pre-processing steps along with the scoring module, the model learns to generate representations that are better optimized for the speaker verification task.

3 Speaker Embedding Extractor

In this section, we provide the description of the front-end feature extraction and x-vector model configuration.

3.1 Training

The x-vector extractor is trained entirely using speech data extracted from combined VoxCeleb 1  [16] and VoxCeleb 2 corpora [4]. These datasets contain speech extracted from celebrity interview videos available on YouTube, spanning a wide range of different ethnicities, accents, professions, and ages. For training the x-vector extractor, we use segments from speakers selected from Vox-Celeb 1 (dev and test), and VoxCeleb 2 (dev).

This x-vector extractor was trained using dimensional Mel-Frequency Cepstral Coefficients (MFCCs) from ms frames shifted every ms using a -channel mel-scale filterbank spanning the frequency range Hz - Hz. A 5-fold augmentation strategy is used that adds four corrupted copies of the original recordings to the training list [21, 14]. The augmentation step generates training segments for the combined VoxCeleb set.

3.2 The x-vector extractor

For x-vector extraction, an extended TDNN with hidden layers and rectified linear unit (RELU) non-linearities is trained to discriminate among the nearly speakers in the training set [14]. The first hidden layers operate at frame-level, while the last layers operate at segment-level. There is a -dimensional statistics pooling layer between the frame-level and segment-level layers that accumulates all frame-level outputs using mean and standard deviation. After training, embeddings are extracted from the dimensional affine component of the th layer (i.e., the first segment-level layer). More details regarding the DNN architecture and the training process can be found in [14].

4 Pairwise Discriminative Neural PLDA Back-end

Following the x-vector extraction, the embeddings are centered (mean removed), transformed using LDA and unit length normalized. The PLDA model on the processed x-vector for a given recording is,

(1)

where is the x-vector for the given recording, is the latent speaker factor with a prior of , characterizes the speaker sub-space matrix and is the residual assumed to have distribution . 1.

For scoring, a pair of x-vectors, one from the enrollment recording and one from the test recording are used with the pre-trained PLDA model to compute the log-likelihood ratio score as,

(2)

where,

(3)
(4)

with and . In the proposed pairwise discriminative network (Neural PLDA) (Fig. 1), we construct the pre-processing steps of LDA as first affine layer, unit-length normalization as a non-linear activation and PLDA centering and diagonalization as another affine transformation. The final PLDA pair-wise scoring given in Eq. 2 is implemented as a quadratic layer in Fig. 1. Thus, the Neural PLDA implements the pre-processing of the x-vectors and the PLDA scoring as a neural back-end. The model parameters of the Neural PLDA can be initialized with the baseline system and these parameters can be learned in a backpropagation setting.

Figure 1: Neural PLDA Net Architecture: The two inputs and are the enrollment and test x-vectors which constitute a trial.

5 Cost Function and Regularization

To train the Neural PLDA for the task of speaker verification, it is required to sample pairs of x-vectors representing target (from same speaker) and non-target hypothesis (from different speakers). We train the model using the trials from previous NIST SRE evaluation sets along with randomly sampled target and non-target pairs which are matched by source and gender. The following error functions can be used in the Neural PLDA,

5.1 Binary Cross Entropy

The standard objective for a two class classification task.

(5)

where is the score for the trial, is the binary target for the trial and is the number of trials.

Using this loss alone for training may result in over-fitting. Hence, a regularization term can be used by regressing to raw PLDA scores generated from Kaldi. The regularized cross-entropy loss is given as:

(6)

The second term encourages the scores from the Neural PLDA to not digress from the generative model PLDA scores drastically.

5.2 Soft Detection Cost

The NIST SRE 2018 normalized detection cost metric [20] is defined as:

(7)

where and are the probability of miss and false alarms computed by applying detection threshold of ,

(8)
(9)

Here, is the indicator function. The normalized detection cost function (Eq. 7) is not a smooth function of the parameters due to the step discontinuity induced by the indicator function , and hence, it cannot be used as an objective function in a neural network. We propose a differentiable approximation of the normalized detection cost by approximating the indicator function with a sigmoid function.

(10)
(11)

By choosing a large enough value for , the approximation can be made arbitrarily close to the actual detection cost function for a wide range of thresholds.

The primary cost metric of the NIST SRE 2018 for the Conversational Telephone Speech (CTS) is given by

(12)

where and . We compute the Neural PLDA loss function as

(13)

where and are the thresholds which minimizes . The minimum detection cost is achieved at a threshold where is minimized. In other words, it is the best cost that can be achieved through calibration of the scores. We include these thresholds in the set of parameters that the neural network learns to minimize through backpropagation. Finally, we compute an affine calibration transform using the SRE 2018 development set.

Systems Dataset Dev Eval
EER (%) EER (%)
PLDA Baseline CMN2 10.02 0.583 0.600 11.50 0.642 0.675
(Kaldi) VAST 11.11 0.605 0.782 12.70 0.686 0.766
DPLDA Baseline [2] CMN2 11.91 0.683 0.718 13.19 0.732 0.78
VAST 11.11 0.527 0.560 14.68 0.625 0.629
Pairwise GB [17, 6] CMN2 12.57 0.606 0.62 12.63 0.712 0.73
VAST 11.11 0.56 0.58 14.6 0.566 0.61
Neural PLDA CMN2 11.33 0.609 0.62 10.06 0.699 0.711
(BCE Loss, Random Init) VAST 11.52 0.449 0.45 15.39 0.636 0.64
Neural PLDA CMN2 11.04 0.564 0.58 08.97 0.603 0.726
(BCE Loss, Kaldi Init) VAST 07.41 0.416 0.527 14.60 0.578 0.627
Neural PLDA CMN2 10.50 0.524 0.532 09.78 0.598 0.654
(Soft detection cost) VAST 07.41 0.370 0.38 13.65 0.525 0.585
Neural PLDA CMN2 11.20 0.540 0.562 10.23 0.646 0.678
(Soft detection cost+0.1*BCE) VAST 11.11 0.374 0.389 15.12 0.550 0.573
Table 1: Summary of results of various back-end models on CMN2 and VAST datasets reported on the SRE 2018 development and evaluation datasets.

6 Experiments

We perform several experiments with the proposed neural net architecture and compare them with various discriminative back-ends previously proposed in the literature such as the discriminative PLDA [2] and pairwise Gaussian back-end [5]. We also compare the performance with the baseline system using Kaldi recipe that implements the generative PLDA model based scoring.

For all the pairwise generative/discriminative models, we train the back-end using the trials sampled from previous NIST SRE evaluation sets along with randomly sampled target and non-target pairs which are matched by source and gender. We use about million trials for this training sampled from NIST SRE 04-10 as well as the NIST SRE16 trials. We also sample training data from Mixer-6 and Switchboard 1&2 corpora. The evaluation of the models are performed on the telephone conditions (CMN2) and the video conditions (VAST) of the NIST SRE 2018 challenge.

6.1 Kaldi PLDA Baseline

The primary baseline to benchmark our systems is the PLDA back-end implementation in the Kaldi toolkit. The Kaldi implementation models the average embedding x-vector of each training speaker. The x-vectors are centered, dimensionality reduced using LDA, followed by unit length normalization. By setting various dimensions, the best performance on SRE 2018 development set was achieved with LDA dimension of . The linear transformations and the Kaldi PLDA matrices are used to initialize the proposed pairwise PLDA network.

6.2 Discriminative PLDA (DPLDA)

In [2], an expanded vector representing a trial was computed using a quadratic kernel as follows:

(14)

The PLDA log likelihood ratio score can be written as the dot product of a weight vector and the expanded vector .

(15)

We implemented the DPLDA in PyTorch by expanding the centered, LDA transformed and length normalized x-vectors from Kaldi baseline. Once the weight vector is trained, the score on the test trials was performed using the inner product of the weight vector with the quadratic kernel.

6.3 Pairwise Gaussian Back-end (GB)

The Pairwise Gaussian Back-end [17, 6] models the pairs of enrollment and test x-vectors, . The x-vector pairs are modeled using a Gaussian distribution with parameters for target trials while the non-target pairs are modeled by a Gaussian distribution with parameters . These parameters are estimated by computing the sample mean and covariance matrices of the target and non-target trials in the training data. The log-likelihood ratio () for a new trial is then obtained as:

The Gaussian Back-end is also trained on the same pairs of target and non-target x-vector trials, after centering, LDA and length normalization.

6.4 Neural PLDA

We perform various experiments using the neural PLDA architecture with different initialization methods and loss functions. We also experiment with the role of batch size parameter, the learning rate as well as the choice of loss function in the optimization. The optimal parameter choices were based on the SRE 2018 development set.

For the binary cross entropy (BCE) loss function and the soft detection cost functions, we need to apply the sigmoid function on the scores at different thresholds. In this work, we also parameterize the threshold value and let the network learn the threshold value to minimize the loss.

The soft detection cost function is highly sensitive to small changes in false alarm probability. Hence, all experiments were conducted with a large batch sizes of 4096/8192. The learning rate was initialized to and halved each time the validation loss increased twice in a row.

6.5 Discussion of Results

The performance of the various back-end systems are reported in Table 1. The PLDA baseline generalized considerably well for both development and evaluation sets for both CMN2 and VAST sources. The Discriminative PLDA (DPLDA) is found to perform well on the VAST set, but it fails to generalize on CMN2 conditions. The Pairwise GB model also performs better than Kaldi’s PLDA baseline on VAST dataset, which is in line with what was observed previously in [17].

The Neural PLDA model with random initialization of all parameters performed significantly better than the DPLDA model on the development set, and marginally better on the evaluation set. We hypothesize this to be a result of the network architecture which has fewer parameters, and hence fewer degrees of freedom than DPLDA model which results in better generalization. When the parameters are initialized with the Kaldi PLDA back-end parameters, the discriminative training further improves the performance on the dev set.

The soft detection cost function helps further reduce the and generalizes much better than using only the cross-entropy loss alone. We observe significant relative improvements over the PLDA Baseline of 10% and 38% in terms of on the SRE 2018 Development sets, respectively on CMN2 and VAST conditions. On the SRE 2018 Evaluation set, the proposed apporach yields relative improvements of 7% and 23% for CMN2 and VAST conditions.

7 Summary and Conclusions

This paper presents a step in the direction of exploring discriminative models for the task of speaker verification. Discriminative models allow the construction of end-to-end systems. However, discriminative models tend to overfit to the training data. In our proposed model, we constrain the parameter set to have lesser degrees of freedom, in order to achieve better generalization. We also propose a task specific differentiable loss function which approximates the NIST SRE 2018 detection cost.

It is important to note that unlike cross entropy loss, the NIST SRE detection cost gives significantly more importance to the false alarms. We also find that initializing the proposed neural PLDA model using generative model parameters allows the model to improve over the baseline system performance.

We observe considerable improvements and better generalization with our proposed approach. We could attribute this to the choice of architecture as well as the choice of loss functions.

8 Acknowledgements

The authors would like to thank the Ministry of Human Resources Development (MHRD) of India and the Department of Science and Technology (DST) for their support. We would also like to thank Bhargavram Mysore and Anand Mohan for the valuable discussions and their help during the SRE 2018 and 2019 Evaluation.

Footnotes

  1. The implementation of the Neural PLDA Network can be found here: https://github.com/iiscleap/NeuralPlda

References

  1. M. F. BenZeghiba, J. Gauvain and L. Lamel (2009) Language score calibration using adapted gaussian back-end. In Tenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  2. L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matějka and N. Brümmer (2011) Discriminatively trained probabilistic linear discriminant analysis for speaker verification. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4832–4835. Cited by: §2, Table 1, §6.2, §6.
  3. W. M. Campbell, D. E. Sturim and D. A. Reynolds (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13 (5), pp. 308–311. Cited by: §2.
  4. J. S. Chung, A. Nagrani and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. In Proc. Interspeech 2018, pp. 1086–1090. External Links: Document, Link Cited by: §3.1.
  5. S. Cumani, N. Brümmer, L. Burget, P. Laface, O. Plchot and V. Vasilakakis (2013) Pairwise discriminative speaker verification in the i-vector space. IEEE Transactions on Audio, Speech, and Language Processing 21 (6), pp. 1217–1227. Cited by: §2, §6.
  6. S. Cumani and P. Laface (2014) Generative pairwise models for speaker recognition. In Odyssey, pp. 273–279. Cited by: §2, Table 1, §6.3.
  7. S. Cumani and P. Laface (2014) Large-scale training of pairwise support vector machines for speaker recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (11), pp. 1590–1600. Cited by: §2.
  8. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet (2011) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1, §1.
  9. L. Ferrer and M. McLaren (2019) Optimizing a speaker embedding extractor through backend-driven regularization. Proc. Interspeech 2019, pp. 4350–4354. Cited by: §2.
  10. D. Garcia-Romero and C. Y. Espy-Wilson (2011) Analysis of i-vector length normalization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association, Cited by: §1.
  11. A. O. Hatch, S. Kajarekar and A. Stolcke (2006) Within-class covariance normalization for svm-based speaker recognition. In Ninth international conference on spoken language processing, Cited by: §1.
  12. P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing 15 (4), pp. 1435–1447. Cited by: §1.
  13. P. Kenny (2010) Bayesian speaker verification with heavy-tailed priors.. In Odyssey, pp. 14–21. Cited by: §1, §2.
  14. M. Mclaren, D. Castán, M. K. Nandwana, L. Ferrer and E. Yilmaz (2018) How to train your speaker embeddings extractor. In Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 327–334. Cited by: §1, §3.1, §3.2.
  15. M. McLaren, A. Lawson, Y. Lei and N. Scheffer (2013) Adaptive Gaussian backend for robust language identification.. In INTERSPEECH, pp. 84–88. Cited by: §2.
  16. A. Nagrani, J. S. Chung and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §3.1.
  17. S. Ramoji, A. Mohan, B. Mysore, A. Bhatia, P. Singh, H. Vardhan and S. Ganapathy (2019) The LEAP speaker recognition system for NIST SRE 2018 challenge. In Proc. of ICASSP, pp. 5771–5775. Cited by: Table 1, §6.3, §6.5.
  18. D. A. Reynolds, T. F. Quatieri and R. B. Dunn (2000) Speaker verification using adapted gaussian mixture models. Digital signal processing 10 (1-3), pp. 19–41. Cited by: §1.
  19. J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matějka and L. Burget (2018) End-to-end DNN based speaker recognition inspired by i-vector and PLDA. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4874–4878. Cited by: §2.
  20. O. Sadjadi NIST 2018 Speaker Recognition Evaluation Plan. Note: \urlhttps://www.nist.gov/sites/default/files/documents/2018/08/17/sre18_eval_plan_2018-05-31_v6.pdf Cited by: §1, §5.2.
  21. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §3.1.
  22. J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, J. Borgstrom, L. P. García-Perera, F. Richardson and R. Dehak (2019) State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Computer Speech & Language, pp. 101026. Cited by: §2.
  23. L. Wan, Q. Wang, A. Papir and I. L. Moreno (2018) Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. Cited by: §2.
  24. C. Zhang and K. Koishida (2017) End-to-end text-independent speaker verification with triplet loss on short utterances.. In Interspeech, pp. 1487–1491. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407113
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description