TextIndependent Speaker Verification Based on Deep Neural Networks and Segmental Dynamic Time Warping
Abstract
In this paper we present a new method for textindependent speaker verification that combines segmental dynamic time warping (SDTW) and the dvector approach. The dvectors, generated from a feed forward deep neural network trained to distinguish between speakers, are used as features to perform alignment and hence calculate the overall distance between the enrolment and test utterances. We present results on the NIST 2008 data set for speaker verification where the proposed method outperforms the conventional ivector baseline with PLDA scores and outperforms dvector approach with local distances based on cosine and PLDA scores. Also score combination with the ivector/PLDA baseline leads to significant gains over both methods.
TextIndependent Speaker Verification Based on Deep Neural Networks and Segmental Dynamic Time Warping
Mohamed Adel, Mohamed Afify and Akram Gaballah 
Microsoft Advanced Technology Lab, Cairo, Egypt 
Microsoft Corporation, Redmond, WA, USA 
{amoadel, mafify, akrgab}@microsoft.com 
1 Introduction
Speaker verification is the process of confirming whether an input utterance belongs to a claimed speaker.There are many popular approaches to the problem including Gaussian mixture model (GMM) [1], ivector [2] and more recently deep learning [3]. Speaker verification could be further classified into textdependent and textindependent. In the textdependent mode, both the enrolment and test utterances have the same text, while in the textindependent case the user can enroll and test with any text.
The dvector approach [3] has been originally proposed for textdependent speaker verification. The basic idea is to train a deep neural network to learn a mapping from the spectral input to the speaker identity. An intermediate layer (embedding) is then extracted for each input frame. The extracted embedding is averaged over the input utterance and used as a speaker representation, called the dvector, similar to the ivector. The dvector is then used for speaker verification by applying a cosine distance or probabilistic linear discriminant analysis (PLDA) [19].
Several variants are proposed to improve on the original dvector idea or generalize it to the textindependent scenario. An endtoend loss, more related to speaker verification, is proposed in [4] and applied to train feedforward and LSTM architectures. The latter endtoend loss is generalized in [5] and applied to both textdependent and textindependent verification. Different attention mechanisms proposed in [20, 6] yield improvement over simple averaging for calculating the dvector. Deep speaker [7] uses different architectures, including convolutional networks with residual connections and gated recurrent unit (GRU) networks, and train them using the triplet loss. The proposed architectures show good results for both textindependent and textdependent speaker recognition. Interestingly, it is also shown in [7] that a network trained for textindependent verification can be adapted using taskdependent data. This is important because the dvector does not work very well for small taskdependent data size [11]. In [12], the intermediate representation from an LSTM, trained either separately or jointly with speech recognition, is used for textindependent speaker verification on Hub5 and Wall Street Journal (WSJ) data. The work of [8] trains a neural network with temporal pooling in an endtoend fashion for textindependent speaker verification, similar in spirit to [4], and present results on telephone speech with various durations. A time delay neural network (TDNN) is trained using cross entropy and the resulting embedding is used for PLDA scoring in [9]. The results are presented on NIST speaker verification tasks for various test durations. A major finding is that the dvector outperforms the conventional ivector for short duration segments while the latter is better for longer duration. The latter work is recently extended by data augmentation and applied to various data sets in [21]. It is also worth mentioning approaches inspired by ivector and PLDA where the whole ivector/PLDA system is formulated and trained as a network [10].
Instead of averaging the dvectors over the whole utterance, as is typically done in conventional approaches, we propose to keep the sequences of dvectors of the enrolment and test utterances. We then align the two sequences to come up with an accumulated score for textindependent speaker verification. In [13], dynamic time warping (DTW) [15] is used to find the best alignment and hence the minimum distance between two sequences of dvectors for textdependent verification. However, conventional DTW with path constraints will not lead to meaningful alignments in the textindependent case. This is because path constraints might be too restrictive to find the potentially nonmonotonic alignments. Here we use segmental dynamic time warping (SDTW) to align the resulting two sequences of dvectors and experiment with both cosine distance and PLDA for measuring the local distance between pairs of dvectors. We present results on the NIST 2008 speaker verification task.
Segmental DTW has been proposed for automatic pattern discovery of speech in [16]. It was then applied to keyword spotting [17] and speaker segmentation [18]. In this article, we combine the dvector with SDTW to do textindependent speaker verification. At a high level, SDTW finds multiple partial paths of two utterances and hence could discover parts of the utterances that exhibit certain similarities.We combine the scores of these paths to come up with a similarity score between the two utterances and use it for verification.
2 Segmental Dynamic Time Warping
In this section we briefly describe segmental DTW. In its basic form [15], DTW finds the optimal global alignment and the accumulated distance between two sequences and .
Assume and , an alignment is given by
(1) 
where and are indices from the two sequences and is the alignment length such that and . The associated accumulated distance is given by
(2) 
where is a local distance. In this work we use distance measures based on the cosine similarity and probabilistic linear discriminant analysis (PLDA).
Given a distance measure and a set of constraints, DTW calculates the optimal path and the associated accumulated distance using dynamic programming [15]. The set of constraints are important to obtain a physically plausible alignment. The socalled adjustment window condition () [15] ensures the aligned indices of the two sequences are not very far apart. If the two sequences grossly violate the constraints, DTW will most likely fail to find a good alignment.
Segmental DTW generalizes DTW by finding a set of partial alignments between two sequences.By allowing partial alignments, SDTW can potentially find well matched subsequences even if the two sequences are not fully matched. Formally, with a constraint parameter and utterances lengths and respectively, we obtain multiple alignments by running DTW on regions that start at:
(3) 
Each region will be limited to a diagonal region depending on and hence represents a partial alignment of the two sequences. One region is shown in Figure 1 taken from [16]. in the figure refers to the width of the region where a partial alignment is calculated. This is determined by the constraint parameter . We refer to these partial alignments as where . Now, given a length constraint parameter , we find for each local alignment path, the fragment of length at least , shown in red in the figure, that has the minimum average distortion. This can be efficiently obtained as outlined in [16] and references therein.
To summarize, given parameters and , SDTW of two sequences gives a set of fragments of length at least and their associated scores as for . We will show in the next section how to use these partial scores for speaker verification. The best parameters and are determined empirically.
3 Speaker Verification Using dvector and SDTW
In this section we will describe how to combine dvector and SDTW for textindependent speaker verification. Also the network architecture and training used to generate dvectors will be described in Section 3.1. We will focus on the case of single enrolment and single test.The generalization to multiple enrolments and multiple tests is straightforward.
In speaker verification, the distance between the enrolment and test utterances is calculated and compared to a threshold to either accept or reject the claim. For the textindependent case both enrolment and test have different phonetic content. As discussed above, SDTW between two sequences provides a set of partial alignments and their distances. The distance between the sequences can be obtained by combining the distances of the partial alignments. In preliminary experiments, we tried using the average, the average of the lowestK and the minimum with very similar performance. Thus, we will report the average in the rest of the work.Once the distance is obtained it is compared to a threshold for the verification decision.
Motivated by recent success of deep learning techniques in speaker verification we apply SDTW at the dvector level.We first generate a sequence of dvectors for both the enrolment and test utterances then apply SDTW to the resulting dvectors.We summarize the enrolment and verification phases below.

Enrolment Phase

Starting from enrolment utterance create sequence of enrolment dvectors . This is obtained by running a fixed length window on the enrolment utterance and advancing it by a fixed step. Each window is input to the network to produce the corresponding dvector. Please note that the time index of the enrolment sequence follows the input step and not the frames. In the case we advance the window by one frame they will coincide.


Verification Phase

Starting from test utterance create sequence of test dvectors similar to the enrolment.

Run SDTW, with parameters and , on the enrolment and test dvector sequences and .This will result, as discussed above, in a set of partial paths and their corresponding scores for .

Obtain the score of the test utterance by averaging the scores of the partial paths. This score is then compared to a threshold to make the verification decision.

3.1 Dvector Network Architecture and Training
Any network architecture can be used with the proposed method. In this work we use a simple feedforward architecture. The input dimension is 1386, as explained below, it consists of feature vector dimension of size 66 and context window of size 21. This is followed by 5 hidden layers that operate on the frame level of sizes 2048, 2048, 1024, 1024 and 512 respectively. All layers use ReLU nonlinearity and batch normalization. This is followed by a temporal pooling layer that operates on the input segment. Following temporal pooling is a hidden layer of size 128 that operates on the segment level and also uses ReLU and batch normalization. Finally, there is the output layer that uses cross entropy and softmax. The output layer corresponds to the 5000 speakers having the largest number of segments in the training data. The dvector is extracted from the last hidden layer, after temporal pooling, of size 128. Excluding the softmax layer, the network has about 10M parameters.Dropout with keep parameter 0.75 is used after the second hidden layer which has about 4M parameters.
Segments of length 200 frames with an advance of 50 frames are extracted from the training data. Data from the most frequent 5000 speakers are kept with about 3000 segments/speaker. This leads to a total of about 15M segments of size 200 labelled with the corresponding speaker. We did some experiments on window duration selection or using random window size but found the selected size to work best. Training optimizes the cross entropy criterion. Other criteria as triplet loos could be used but these typically need CE initialization and could be tried in future work. The network is randomly initialized. Minibatches of size 70 segments are randomly formed from the above segments and used to optimize the weights using SGD with momentum. The learning rate is reduced after every sweep through the data to prevent overfitting.
4 Experimental Results
In this section we present experiments to verify the proposed method. We first present the training and testing data, followed by the baseline setup for ivector/PLDA and dvector and experimental results.
4.1 Training Data and Testing Setup
The training data consists of about 4000 hours from the English Fisher and the NIST 2004,2005 and 2006 telephone corpora sampled at 8 kHz. Voice activity detection (VAD)using an energybased criterion is applied to the data. 22 log filterbank energies (LFB) are extracted together with their first and second derivatives leading to 66dimensional feature vector that is used during training and testing. We use window of 21 frames, centred around the current frame, leading to a network input size of 1386 ^{1}^{1}1We tried several window sizes and found that 21 is the best..The common evaluation condition of NIST 2008 SRE is used for testing. We use both telephone and interview data comprising all conditions C1C8.
4.2 Baseline System
For comparison we use an ivector/PLDA baseline and a dvector baseline. The configuration of these systems are as follows:

ivector/PLDA: This follows the system in [2] and based on our previous experiments we set the UBM size to 2048 and the ivector size to 400. We always test ivector with PLDA as in [19]. The PLDA dimension is set to 200. After generating ivectors for the training data we project them to dimension 200 using LDA then apply centering, whitening and length normalization. The PLDA is then trained on the transformed data. The same processing is done on the test data and the PLDA score is used for verification.

dvector: The baseline dvector system works as follows. First a sequence of dvectors are generated by sliding a window over the test utterance as described above. The resulting sequence of dvectors is then averaged to yield a single vector representation for the utterance. We use two configurations. The first uses cosine distance for scoring while the second, similar to ivector, uses PLDA. The dvector size is 128 for both configurations.
4.3 Results
Table 1 shows the equal error rate (EER) averaged over the 8 conditions. The second column presents the results of the following systems: ivector/PLDA, dvector with cosine scoring, dvector/PLDA and dvector with cosine and dvector/PLDA with SDTW. The latter two use and . The ivector/PLDA EER in the second row is a reasonable baseline compared to other results obtained on NIST 2008. Although this can be further optimized using genderdependent models and phoneticallyaware features, adding these can also benefit the dvector and hence are not tried here. The dvector with cosine scoring shows significantly worse performance than the ivector while the dvector/PLDA is better than the baseline dvector. The dvector/PLDA is still worse than the ivector/PLDA. Results regarding the latter point are mixed. For example, [7] shows excellent results with only cosine scoring while [9] shows only results with PLDA. We believe that for public corpora where there are only few thousand speakers for training, training data mainly consists of telephone speech while test data has varying acoustic conditions, the network will not fully learn to normalize the acoustic condition. Hence, PLDA will provide desirable normalization on top of dvector and can potentially lead to better results as shown here. In [9] ivector/PLDA is slightly better than dvector/PLDA when testing with the full utterance^{2}^{2}2We test on the core condition where the average utterance length is around 2 minutes.. Finally, we can see decent gains by using SDTW on top of dvector both with cosine and PLDA. In particular, the dvector/PLDA with SDTW performs better than ivector/PLDA. The third column shows the results of combining ivector/PLDA, on the score level with weight 0.5, with other systems. In all cases we see significant gains from the combination.
Figure 2 shows SDTW results for both cosine and PLDA scoring with varying and . Six curves with different colors correspond to , , and and cosine and PLDA scoring. The horizontal axis stands for different values of . It can be observed that PLDA results are significantly better than cosine results for all values of both parameters. Generally speaking, we observe that small values of give better performance because we sample at a rather coarse rate of 50 frames. Also relatively large tends to give better result as longer segments carry meaningful speaker information. Also in the figure we observe a fairly stable region for selecting the and .
To gain more insight, we show in Table 2 the results of individual conditions. These correspond to the second column of Table 1 before averaging. It is clear that the dvector approach is significantly better than ivector for the telephone conditions (C6C8) and significantly worse for the microphone (interview) conditions (C1C3). The dvector is also better for the mixed conditions (telephone/interview) C4 and C5. As the training data for the network and PLDA consist mainly of telephone speech. We might argue that, for the available amount of training data and network architecture, the network is not able to fully normalize for the acoustic condition. PLDA helps for acoustic normalization in almost all conditions. Also SDTW shows significant improvement in almost all conditions.
5 Conclusion
We propose segmental DTW to align the dvectors of the enrolment and test utterances for textindependent speaker verification. Compared to the conventional dvector, which averages the dvectors over the whole utterance, alignment can potentially find better matching parts of the enrolment and test utterances and hence reduce bias due to phonetic content. Compared to conventional DTW, Segmental DTW can find good partial alignments even if the two utterances are grossly mismatched. The proposed method is tested on the core condition of NIST 2008 where the utterances are relatively long and shows improvement over the baseline dvector with and without PLDA scoring. Combining with ivector/PLDA provides interesting gains in all cases. Future work includes improving the dvector itself by exploring more sophisticated architectures recurrent and convolutional networks and using other training criteria as triplet loss. We also plan to test on other corpora like NIST SRE 2010, SITW and VoxCeleb.
6 Acknowledgement
We thank Ahmed Ewais and Mohamed Yahia for working on an earlier setup for this task and Lana Chafik for help with some experiments.
References
 [1] D. Reynolds, T.F. Quatieri and R.B. Dunn, ”Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, Vol. 10, pp. 1941, 2000.
 [2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet ”FrontEnd Factor Analysis for Speaker Verification,” IEEE Transactions On Audio, Speech And Language Processing, Vol. 19, No. 4, May 2011.
 [3] E. Variani, X. Lei, E. McDermott, I. LopezMoreno and J. GonzalezDominguez, ”Deep Neural Networks for Small Footprint textdependent Speaker Verification,” In Proc. ICASSP 2014.
 [4] G. Heigold, I. Moreno, S. Bengio and N. Shazeer,” EndtoEnd TextDependent Speaker Verification,” In Proc. ICASSP 2016.
 [5] L. Wan, Q. Wang, A. Papir and I. Lopez Moreno, ”Generalized EndtoEnd Loss for Speaker Verification,” arXiv preprint arXiv:1710.10467, 2017.
 [6] F. Chowdhury, Q. Wang, I. Lopez Moreno and L. Wan, ”AttentionBased Models for TextDependent Speaker Verification,” arXiv preprint arXiv:1710.10470, 2017.
 [7] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan and Z. Zhu, ”Deep Speaker: An EndtoEnd Neural Speaker Embedding System,” arXiv preprint arXiv:1705.02304, 2017.
 [8] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel and S. Khudanpur, ”Deep Neural Network Speaker Embeddings for endtoend Speaker Verification,” In Proc. SLT2016.
 [9] D. Snyder, D. GarciaRomero, D. Povey and S. Khudanpur, ”Deep Neural Network Embeddings for TextIndependent Speaker Verification,” In Proc. Interspeech 2017.
 [10] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matejka and L. Burget, ”EndtoEnd DNN Based Speaker Recognition Inspired by ivector and PLDA,” arXiv preprint arXiv:1710.02369, 2017.
 [11] G. Bhattacharya, J. Alam, T. Stafylakis and P. Kenny, ”Deep Neural Network Based TextDependent Speaker Recognition: Preliminary Results,” In Proc. Odyssey 2016.
 [12] Z. Tang, L. Li and D. Wang, ”MultiTask Recurrent Model for Speech and Speaker Recognition,” arXiv preprint arXiv:1603.09643, 2016.
 [13] L. Li, Y. Lin, Z. Zhang and D. Wang, ”Improved Deep Speaker Feature Learning for TextDependent Speaker Verification,” arXiv preprint arXiv:1506.08349, 2015.
 [14] V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis, ”IVectorBased Speaker Adaptation Of Deep Neural Networks For French Broadcast Audio Transcription,” In Proc. ICASSP 2014.
 [15] H. Sakoe and S. Chiba, ”Dynamic Programming Algorithm Optimization for Spoken Word Recognition,” IEEE Transactions On Acoustics, Speech and Signal Processing, Vol. 26, No. 1, February 1978.
 [16] A. Park and J. Glass, ”Unsupervised Pattern Discovery in Speech,” , IEEE Transactions On Audio, Speech And Language Processing, Vol. 16, No. 1, January 2008.
 [17] Z. Yaodong and J. Glass, ”Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams,” In Proc. ASRU 2009.
 [18] A. Park and J. Glass, ”A Novel DTWBased Distance Measure for Speaker Segmentation,” In Proc. SLT 2006.
 [19] P. Kenny, ”Bayesian Speaker Verification with HeavyTailed Priors,” In Proc. Odyssey 2010.
 [20] S. Zhang, Z. Chen, Y. Zhao, J. Li and Y. Gong, ”EndtoEnd Attention Based TextDependent Speaker Verification,” In Proc. SLT 2016.
 [21] David Snyder, Daniel GarciaRomero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur, ”XVectors: Robust DNN Embeddings for Speaker Recognition” in Proc. ICASSP 2018.