Deep Discriminant Analysis for ivector Based Robust Speaker Recognition
Abstract
Linear Discriminant Analysis (LDA) has been used as a standard postprocessing procedure in many stateoftheart speaker recognition tasks. Through maximizing the interspeaker difference and minimizing the intraspeaker variation, LDA projects ivectors to a lowerdimensional and more discriminative subspace. In this paper, we propose a neural network based compensation scheme(termed as deep discriminant analysis, DDA) for ivector based speaker recognition, which shares the spirit with LDA. Optimized against softmax loss and center loss at the same time, the proposed method learns a more compact and discriminative embedding space. Compared with the Gaussian distribution assumption of data and the learnt linear projection in LDA, the proposed method doesn’t pose any assumptions on data and can learn a nonlinear projection function. Experiments are carried out on a shortduration textindependent dataset based on the SRE Corpus, noticeable performance improvement can be observed against the normal LDA or PLDA methods.
Deep Discriminant Analysis for ivector Based Robust Speaker Recognition
Shuai Wang, Zili Huang, Yanmin Qian, Kai Yu 
Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering 
SpeechLab, Department of Computer Science and Engineering 
Brain Science and Technology Research Center 
Shanghai Jiao Tong University, Shanghai, China 
^{†}^{†}thanks: The corresponding author is Kai Yu ^{†}^{†}thanks: This work has been supported by the National Key Research and Development Program of China under Grant No.2017YFB1002102 and the China NSFC projects (No. U1736202 and No. 61603252). Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University {feixiang121976,huangziliandy, yanminqian, kai.yu}@sjtu.edu.cn 
1 Introduction
Speaker Recognition aims to recognize or verify a speaker’s identity through the given speech segment. Since proposed in [1], ivector has become the stateoftheart speaker modeling technique, it is a simple but elegant factor analysis model, inspired by the Joint Factor Analysis (JFA) [2] framework. Though some researchers have been working on improving the ivector model itself[3, 4], more researchers pay attention to the compensation techniques in the ivector space[5, 6, 7, 8]. JFA can be regarded as a compensation method in the GMM supervector space, which models the speaker and channel variabilities separately. ivector simplifies JFA by modeling the speaker and channeldependent factors in a single low dimensional space, leaving the compensation mechanisms to the following steps. In real applications, nuisance attributes such as channel, noise can pose a huge impact on the system performance, compensation methods become necessary and have attracted more and more interest.
Linear Discriminant Analysis (LDA) [9, 10] is widely used in pattern recognition tasks[11, 12] to project features onto a lowerdimensional and more discriminative space. The transformation is learned via maximizing the betweenclass (interspeaker) difference and minimizing the withinclass (intraspeaker) variation. LDA is a simple linear transformation which is used as a preprocessor to generate reduced dimensional and channel compensated embeddings from the original ivectors in many speaker verification systems, results on standard datasets such as the Speaker Recognition Evaluation (SRE) corpus show its effectiveness[13]. Despite its effectiveness and popularity, LDA has its limitations. For instance, LDA can provide at most discriminant features, where is the number of classes. It’s a linear projection which may not be capable of dealing with highly nonlinear separable data. Several methods such as weighted LDA [14] and nonparametric discriminant analysis (NDA) [15, 6] are proposed as a substitution of LDA in speaker verification tasks. NDA redefines the betweenclass scatter matrix, the expected values that represent the global information about each class are replaced with local sample averages computed based on the NN of individual samples. Another most popular compensation method in the ivector space is Probabilistic Linear Discriminant Analysis (PLDA) [16]. It’s usually used as a scoring method and combined with other compensation methods, such as LDA. Considering different scenarios, PLDA has several variations such as twocovariance PLDA[17], simplified PLDA[5, 18] and HeavyTailed PLDA[18]. Currently, the ivector/PLDA system achieves the stateoftheart performance.
Recently, there are also some attempts using Deep Learning (DL) techniques for denoising and channel compensation in speaker recognition. This compensation can be performed in the cepstral feature space or the ivector space. Authors in [19] used features estimated by the denoising DNN as the input to an ivector system for channel robust speaker recognition. Authors in [7] proposed to use an autoencoder to learn a projection which maps noisy ivectors to denoised ones. To address the shortduration problem of ivector[20], a Convolutional Neural Network (CNN) based system was trained in [8] to map the ivectors extracted from short utterances to the corresponding longutterance ivectors.
In this paper, we propose a discriminative neural network (NN) based compensation method in the ivector space. The proposed NNbased method shares the same spirit with LDA, it is trained to minimize softmax loss and center loss [21] simultaneously, where the former forces the transformed embeddings from different classes staying apart and the latter pulls the embeddings from the same class close to their centers. With the joint supervision of softmax loss and center loss, the NN produces a projection function similar to LDA, enlarging the betweenclass difference and reducing the withinclass variation. The proposed NNbased compensation method will be referred to as Deep Discriminant Analysis (DDA) in this paper.
The rest of the paper is organized as follows. Section 2 briefly introduces the ivector framework. Section 3 reviews two conventional compensation methods in ivector space, LDA and PLDA. We propose the discriminative neural network based compensation method (DDA) in Section 4, followed by experiments and results analysis in Section 5. Section 6 concludes this paper.
2 ivector
Joint Factor Analysis (JFA) framework[2] was proposed as a compensation method in the GMM supervector space, it models speaker and channel factors in separate subspaces. The following ivector simplifies the JFA framework by modeling a single total variability subspace[1]. In the ivector framework, the speaker and sessiondependent supervector (derived from UBM) is modeled as
(1) 
where is a speaker and sessionindependent supervector, is a low rank matrix which captures speaker and session variability, , is a multivariate random variable, and the termed ivector is the posterior mean of . , is the residual noise term to account for the variability not captured by .
As shown in Equation1, ivector is a simple and elegant representation, which follows the standard Factor Analysis (FA) scheme. However, since ivector contains the speaker and channeldependent factors in the same subspace, further channel compensation methods such as LDA are often applied to annihilate the impact of nuisance attributes.
3 Conventional Compensation Methods
Speaker recognition systems are fragile to noise, channel and many other factors. Compensation technologies have been heavily researched on during the past several decades. In this section, we mainly discuss the compensation methods in the ivector space. Two methods, LDA and PLDA will be specifically introduced.
3.1 Linear Discriminative Analysis (LDA)
LDA is widely used in pattern recognition tasks such as image recognition[11] and speaker recognition[13]. LDA calculates a matrix that projects high dimensional feature vectors (ivectors in this paper) into a lowerdimensional and more discriminative subspace (). The projection can be represented as:
(2) 
where denotes the compensated embedding and is a rectangular matrix of shape . is determined by
(3)  
(4) 
The betweenclass and withinclass covariance matrices, and respectively, can be computed as
(5)  
(6) 
where represents the total number of speakers, represents the total number of ivectors from all speakers. represents the global mean of all ivectors, whereas represents the mean of ivectors from the specific th speaker. represents the th ivector from the th speaker, is the number of utterances from the th speaker.
LDA has an analytical solution and the optimized is a matrix whose columns are the eigenvectors corresponding to the largest eigenvalues of . However, despite its simpleness and effectiveness, LDA has several limitations,

The within and betweenclass matrices are formed based on Gaussian assumptions for samples of each class. If the Gaussian assumption doesn’t hold, LDA is not able to learn a effective enough projection function for classification problems.

LDA suffers from the “small sample size” problem, which leads to the singularity of the withinclass scatter matrix . This problem happens when the number of the samples is much smaller than the dimension of the original samples.

Given the class number , LDA can provide at most discriminant features, since the betweenclass scatter matrix has of rank of . However, this may not be sufficient for tasks in which the class number is much smaller than the dimension of input features.

LDA learns a linear projection function, which may not be enough for data that are highly linearly inseparable.
To address these limitations of LDA, several approaches are proposed. For instance, a nonparametric was proposed in [15] and applied to robust speaker verification [6, 13]. In NDA, instead of only considering the class center when computing the betweenclass scatter matrix, the global information about a class is defined with local sample averages computed based on the NN of individual samples. Researchers also proposed several approaches to tackle the “small sample size” problem [22, 23].
3.2 Probabilistic Linear Discriminant Analysis
ivectors with Probabilistic Linear Discriminant Analysis (PLDA) backend obtains the stateoftheart performance in speaker verification. Several variants of PLDA have been introduced into the speaker verification task, including the standard PLDA[16], twocovariance PLDA[17], heavytailed PLDA[18] and the Simplified PLDA[5, 18]. The optimization goal of all variants is to maximize the betweenclass difference and minimize the withinclass variation. PLDA models regard ivectors as observations from a probabilistic generative model and can be seen as a factor analysis in the ivector space. In our experimental settings, the variant implemented in Kaldi[24] achieves best performance, which we termed as KaldiPLDA here, it’s following the formulations in [25] and similar to the twocovariance model.
In the KaldiPLDA, an ivector is assumed to be generated as,
(7)  
(8)  
(9) 
4 Neural Network based Approach
4.1 Center Loss
Neural networks have been investigated a lot in areas such as image recognition, speech recognition[26, 27, 28] and speaker recognition[3, 29, 30]. One of the most popular method is to treat the neural network as a feature extractor, whereas the learned features are called “bottleneck feature” or “deep feature”[31, 32, 33, 34, 35]. For instance, in speaker recognition, researchers proposed to extract feature vectors from the last hidden layer of a welltrained speakerdiscriminative DNN. In most work, the DNN is optimized against the softmax loss, which emphasizes on discriminating different speakers.
The softmax loss function is defined as
(10) 
where is the total number of training samples (ivectors), denotes the th sample, belonging to the th class. is the number of softmax outputs, representing different classes. is the projection weight matrix and is the corresponding bias term.
Center loss [21] is formulated as
(11) 
where represents the center of th class (which the th sample belongs to) and is updated along with the training procedure. The neural network will be trained under the joint supervision of softmax loss and center loss, formulated as,
(12) 
where is adopted for balancing the two loss functions. Intuitively, the softmax loss forces the learned embeddings of different classes staying apart, while the center loss pulls the embeddings from the same class close to their centers. With the joint supervision of softmax loss and center loss, the neural network learns a projection function similar to LDA, enlarging the interclass differences and reducing the intraclass variations.
To show the effectiveness of center loss, following the approach in [21], we also train a toy example on a small speaker audio dataset, which contains 10 different speakers. A 2layer neural network is trained and the dimension of embedding layer is set as 2 for illustration. As shown in Fig.1 and Fig.2 (Best viewed in color), with the assistance of center loss, the withinclass variation reduced a lot. In the following experiments, it can be seen that this property can be generalized to scenarios where validation speakers have no overlap with the training speakers, which is the common condition in speaker verification.
4.2 Deep Discriminant Analysis
Deep Neural Network (DNN) shows its extraordinary capability in speech recognition and speaker recognition, there is no prior assumption on the input data. Through substituting Gaussian Mixture Model (GMM) to DNN[26], the DNNHMM systems achieve noticeable performance improvement compared to traditional GMMHMM systems, which also holds for speaker recognition tasks when updating GMMivector to DNNivector[3]. In this section, a DNN is used to perform the channel compensation in the ivector space. The whole architecture is depicted in Fig.3. In the training phase, the extracted ivector from different speakers are prepared as input, the DNN is joint supervised by the softmax loss and the center loss. The last layer before the loss layer is an embedding layer, from which we extract the transformed embeddings. In the compensation stage, the source ivectors are mapped to their corresponding transformed version through the trained neural network. Similar to the projection in Equation 2, given the original ivector , the compensated lowerdimensional embedding can be represented as
(13) 
where denotes the nonlinear transformation function learned by the NN through the training data. We term this NNbased compensation method as Deep Discriminant Analysis (DDA), for comparison with LDA or NDA.
5 Experiments
5.1 Dataset
We evaluate the performance of our proposed methods on a shortduration dataset generated from the NIST SRE corpus. This short duration textindependent task is more difficult for speaker verification. The training set consists of selected data from SRE, Switchboard II phase , and Switchboard Cellular Part, Part. After removing silence frames using an energybased VAD, the utterances are chopped into short segments (ranging from s). The final training set contains speakers and each speaker has short utterances. The enrollment set and test set are derived from NIST SRE following a similar procedure. The enrollment set contains speakers ( males and females) and each speaker is enrolled by utterances. The test set contains utterances from the models in the enrollment set. The trial list we create contains trials. There are positive samples and negative samples on average for each model. No crossgender trial exists.
5.2 System Details
5.2.1 Baseline Settings
The baseline ivector system is implemented using the Kaldi toolkit, 20dimensional MFCC coefficients with their first and second order derivatives are extracted from the speech segments (identified with an energy based VAD). A 25 ms Hamming window with a 10 ms frame shift is adopted in the feature extraction process. The universal background model (UBM) contains 2048 Gaussian mixtures and the ivector dimension is set to 600. Different scoring methods are applied to the length normalized ivectors. 3 different scoring methods are adopted to evaluate the performance. Cos denotes the cosine similarity of two vectors, while Euc denotes the Euclidean Distance. As shown in Table 2, PLDA achieves best performance for the raw input ivectors with a EER of 4.96%, since PLDA itself is both a compensation and scoring method. LDA’s dimension in Table 2 is set as 300 and obtains significant performance improvement when applied to Cos or Euc scoring methods. However, no improvement is observed when combining LDA and PLDA.
5.2.2 Neural Network Settings
As shown in Table 1, we adopt a standard feedforward neural network as the compensation model, which contains one input layer, one hidden layer and one embedding layer. PReLU[36] is chosen as the activation function, while a batch normalization layer is added before the embedding layer to stabilize the training procedure. The whole network is trained under the joint supervision of softmax loss and center loss, with the value of in Equation 12 set as 0.01 (Detailed explanation of this setting can be found in Section 5.3.1). Following the strategy used in [21], besides the to balance the impact of two losses, a different learning rate is used for the center loss parameters. The learning rate for the basic neural network is set to 0.01 and the one for center loss is set to 0.1. In the training stage, since it’s impractical to update the centers with respect to the whole training set, we update the centers per minibatch instead, centers are computed by averaging the embeddings of corresponding classes (centers of some classes may not be updated).
Input  Source ivectors of 600 dimension  

Linear Layers  number of nodes  nonlinear 
Input Layer  600  PReLU 
Hidden Layer  600  PReLU + BatchNorm 
Embedding Layer  300  None 
Loss  softmax loss  0.01 * center loss 
5.3 Results and Analysis
The proposed neural network based system is evaluated on the dataset described in Section 5.1. As shown in Table 2, compared to LDA, the NNbased DDA obtains larger improvement for Cos and Euc scoring methods, while the best performance of EER 4.69% is achieved by DDA+Euc, which also outperforms the baseline PLDA system. However, the proposed compensation method is not compatible with PLDA. To better understand the proposed method’s effect, we use tSNE[37] to visualize the ivectors and their corresponding DDAcompensated embeddings in Fig.4 and Fig.5 (Best viewed in color).
Methods  Cos  Euc  PLDA 

Baseline  7.29  6.04  4.96 
LDA  5.89  5.22  5.0 
DDA  4.78  4.69  7.32 
Fig.4 depicts the distribution of ivectors from 10 speakers randomly chosen from the test set, while the distribution of corresponding compensated embeddings are shown in Fig.5. As shown in the two figures, with the proposed compensation method, the distribution of embeddings from the same speaker seems more compact, which means the intraspeaker variation is significantly reduced.
5.3.1 Impact of the loss weight
As mentioned in above sections, a weight is used to balance the softmax loss and center loss. A small implies a strong supervision signal provided by the softmax loss, whereas a large implies a strong supervision signal from the center loss. As shown in Fig.6 and Fig.7, when the weight is set to 0.1, the network is actually not trainable, though the center loss degrades quickly, the softmax loss hardly change. In this case, the embeddings are trained to be similar to each other and became not distinguishable. As the value of is reduced, the softmax loss degrades faster due to its relatively stronger supervision signal. In our experiments, when varies from 0.001 to 0.01, the training converges faster, while the compensation performance hardly changes.
5.3.2 Impact of the embedding dimension
In this section, we investigated the impact of different dimensions of the projection subspace by varying the embedding layer’s dimension. As shown in Table 3, it’s interesting to find that DDA achieves best performance with 400 dimension, in contrary, LDA achieves best performance with 200 dimension. Though not listed in the following table, it should be mentioned that with the dimension of 100 or 500 which are not listed, the EER increases for both two compensation methods.
Scoring  Compensation  200dim  300dim  400dim 

Cos  LDA  5.53  5.89  6.28 
DDA  5.31  4.78  4.67  
Euc  LDA  5.22  5.22  5.40 
DDA  5.08  4.69  4.51 
6 Conclusion
Intraspeaker variability compensation techniques such as LDA have been researched a lot in the stateoftheart ivector framework, LDA has several limitations dual to the mismatch between LDA’s assumptions and the true distribution of ivectors. In this paper, we proposed a nonlinear compensation framework based on a discriminative neural network, termed as DDA (Deep Discriminant Analysis). The neural network is trained under the joint supervision of softmax loss and center loss, the softmax loss forces the learned embeddings of different classes staying apart, while the center loss pulls the embeddings from the same class close to their centers. Experiments shows that with the assistance of the proposed compensation method, simple Cosine Scoring or Euclidean Scoring can achieve even better performance than PLDA.
References
 [1] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [2] Patrick Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM06/0813, 2005.
 [3] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1695–1699.
 [4] Wei Rao, ManWai Mak, and KongAik Lee, “Normalization of total variability matrix for ivector/plda speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4180–4184.
 [5] Daniel GarciaRomero and Carol Y EspyWilson, “Analysis of ivector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
 [6] Seyed Omid Sadjadi, Jason Pelecanos, and Weizhong Zhu, “Nearest neighbor discriminant analysis for robust speaker recognition,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [7] Shivangi Mahto, Hitoshi Yamamoto, and Takafumi Koshinaka, “Ivector transformation using a novel discriminative denoising autoencoder for noiserobust speaker recognition,” Proc. Interspeech 2017, pp. 3722–3726, 2017.
 [8] Jinxi Guo, Usha Amrutha Nookala, and Abeer Alwan, “Cnnbased joint mapping of short and long utterance ivectors for speaker verification using short utterances,” Proc. Interspeech 2017, pp. 3712–3716, 2017.
 [9] Suresh Balakrishnama and Aravind Ganapathiraju, “Linear discriminant analysisa brief tutorial,” Institute for Signal and information Processing, vol. 18, pp. 1–8, 1998.
 [10] Nasser M Nasrabadi, “Pattern recognition and machine learning,” Journal of electronic imaging, vol. 16, no. 4, pp. 049901, 2007.
 [11] Peter N. Belhumeur, João P Hespanha, and David J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 711–720, 1997.
 [12] Reinhold HaebUmbach and Hermann Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP92., 1992 IEEE International Conference on. IEEE, 1992, vol. 1, pp. 13–16.
 [13] Seyed Omid Sadjadi, Sriram Ganapathy, and Jason W Pelecanos, “The ibm 2016 speaker recognition system,” arXiv preprint arXiv:1602.07291, 2016.
 [14] Ahilan Kanagasundaram, David Dean, Robbie Vogt, Mitchell McLaren, Sridha Sridharan, and Michael Mason, “Weighted lda techniques for ivector based speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4781–4784.
 [15] K Fukunaga and JM Mantock, “Nonparametric discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, , no. 6, pp. 671–678, 1983.
 [16] Simon JD Prince and James H Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
 [17] Niko Brümmer and Edward De Villiers, “The speaker partitioning problem.,” in Odyssey, 2010, p. 34.
 [18] Patrick Kenny, “Bayesian speaker verification with heavytailed priors.,” in Odyssey, 2010, p. 14.
 [19] Frederick S Richardson, Douglas A Reynolds, and Brian Nemsick, “Channel compensation for speaker recognition using map adapted plda and denoising dnns,” Tech. Rep., MIT Lincoln Laboratory Lexington United States, 2016.
 [20] Ahilan Kanagasundaram, Robbie Vogt, David B Dean, Sridha Sridharan, and Michael W Mason, “Ivector based speaker recognition on short utterances,” in Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA), 2011, pp. 2341–2344.
 [21] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 499–515.
 [22] Rui Huang, Qingshan Liu, Hanqing Lu, and Songde Ma, “Solving the small sample size problem of lda,” in Pattern Recognition, 2002. Proceedings. 16th International Conference on. IEEE, 2002, vol. 3, pp. 29–32.
 [23] Alok Sharma and Kuldip K Paliwal, “Linear discriminant analysis for the small sample size problem: an overview,” International Journal of Machine Learning and Cybernetics, vol. 6, no. 3, pp. 443–454, 2015.
 [24] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
 [25] Sergey Ioffe, “Probabilistic linear discriminant analysis,” in European Conference on Computer Vision. Springer, 2006, pp. 531–542.
 [26] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 [27] George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2012.
 [28] Frank Seide, Gang Li, and Dong Yu, “Conversational speech transcription using contextdependent deep neural networks,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
 [29] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “Endtoend textdependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. IEEE, 2016, pp. 5115–5119.
 [30] David Snyder, Pegah Ghahremani, Daniel Povey, Daniel GarciaRomero, Yishay Carmiel, and Sanjeev Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 165–170.
 [31] Zhizheng Wu, Cassia ValentiniBotinhao, Oliver Watts, and Simon King, “Deep neural networks employing multitask learning and stacked bottleneck features for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4460–4464.
 [32] Dong Yu and Michael L Seltzer, “Improved bottleneck features using pretrained deep neural networks,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
 [33] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
 [34] Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu, “Deep feature for textdependent speaker verification,” Speech Communication, vol. 73, pp. 1–13, 2015.
 [35] Nanxin Chen, Yanmin Qian, and Kai Yu, “Multitask learning for textdependent speaker verification,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
 [37] Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using tsne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.