Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition

Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition

Abstract

Linear Discriminant Analysis (LDA) has been used as a standard post-processing procedure in many state-of-the-art speaker recognition tasks. Through maximizing the inter-speaker difference and minimizing the intra-speaker variation, LDA projects i-vectors to a lower-dimensional and more discriminative subspace. In this paper, we propose a neural network based compensation scheme(termed as deep discriminant analysis, DDA) for i-vector based speaker recognition, which shares the spirit with LDA. Optimized against softmax loss and center loss at the same time, the proposed method learns a more compact and discriminative embedding space. Compared with the Gaussian distribution assumption of data and the learnt linear projection in LDA, the proposed method doesn’t pose any assumptions on data and can learn a non-linear projection function. Experiments are carried out on a short-duration text-independent dataset based on the SRE Corpus, noticeable performance improvement can be observed against the normal LDA or PLDA methods.

Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition

Shuai Wang, Zili Huang, Yanmin Qian, Kai Yu
Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering
SpeechLab, Department of Computer Science and Engineering
Brain Science and Technology Research Center
Shanghai Jiao Tong University, Shanghai, China
thanks: The corresponding author is Kai Yu thanks: This work has been supported by the National Key Research and Development Program of China under Grant No.2017YFB1002102 and the China NSFC projects (No. U1736202 and No. 61603252). Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University {feixiang121976,huangziliandy, yanminqian, kai.yu}@sjtu.edu.cn


1 Introduction

Speaker Recognition aims to recognize or verify a speaker’s identity through the given speech segment. Since proposed in [1], i-vector has become the state-of-the-art speaker modeling technique, it is a simple but elegant factor analysis model, inspired by the Joint Factor Analysis (JFA) [2] framework. Though some researchers have been working on improving the i-vector model itself[3, 4], more researchers pay attention to the compensation techniques in the i-vector space[5, 6, 7, 8]. JFA can be regarded as a compensation method in the GMM super-vector space, which models the speaker and channel variabilities separately. i-vector simplifies JFA by modeling the speaker- and channel-dependent factors in a single low dimensional space, leaving the compensation mechanisms to the following steps. In real applications, nuisance attributes such as channel, noise can pose a huge impact on the system performance, compensation methods become necessary and have attracted more and more interest.

Linear Discriminant Analysis (LDA) [9, 10] is widely used in pattern recognition tasks[11, 12] to project features onto a lower-dimensional and more discriminative space. The transformation is learned via maximizing the between-class (inter-speaker) difference and minimizing the within-class (intra-speaker) variation. LDA is a simple linear transformation which is used as a preprocessor to generate reduced dimensional and channel compensated embeddings from the original i-vectors in many speaker verification systems, results on standard datasets such as the Speaker Recognition Evaluation (SRE) corpus show its effectiveness[13]. Despite its effectiveness and popularity, LDA has its limitations. For instance, LDA can provide at most discriminant features, where is the number of classes. It’s a linear projection which may not be capable of dealing with highly non-linear separable data. Several methods such as weighted LDA [14] and nonparametric discriminant analysis (NDA) [15, 6] are proposed as a substitution of LDA in speaker verification tasks. NDA redefines the between-class scatter matrix, the expected values that represent the global information about each class are replaced with local sample averages computed based on the -NN of individual samples. Another most popular compensation method in the i-vector space is Probabilistic Linear Discriminant Analysis (PLDA) [16]. It’s usually used as a scoring method and combined with other compensation methods, such as LDA. Considering different scenarios, PLDA has several variations such as two-covariance PLDA[17], simplified PLDA[5, 18] and Heavy-Tailed PLDA[18]. Currently, the i-vector/PLDA system achieves the state-of-the-art performance.

Recently, there are also some attempts using Deep Learning (DL) techniques for de-noising and channel compensation in speaker recognition. This compensation can be performed in the cepstral feature space or the i-vector space. Authors in [19] used features estimated by the denoising DNN as the input to an i-vector system for channel robust speaker recognition. Authors in [7] proposed to use an auto-encoder to learn a projection which maps noisy i-vectors to de-noised ones. To address the short-duration problem of i-vector[20], a Convolutional Neural Network (CNN) based system was trained in [8] to map the i-vectors extracted from short utterances to the corresponding long-utterance i-vectors.

In this paper, we propose a discriminative neural network (NN) based compensation method in the i-vector space. The proposed NN-based method shares the same spirit with LDA, it is trained to minimize softmax loss and center loss [21] simultaneously, where the former forces the transformed embeddings from different classes staying apart and the latter pulls the embeddings from the same class close to their centers. With the joint supervision of softmax loss and center loss, the NN produces a projection function similar to LDA, enlarging the between-class difference and reducing the within-class variation. The proposed NN-based compensation method will be referred to as Deep Discriminant Analysis (DDA) in this paper.

The rest of the paper is organized as follows. Section 2 briefly introduces the i-vector framework. Section 3 reviews two conventional compensation methods in i-vector space, LDA and PLDA. We propose the discriminative neural network based compensation method (DDA) in Section 4, followed by experiments and results analysis in Section 5. Section 6 concludes this paper.

2 i-vector

Joint Factor Analysis (JFA) framework[2] was proposed as a compensation method in the GMM super-vector space, it models speaker and channel factors in separate subspaces. The following i-vector simplifies the JFA framework by modeling a single total variability subspace[1]. In the i-vector framework, the speaker- and session-dependent super-vector (derived from UBM) is modeled as

(1)

where is a speaker and session-independent super-vector, is a low rank matrix which captures speaker and session variability, , is a multivariate random variable, and the termed i-vector is the posterior mean of . , is the residual noise term to account for the variability not captured by .

As shown in Equation1, i-vector is a simple and elegant representation, which follows the standard Factor Analysis (FA) scheme. However, since i-vector contains the speaker- and channel-dependent factors in the same subspace, further channel compensation methods such as LDA are often applied to annihilate the impact of nuisance attributes.

3 Conventional Compensation Methods

Speaker recognition systems are fragile to noise, channel and many other factors. Compensation technologies have been heavily researched on during the past several decades. In this section, we mainly discuss the compensation methods in the i-vector space. Two methods, LDA and PLDA will be specifically introduced.

3.1 Linear Discriminative Analysis (LDA)

LDA is widely used in pattern recognition tasks such as image recognition[11] and speaker recognition[13]. LDA calculates a matrix that projects high dimensional feature vectors (i-vectors in this paper) into a lower-dimensional and more discriminative subspace (). The projection can be represented as:

(2)

where denotes the compensated embedding and is a rectangular matrix of shape . is determined by

(3)
(4)

The between-class and within-class covariance matrices, and respectively, can be computed as

(5)
(6)

where represents the total number of speakers, represents the total number of i-vectors from all speakers. represents the global mean of all i-vectors, whereas represents the mean of i-vectors from the specific -th speaker. represents the -th i-vector from the -th speaker, is the number of utterances from the -th speaker.

LDA has an analytical solution and the optimized is a matrix whose columns are the eigenvectors corresponding to the largest eigenvalues of . However, despite its simpleness and effectiveness, LDA has several limitations,

  • The within- and between-class matrices are formed based on Gaussian assumptions for samples of each class. If the Gaussian assumption doesn’t hold, LDA is not able to learn a effective enough projection function for classification problems.

  • LDA suffers from the “small sample size” problem, which leads to the singularity of the within-class scatter matrix . This problem happens when the number of the samples is much smaller than the dimension of the original samples.

  • Given the class number , LDA can provide at most discriminant features, since the between-class scatter matrix has of rank of . However, this may not be sufficient for tasks in which the class number is much smaller than the dimension of input features.

  • LDA learns a linear projection function, which may not be enough for data that are highly linearly inseparable.

To address these limitations of LDA, several approaches are proposed. For instance, a nonparametric was proposed in [15] and applied to robust speaker verification [6, 13]. In NDA, instead of only considering the class center when computing the between-class scatter matrix, the global information about a class is defined with local sample averages computed based on the -NN of individual samples. Researchers also proposed several approaches to tackle the “small sample size” problem [22, 23].

3.2 Probabilistic Linear Discriminant Analysis

i-vectors with Probabilistic Linear Discriminant Analysis (PLDA) back-end obtains the state-of-the-art performance in speaker verification. Several variants of PLDA have been introduced into the speaker verification task, including the standard PLDA[16], two-covariance PLDA[17], heavy-tailed PLDA[18] and the Simplified PLDA[5, 18]. The optimization goal of all variants is to maximize the between-class difference and minimize the within-class variation. PLDA models regard i-vectors as observations from a probabilistic generative model and can be seen as a factor analysis in the i-vector space. In our experimental settings, the variant implemented in Kaldi[24] achieves best performance, which we termed as Kaldi-PLDA here, it’s following the formulations in [25] and similar to the two-covariance model.

In the Kaldi-PLDA, an i-vector is assumed to be generated as,

(7)
(8)
(9)

where represents the class (speaker), and represents a sample of that class in the projected space. Kaldi-PLDA is trained using EM algorithm, training and inference details can be found in [25] or the Kaldi project[24]. In the following sections, Kaldi-PLDA will be simply referred to as PLDA.

4 Neural Network based Approach

4.1 Center Loss

Neural networks have been investigated a lot in areas such as image recognition, speech recognition[26, 27, 28] and speaker recognition[3, 29, 30]. One of the most popular method is to treat the neural network as a feature extractor, whereas the learned features are called “bottle-neck feature” or “deep feature”[31, 32, 33, 34, 35]. For instance, in speaker recognition, researchers proposed to extract feature vectors from the last hidden layer of a well-trained speaker-discriminative DNN. In most work, the DNN is optimized against the softmax loss, which emphasizes on discriminating different speakers.

The softmax loss function is defined as

(10)

where is the total number of training samples (i-vectors), denotes the -th sample, belonging to the -th class. is the number of softmax outputs, representing different classes. is the projection weight matrix and is the corresponding bias term.

Center loss [21] is formulated as

(11)

where represents the center of -th class (which the -th sample belongs to) and is updated along with the training procedure. The neural network will be trained under the joint supervision of softmax loss and center loss, formulated as,

(12)

where is adopted for balancing the two loss functions. Intuitively, the softmax loss forces the learned embeddings of different classes staying apart, while the center loss pulls the embeddings from the same class close to their centers. With the joint supervision of softmax loss and center loss, the neural network learns a projection function similar to LDA, enlarging the inter-class differences and reducing the intra-class variations.

To show the effectiveness of center loss, following the approach in [21], we also train a toy example on a small speaker audio dataset, which contains 10 different speakers. A 2-layer neural network is trained and the dimension of embedding layer is set as 2 for illustration. As shown in Fig.1 and Fig.2 (Best viewed in color), with the assistance of center loss, the within-class variation reduced a lot. In the following experiments, it can be seen that this property can be generalized to scenarios where validation speakers have no overlap with the training speakers, which is the common condition in speaker verification.

Figure 1: Embeddings supervised by softmax loss
Figure 2: Embeddings supervised by softmax + center loss

4.2 Deep Discriminant Analysis

Deep Neural Network (DNN) shows its extraordinary capability in speech recognition and speaker recognition, there is no prior assumption on the input data. Through substituting Gaussian Mixture Model (GMM) to DNN[26], the DNN-HMM systems achieve noticeable performance improvement compared to traditional GMM-HMM systems, which also holds for speaker recognition tasks when updating GMM-i-vector to DNN-i-vector[3]. In this section, a DNN is used to perform the channel compensation in the i-vector space. The whole architecture is depicted in Fig.3. In the training phase, the extracted i-vector from different speakers are prepared as input, the DNN is joint supervised by the softmax loss and the center loss. The last layer before the loss layer is an embedding layer, from which we extract the transformed embeddings. In the compensation stage, the source i-vectors are mapped to their corresponding transformed version through the trained neural network. Similar to the projection in Equation 2, given the original i-vector , the compensated lower-dimensional embedding can be represented as

(13)

where denotes the nonlinear transformation function learned by the NN through the training data. We term this NN-based compensation method as Deep Discriminant Analysis (DDA), for comparison with LDA or NDA.

Figure 3: Architecture of DDA

5 Experiments

5.1 Dataset

We evaluate the performance of our proposed methods on a short-duration dataset generated from the NIST SRE corpus. This short duration text-independent task is more difficult for speaker verification. The training set consists of selected data from SRE-, Switchboard II phase , and Switchboard Cellular Part, Part. After removing silence frames using an energy-based VAD, the utterances are chopped into short segments (ranging from -s). The final training set contains speakers and each speaker has short utterances. The enrollment set and test set are derived from NIST SRE following a similar procedure. The enrollment set contains speakers ( males and females) and each speaker is enrolled by utterances. The test set contains utterances from the models in the enrollment set. The trial list we create contains trials. There are positive samples and negative samples on average for each model. No cross-gender trial exists.

5.2 System Details

5.2.1 Baseline Settings

The baseline i-vector system is implemented using the Kaldi toolkit, 20-dimensional MFCC coefficients with their first and second order derivatives are extracted from the speech segments (identified with an energy based VAD). A 25 ms Hamming window with a 10 ms frame shift is adopted in the feature extraction process. The universal background model (UBM) contains 2048 Gaussian mixtures and the i-vector dimension is set to 600. Different scoring methods are applied to the length normalized i-vectors. 3 different scoring methods are adopted to evaluate the performance. Cos denotes the cosine similarity of two vectors, while Euc denotes the Euclidean Distance. As shown in Table 2, PLDA achieves best performance for the raw input i-vectors with a EER of 4.96%, since PLDA itself is both a compensation and scoring method. LDA’s dimension in Table 2 is set as 300 and obtains significant performance improvement when applied to Cos or Euc scoring methods. However, no improvement is observed when combining LDA and PLDA.

5.2.2 Neural Network Settings

As shown in Table 1, we adopt a standard feed-forward neural network as the compensation model, which contains one input layer, one hidden layer and one embedding layer. PReLU[36] is chosen as the activation function, while a batch normalization layer is added before the embedding layer to stabilize the training procedure. The whole network is trained under the joint supervision of softmax loss and center loss, with the value of in Equation 12 set as 0.01 (Detailed explanation of this setting can be found in Section 5.3.1). Following the strategy used in [21], besides the to balance the impact of two losses, a different learning rate is used for the center loss parameters. The learning rate for the basic neural network is set to 0.01 and the one for center loss is set to 0.1. In the training stage, since it’s impractical to update the centers with respect to the whole training set, we update the centers per mini-batch instead, centers are computed by averaging the embeddings of corresponding classes (centers of some classes may not be updated).

Input Source i-vectors of 600 dimension
Linear Layers number of nodes nonlinear
Input Layer 600 PReLU
Hidden Layer 600 PReLU + BatchNorm
Embedding Layer 300 None
Loss softmax loss 0.01 * center loss
Table 1: Neural Network Configuration

5.3 Results and Analysis

The proposed neural network based system is evaluated on the dataset described in Section 5.1. As shown in Table 2, compared to LDA, the NN-based DDA obtains larger improvement for Cos and Euc scoring methods, while the best performance of EER 4.69% is achieved by DDA+Euc, which also outperforms the baseline PLDA system. However, the proposed compensation method is not compatible with PLDA. To better understand the proposed method’s effect, we use t-SNE[37] to visualize the i-vectors and their corresponding DDA-compensated embeddings in Fig.4 and Fig.5 (Best viewed in color).

Methods Cos Euc PLDA
Baseline 7.29 6.04 4.96
LDA 5.89 5.22 5.0
DDA 4.78 4.69 7.32
Table 2: EER (%) of different compensation methods

Fig.4 depicts the distribution of i-vectors from 10 speakers randomly chosen from the test set, while the distribution of corresponding compensated embeddings are shown in Fig.5. As shown in the two figures, with the proposed compensation method, the distribution of embeddings from the same speaker seems more compact, which means the intra-speaker variation is significantly reduced.

Figure 4: Visualization of i-vectors
Figure 5: Visualization of compensated embeddings

5.3.1 Impact of the loss weight

As mentioned in above sections, a weight is used to balance the softmax loss and center loss. A small implies a strong supervision signal provided by the softmax loss, whereas a large implies a strong supervision signal from the center loss. As shown in Fig.6 and Fig.7, when the weight is set to 0.1, the network is actually not trainable, though the center loss degrades quickly, the softmax loss hardly change. In this case, the embeddings are trained to be similar to each other and became not distinguishable. As the value of is reduced, the softmax loss degrades faster due to its relatively stronger supervision signal. In our experiments, when varies from 0.001 to 0.01, the training converges faster, while the compensation performance hardly changes.

Figure 6: Center Loss with the training epochs
Figure 7: Softmax Loss with the training epochs

5.3.2 Impact of the embedding dimension

In this section, we investigated the impact of different dimensions of the projection subspace by varying the embedding layer’s dimension. As shown in Table 3, it’s interesting to find that DDA achieves best performance with 400 dimension, in contrary, LDA achieves best performance with 200 dimension. Though not listed in the following table, it should be mentioned that with the dimension of 100 or 500 which are not listed, the EER increases for both two compensation methods.

Scoring Compensation 200dim 300dim 400dim
Cos LDA 5.53 5.89 6.28
DDA 5.31 4.78 4.67
Euc LDA 5.22 5.22 5.40
DDA 5.08 4.69 4.51
Table 3: EER (%) comparison

6 Conclusion

Intra-speaker variability compensation techniques such as LDA have been researched a lot in the state-of-the-art i-vector framework, LDA has several limitations dual to the mismatch between LDA’s assumptions and the true distribution of i-vectors. In this paper, we proposed a non-linear compensation framework based on a discriminative neural network, termed as DDA (Deep Discriminant Analysis). The neural network is trained under the joint supervision of softmax loss and center loss, the softmax loss forces the learned embeddings of different classes staying apart, while the center loss pulls the embeddings from the same class close to their centers. Experiments shows that with the assistance of the proposed compensation method, simple Cosine Scoring or Euclidean Scoring can achieve even better performance than PLDA.

References

  • [1] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  • [2] Patrick Kenny, “Joint factor analysis of speaker and session variability: Theory and algorithms,” CRIM, Montreal,(Report) CRIM-06/08-13, 2005.
  • [3] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1695–1699.
  • [4] Wei Rao, Man-Wai Mak, and Kong-Aik Lee, “Normalization of total variability matrix for i-vector/plda speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4180–4184.
  • [5] Daniel Garcia-Romero and Carol Y Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
  • [6] Seyed Omid Sadjadi, Jason Pelecanos, and Weizhong Zhu, “Nearest neighbor discriminant analysis for robust speaker recognition,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [7] Shivangi Mahto, Hitoshi Yamamoto, and Takafumi Koshinaka, “I-vector transformation using a novel discriminative denoising autoencoder for noise-robust speaker recognition,” Proc. Interspeech 2017, pp. 3722–3726, 2017.
  • [8] Jinxi Guo, Usha Amrutha Nookala, and Abeer Alwan, “Cnn-based joint mapping of short and long utterance i-vectors for speaker verification using short utterances,” Proc. Interspeech 2017, pp. 3712–3716, 2017.
  • [9] Suresh Balakrishnama and Aravind Ganapathiraju, “Linear discriminant analysis-a brief tutorial,” Institute for Signal and information Processing, vol. 18, pp. 1–8, 1998.
  • [10] Nasser M Nasrabadi, “Pattern recognition and machine learning,” Journal of electronic imaging, vol. 16, no. 4, pp. 049901, 2007.
  • [11] Peter N. Belhumeur, João P Hespanha, and David J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 711–720, 1997.
  • [12] Reinhold Haeb-Umbach and Hermann Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on. IEEE, 1992, vol. 1, pp. 13–16.
  • [13] Seyed Omid Sadjadi, Sriram Ganapathy, and Jason W Pelecanos, “The ibm 2016 speaker recognition system,” arXiv preprint arXiv:1602.07291, 2016.
  • [14] Ahilan Kanagasundaram, David Dean, Robbie Vogt, Mitchell McLaren, Sridha Sridharan, and Michael Mason, “Weighted lda techniques for i-vector based speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4781–4784.
  • [15] K Fukunaga and JM Mantock, “Nonparametric discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, , no. 6, pp. 671–678, 1983.
  • [16] Simon JD Prince and James H Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.
  • [17] Niko Brümmer and Edward De Villiers, “The speaker partitioning problem.,” in Odyssey, 2010, p. 34.
  • [18] Patrick Kenny, “Bayesian speaker verification with heavy-tailed priors.,” in Odyssey, 2010, p. 14.
  • [19] Frederick S Richardson, Douglas A Reynolds, and Brian Nemsick, “Channel compensation for speaker recognition using map adapted plda and denoising dnns,” Tech. Rep., MIT Lincoln Laboratory Lexington United States, 2016.
  • [20] Ahilan Kanagasundaram, Robbie Vogt, David B Dean, Sridha Sridharan, and Michael W Mason, “I-vector based speaker recognition on short utterances,” in Proceedings of the 12th Annual Conference of the International Speech Communication Association. International Speech Communication Association (ISCA), 2011, pp. 2341–2344.
  • [21] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 499–515.
  • [22] Rui Huang, Qingshan Liu, Hanqing Lu, and Songde Ma, “Solving the small sample size problem of lda,” in Pattern Recognition, 2002. Proceedings. 16th International Conference on. IEEE, 2002, vol. 3, pp. 29–32.
  • [23] Alok Sharma and Kuldip K Paliwal, “Linear discriminant analysis for the small sample size problem: an overview,” International Journal of Machine Learning and Cybernetics, vol. 6, no. 3, pp. 443–454, 2015.
  • [24] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
  • [25] Sergey Ioffe, “Probabilistic linear discriminant analysis,” in European Conference on Computer Vision. Springer, 2006, pp. 531–542.
  • [26] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
  • [27] George E Dahl, Dong Yu, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2012.
  • [28] Frank Seide, Gang Li, and Dong Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
  • [29] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. IEEE, 2016, pp. 5115–5119.
  • [30] David Snyder, Pegah Ghahremani, Daniel Povey, Daniel Garcia-Romero, Yishay Carmiel, and Sanjeev Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 165–170.
  • [31] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4460–4464.
  • [32] Dong Yu and Michael L Seltzer, “Improved bottleneck features using pretrained deep neural networks,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
  • [33] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
  • [34] Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu, “Deep feature for text-dependent speaker verification,” Speech Communication, vol. 73, pp. 1–13, 2015.
  • [35] Nanxin Chen, Yanmin Qian, and Kai Yu, “Multi-task learning for text-dependent speaker verification,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [37] Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
210030
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description