Adapting EndtoEnd Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training
Abstract
In this article we propose a novel approach for adapting speaker embeddings to new domains based on adversarial training of neural networks. We apply our embeddings to the task of textindependent speaker verification, a challenging, realworld problem in biometric security. We further the development of endtoend speaker embedding models by combing a novel 1dimensional, selfattentive residual network, an angular margin loss function and adversarial training strategy. Our model is able to learn extremely compact, 64dimensional speaker embeddings that deliver competitive performance on a number of popular datasets using simple cosine distance scoring. One the NISTSRE 2016 task we are able to beat a strong ivector baseline, while on the Speakers in the Wild task our model was able to outperform both ivector and xvector baselines, showing an absolute improvement of 2.19% over the latter. Additionally, we show that the integration of adversarial training consistently leads to a significant improvement over an unadapted model.
Adapting EndtoEnd Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training
Gautam Bhattacharya, Jahangir Alam, Patrick Kenny 

McGill University 
Computer Research Institute of Montreal 
Index Terms— Speaker Verification, Adversarial Training , Domain Adaptation, EndtoEnd
1 Introduction
TextIndependent Speaker Verification systems are binary classifiers that given two recordings answer the question:
Are the people speaking in the two recordings the same person?
The answer is typically delivered in the form of a scalar value or verification score. Verification scores can be formulated as a likelihood ratio, as in the popular ivector/PLDA approach [1, 2]. An alternate approach is to use simple distance metrics like meansquared error or cosine distance. Verification models that can be scored like this typically need to optimize the distance metric itself, i.e. they are optimized endtoend. While contrastive loss based endtoend face verification models have shown stateoftheart performance [3], their adoption in the speaker verification community has not been widespread due to the difficulties associated with training such models.
Stateoftheart speaker verification systems follow the same recipe as ivector systems by using a LDA/PLDA classifier, but replace the ivector extractor with a Deep Neural Network (DNN) feature extractor [4]. The DNN embedding model is trained by minimizing the crossentropy loss over speakers in the training data. While crossentropy minimization is simpler than optimizing contrastive losses, the nature of the verification problem makes learning a good DNN embedding model challenging. This is evidenced by the Kaldi xvector recipe, which we use as one of the baseline systems in this work. The recipe involves extensive data preparation, followed by a multiGPU training strategy that involves a sophisticated model averaging technique combined with a natural gradient variant of SGD [4]. Replicating the performance of xvectors with conventional first order optimizers is nontrivial [5].
In this article we present Domain Adversarial Neural Speaker Embeddings (DANSE) for textindependent speaker verification. We make the following contributions:

We propose a novel architecture for extracting neural speaker embeddings based on a 1dimensional residual network and a selfattention model. The model can be trained using a simple data sampling strategy and using traditional first order optimizers.

We show that the DANSE model can be optimized endtoend to learn extremely compact (64dimensional) embeddings that deliver competitive speaker verification performance using simple cosine scoring.

Finally, we propose to integrate adversarial training into part of learning a speaker embedding model, in order to learn domain invariant features. To the best of our knowledge, ours is the first to propose the use of adversarial training in a verification setting.
Modern speaker verification datasets like NISTSRE 2016 and Speakers in the Wild (SITW) are challenging because indomain or target data is not available for training verification systems [6, 7]. This leads to a domain shift between training and test datasets, which in turn degrades performance. Our key insight in this work is that verification performance can be improved significantly by encouraging the speaker embedding model to learn domain invariant features. We achieve this through Domain Adversarial Training (DAT) using the framework of Gradient Reversal [8]. This allows us to learn domain invariant speaker embeddings using a small amount of unlabelled, target domain data. DAT uses a simple reverse gradient method to learn a symmetric feature space, common to source and target data. This idea has been primarily used to adapt classifiers, but in this work we show that the features learned by DAT are also more speaker discriminative in the target domain. Apart from domain robustness, we find that using an appropriate verification loss function in combination with DAT is equally important for the model to show robust performance using a simple cosine scoring strategy.
2 Learning Domain Invariant Speaker Embeddings
2.1 Feature Extractor
The first step for learning discriminative speaker embeddings is to learn a mapping , from a sequence of speech frames from speaker to a Ddimensional feature vector f. can be implemented using a variety of neural network architectures [4, 9, 10, 11]. In this work we use a deep residual network (ResNet) as our feature extractor [12]. Motivated by the fact that speech is translation invariant along the timeaxis only, we propose to build our model using 1dimensional convolutional layers. The ResNet architecture allows us to train much deeper networks, and leverage the greater representational capacity afforded by these models. The first convolutional layer utilizes a filter, where is the dimension of the frequency axis. The residual blocks are followed by an attentive statistics pooling layer (described in next section) and two fully connected layers. In total the feature extractor consists of 52 layers.
Advantages of ResNet Model: The main advantage of using a residual architecture is that we are able to learn a very deep speaker representation while maintaining a comparable number of parameters as the baseline xvector model. Our model has 4.8 million parameters compared to 4.4 million, while our network is over 50 layers deep, while the xvector network has 7 layers.
Another advantage of the proposed ResNet model is the way incoming audio is processed, which is done at the segment or utterance level. Context information is determined by the size of filter receptive fields, and operations like pooling and striding. In contrast, the baseline xvector system processes audio at both the frame and segment level, and context is provided through data splicing. As a result, the ResNet model is able to extract speaker embedding much faster than the xvector system.
2.2 SelfAttentive Speaker Statistics
SelfAttention models are an active area of research in the speaker verification community [11, 13, 14]. Intuitively, such models allow the network to focus on fragments of speech that are more speaker discriminative. The attention layer computes a scalar weight corresponding to each timestep :
(1) 
These weights are then normalized, , to give them a probabilistic interpretation. We use the attention model proposed in [13], which extends attention to the mean as well as standard deviation:
(2) 
(3) 
In this work we apply a self attention model on convolutional feature maps, as indicated in Fig. 1. The last residual block outputs a tensor of size , where is the batch size, is the number of filters and is time. The input to the attention layer, , is a dimensional vector.
By using a selfattention model, we also equip our network with a more robust framework for processing inputs of arbitrary size than simple global averaging. This allows us simply forward propagate a recording through the network in order to extract speaker embeddings.
2.3 Classifier
The classifier block, , is arguably the key component of the model, as it is responsible for learning speaker discriminative features. Recently, angular margin loss functions have been proposed as an alternative to contrastive loss functions for verification tasks [15]. The Additive Margin Softmax (AMSoftmax) loss function is one such algorithm with an intuitive interpretation. The loss computes similarity between classes using cosine, and forces the similarity of the correct class to be greater than that of incorrect classes by a margin .
(4) 
Where and are the normalized weight vector and speaker embedding respectively. The AMSoftmax loss also adds a scale parameter , which helps the model converge faster. We select and for all our experiments.
2.4 Domain Adversarial Training
So far we have covered the feature extractor and classifier part of our proposed model. In order to encourage our model to learn a symmetric feature space, we augment our network with a domain discriminator . The discriminator takes features from both the source and target data and outputs the posterior probability that an input feature belongs to the target domain.
Where is the AMSoftmax loss described in section. and is the the binary crossentropy loss. The objective of domain adversarial training is to learn parameters that deliver a saddle point of the functional (2):
(5) 
(6) 
At the saddle point, the parameters of the domain classifier minimize the domain classification loss, while the parameters of the speaker classifier minimize the label prediction loss. The feature mapping parameters minimize the label prediction loss  so the features are discriminative, while maximizing the domain classification loss  so the features are domain invariant. The parameter controls the tradeoff between the two objectives [8]. A saddle point of (5)(6) can be found using backpropagation:
(7) 
(8) 
(9) 
Where , and are learning rates.
The negative coefficient in eq. (7) induces a reverse gradient that maximizes and makes the features from the source domain similar to those from the target domain. The implementation of the gradient reversal layer is conceptually simple  it acts as the identity transformation during forward propagation, and multiplies the gradient by during backpropagation.
3 Experimental Setup
Training Data All our systems are trained using data from previous NISTSRE evaluations (20042010) and Switchboard Cellular audio for training the proposed DANSE model as well as the xvector and ivector baseline systems. We also augment our data with noise and reverberation, as in [4]. For speech features extracted 23dimensional MFCC features from the training set, which mean variance normalization. The baseline ivector and xvector systems were trained using the recipies provided with Kaldi. For DANSE model training we filter out speakers with less than 5 recordings.
Model: The feature extractor consists of input convolutional layer followed by 4 residual blocks [3,4,6,3], consisting of 48 layers. This is followed by an attentive statistics layer and 2 fully connected layers. The classifier consists of a one hidden layer and the AMSoftmax output layer. The Domain Discriminator consists of 2 hidden layers of 256 units each and the binary crossentropy (BCE) output layer. We use Exponential Linear Unit (ELU) activations and batchnormalization on all layers of the network.
Optimization: We start by pretraining the feature extractor using standard crossentropy training. Crossentropy pretraining is carried out using the RMSprop optimizer with a learning rate (lr) of . This learning rate is annealed by a factor of after epochs & . We use a simple sampling strategy wherein we define one training epoch as sampling (randomly) each recording in the training set 10 times.
For training the full DANSE model we found it beneficial to optimize the feature extractor, classifier and domain discriminator with differnet optimizers. The classifier is trained using RMSprop with , while the domain discriminator and feature extractor are trained using SGD with . We used performance on held out validation set to determine when to stop training. Gradient Reverasl scaling coefficient is set to for all experiments.
Data Sampling: We use an extremely simple approach for sampling data during training. We sample random chunks of audio (38 seconds) from each recording in the training set. We sample each recording 10 times to define an epoch. For each minibatch of source data, we randomly sample (with repetition) a minibatch from the unlabelled adaptation data for adversarial training.
Speaker Verification: At test time we discard the domain discriminator branch of the model, as it is not needed for extracting embeddings. Extraction is done by performing a forward pass on the full recording, and using the 64dimensional layer as our speaker embeddings. Verification trials are scored using cosine distance. Verification performance is reported in terms of Equal Error Rate (EER).
4 Results
NISTSRE 2016: The 2016 edition of the NIST evaluation presented a new set of challenges as compared to previous years. The evaluation data consists of Cantonese and Tagalog speakers. The change in language introduces a shift between the data distributions of the training (source) and evaluation data.
Adaptation Data: NIST also provides 2272 recordings of unlabelled, indomain, target data for adapting verification systems.
Model  Classifier  Cantonese  Tagalog  Pooled 
ivector  PLDA  9.51  17.61  13.65 
xvector  COSINE  36.44  41.07  38.69 
xvector  LDA/PLDA  7.52  15.96  11.73 
xvector  PLDA  7.99  18.46  13.32 
AMS  COSINE  11.44  21.22  16.28 
DANSE 
COSINE  8.84  17.87  13.29 
Table 1. compares the performance of the proposed DANSE model with the baseline ivector and xvector systems. The DANSE outperforms the ivector system, showing a 2.6% relative improvement in terms of the pooled EER. DANSE performs at the level of xvectors + PLDA, however we are unable to match the full xvector + LDA/PLDA recipe. We also see that DANSE outperforms the unadapted AMSoftmax model by a large margin, indicating the advantage of adversarial training.
SPEAKERS IN THE WILD (SITW): The SITW database provides a large collection of realworld data with speech from individuals across a wide array of challenging acoustic and environmental conditions. The audio is extracted from opensource video, and while consisting of English speakers (like the training data) there is still a distribution shift due to the difference in the microphones used.
Adaptation Data: We use a small random selection of 3000 recordings from the VoxCeleb dataset [16] as adaptation data. Like SITW, VoxCeleb was also extracted from opensource videos, and hence matches the SITW data more closely than the training data.
Model  ivector  xvector  AMS  DANSE 
EER  11.47  10.51  9.87  8.32 
From Table 2. we see that the DANSE model displays the strongest performance on the SITW dataset, showing a 2.19% absolute improvement over the xvector baseline. Comparing the performance of our model with and without adversarial adaptation, once again we see a clear advantage for the former, with DANSE outperforming the unadapted AMSoftmax model by 1.5%.
5 Conclusion
In this work we we presented a novel framework for learning domaininvariant speaker embeddings. By combining a powerful deep feature extractor, an endtoend loss function and most importantly, domain adversarial training we are able to learn extremely compact speaker embeddings that deliver robust verification performance on challenging evaluation datasets. In future work we will explore other forms of domain adversarial training based on Generative Adversarial Networks [17]. We will also explore different metrics beyond simple visualization to gain further insight into the feature transformations being induced through adversarial training.
References
 [1] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [2] Patrick Kenny, “Bayesian speaker verification with heavytailed priors,” Proc. Speaker Odyssey 2010, pp. 3588–3592, 2010.
 [3] Florian Schroff, Dmitry Kalenichenko, and James. Philbin, “Facenet: A unified embedding for face recognition and clustering.,” IEEE conference on computer vision and pattern recognition, In Proceedings of, pp. 815–823, 2015.
 [4] David Snyder, Daniel GarciaRomero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “Xvectors: Robust dnn embeddings for speaker recognition,” Submitted to ICASSP, 2018.
 [5] Sergey Novoselov, Andrey Shulipa, Ivan Kremnev, Alexandr Kozlov, and Vadim. Shchemelinin, “On deep speaker embeddings for textindependent speaker recognition.,” Proc. Speaker Odyssey 2018, 2018.
 [6] Seyed Omid Sadjadi, Timothée Kheyrkhah, Audrey Tong, Craig Greenberg, Elliot Singer Reynolds, Lisa Mason, and Jaime HernandezCordero, “The 2016 nist speaker recognition evaluation,” in Proc. Interspeech, 2017, pp. 1353–1357.
 [7] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson, “The speakers in the wild (sitw) speaker recognition database.,” in Interspeech, 2016, pp. 818–822.
 [8] Yaroslav Ganin and Victor Lempitsky, “Unsupervised domain adaptation by backpropagation,” in arXiv preprint arXiv:1409.7495, 2014.
 [9] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an endtoend neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
 [10] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized endtoend loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
 [11] Gautam Bhattacharya, Md Jahangir Alam, Vishwa Gupta, and Patrick Kenny, “Deeply fused speaker embeddings for textindependent speaker verification,” Proc. Interspeech 2018, pp. 3588–3592, 2018.
 [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition.,” IEEE conference on computer vision and pattern recognition, In Proceedings of, pp. 770–778, 2016.
 [13] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding.,” ArXiv eprints, abs/1803.10963, 2018.
 [14] Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and Daniel Povey, “Selfattentive speaker embeddings for textindependent speaker verification,” Proc. Interspeech 2018, pp. 3573–3577, 2018.
 [15] Wang Feng, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification.,” IEEE Signal Processing Letters 25, no. 7, pp. 926–930, 2018.
 [16] Arsha Nagrani, Joon Son Chung, and Zisserman Gupta, “Voxceleb: a largescale speaker identification dataset.,” Proc. Interspeech 2017, pp. 358–362, 2017.
 [17] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets.,” in In Advances in neural information processing systems, 2014, pp. 2672–2680.