Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training
In this article we propose a novel approach for adapting speaker embeddings to new domains based on adversarial training of neural networks. We apply our embeddings to the task of text-independent speaker verification, a challenging, real-world problem in biometric security. We further the development of end-to-end speaker embedding models by combing a novel 1-dimensional, self-attentive residual network, an angular margin loss function and adversarial training strategy. Our model is able to learn extremely compact, 64-dimensional speaker embeddings that deliver competitive performance on a number of popular datasets using simple cosine distance scoring. One the NIST-SRE 2016 task we are able to beat a strong i-vector baseline, while on the Speakers in the Wild task our model was able to outperform both i-vector and x-vector baselines, showing an absolute improvement of 2.19% over the latter. Additionally, we show that the integration of adversarial training consistently leads to a significant improvement over an unadapted model.
Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training
|Gautam Bhattacharya, Jahangir Alam, Patrick Kenny|
|Computer Research Institute of Montreal|
Index Terms— Speaker Verification, Adversarial Training , Domain Adaptation, End-to-End
Text-Independent Speaker Verification systems are binary classifiers that given two recordings answer the question:
Are the people speaking in the two recordings the same person?
The answer is typically delivered in the form of a scalar value or verification score. Verification scores can be formulated as a likelihood ratio, as in the popular i-vector/PLDA approach [1, 2]. An alternate approach is to use simple distance metrics like mean-squared error or cosine distance. Verification models that can be scored like this typically need to optimize the distance metric itself, i.e. they are optimized end-to-end. While contrastive loss based end-to-end face verification models have shown state-of-the-art performance , their adoption in the speaker verification community has not been widespread due to the difficulties associated with training such models.
State-of-the-art speaker verification systems follow the same recipe as i-vector systems by using a LDA/PLDA classifier, but replace the i-vector extractor with a Deep Neural Network (DNN) feature extractor . The DNN embedding model is trained by minimizing the cross-entropy loss over speakers in the training data. While cross-entropy minimization is simpler than optimizing contrastive losses, the nature of the verification problem makes learning a good DNN embedding model challenging. This is evidenced by the Kaldi x-vector recipe, which we use as one of the baseline systems in this work. The recipe involves extensive data preparation, followed by a multi-GPU training strategy that involves a sophisticated model averaging technique combined with a natural gradient variant of SGD . Replicating the performance of x-vectors with conventional first order optimizers is non-trivial .
In this article we present Domain Adversarial Neural Speaker Embeddings (DANSE) for text-independent speaker verification. We make the following contributions:
We propose a novel architecture for extracting neural speaker embeddings based on a 1-dimensional residual network and a self-attention model. The model can be trained using a simple data sampling strategy and using traditional first order optimizers.
We show that the DANSE model can be optimized end-to-end to learn extremely compact (64-dimensional) embeddings that deliver competitive speaker verification performance using simple cosine scoring.
Finally, we propose to integrate adversarial training into part of learning a speaker embedding model, in order to learn domain invariant features. To the best of our knowledge, ours is the first to propose the use of adversarial training in a verification setting.
Modern speaker verification datasets like NIST-SRE 2016 and Speakers in the Wild (SITW) are challenging because in-domain or target data is not available for training verification systems [6, 7]. This leads to a domain shift between training and test datasets, which in turn degrades performance. Our key insight in this work is that verification performance can be improved significantly by encouraging the speaker embedding model to learn domain invariant features. We achieve this through Domain Adversarial Training (DAT) using the framework of Gradient Reversal . This allows us to learn domain invariant speaker embeddings using a small amount of unlabelled, target domain data. DAT uses a simple reverse gradient method to learn a symmetric feature space, common to source and target data. This idea has been primarily used to adapt classifiers, but in this work we show that the features learned by DAT are also more speaker discriminative in the target domain. Apart from domain robustness, we find that using an appropriate verification loss function in combination with DAT is equally important for the model to show robust performance using a simple cosine scoring strategy.
2 Learning Domain Invariant Speaker Embeddings
2.1 Feature Extractor
The first step for learning discriminative speaker embeddings is to learn a mapping , from a sequence of speech frames from speaker to a D-dimensional feature vector f. can be implemented using a variety of neural network architectures [4, 9, 10, 11]. In this work we use a deep residual network (ResNet) as our feature extractor . Motivated by the fact that speech is translation invariant along the time-axis only, we propose to build our model using 1-dimensional convolutional layers. The ResNet architecture allows us to train much deeper networks, and leverage the greater representational capacity afforded by these models. The first convolutional layer utilizes a filter, where is the dimension of the frequency axis. The residual blocks are followed by an attentive statistics pooling layer (described in next section) and two fully connected layers. In total the feature extractor consists of 52 layers.
Advantages of ResNet Model: The main advantage of using a residual architecture is that we are able to learn a very deep speaker representation while maintaining a comparable number of parameters as the baseline x-vector model. Our model has 4.8 million parameters compared to 4.4 million, while our network is over 50 layers deep, while the x-vector network has 7 layers.
Another advantage of the proposed ResNet model is the way incoming audio is processed, which is done at the segment or utterance level. Context information is determined by the size of filter receptive fields, and operations like pooling and striding. In contrast, the baseline x-vector system processes audio at both the frame and segment level, and context is provided through data splicing. As a result, the ResNet model is able to extract speaker embedding much faster than the x-vector system.
2.2 Self-Attentive Speaker Statistics
Self-Attention models are an active area of research in the speaker verification community [11, 13, 14]. Intuitively, such models allow the network to focus on fragments of speech that are more speaker discriminative. The attention layer computes a scalar weight corresponding to each time-step :
These weights are then normalized, , to give them a probabilistic interpretation. We use the attention model proposed in , which extends attention to the mean as well as standard deviation:
In this work we apply a self attention model on convolutional feature maps, as indicated in Fig. 1. The last residual block outputs a tensor of size , where is the batch size, is the number of filters and is time. The input to the attention layer, , is a dimensional vector.
By using a self-attention model, we also equip our network with a more robust framework for processing inputs of arbitrary size than simple global averaging. This allows us simply forward propagate a recording through the network in order to extract speaker embeddings.
The classifier block, , is arguably the key component of the model, as it is responsible for learning speaker discriminative features. Recently, angular margin loss functions have been proposed as an alternative to contrastive loss functions for verification tasks . The Additive Margin Softmax (AM-Softmax) loss function is one such algorithm with an intuitive interpretation. The loss computes similarity between classes using cosine, and forces the similarity of the correct class to be greater than that of incorrect classes by a margin .
Where and are the normalized weight vector and speaker embedding respectively. The AM-Softmax loss also adds a scale parameter , which helps the model converge faster. We select and for all our experiments.
2.4 Domain Adversarial Training
So far we have covered the feature extractor and classifier part of our proposed model. In order to encourage our model to learn a symmetric feature space, we augment our network with a domain discriminator . The discriminator takes features from both the source and target data and outputs the posterior probability that an input feature belongs to the target domain.
Where is the AM-Softmax loss described in section. and is the the binary cross-entropy loss. The objective of domain adversarial training is to learn parameters that deliver a saddle point of the functional (2):
At the saddle point, the parameters of the domain classifier minimize the domain classification loss, while the parameters of the speaker classifier minimize the label prediction loss. The feature mapping parameters minimize the label prediction loss - so the features are discriminative, while maximizing the domain classification loss - so the features are domain invariant. The parameter controls the trade-off between the two objectives . A saddle point of (5)-(6) can be found using backpropagation:
Where , and are learning rates.
The negative coefficient in eq. (7) induces a reverse gradient that maximizes and makes the features from the source domain similar to those from the target domain. The implementation of the gradient reversal layer is conceptually simple - it acts as the identity transformation during forward propagation, and multiplies the gradient by during backpropagation.
3 Experimental Setup
Training Data All our systems are trained using data from previous NIST-SRE evaluations (2004-2010) and Switchboard Cellular audio for training the proposed DANSE model as well as the x-vector and i-vector baseline systems. We also augment our data with noise and reverberation, as in . For speech features extracted 23-dimensional MFCC features from the training set, which mean variance normalization. The baseline i-vector and x-vector systems were trained using the recipies provided with Kaldi. For DANSE model training we filter out speakers with less than 5 recordings.
Model: The feature extractor consists of input convolutional layer followed by 4 residual blocks [3,4,6,3], consisting of 48 layers. This is followed by an attentive statistics layer and 2 fully connected layers. The classifier consists of a one hidden layer and the AM-Softmax output layer. The Domain Discriminator consists of 2 hidden layers of 256 units each and the binary cross-entropy (BCE) output layer. We use Exponential Linear Unit (ELU) activations and batch-normalization on all layers of the network.
Optimization: We start by pre-training the feature extractor using standard cross-entropy training. Cross-entropy pre-training is carried out using the RMSprop optimizer with a learning rate (lr) of . This learning rate is annealed by a factor of after epochs & . We use a simple sampling strategy wherein we define one training epoch as sampling (randomly) each recording in the training set 10 times.
For training the full DANSE model we found it beneficial to optimize the feature extractor, classifier and domain discriminator with differnet optimizers. The classifier is trained using RMSprop with , while the domain discriminator and feature extractor are trained using SGD with . We used performance on held out validation set to determine when to stop training. Gradient Reverasl scaling coefficient is set to for all experiments.
Data Sampling: We use an extremely simple approach for sampling data during training. We sample random chunks of audio (3-8 seconds) from each recording in the training set. We sample each recording 10 times to define an epoch. For each mini-batch of source data, we randomly sample (with repetition) a mini-batch from the unlabelled adaptation data for adversarial training.
Speaker Verification: At test time we discard the domain discriminator branch of the model, as it is not needed for extracting embeddings. Extraction is done by performing a forward pass on the full recording, and using the 64-dimensional layer as our speaker embeddings. Verification trials are scored using cosine distance. Verification performance is reported in terms of Equal Error Rate (EER).
NIST-SRE 2016: The 2016 edition of the NIST evaluation presented a new set of challenges as compared to previous years. The evaluation data consists of Cantonese and Tagalog speakers. The change in language introduces a shift between the data distributions of the training (source) and evaluation data.
Adaptation Data: NIST also provides 2272 recordings of unlabelled, in-domain, target data for adapting verification systems.
Table 1. compares the performance of the proposed DANSE model with the baseline i-vector and x-vector systems. The DANSE outperforms the i-vector system, showing a 2.6% relative improvement in terms of the pooled EER. DANSE performs at the level of x-vectors + PLDA, however we are unable to match the full x-vector + LDA/PLDA recipe. We also see that DANSE outperforms the un-adapted AM-Softmax model by a large margin, indicating the advantage of adversarial training.
SPEAKERS IN THE WILD (SITW): The SITW database provides a large collection of real-world data with speech from individuals across a wide array of challenging acoustic and environmental conditions. The audio is extracted from open-source video, and while consisting of English speakers (like the training data) there is still a distribution shift due to the difference in the microphones used.
Adaptation Data: We use a small random selection of 3000 recordings from the VoxCeleb dataset  as adaptation data. Like SITW, VoxCeleb was also extracted from open-source videos, and hence matches the SITW data more closely than the training data.
From Table 2. we see that the DANSE model displays the strongest performance on the SITW dataset, showing a 2.19% absolute improvement over the x-vector baseline. Comparing the performance of our model with and without adversarial adaptation, once again we see a clear advantage for the former, with DANSE outperforming the un-adapted AM-Softmax model by 1.5%.
In this work we we presented a novel framework for learning domain-invariant speaker embeddings. By combining a powerful deep feature extractor, an end-to-end loss function and most importantly, domain adversarial training we are able to learn extremely compact speaker embeddings that deliver robust verification performance on challenging evaluation datasets. In future work we will explore other forms of domain adversarial training based on Generative Adversarial Networks . We will also explore different metrics beyond simple visualization to gain further insight into the feature transformations being induced through adversarial training.
-  Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
-  Patrick Kenny, “Bayesian speaker verification with heavy-tailed priors,” Proc. Speaker Odyssey 2010, pp. 3588–3592, 2010.
-  Florian Schroff, Dmitry Kalenichenko, and James. Philbin, “Facenet: A unified embedding for face recognition and clustering.,” IEEE conference on computer vision and pattern recognition, In Proceedings of, pp. 815–823, 2015.
-  David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” Submitted to ICASSP, 2018.
-  Sergey Novoselov, Andrey Shulipa, Ivan Kremnev, Alexandr Kozlov, and Vadim. Shchemelinin, “On deep speaker embeddings for text-independent speaker recognition.,” Proc. Speaker Odyssey 2018, 2018.
-  Seyed Omid Sadjadi, Timothée Kheyrkhah, Audrey Tong, Craig Greenberg, Elliot Singer Reynolds, Lisa Mason, and Jaime Hernandez-Cordero, “The 2016 nist speaker recognition evaluation,” in Proc. Interspeech, 2017, pp. 1353–1357.
-  Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson, “The speakers in the wild (sitw) speaker recognition database.,” in Interspeech, 2016, pp. 818–822.
-  Yaroslav Ganin and Victor Lempitsky, “Unsupervised domain adaptation by backpropagation,” in arXiv preprint arXiv:1409.7495, 2014.
-  Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
-  Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
-  Gautam Bhattacharya, Md Jahangir Alam, Vishwa Gupta, and Patrick Kenny, “Deeply fused speaker embeddings for text-independent speaker verification,” Proc. Interspeech 2018, pp. 3588–3592, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition.,” IEEE conference on computer vision and pattern recognition, In Proceedings of, pp. 770–778, 2016.
-  Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding.,” ArXiv e-prints, abs/1803.10963, 2018.
-  Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and Daniel Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” Proc. Interspeech 2018, pp. 3573–3577, 2018.
-  Wang Feng, Jian Cheng, Weiyang Liu, and Haijun Liu, “Additive margin softmax for face verification.,” IEEE Signal Processing Letters 25, no. 7, pp. 926–930, 2018.
-  Arsha Nagrani, Joon Son Chung, and Zisserman Gupta, “Voxceleb: a large-scale speaker identification dataset.,” Proc. Interspeech 2017, pp. 358–362, 2017.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets.,” in In Advances in neural information processing systems, 2014, pp. 2672–2680.