Improving robustness in speaker identification using a two-stage attention model

Improving robustness in speaker identification using a two-stage attention model


In this paper a novel framework to tackle speaker recognition using a two-stage attention model is proposed. In recent years, the use of deep neural networks, such as time delay neural network (TDNN), and attention model have boosted speaker recognition performance. However, it is still a challenging task to tackle speaker recognition in severe acoustic environments. To build a robust speaker recognition system against noise, we employ a two-stage attention model and combine it with a TDNN model. In this framework, the attention mechanism is used in two aspects: embedding space and temporal space. The embedding attention model built in embedding space is to highlight the importance of each embedding element by weighting them using self attention. The frame attention model built in temporal space aims to find which frames are significant for speaker recognition. To evaluate the effectiveness and robustness of our approach, we use the TIMIT dataset and test our approach in the condition of five kinds of noise and different signal-noise-ratios (SNRs). In comparison with three strong baselines, CNN, TDNN and TDNN+attention, the experimental results show that the use of our approach outperforms them in different conditions. The correct recognition rate obtained using our approach can still reach 49.1%, better than any baselines, even if the noise is Gaussian white Noise and the SNR is 0dB.


Yanpei Shithanks: the first and second author have equal contributions, Qiang Huang, Thomas Hain \address Speech and Hearing Research Group
Department of Computer Science, University of Sheffield
{YShi30, qiang.huang, t.hain}


Robust Speaker Recognition, Deep Neural Networks, Attention Mechanism, Time-Delay Neural Network, Two-Stage Attention.

1 Introduction

Speaker recognition aims to identify a person from characteristics of voices[21]. To do this, traditional i-vector method based on the GMM-UBM described in [4] was widely used to extract acoustic features for speaker recognition. However, in practical applications of speaker recognition, input audio signals are often corrupted by different types of background noise[15]. This easily interferes the extraction of some key acoustic features of speakers and thus makes speaker recognition in noise conditions a challenging task.

In recent years, due to the rapid development of deep learning technologies, recognizing speaker identities from audio signal using deep neural networks has been an active research area and different speaker modelling approaches [21, 23, 24] are proposed. Variani, et al. developed d-vector using multiple fully-connected neural network layers [24]. In [23], Snyder, et al. developed a framework called as X-vector, which is one of the popular methods for speaker recognition. It can yield a good performance using a feed-forward TDNN that computes speaker embeddings from variable-length acoustic segments.

To further tackle interferences caused by background noise, attention mechanism [8] is used as it allows the model to allocate weights on different part of data and help to search for salient features. For speaker recognition, there are some previous studies [32, 18, 27, 22] using attention model. Wang, et al. used an attentive X-vector where a self-attention layer was added before a statistics pooling layer to weight each frames [27, 18, 32]. Rahman, et al. jointly used attention model and K-max pooling to selects the most relevant features[22].

In addition to speaker recognition, the attention model has also been widely used in natural language processing [1, 13, 25, 30, 9], speech recognition[17, 16, 31, 2], and computer vision [29, 12, 28, 26, 19, 14]. In [1] , Bahdanau, et al. designed an attention model to allow the each time step of decoder to pay attention to different part of input sentence. Xu used an attention model in a similar way to design an encoder decoder network for image caption [29]. In [17], Moritz, et al. combined CTC (connectionist temporal classification) and attention model to improve the performance of end to end speech recognition. In [16, 31] and [2], different attention models were also designed for speech emotion recognition and phoneme recognition, respectively. To further improve the robustness of the attention model, some previous studies used two attention models within one framework. Luong, et al. used global attention and local attention, where global attention attends to the whole input sentence and local attention only looks at a part of the input sentence [13] . Li, et al. applied global and local attention in image processing to further improve the performance [12]. Woo, et al. used spatial attention and channel attention to extract salient features from input data [28].

Inspired by those previous studies using attention models, a two-stage framework consisting of two attention models was built to improve the robustness against background noise. The aim of this work is to use attention model not only in temporal space by computing weight on data frames, but also in embedding space by computing weight for each element of embedding vector. In this work, the two attention models are concatenated where the attention model used for data frame is followed by the attention model built on embeddings. The first attention model in our work in referred to as frame attention model and the second one is embedding attention model. The details of our approach and implementation will be presented in next sections.

The rest of the paper is organized as follow: Section 2 presents the theoretical framework of our approach. Section 3.2 depicts the data we use, experimental setup, and the baselines to be compared. We show the obtained results in Section 4, and finally draw a conclusion in Section 5.

2 Model Architecture

Figure 1 shows the architecture of our approach, from its input to output, consisting of a time delay neural network (TDNN), a two-stage attention model, a statistics pooling layer, and two fully connected layers. The input data is (, where represents the sequence length, represents the dimension of each feature vector, denotes the th acoustic feature vector converted from speech signals. The TDNN works as a frame-level feature extractor and () denotes its output, where is its length (same as ), is the embedding dimension, and denotes the th embedding vector[15]. The two-stage attention model includes a frame attention model and an embedding attention model. Its output is denoted by , where has the same dimension as .

Figure 1: Architecture of proposed approach, consisting of TDNN and a two-stage attention model

2.1 Time Delay Neural Network (TDNN)

In the architecture of TDNN, the initial transforms are learnt on narrow contexts and the deeper layers process the hidden activations from a wider temporal context. Hence the higher layers have the ability to learn wider temporal relationships. Each layer in a TDNN operates at a different temporal resolution, which increases as we go to higher layers of the network [20]. Furthermore, it is more efficient than the use of RNN [6] due to its use of sub-sampling as described in [20].

In our work, a five-layer TDNN is used as a frame-level feature extractor. The first layer takes five-frame context (from to , where {2,3…,T-2}) as input. The second layer takes context frames at , and , while the next layer takes context frames at , and . The last two layers operates on current frame , but as the previous layer has take into account context frames, so the last two TDNN layers take total 15 frames context as input[23].

2.2 Two-stage Attention Model

Figure 2: Architecture of the Two-Stage Attention Model: Embedding Attention Model(Left) and Frame Attention Model (Right)

As shown in Figure 2.2, the two-stage attention model is a concatenating structure where the embedding attention model is followed by the frame attention model.

The embedding attention model uses a self-attention structure to allocate weights to each element of the embedding vector. The weight is computed using . In equation 1, the output of the embedding attention model, is the product of and :


where is defined as:


The embedding attention model employs two different pooling mechanisms, max-pooling and statistics-pooling. The output of max-pooling is used to compute after employing a linear mapping and an activation function (). The statistics-pooling outputs are and , whose computing can be referred to [27]. Their summation is then used to compute in the same way as .

, and shown in equation 2 are the parameters of linear mapping. The parameter, 100, is the number of hidden units. The weight is finally obtained using a Sigmoid function on the summation of and .

The frame attention mode also uses a self-attention structure whose input is and output is :


where is defined as:


is normalized score vector on each frame, where denotes the scalar weight for each frame and is computed using [27, 32]:


where , and are the parameters used in frame attention model.

3 Experiments

3.1 Data

In this paper, TIMIT corpus [5] is used to evaluate the proposed approach. The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. It includes a 16-bit, 16kHz speech waveform file for each utterance. There are a total of 6300 utterances, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. 70% of the speakers are male and 30% are female. As two utterances of each speaker have the same word transcriptions, they are excluded in our work to reduce possible bias. So there are finally 8 utterances spoken by each speaker. In this paper, the train and test set are re-split. Six utterances from each speaker are randomly selected for training and the rest two utterances are for testing. Hence there are 3780 utterances in the training set and 1260 utterances in the test set.

To evaluate robustness, additional noise signals are added with the TIMIT data. The noise signals are from the QUT-NOISE dataset[3], which contains five common noise scenarios with 10 unique locations. For each noise scenario, there are two different locations: In-door and out-door on cafe noise scenario; Window-down when driving and window-up when driving on car noise scenario; In a kitchen and living room on home noise scenario; Near inner-city and near outer-city on street noise scenario [3]. In this work, four scenarios, cafe, car, home and street scenarios, are used, and for each noise scenario, the locations and utterances are randomly selected. In our experiments, there are finally five types of noise: Gaussian white noise, cafe, car, home and street. The SNR values are set to be 0 dB, 5 dB, 10 dB, 15 dB, respectively.

Figure 3: Illustration of data segmentation strategy.

3.2 Experiment Setup

In the experiment, as shown in Figure 3, each utterance is segmented into short audio segments using a one-second sliding window with a 50-ms hop. The short audio segments are then converted into 13D MFCCs vectors. In the TDNN, the dimension of first four layers is 512 and 1500 for the fifth layer. The dimension of the two fully connected layers is 512, and each layer is followed by a batch normalisation [10] and droupout layer, where dropout rate is set to 0.2. The Adam optimiser[11] is used in training, with set to 0.95, to 0.999, and is . The initial learning rate is

In order to compare with the proposed approach, three baselines are built based on the methods developed in previous studies [23, 27, 7] . The first baseline (”TDNN”) is based on X-vector architecture. X-vector architecture is now widely used for speaker recognition [23]. In this baseline, it uses a TDNN, same as ours model, followed by a statistics pooling layer and a fully connected layer. The second baseline (”TDNN + Attention”) is similar to our approach, but use only frame attention model [27, 32]. The third baseline (”CNN”) is based on one-dimensional convolutional neural network [7]. It uses one-dimensional convolutional network across time axis and can yield a good performance for speaker recognition. In the first layer, there are 64 kernels and each kernel size is (, the dimension of MFCC vectors). The stride value of this layer is set to 1. In the following two convolutional layers, there are 128 and 256 kernels , respectively. The kernels size in the two layers is and the stride value is 2. A fully-connected layer with 512 nodes is then used before the output of the model.

3.3 Evaluation Metric

In this paper, speaker recognition accuracy is computed over 1260 utterances collected in the test set. We initially compute the score of each segment of an utterance, and then compute the accuracy over all segments of each utterance.

Suppose denotes the normalised value of the th () speaker after using Softmax when the input is the th ( segment of the th utterance (, where is the number of segments of the th utterance and is variable).

In equation 6, denotes the score of the th utterance if it is assumed to be spoken by the th speaker. It is computed by logarithm average over all segments of the th utterance.


where is the number of segment of the th utterance.

The most likely predicted speaker identity, , is computed by selecting the one whose score obtained using Eq 6 is maximum


The speaker recognition accuracy on the test set is computed by:


where is the ground truth of the corresponding utterance.

4 Results

Table 1 shows the speaker recognition accuracy on clean speech data and data corrupted with Gaussian white noise. To evaluate the robustness of the proposed approach and compare with the three baselines, the results obtained when the SNR value varying from 0dB to 15dB are also shown. It is clear that the use of attention model can yield better recognition performance than the baselines without using attention model. The possible reason is that the use of attention model can highlight positive contributions from those relevant features while reduce negative impacts caused by those irrelevant ones occurring in observed acoustic information.

The use of our approach (TS-Attention) outperforms the three baselines in all SNR conditions. Even if the SNR value is 0dB, our approach can still reach 49.1%, more than 10% relative improvement over the baseline of TDNN+Attention. This is probably because that an additional embedding attention model is used in our approach in comparison with the baseline. In the two-stage model, frame attention indicates which position is likely to be significant for speaker recognition, and the embedding attention model indicates which embedding element is important.

In addition, comparing with the other two baselines and our approach, the use of TDNN only yields relatively lower performances when SNR is 0dB and 5dB, respectively. Besides not using attention model, it heavily depends on frame context, which is easily interfered when background noise is strong.

0 db 5db 10 db 15 db Clean
CNN 43.2 58.7 66.1 76.5 92.9
TDNN 34.1 55.2 69.6 78.2 94.2
TDNN+Attention 44.5 62.1 75.1 81.6 95.5
TS-Attention 49.1 64.4 78.7 83.7 96.3
Table 1: Speaker recognition accuracy % on TIMIT test set in the condition of clean and different SNRs. The background noise is Gaussian white noise.

In order to further evaluate the robustness, as introduced in Section 3.1, the three baselines and the proposed approach are also tested on four types of noises (cafe, car, home, and street). Table 2, 3, 4, 5 show the speaker recognition accuracy in the four noise scenarios and with different SNRs.

In the four tables, the proposed approach consistently yields better recognition performance than the three baselines in the four noise scenarios and in the condition of different SNRs. When the background noise scenario is “car”, the recognition performances obtained whether using the proposed approach or using the three baseline are better than those obtained in other noise scenarios in the condition of same SNR. Even if the SNR is 0dB, the correct recognition rate can even reach 81% when using the proposed approach in “car” noise scenario. Inversely, when background noise is Gaussian white noise, the recognition performances are relatively worse in comparison with other noise scenarios in the same conditions. This phenomena might be related to the noise distribution in time and frequency domain. For Gaussian white noise, its distributions in both time and frequency domain are relatively consistent, while the noise made by car engines are like narrow-band impulse response whose interference often covers limited frequencies and occurs discontinuously.

0 db 5db 10 db 15 db Clean
CNN 71.3 78.2 82.5 84.2 92.9
TDNN 68.5 77.6 82.6 88.4 94.2
TDNN+Attention 73.2 81.1 85.8 89.2 95.5
TS-Attention 76.1 82.2 86.8 90.4 96.3
Table 2: Speaker recognition accuracy (%) on the TIMIT test set in the condition of clean and cafe noise.
0 db 5db 10 db 15 db Clean
CNN 78.8 81.5 84.7 89.9 92.9
TDNN 75.2 82.8 88.3 91.3 94.2
TDNN+Attention 79.3 85.4 88.2 91.9 95.5
TS-Attention 81.0 86.5 89.0 92.1 96.3
Table 3: Speaker recognition accuracy (%) on the TIMIT test set in the condition of clean and car noise.
0 db 5db 10 db 15 db Clean
CNN 71.6 78.1 82.9 86.6 92.9
TDNN 66.4 76.4 85.0 88.7 94.2
TDNN+Attention 72.3 81.7 86.9 90.0 95.5
TS-Attention 73.5 82.9 87.9 91.2 96.3
Table 4: Speaker recognition accuracy (%) on the TIMIT test set in the condition of clean and home noise.
0 db 5db 10 db 15 db Clean
CNN 74.4 80.4 84.4 88.0 92.9
TDNN 70.8 78.4 85.0 89.5 94.2
TDNN+Attention 75.6 82.8 87.5 90.5 95.5
TS-Attention 77.0 84.6 88.6 91.2 96.3
Table 5: Speaker recognition accuracy (%) on TIMIT test set in the condition of clean and street noise.

In order to show how the proposed approach works, we visualize the generated attention weights and show them in figure 4. The input segment is randomly selected from the test data, and the model was trained in Gaussian white noise and its SNR is 0dB. In figure 4, the (a) sub-figure is the clean segment, and (b) sub-figure is the corresponding noise segment. The (c) sub-figure shows the weights generated by the embedding attention model over the the embedding whose the dimension is 1500. One could observe that some weights are quite small and some are close to 1 (due to sigmoid function). This means that some embedding elements might be more relevant to the target and some are not. The (d) sub-figure shows the weights over the frames of a one-second short audio segment generated by frame attention model. In the figure, some weights corresponding to voiced signals are different because of their different contributions to the target and much larger than the unvoiced signals. It is reasonable that the use of two attention models has the ability to help to highlight the relevant features in both frame domain and embedding domain, which is probably the key factor to yield better performances than those baselines.

Figure 4: The Visualization of attention weights. (a): Clean segment; (b): Noise segment. (c): Embedding attention weights. (d): Frame attention weights. In (a) and (b), X-axis represents the sampling index and Y-axis represents the amplitude of speech signals. In (c) and (d), X-axis represents the index of embedding elements and frames respectively, Y-axis represents the attention weight value.

5 Conclusion and Future Work

In this paper a two-stage attention model was proposed to tackle speaker recognition in noise environment. The proposed model, containing an embedding attention model and a frame attention model, can yield better performances than the three baselines: CNN, TDNN and TDNN+attention, when speech data is corrupted by five types of noise and in the condition of different SNRs.

In our future work, more data sets for speaker recognition will be tested and more complex network architecture will be investigated to further improve the robustness and effectiveness of the proposed approach.


  • [1] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014).
  • [2] Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. Attention-based models for speech recognition. In Advances in neural information processing systems (2015), pp. 577–585.
  • [3] Dean, D. B., Kanagasundaram, A., Ghaemmaghami, H., Rahman, M. H., and Sridharan, S. The qut-noise-sre protocol for the evaluation of noisy speaker recognition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Interspeech 2015 (2015), International Speech Communication Association, pp. 3456–3460.
  • [4] Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet, P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19, 4 (2010), 788–798.
  • [5] Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993).
  • [6] Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (2013), IEEE, pp. 6645–6649.
  • [7] Hsu, W.-N., Zhang, Y., and Glass, J. Learning latent representations for speech generation and transformation. arXiv preprint arXiv:1704.04222 (2017).
  • [8] Hu, D. An introductory survey on attention mechanisms in nlp problems. In Proceedings of SAI Intelligent Systems Conference (2019), Springer, pp. 432–448.
  • [9] Huang, X., et al. Attention-based convolutional neural network for semantic relation extraction. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (2016), pp. 2526–2536.
  • [10] Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
  • [11] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • [12] Li, L., Tang, S., Deng, L., Zhang, Y., and Tian, Q. Image caption with global-local attention. In Thirty-First AAAI Conference on Artificial Intelligence (2017).
  • [13] Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
  • [14] Mejjati, Y. A., Richardt, C., Tompkin, J., Cosker, D., and Kim, K. I. Unsupervised attention-guided image-to-image translation. In Advances in Neural Information Processing Systems (2018), pp. 3693–3703.
  • [15] Ming, J., Hazen, T. J., Glass, J. R., and Reynolds, D. A. Robust speaker recognition in noisy conditions. IEEE Transactions on Audio, Speech, and Language Processing 15, 5 (2007), 1711–1723.
  • [16] Mirsamadi, S., Barsoum, E., and Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), IEEE, pp. 2227–2231.
  • [17] Moritz, N., Hori, T., and Le Roux, J. Triggered attention for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), IEEE, pp. 5666–5670.
  • [18] Okabe, K., Koshinaka, T., and Shinoda, K. Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963 (2018).
  • [19] Oktay, O., Schlemper, J., Folgoc, L. L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N. Y., Kainz, B., et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
  • [20] Peddinti, V., Povey, D., and Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association (2015).
  • [21] Poddar, A., Sahidullah, M., and Saha, G. Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7, 2 (2017), 91–101.
  • [22] rahman Chowdhury, F. R., Wang, Q., Moreno, I. L., and Wan, L. Attention-based models for text-dependent speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), IEEE, pp. 5359–5363.
  • [23] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), IEEE, pp. 5329–5333.
  • [24] Variani, E., Lei, X., McDermott, E., Moreno, I. L., and Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), IEEE, pp. 4052–4056.
  • [25] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems (2017), pp. 5998–6008.
  • [26] Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., and Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3156–3164.
  • [27] Wang, Q., Okabe, K., Lee, K. A., Yamamoto, H., and Koshinaka, T. Attention mechanism in speaker recognition: What does it learn in deep speaker embedding? In 2018 IEEE Spoken Language Technology Workshop (SLT) (2018), IEEE, pp. 1052–1059.
  • [28] Woo, S., Park, J., Lee, J.-Y., and So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 3–19.
  • [29] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (2015), pp. 2048–2057.
  • [30] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (2016), pp. 1480–1489.
  • [31] Zhang, Y., Du, J., Wang, Z., Zhang, J., and Tu, Y. Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2018), IEEE, pp. 1771–1775.
  • [32] Zhu, Y., Ko, T., Snyder, D., Mak, B., and Povey, D. Self-attentive speaker embeddings for text-independent speaker verification. In Interspeech (2018), pp. 3573–3577.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description