Collaborative Learning for Language and Speaker Recognition

Collaborative Learning for Language and Speaker Recognition


This paper presents a unified model to perform language and speaker recognition simultaneously and together. This model is based on a multi-task recurrent neural network, where the output of one task is fed in as the input of the other, leading to a collaborative learning framework that can improve both language and speaker recognition by sharing information between the tasks. The preliminary experiments presented in this paper demonstrate that the multi-task model outperforms similar task-specific models on both language and speaker tasks. The language recognition improvement is especially remarkable, which we believe is due to the speaker normalization effect caused by using the information from the speaker recognition component.


Lantian Li, Zhiyuan Tang, Dong Wang, Andrew Abel, Yang Feng, Shiyue Zhang \address Center for Speech and Language Technologies, Tsinghua University, China
Xi’An Jiaotong Liverpool-University, Suzhou, China \email{lilt,tangzy,fengyang,zhangsy};;

Index Terms: language recognition, speaker recognition, deep learning, recurrent neural network

1 Introduction

Language recognition (LRE) [1] and speaker recognition (SRE) [2] are two important tasks in speech processing. Traditionally, the research in these two fields seldom acknowledges the other domain, although some there are a number of shared techniques, such as SVM [3], the i-vector model [4, 5, 6, 7], and deep neural models [8, 9, 10, 11, 12, 13, 14, 15, 16]. This lack of overlap can be largely attributed to the intuition that speaker characteristics are language independent in SRE, and dealing with speaker variation is regarded as a basic request in LRE. This independent processing of language identities and speaker traits, however, is not the way we human beings process speech signals: it is easy to imagine that our brain recognizes speaker traits and language identities simultaneously, and that the success of identifying languages helps discriminate between speakers, and vice versa.

A number of researchers have noticed that language and speaker are two correlated factors. In speaker recognition, it has been confirmed that language mismatch indeed leads to serious performance degradation for speaker recognition [17, 18, 19], and some language-aware models have been demonstrated successfully [20]. In language recognition, speaker variation is seen as a major corruption and is often normalized in the front-end, e.g., by VTLN [21, 22] or CMLLR [23]. These previous studies suggest that speaker and language are inter-correlated factors and should be modelled in an integrated way.

This paper presents a novel collaborative learning approach which models speaker and language variations in a single neural model architecture. The key idea is to propagate the output of one task to the input of the other, resulting in a multi-task recurrent model. In this way, the two tasks can be learned and inferred simultaneously and collaboratively, as illustrated in Figure 1. It should be noted that collaborative learning is a general framework and the component for each task can be implemented using any model, but in this paper, we have chosen to make use of recurrent neural networks (RNN) due to their great potential and good results in various speech processing tasks, including SRE [9, 24] and LRE [15, 22, 25, 26]. Our experiments on the WSJ English database and a Chinese database of a comparable volume demonstrate that the collaborative training method can improve performance on both tasks, and the performance gains on language recognition are especially remarkable.

In summary, the contributions of this paper are: firstly, we demonstrate that SRE and LRE can be jointly learned by collaborative learning, and that the collaboration benefits both tasks; secondly, we show that the collaborative learning is especially beneficial for language recognition, which is likely to be due to the normalization effect of using the speaker information provided from the speaker recognition component.

Figure 1: Multi-task recurrent model for language and speaker recognition.

The rest of the paper is organized as follows: we first discuss some related work in Section 2, and then present the collaborative learning architecture in Section 3. The experiments are reported in Section 4, and the paper is concluded in Section 5.

2 Related work

This collaborative learning approach was proposed by Tang et al. for addressing the close relationship between speech and speaker recognition [27]. The idea of multi-task learning for speech signals has been extensively studied, e.g., [28, 29], and more research on this multi-task learning can be found in [30]. The key difference between collaborative learning and traditional multi-task learning is that the inter-task knowledge share is on-line, i.e., results of one task will impact other tasks, and this impact will be propagated back to itself by the feedback connection, leading to a collaborative and integrated information processing framework.

The close correlation between speaker traits and language identities is well known to both SRE and LRE researchers. In language recognition, the conventional phonetic approach [31, 32] relies on the compositional speech recognition system to deal with the speaker variation. In the HMM-GMM era, this often relied on various front-end normalization techniques, such as vocal track length normalization (VTLN) [21, 22] and constrained maximum likelihood linear regression (CMLLR) [23]. In the HMM-DNN era, a DNN model has the natural capability to normalize speaker variation when sufficient training data is available. This capability has been naturally used in i-vector based LRE approaches [33, 34]. However, for pure acoustic-based DNN/RNN methods, e.g., [14, 15], there is limited research into speaker-aware learning for LRE.

For speaker recognition, language is often not a major concern, perhaps due to the a widely held assumption that speaker traits are language independent. However from the engineering perspective, language mismatch has been found to pose a serious problem due to the different patterns of acoustic space in different languages, according to their own phonetic systems [17, 18, 19]. A simple approach is to train a multi-lingual speaker model by data pooling [17, 18], but this approach does not model the correlation between language identities and speaker traits. Another potential approach is to treat language and speaker as two random variables and represent them by a linear Gaussian model [35], but this linear Gaussian assumption is perhaps too strong.

The collaborative learning approach benefits both tasks. For SRE, the language information provided by LRE helps to identify acoustic units that the recognition should focus on, and for LRE, the speaker information provided by SRE helps to normalize the speaker variation. It is important to note that the models for these two tasks are jointly optimized, and that the information from both tasks during decoding. This means that the collaborative learning is collaborative in both model training and inference.

3 Multi-task RNN and collaborative learning

This section first presents the neural model structure for single tasks, and then extends this to the multi-task recurrent model for collaborative learning.

3.1 Basic single-task model

For the work in this paper we have chosen a particular RNN, the long short-term memory (LSTM) [36] approach to build the baseline single-task systems for SRE and LRE. LSTM has been shown to deliver good performance for both SRE [9] and LRE [15, 22, 25]. In particular, the recurrent LSTM structure proposed in [37] is used here, as shown in Figure 2, and the associated computation is as follows:

Figure 2: Basic recurrent LSTM model for LRE and SRE single-task baselines.

In the above equations, the terms denote weight matrices and the terms denote bias vectors. and are the input and output vectors; , , represent the input, forget and output gates respectively; is the cell and is the cell output. and are the two output components derived from , in which is recurrent and used as an input of the next time step, while is not recurrent and contributes to the present output only. is the logistic sigmoid function, and and are non-linear activation functions, often chosen to be hyperbolic. denotes the element-wise multiplication.

3.2 Multi-task recurrent model

The basic idea of the multi-task recurrent model, as shown in Figure 1, is to use the output of one task at the current time step as an auxiliary input into the other task at the next step. In this study, we use the recurrent LSTM model to build the LRE and SRE components, and then combine them with a number of inter-task recurrent connections. This results in a multi-task recurrent model, by which LRE and SRE can be trained and inferred in a collaborative way. The complete model structure is shown in Figure 3, where the superscripts and denote the LRE and SRE task respectively, and the dashed lines represent the inter-task recurrent connections.

A multitude of possible model configurations can be selected. For example, feedback information can be extracted from the cell or cell output , or from the output component or ; the feedback information can be propagated to the input variable , the input gate , the output gate , the forget gate , or the non-linear function .

Figure 3: Multi-task recurrent learning for LRE and SRE.

Given the above alternatives, the multi-task recurrent model is rather flexible. The structure shown in Figure 3 is one simple example, where the feedback information is extracted from both the recurrent projection and the non-recurrent projection , and propagated to the non-linear function . Using the feedback, the computation for LRE is given as follows:

and the computation for SRE is given as follows:

3.3 Model training

The model can be trained ‘completely’, where each training sample is labelled by both speaker and language, or ‘incompletely’ where only one task label is available. Our previous research has demonstrated that both cases are suitable  [27]. In this preliminary study, we have focused on using ‘completely’ training. The natural stochastic gradient descent (NSGD) algorithm  [38] is employed to train the model.

4 Experiments

This section first describes the data profile, and presents the baseline systems. Finally, experimental results of our collaborative learning approach are given.

4.1 Data

Two databases were used to perform the experiment: the WSJ database in English and the CSLT-C300 database in Chinese1. All the utterances in both databases were labelled with both language and speaker identities. The development set involves two subsets: WSJ-E200, which contains speakers ( utterances) selected from WSJ, and CSLT-C200, which contains speakers ( utterances) selected from the CSLT-C300 database. The development set was used to train the i-vector, SVM, and multi-task recurrent models.

The evaluation set contains an English subset WSJ-E110, which contains speakers selected from WSJ, and a Chinese subset CSLT-C100, which contains speakers selected from the CSLT-C300 database. For each speaker in each subset, utterances were used to enrol its speaker and language identity, and the remaining English utterances and Chinese utterances were used for testing. For SRE, the test is pair-wised, leading to target trials and imposter trials in English, plus target trials and Chinese imposter trials. For LRE, the number of test trials is the same as the number of test utterances, which is for English trials and for Chinese trials.

4.2 LRE and SRE baselines

Here, we first present the LRE and SRE baselines. For each task, two baseline systems were constructed, one based on i-vectors (still state-of-the art), and the other, based on LSTM. All experiments were conducted with the Kaldi toolkit [39].

i-vector baseline

For the i-vector baseline, the acoustic features were -dimensional MFCCs. The number of Gaussian components of the universal background model (UBM) was , and the dimension of the i-vectors was . The resulting i-vectors were used to conduct both SRE and LRE with different scoring methods. For SRE, we consider the simple Cosine distance, as well as the popular discriminative models LDA and PLDA; for LRE, we consider Cosine distance and SVM. All the discriminative models were trained on the development set.

The results of the SRE baseline are reported in Table 1, in terms of equal error rate (EER). We tested two scenarios, one is a Full-length test which uses the entire enrolment and test utterance; the other is a Short-length test which involves only second of speech (sampled from the original data after voice activity detection is applied). In both scenarios, the language of each test is assumed to be known in advance, i.e., the tests on English and Chinese datasets are independent.

Test System Dataset EER(%)
Full i-vector English 0.88 0.70 0.62
Chinese 1.28 0.97 0.84
r-vector English 1.25 1.38 3.57
Chinese 1.70 1.61 4.93
Short i-vector English 7.00 4.01 3.47
Chinese 9.12 6.16 5.69
r-vector English 3.27 2.70 7.88
Chinese 4.77 3.99 8.21
Table 1: SRE baseline results.

LRE is an identification task, with the purpose being to discriminate between two languages (English and Chinese). We therefore use identification error rate (IDR) [40] to measure performance, which is the fraction of the identification mistakes in the total number of identification trials. For a more thorough comparison, the number of identification errors (IDE) is also reported. The results of the i-vector/SVM baseline system are reported in Table 2.

Test System IDR(%) IDE
Full i-vector/Cosine 3.43 763
i-vector/SVM 0.01 2
r-vector/Cosine 0.11 25
r-vector/SVM 0.21 47
r-vector/Softmax 0.13 29
Short i-vector/Cosine 10.21 2270
i-vector/SVM 1.40 311
r-vector/Cosine 0.98 218
r-vector/SVM 0.63 139
r-vector/Softmax 0.58 129
Table 2: LRE baseline results.

r-vector baseline

The r-vector baseline is based on the recurrent LSTM structure shown in Figure 2. The SRE and LRE systems use the same configurations: the dimensionality of the cell was set to , and the dimensionality of both the recurrent and non-recurrent projections was set to . For the SRE system, the output corresponds to the speakers in the training set; For LRE, the output corresponds to the two languages to identify. The output of both projections were concatenated and averaged over all the frames of an utterance, resulting in a -dimensional ‘r-vector’ for that utterance. The r-vector derived from the SRE system represents speaker characters, and the r-vector derived from the LRE system represents the language identity.

As in the i-vector baseline, decisions were made based on distance between r-vectors, measured by either the Cosine distance or some discriminative models. The same discriminative models as in the i-vector baseline were used, except that in the LRE system, the softmax outputs of the task-specific LSTMs can be directly used to identify language. The results are shown in Table 1 and Table 2 for SRE and LRE, respectively.

The results in Table 1 show that for SRE, the i-vector system with PLDA performs better than the r-vector system in the Full-length test. However, in the Short-length test, the r-vector system is clearly better. This is understandable as the i-vector model is generative and relies on sufficient data to estimate the data distribution; the LSTM model, in contrast, is discriminative and the speaker information can be extracted with even a single frame. Moreover, the PLDA model works very well for the i-vector system, but rather poor for the r-vector system. We estimate that this could be due to the unreliable Gaussian assumption for the residual noise by PLDA. A pair-wised t-test confirms that the performance advantage of the r-vector/LDA system over the i-vector/PLDA system is statistically significant ( 1e-5).

The results in Table 2 show a similar trend, that the i-vector system (with SVM) works well in the full-length test, but in the short-length test, the r-vector system shows much better performance, even with the simple Cosine distance. Again, this can be explained by the fact that the i-vector model is generative, while the r-vector model is discriminative. The advantage of the r-vector model on short utterances has previously been observed, both for LRE [15] and SRE [10].

4.3 Collaborative learning

The multi-task recurrent LSTM system, as shown in Figure 3, was constructed by combining the LRE and SRE r-vector systems, with inter-task recurrent connections augmented. Following research in [27], we selected the output of the recurrent projection layer as the feedback information, and tested several configurations, where the feedback information from one task is propagated into different components of the other task. The results are reported in Tables 3 and 4 for SRE and LRE, where denotes the input, forget and output gates, and denotes the non-linear function.

Feedback EER(%)
Input Full Short
Eng. Chs. Eng. Chs.
r-vector Baseline 1.38 1.61 2.70 3.99
1.27 1.43 2.50 3.61
1.38 1.38 2.55 3.52
1.19 1.31 2.48 3.66
1.37 1.48 2.67 3.52
1.32 1.31 2.52 3.69
Table 3: SRE results with collaborative learning.
Feedback IDE
Input Full Short
Cosine SVM Softmax Cosine SVM Softmax
r-vector Baseline 25 47 29 218 139 129
5 2 0 11 6 2
1 0 0 3 1 1
11 2 0 21 8 3
0 0 1 2 2 1
6 2 0 17 10 2
Table 4: LRE results with collaborative learning.

The results show that collaborative learning provides consistent performance improvement on both SRE and LRE, regardless of which component the feedback is applied to. The results show that the output gate is an appropriate component for SRE to receive the feedback, whereas for LRE, the forget gate seems a more suitable choice. However, these observations are based on relatively small databases. More experiments on large data are required to confirm and understand these observations. Finally, it should be highlighted that the collaborative training provides very impressive performance gains for LRE: it significantly improves the single-task r-vector baseline, and beats the i-vector baseline even on the full-length task. This is likely to be because the LRE model trained with the limited training data is largely disturbed by the speaker variation, and the language information provided by the SRE system plays a valuable role of speaker normalization.

5 Conclusions

This paper proposed a novel collaborative learning architecture that performs speaker and language recognition as a single and unified model, based on a multi-task recurrent neural network. These preliminary experiments demonstrated that the proposed approach can deliver consistent performance improvement over the single-task baselines for both SRE and LRE. The performance gain on LRE is particularly impressive, which we suggest could be due to the effect of speaker normalization. Future work involves experimenting with large databases and analyzing the properties of the collaborative mechanism, e.g., trainability, stability and extensibility.


  1. This database was collected by our institute for commercial usage, so we cannot release the wave data, but the Fbanks and MFCCs in the Kaldi format have been published online. See The Kaldi recipe to reproduce the results is also available there.


  1. J. Navratil, “Spoken language recognition-a step toward multilinguality in speech processing,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 6, pp. 678–685, 2001.
  2. F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 2004, pp. 430–451, 2004.
  3. W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo, “Support vector machines for speaker and language recognition,” Computer Speech & Language, vol. 20, no. 2, pp. 210–229, 2006.
  4. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  5. Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in ICASSP.    IEEE, 2014, pp. 1695–1699.
  6. N. Dehak, A.-C. Pedro, D. Reynolds, and R. Dehak, “Language recognition via ivectors and dimensionality reduction,” in Interspeech, 2011, pp. 857–860.
  7. D. Martınez, O. Plchot, L. Burget, O. Glembek, and P. Matejka, “Language recognition in ivectors space,” in Interspeech, 2011, pp. 861–864.
  8. V. Ehsan, L. Xin, M. Erik, L. M. Ignacio, and G.-D. Javier, “Deep neural networks for small footprint text-dependent speaker verification,” in ICASSP, 2014, pp. 357–366.
  9. G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in ICASSP.    IEEE, 2016, pp. 5115–5119.
  10. D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in SLT, 2016.
  11. I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez, and P. Moreno, “Automatic language identification using deep neural networks,” in ICASSP.    IEEE, 2014, pp. 5337–5341.
  12. A. Lozano-Diez, R. Zazo Candil, J. González Domínguez, D. T. Toledano, and J. Gonzalez-Rodriguez, “An end-to-end approach to language identification in short utterances using convolutional neural networks,” in Interspeech, 2015.
  13. D. Garcia-Romero and A. McCree, “Stacked long-term tdnn for spoken language recognition,” in Interspeech, 2016, pp. 3226–3230.
  14. M. Jin, Y. Song, I. Mcloughlin, L.-R. Dai, and Z.-F. Ye, “LID-senone extraction via deep neural networks for end-to-end language identification,” in Odyssey, 2016.
  15. R. Zazo, A. Lozano-Diez, J. Gonzalez-Dominguez, D. T. Toledano, and J. Gonzalez-Rodriguez, “Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks,” PloS one, vol. 11, no. 1, p. e0146917, 2016.
  16. M. Kotov and M. Nastasenko, “Language identification using time delay neural network d-vector on short utterances,” in Speech and Computer: 18th International Conference, SPECOM 2016, vol. 9811.    Springer, 2016, p. 443.
  17. B. Ma and H. Meng, “English-Chinese bilingual text-independent speaker verification,” in ICASSP.    IEEE, 2004, pp. V–293.
  18. R. Auckenthaler, M. J. Carey, and J. Mason, “Language dependency in text-independent speaker verification,” in ICASSP.    IEEE, 2001, pp. 441–444.
  19. A. Misra and J. H. L. Hansen, “Spoken language mismatch in speaker verification: An investigation with nist-sre and crss bi-ling corpora,” in IEEE Spoken Language Technology Workshop (SLT).    IEEE, 2014, pp. 372–377.
  20. A. Rozi, D. Wang, L. Li, and T. F. Zheng, “Language-aware plda for multilingual speaker recognition,” in O-COCOSDA 2016, 2016.
  21. P. Matejka, L. Burget, P. Schwarz, and J. Cernocky, “Brno university of technology system for nist 2005 language recognition evaluation,” in Speaker and Language Recognition Workshop,IEEE Odyssey 2006.    IEEE, 2006, pp. 1–7.
  22. G. Gelly, J.-L. Gauvain, V. Le, and A. Messaoudi, “A divide-and-conquer approach for language identification based on recurrent neural networks,” in Interspeech, 2016, pp. 3231–3235.
  23. W. Shen and D. Reynolds, “Improved gmm-based language recognition using constrained mllr transforms,” in ICASSP.    IEEE, 2008, pp. 4149–4152.
  24. S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end attention based text-dependent speaker verification,” arXiv preprint arXiv:1701.00562, 2017.
  25. J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. Gonzalez-Rodriguez, and P. J. Moreno, “Automatic language identification using long short-term memory recurrent neural networks.” in Interspeech, 2014, pp. 2155–2159.
  26. C. Salamea, L. F. D’Haro, R. de Córdoba, and R. San-Segundo, “On the use of phone-gram units in recurrent neural networks for language identification,” in Odyssey, 2016, pp. 117–123.
  27. Z. Tang, L. Li, D. Wang, and R. C. Vipperla, “Collaborative joint training with multi-task recurrent model for speech and speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016.
  28. X. Li and X. Wu, “Modeling speaker variability using long short-term memory networks for speech recognition,” in Interspeech, 2015, pp. 1086–1090.
  29. Y. Qian, T. Tan, and D. Yu, “Neural network based multi-factor aware joint training for robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2231–2240, 2016.
  30. D. Wang and T. F. Zheng, “Transfer learning for speech and language processing,” in APSIPA, 2015, pp. 1225–1237.
  31. L. F. Lamel and J.-L. Gauvain, “Language identification using phone-based acoustic likelihoods,” in ICASSP, vol. 1.    IEEE, 1994, pp. I–293.
  32. M. A. Zissman et al., “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Transactions on speech and audio processing, vol. 4, no. 1, p. 31, 1996.
  33. Y. Song, B. Jiang, Y. Bao, S. Wei, and L.-R. Dai, “I-vector representation based on bottleneck features for language identification,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, 2013.
  34. Y. Tian, L. He, Y. Liu, and J. Liu, “Investigation of senone-based long-short term memory rnns for spoken language recognition,” Odyssey, pp. 89–93, 2016.
  35. L. Lu, Y. Dong, Z. Xianyu, L. Jiqing, and W. Haila, “The effect of language factors for robust speaker recognition,” in ICASSP.    IEEE, 2009, pp. 4217–4220.
  36. S. Hochreiter and S. Jürgen, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  37. H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Interspeech, 2014, pp. 338–342.
  38. D. Povey, X. Zhang, and S. Khudanpur, “Parallel training of deep neural networks with natural gradient and parameter averaging,” arXiv preprint arXiv:1410.7455, 2014.
  39. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584.    IEEE Signal Processing Society, 2011.
  40. B. Yin, A. Eliathamby, and C. Fang, “Hierarchical language identification based on automatic language clustering,” in Interspeech, 2007, pp. 178–181.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description