Robust End to End Speaker Verification Using EEG

Robust End to End Speaker Verification Using EEG

Abstract

In this paper we demonstrate that performance of a speaker verification system can be improved by concatenating electroencephalography (EEG) signal features with speech signal. We use state of art end to end deep learning model for performing speaker verification and we demonstrate our results for noisy speech.

Our results indicate that EEG signals can improve the robustness of speaker verification systems.

Robust End to End Speaker Verification Using EEG

Yan Han, Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H Tewfik

Brain Machine Interface Lab, University of Texas at Austin

equal author contribution


Index Terms: Electroencephalography (EEG), Speaker Verification, Deep Learning

1 Introduction

Speaker verification is the process of verifying whether an utterance belongs to a specific speaker, based on that speaker’s known utterances. Speaker verification systems are used as authentication system in many voice activated technologies, for example in applications like voice match for Google home. In our work we focused on text independent speaker verification [1, 2]. Deep learning based speaker verification systems [3, 4, 5, 6, 7, 8] are getting popular this days and such systems have improved the performance of speaker verification systems.

Even though deep learning models have improved the state of art performance of speaker verification systems, their performance is degraded in presence of background noise and they are prone to attacks in the form of voice mimicking by the adversaries who are interested in breaching the authentication system. We propose to use electroencephalography (EEG) signals to address these challenges. EEG is a non invasive way of measuring electrical activity of human brain. In [9] we demonstrated that EEG features can help automatic speech recognition (ASR) systems to overcome the performance loss in presence of background noise. Further, prior work explained in the references [10, 11, 12, 13, 14, 15, 16] show that EEG pattern in every individual is unique and it can be used for bio metric identification. Motivated by this prior results, we concatenated acoustic features with EEG features to improve the performance of speaker verification systems operating under very noisy conditions. The use of EEG features to improve the robustness of speaker verification system is also motivated by the unique robustness to environmental artifacts exhibited by the human auditory cortex [17, 18]. Speaker verification using EEG will also help people with speaking disabilities like broken speech to use voice authentication technologies with better accuracy, thereby improving technology accessibility.
For this work we used the end to end speaker verification model explained in [4] as it is the current state of art model for speaker verification. Major contribution of this paper is the demonstration of improving robustness of speaker verification systems using EEG features.

2 Speaker Verification Model

We used generalized end to end model introduced in [4] for performing text independent speaker verification. We used Google’s Tensorflow deep learning library for implementing the model.

The model consists of a single layer of Long short term memory (LSTM) [19] or Gated recurrent unit (GRU) [20] cell with 128 hidden units followed by a dense layer which is a fully connected linear layer followed by a softmax activation. During training phase, an utterance of an enrollment candidate is passed to the LSTM or GRU cell and L2 normalization is applied to the dense layer output to derive the embedding or d vector for that utterance . Similarly d vectors are derived for all the utterances of the enrolment candidate . Since different utterances can have different lengths, we used dynamic recurrent neural network (RNN) cell of tensorflow. The speaker model or centroid for is defined as the average value of all the d vectors obtained for all the enrollment utterances of .

Consider a set of enrollment candidates {,, , } and let us assume each has number of utterances. Now as per our earlier definition we can build a set of centroids as {, , , } where is the centroid for . Now a two dimensional similarity matrix with rows and columns is constructed, that computes the cosine similarity between all ’s and d vectors corresponding to utterances from all ’s. The cosine similarity contains learnable parameters that act as scaling factors as explained in [4]. A softmax layer is applied on the similarity matrix. We used the generalized end to end softmax loss explained in [4] as the loss function for the model. For any (, ) pair, where is a d vector corresponding to an utterance from any , the final softmax layer in the model outputs 1 if both and are derived from the same , otherwise it outputs 0. Batch size was set to one and gradient descent optimizer was used.

During test time, along with the enrollment utterances from the test set, evaluation utterances are also passed to the trained model. The similarity is calculated between the centroids derived from the enrollment utterances from test set and d vectors corresponding to evaluation utterances. Figure 2 shows the training methodology and Figure 3 shows the testing methodology of the verification model. Figure 2 is adapted from [4]. By the term utterance through out this paper, we refer to the Mel-frequency cepstral coefficients (MFCC 13) or EEG features or concatenation of MFCC 13 and EEG features, depending on how the speaker verification model was trained.

Figure 1: EEG channel locations for the cap used in our experiments

3 Design of Experiments for building the database

We built two types of simultaneous speech EEG recording databases for this work. For database A five female and five male subjects took part in the experiment. For database B five male and three female subjects took part in the experiment. Except two subjects, rest all were native English speakers for both the databases. All subjects were UT Austin undergraduate,graduate students in their early twenties.

For data set A, the 10 subjects were asked to speak the first 30 sentences from the USC-TIMIT database[21] and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 40 dB (noise generated by room air conditioner fan). We then asked each subject to repeat the same experiment two more times, thus we had 30 speech EEG recording examples for each sentence.

For data set B, the 8 subjects were asked to repeat the same previous experiment but this time we used background music played from our lab computer to generate a background noise of 65 dB. Here we had 24 speech EEG recording examples for each sentence.

We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 1. We used EEGLab [22] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

For data set A, we used data from first 8 subjects for training the model, remaining two subjects data for test set.

For data set B, we used data from first 6 subjects for training the model, remaining two subjects data for test set.

4 EEG and Speech feature extraction details

EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [22] Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [9]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum (MFCC) as features for speech signal. We extracted MFCC 13 features. The MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features to avoid seq2seq problem.

Figure 2: Training method for Speaker verification

5 EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used non linear methods to do feature dimension reduction in order to obtain set of EEG features which are better representation of acoustic features. We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) [23].We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 4. We used KPCA with polynomial kernel of degree 3 [9]. We used python scikit library for performing KPCA. The cumulative explained variance plot is not supported by the library for KPCA as KPCA projects features to different feature space, hence for getting explained variance plot we used normal PCA but after identifying the right dimension we used the KPCA to perform dimension reductions.

Figure 3: Testing method for Speaker verification
Figure 4: Explained variance plot

6 Results

As explained earlier, for data set A we used data from first 8 subjects or 720 sentence utterances as training set, last two subjects or 180 sentence utterances as testing set. For data set B we used data from first 6 subjects or 540 sentence utterances as training set, last two subjects or 180 sentence utterances as testing set. There are 90 utterances per each subject.

Figure 5: Training Loss

Equal error rate (EER) defined in [3] was used as the evaluation metric to evaluate the model. Table 1 shows the obtained results on the most noisy data set B when we used LSTM as the RNN cell in the verification model and results clearly indicate that combining EEG features with acoustic features help in reducing the EER. In the table results are shown for number of sentence = {3, 10, 15, 20, 30}. Number of sentence = 3 implies during training, for first training step, features (MFCC or EEG or MFCC+EEG) corresponding to first 3 sentence utterances are selected randomly from any two subjects from the training set and during second training step, features corresponding to next 3 sentence utterances are selected randomly from any two subjects and so on till the 90th sentence utterance is selected from the training set. So for this example, 30 training steps constitute one epoch. The model is trained for sufficient number of epochs to make sure it sees all the utterances present in the training set. In every epoch any two subjects are randomly chosen.

During test time we have data from two subjects and number of sentences equal to 3 implies, for the first testing step features corresponding to first 3 sentence utterances from subject 1 and subject 2 are selected as enrollment utterances and features corresponding to next 3 sentence utterances from subject 1 and subject 2 are selected as evaluation utterances and EER is calculated. For second testing step, the features used as evaluation utterances in first step will become new enrollment utterances and features corresponding to next 3 sentence utterances from subjects 1, 2 will become new evaluation utterance and new EER is calculated. So for this example,there will be a total of 30 testing steps. EER value of 17 % for number of sentences equal to 3 as seen from the Table 1, corresponds to the average of all the 30 testing step EER values. Similarly model was trained and tested for number of sentences = {10,15,20,30}. As 20 is not a factor of 90, for number of sentences equal to 20, there are five training steps in one epoch, five testing steps but the last step will contain only the features corresponding to last 10 sentence utterances. Figure 5 shows the training loss convergence for the LSTM model for number of sentences equal to 3 on data set B when trained using MFCC features.

In [9] we demonstrated that EEG sensors T7 and T8 contributed most to ASR test time accuracy, so we tried training the model with EEG features from only T7 and T8 sensors for data set B,A and obtained results are shown in Table 2,Table 5 respectively. We used GRU model for T7,T8 training as GRU is easier to train than LSTM for less amount of data.

Table 3 shows the results for the less noisy data set A when we used LSTM as the RNN cell in the verification model and results indicate that for two test time instances, MFCC performed better than MFCC+EEG. We tried improving the results for data set A by tuning the hyper parameters of the LSTM model. We added one more layer to the LSTM cell and reduced the hidden units in LSTM to 64 ( ie: 2 Layer LSTM with 64 hidden units in each layer) and observed better results as shown in Table 4. Now except for one test time instance, combining EEG with acoustic features always resulted in reduced EER values as seen from the table.

Number
of Sentences
MFCC
(EER %)
MFCC+EEG
(EER %)
3 17 17
10 26 23
15 21 16
20 31 17
30 19 16
Table 1: EER on test set for Data set B using LSTM based model
Number
of Sentences
EEG
(EER %)
3 9
10 10
15 12
20 12
30 21
Table 2: EER on test set for Data set B using GRU based model using EEG features from only T7 and T8 electrodes
Number
of Sentences
MFCC
(EER %)
MFCC +EEG
(EER %)
3 16 9
10 7 14
15 10 8
20 6 10
30 12 11
Table 3: EER on test set for Data set A using LSTM based model
Number
of Sentences
MFCC
(EER %)
MFCC+EEG
(EER %)
3 21 17
10 17 14
15 15 15
20 15 10
30 13 17
Table 4: EER on test set for Data set A using 2 layer LSTM 64 units based model
Number
of Sentences
EEG
(EER %)
3 14
10 12
15 13
20 15
30 17
Table 5: EER on test set for Data set A using GRU based model using EEG features from only T7 and T8 electrodes

7 Conclusions

In this paper we demonstrated the feasibility of using EEG signals to improve the robustness of speaker verification systems operating in very noisy environment. We demonstrated that for the most noisy data (65 dB), combination of MFCC and EEG features always resulted in lower EER and for less noisy data ( 40 dB), except for one test time instance adding EEG features always helped in reducing EER.

Our overall results indicate that EEG features are less affected by background noise and they are helpful in improving robustness of verification systems operating in presence of high background noise.

We further plan to publish our speech EEG data base used in this work to help advancement of the research in this area. For future work, we will work on building a much larger speech EEG data base and validate the results for larger data base.

8 Acknowledgements

We would like to thank Kerry Loader from Dell, Austin, TX for donating us the GPU to train the models used in this work.

References

  • [1] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 4, p. 101962, 2004.
  • [2] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech communication, vol. 52, no. 1, pp. 12–40, 2010.
  • [3] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2016, pp. 5115–5119.
  • [4] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2018, pp. 4879–4883.
  • [5] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2014, pp. 4052–4056.
  • [6] Y.-h. Chen, I. Lopez-Moreno, T. N. Sainath, M. Visontai, R. Alvarez, and C. Parada, “Locally-connected and convolutional neural networks for small footprint speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [7] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end attention based text-dependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT).    IEEE, 2016, pp. 171–178.
  • [8] S. O. Sadjadi, S. Ganapathy, and J. W. Pelecanos, “The ibm 2016 speaker recognition system,” arXiv preprint arXiv:1602.07291, 2016.
  • [9] G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on.    IEEE, 2019.
  • [10] S. Marcel and J. d. R. Millán, “Person authentication using brainwaves (eeg) and maximum a posteriori model adaptation,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 4, pp. 743–752, 2007.
  • [11] R. Paranjape, J. Mahovsky, L. Benedicenti, and Z. Koles, “The electroencephalogram as a biometric,” in Canadian Conference on Electrical and Computer Engineering 2001. Conference Proceedings (Cat. No. 01TH8555), vol. 2.    IEEE, 2001, pp. 1363–1366.
  • [12] R. Palaniappan and K. Ravi, “A new method to identify individuals using signals from the brain,” in Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint, vol. 3.    IEEE, 2003, pp. 1442–1445.
  • [13] M. Poulos, M. Rangoussi, V. Chrissikopoulos, and A. Evangelou, “Person identification based on parametric processing of the eeg,” in ICECS’99. Proceedings of ICECS’99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 99EX357), vol. 1.    IEEE, 1999, pp. 283–286.
  • [14] A. Riera, A. Soria-Frisch, M. Caparrini, C. Grau, and G. Ruffini, “Unobtrusive biometric system based on electroencephalogram analysis,” EURASIP Journal on Advances in Signal Processing, vol. 2008, p. 18, 2008.
  • [15] P. Campisi and D. La Rocca, “Brain waves for automatic biometric-based user recognition,” IEEE transactions on information forensics and security, vol. 9, no. 5, pp. 782–800, 2014.
  • [16] D. La Rocca, P. Campisi, B. Vegso, P. Cserti, G. Kozmann, F. Babiloni, and F. D. V. Fallani, “Human brain distinctiveness based on eeg spectral coherence connectivity,” IEEE transactions on Biomedical Engineering, vol. 61, no. 9, pp. 2406–2412, 2014.
  • [17] X. Yang, K. Wang, and S. A. Shamma, “Auditory representations of acoustic signals,” Tech. Rep., 1991.
  • [18] N. Mesgarani and S. Shamma, “Speech processing with a cortical representation of audio,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on.    IEEE, 2011, pp. 5872–5875.
  • [19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [20] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [21] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
  • [22] A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [23] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.
Comments 2
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
377652
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
2

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description