Robust End-to-End Speaker Verification Using EEG

Robust End-to-End Speaker Verification Using EEG

Abstract

In this paper we demonstrate that performance of a speaker verification system can be improved by concatenating electroencephalography (EEG) signal features with speech signal. We use state-of-the-art end-to-end deep learning model for performing speaker verification and we demonstrate our results for noisy speech.

Our results demonstrate that EEG features can improve the robustness of speaker verification systems.

\name

Yan Han   Gautam Krishna   Co Tran   Mason Carnahan   Ahmed H Tewfik thanks: Equal author contribution \addressBrain Machine Interface Lab, The University of Texas at Austin
{keywords} Electroencephalography (EEG), Speaker Verification, Deep Learning, Bio-metrics

1 Introduction

Speaker verification is the process of verifying whether an utterance belongs to a specific speaker, based on that speaker’s known utterances. Speaker verification systems are used as authentication system in many voice activated technologies, for example in applications like voice match for Google home. In our work we focused on text independent speaker verification [1, 2]. Deep learning based speaker verification systems [3, 4, 5, 6, 7, 8] are getting popular this days and such systems have improved the performance of speaker verification systems.

Even though deep learning models have improved the state of art performance of speaker verification systems, their performance is degraded in presence of background noise and they are prone to attacks in the form of voice mimicking by the adversaries who are interested in breaching the authentication system. We propose to use electroencephalography (EEG) signals to address these challenges. EEG is a non invasive way of measuring electrical activity of human brain. In [9] authors demonstrated that EEG features can help automatic speech recognition (ASR) systems to overcome the performance loss in presence of background noise and references [10, 11, 12] demonstrates the feasibility of using EEG features for performing continuous speech recognition. Further, prior work explained in the references [13, 14, 15, 16, 17, 18] show that EEG pattern in every individual is unique and it can be used for bio metric identification. Motivated by this prior results, we concatenated acoustic features with EEG features to improve the performance of speaker verification systems operating under very noisy conditions. Speaker verification using EEG will also help people with speaking disabilities like broken speech to use voice authentication technologies with better accuracy, thereby improving technology accessibility.

For this work we used the end to end speaker verification model explained in [4] as it is the current state of art model for speaker verification. Major contribution of this paper is the demonstration of improving robustness of speaker verification systems using EEG features.

2 Speaker Verification Model

We used generalized end to end model introduced in [4] for performing text independent speaker verification. We used Google’s Tensorflow deep learning library for implementing the model.

The model consists of a single layer of Long short term memory (LSTM) [19] or Gated recurrent unit (GRU) [20] cell with 128 hidden units followed by a dense layer which is a fully connected linear layer followed by a softmax activation. During training phase, an utterance of an enrollment candidate is passed to the LSTM or GRU cell and L2 normalization is applied to the dense layer output to derive the embedding or d vector for that utterance . Similarly d vectors are derived for all the utterances of the enrolment candidate . Since different utterances can have different lengths, we used dynamic recurrent neural network (RNN) cell of tensorflow. The speaker model or centroid for is defined as the average value of all the d vectors obtained for all the enrollment utterances of .

Consider a set of enrollment candidates {,, , } and let us assume each has number of utterances. Now as per our earlier definition we can build a set of centroids as {, , , } where is the centroid for . Now a two dimensional similarity matrix with rows and columns is constructed, that computes the cosine similarity between all ’s and d vectors corresponding to utterances from all ’s. The cosine similarity contains learnable parameters that act as scaling factors as explained in [4]. A softmax layer is applied on the similarity matrix. We used the generalized end to end softmax loss explained in [4] as the loss function for the model. For any (, ) pair, where is a d vector corresponding to an utterance from any , the final softmax layer in the model outputs 1 if both and are derived from the same , otherwise it outputs 0. Batch size was set to one and gradient descent optimizer was used.

During test time, along with the enrollment utterances from the test set, evaluation utterances are also passed to the trained model. The similarity is calculated between the centroids derived from the enrollment utterances from test set and d vectors corresponding to evaluation utterances. Figure 2 shows the training methodology and Figure 3 shows the testing methodology of the verification model. Figure 2 is adapted from [4]. By the term utterance through out this paper, we refer to the Mel-frequency cepstral coefficients (MFCC 13) or EEG features or concatenation of MFCC 13 and EEG features, depending on how the speaker verification model was trained.

Figure 1: EEG channel locations for the cap used in our experiments

3 Design of Experiments for building the database

We built two types of simultaneous speech EEG recording databases for this work. For database A five female and five male subjects took part in the experiment. For database B five male and three female subjects took part in the experiment. Except two subjects, rest all were native English speakers for both the databases. All subjects were UT Austin undergraduate,graduate students in their early twenties.

For data set A, the 10 subjects were asked to speak the first 30 sentences from the USC-TIMIT database[21] and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 40 dB (noise generated by room air conditioner fan). We then asked each subject to repeat the same experiment two more times, thus we had 30 speech EEG recording examples for each sentence.

For data set B, the 8 subjects were asked to repeat the same previous experiment but this time we used background music played from our lab computer to generate a background noise of 65 dB. Here we had 24 speech EEG recording examples for each sentence.

We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 1. We used EEGLab [22] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

For data set A, we used data from first 8 subjects for training the model, remaining two subjects data for test set.

For data set B, we used data from first 6 subjects for training the model, remaining two subjects data for test set.

4 EEG and Speech feature extraction details

EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [22] Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [9, 11, 12]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum coefficients (MFCC) as features for speech signal. We extracted MFCC 13 features. The MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features to avoid seq2seq problem.

Figure 2: Training method for Speaker verification

5 EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used non linear methods to do feature dimension reduction in order to obtain set of EEG features which are better representation of acoustic features. We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) [23].We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 4. We used KPCA with polynomial kernel of degree 3 [9]. We used python scikit library for performing KPCA. The cumulative explained variance plot is not supported by the library for KPCA as KPCA projects features to different feature space, hence for getting explained variance plot we used normal PCA but after identifying the right dimension we used the KPCA to perform dimension reductions.

Figure 3: Testing method for Speaker verification
Figure 4: Explained variance plot

6 Results

As explained earlier, for data set A we used data from first 8 subjects or 720 sentence utterances as training set, last two subjects or 180 sentence utterances as testing set. For data set B we used data from first 6 subjects or 540 sentence utterances as training set, last two subjects or 180 sentence utterances as testing set. There are 90 utterances per each subject.

Figure 5: Training Loss

Equal error rate (EER) defined in [3] was used as the evaluation metric to evaluate the model. Table 1 shows the obtained results on the most noisy data set B when we used LSTM as the RNN cell in the verification model and results clearly indicate that combining EEG features with acoustic features help in reducing the EER. In the table results are shown for number of sentence = {3, 5, 7, 10, 15, 20, 30}. Number of sentence = 3 implies during training, for first training step, features (MFCC or EEG or MFCC+EEG) corresponding to first 3 sentence utterances are selected randomly from any two subjects from the training set and during second training step, features corresponding to next 3 sentence utterances are selected randomly from any two subjects and so on till the \nth90 sentence utterance is selected from the training set. So for this example, 30 training steps constitute one epoch. The model is trained for sufficient number of epochs to make sure it sees all the utterances present in the training set. In every epoch any two subjects are randomly chosen.

Number
of
Sentences
MFCC
(EER%)
MFCC+EEG
(EER%)
3 6 2
5 8 4
7 4 3
10 7 2
15 9 4
20 12 6
30 13 8
Table 1: EER on test set for Data set B using LSTM based model

During test time we have data from two subjects and number of sentences equal to 3 implies, for the first testing step features corresponding to first 3 sentence utterances from subject 1 and subject 2 are selected as enrollment utterances and features corresponding to next 3 sentence utterances from subject 1 and subject 2 are selected as evaluation utterances and EER is calculated. For second testing step, the features used as evaluation utterances in first step will become new enrollment utterances and features corresponding to next 3 sentence utterances from subjects 1, 2 will become new evaluation utterance and new EER is calculated. So for this example,there will be a total of 30 testing steps. EER value of 6 % or 2 % for number of sentences equal to 3 as seen from the Table 1, corresponds to the average of all the 30 testing step EER values. Similarly model was trained and tested for number of sentences = {5,7,10,15,20,30}. As 20 is not a factor of 90, for number of sentences equal to 20, there are five training steps in one epoch, five testing steps but the last step will contain only the features corresponding to last 10 sentence utterances. Figure 5 shows the training loss convergence for the LSTM model for number of sentences equal to 3 on data set B when trained using MFCC features.

Table 2 shows the test time result using GRU model for Data Set B, Table 3 shows the test time result using LSTM model for Data Set A and Table 4 shows the test time result using GRU model for Data Set A.

Number
of
Sentences
MFCC
(EER%)
MFCC+EEG
(EER%)
3 5 1
5 6 3
7 5 1
10 8 2
15 11 3
20 9 4
30 10 5
Table 2: EER on test set for Data set B using GRU based model
Number
of
Sentences
MFCC
(EER%)
MFCC+EEG
(EER%)
3 4 2
5 5 4
7 3 3
10 6 2
15 7 4
20 8 6
30 9 8
Table 3: EER on test set for Data set A using LSTM based model
Number
of
Sentences
MFCC
(EER%)
MFCC+EEG
(EER%)
3 3 1
5 3 2
7 3 1
10 5 2
15 6 3
20 5 5
30 9 6
Table 4: EER on test set for Data set A using GRU based model

7 Conclusion

In this paper we demonstrated the feasibility of using EEG signals to improve the robustness of speaker verification systems operating in very noisy environment. We demonstrated that for the most noisy data (65 dB) and for less noisy data (40dB), combination of MFCC and EEG features always resulted in lower EER compared to when the verification systems where trained and tested using only MFCC or acoustic features.

Our overall results indicate that EEG features are less affected by background noise and they are helpful in improving robustness of verification systems operating in presence of high background noise.

We further plan to publish our speech EEG data base used in this work to help advancement of the research in this area. For future work, we will work on building a much larger speech EEG data base and validate the results for larger data base.

8 Acknowledgements

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.

References

  • [1] Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau, Sylvain Meignier, Teva Merlin, Javier Ortega-García, Dijana Petrovska-Delacrétaz, and Douglas A Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 4, pp. 101962, 2004.
  • [2] Tomi Kinnunen and Haizhou Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech communication, vol. 52, no. 1, pp. 12–40, 2010.
  • [3] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5115–5119.
  • [4] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
  • [5] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4052–4056.
  • [6] Yu-hsin Chen, Ignacio Lopez-Moreno, Tara N Sainath, Mirkó Visontai, Raziel Alvarez, and Carolina Parada, “Locally-connected and convolutional neural networks for small footprint speaker recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [7] Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li, and Yifan Gong, “End-to-end attention based text-dependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 171–178.
  • [8] Seyed Omid Sadjadi, Sriram Ganapathy, and Jason W Pelecanos, “The ibm 2016 speaker recognition system,” arXiv preprint arXiv:1602.07291, 2016.
  • [9] Gautam Krishna, Co Tran, Jianguo Yu, and Ahmed Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
  • [10] Gautam Krishna, Yan Han, Co Tran, Mason Carnahan, and Ahmed H Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
  • [11] Gautam Krishna, Co Tran, Mason Carnahan, and Ahmed Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
  • [12] Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, and Ahmed H Tewfik, “Speech recognition with no speech or with noisy speech beyond english,” arXiv preprint arXiv:1906.08045, 2019.
  • [13] Sebastien Marcel and José del R Millán, “Person authentication using brainwaves (eeg) and maximum a posteriori model adaptation,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 4, pp. 743–752, 2007.
  • [14] RB Paranjape, J Mahovsky, L Benedicenti, and Z Koles, “The electroencephalogram as a biometric,” in Canadian Conference on Electrical and Computer Engineering 2001. Conference Proceedings (Cat. No. 01TH8555). IEEE, 2001, vol. 2, pp. 1363–1366.
  • [15] M Poulos, M Rangoussi, V Chrissikopoulos, and A Evangelou, “Person identification based on parametric processing of the eeg,” in ICECS’99. Proceedings of ICECS’99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 99EX357). IEEE, 1999, vol. 1, pp. 283–286.
  • [16] Alejandro Riera, Aureli Soria-Frisch, Marco Caparrini, Carles Grau, and Giulio Ruffini, “Unobtrusive biometric system based on electroencephalogram analysis,” EURASIP Journal on Advances in Signal Processing, vol. 2008, pp. 18, 2008.
  • [17] Patrizio Campisi and Daria La Rocca, “Brain waves for automatic biometric-based user recognition,” IEEE transactions on information forensics and security, vol. 9, no. 5, pp. 782–800, 2014.
  • [18] Daria La Rocca, Patrizio Campisi, Balazs Vegso, Peter Cserti, György Kozmann, Fabio Babiloni, and F De Vico Fallani, “Human brain distinctiveness based on eeg spectral coherence connectivity,” IEEE transactions on Biomedical Engineering, vol. 61, no. 9, pp. 2406–2412, 2014.
  • [19] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [20] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [21] Shrikanth Narayanan, Asterios Toutios, Vikram Ramanarayanan, Adam Lammert, Jangwon Kim, Sungbok Lee, Krishna Nayak, Yoon-Chul Kim, Yinghua Zhu, Louis Goldstein, et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
  • [22] Arnaud Delorme and Scott Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [23] Sebastian Mika, Bernhard Schölkopf, Alex J Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398195
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description