Spoken Speech Enhancement using EEG

Spoken Speech Enhancement using EEG

Abstract

In this paper we demonstrate spoken speech enhancement using electroencephalography (EEG) signals using a generative adversarial network (GAN) based model and Long short-term Memory (LSTM) regression based model. Our results demonstrate that EEG features can be used to clean speech recorded in presence of background noise. We further observed that GAN based model demonstrated better EEG based speech enhancement results compared to LSTM regression based model. To the best of our knowledge this is the first time a spoken speech enhancement is demonstrated using EEG features recorded in parallel with spoken speech.

\name

Gautam Krishna   Yan Han   Co Tran   Mason Carnahan   Ahmed H Tewfikthanks: Equal author contribution \addressBrain Machine Interface Lab, The University of Texas at Austin
{keywords} electroencephalograpgy (EEG), speech enhancement, deep learning

1 Introduction

Speech enhancement is the process of improving the quality of speech whose quality was degraded due to additive noise. Speech enhancement is a critical preprocessing method used to improve the performance of automatic speech recognition (ASR) systems operating in presence of background noise. Noisy speech is first fed into a speech enhancement system to produce enhanced speech which is then fed into the ASR model. Speech enhancement systems also plays critical role in improving the quality of speech used in devices like hearing aids and cochlear implants.

In references [1, 2] authors demonstrated speech enhancement using classical methods. Recently researchers have started applying deep learning methods for performing speech enhancement as indicated in the following references [3, 4, 5]. In references [6, 7] authors demonstrated speech enhancement using generative adversarial networks (GAN)[8].

Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. In [9] authors demonstrated that EEG features can be used to overcome the performance loss of ASR systems in presence of background noise. Though references [10, 11, 9, 12] demonstrated isolated and continuous speech recognition using EEG signals for various experimental conditions, they didn’t specifically study the speech enhancement problem. In this paper we demonstrate that EEG features can be used to improve the quality of speech recorded in presence of background noise. We make use of GAN and long short-term memory (LSTM) [13] networks to demonstrate speech enhancement using EEG features. In [14] authors demonstrated EEG based attention driven speech enhancement using wiener filters where EEG was used to detect auditory attention where as in this paper we demonstrate speech enhancement for ”Spoken” speech using EEG features and auditory attention detection module is not required for performing speech enhancement. Our idea is mainly inspired by the results demonstrated in [9] where authors demonstrated EEG features are less affected by external background noise. To the best of our knowledge this is the first time a spoken speech enhancement is demonstrated using EEG features recorded in parallel with spoken speech.

2 Design of Experiments for building Training and Test Set

For training set data five female and five male subjects took part in the experiment. For test set data five male and three female subjects took part in the experiment. Except two subjects, rest all were native English speakers for both the databases. All subjects were UT Austin undergraduate,graduate students in their early twenties.

For training set, the 10 subjects were asked to speak the first 30 sentences from the USC-TIMIT database[15] and their simultaneous speech and EEG signals were recorded. This data was recorded in absence of externally created background noise but a background noise of 40 dB due to the sound of lab ventilation fan was observed. We then asked each subject to repeat the same experiment two more times, thus we had 30 speech EEG recording examples for each sentence. For the sake of the simplicity of the study we would neglect this 40 dB noise effect and would consider training data set as clean.

For test set, the 8 subjects were asked to repeat the same previous experiment but this time we used background music played from our lab computer to generate an external background noise of 65 dB. Here we had 24 speech EEG recording examples for each sentence. Both the training and test set experiments had two subjects in common.

We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 1. We used EEGLab [16] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.

Figure 1: EEG channel locations for the cap used in our experiments

3 EEG and Speech feature extraction details

We followed the same methodology used by authors in references [10, 9, 12] for EEG and speech preprocessing. EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [16] Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [9, 10, 12]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.

The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum coefficients (MFCC) as features for speech signal. We extracted MFCC 13 features and the MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features to avoid seq2seq problem.

Figure 2: Explained variance plot

4 EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used non linear methods to do feature dimension reduction in order to obtain set of EEG features which are better representation of acoustic features. We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) [17].We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 2. We used KPCA with polynomial kernel of degree 3 [9, 10, 12]. We used python scikit library for performing KPCA. The cumulative explained variance plot is not supported by the library for KPCA as KPCA projects features to different feature space, hence for getting explained variance plot we used normal PCA but after identifying the right dimension we used the KPCA to perform dimension reductions.

5 Speech Enhancement models

We used two different types of model for performing speech enhancement using EEG features. We first performed experiments using a simple long short-term memory (LSTM) [13] regression model and then we performed speech enhancement experiments using a generative adversarial networks (GAN) [8] model. In the below sub sections we explain the architecture of our models and experiment set up details. Our GAN model architecture is different from the ones used by authors in references [6, 7]. We added Gaussian noise with zero mean and standard deviation 10 to the recorded MFCC features from training set to generate noisy MFCC features. These noisy MFCC features will be used during training of the models as explained in below sub sections. The gaussian noise was not added to the EEG features from training set as our hypothesis was effect of background noise on EEG features is negligible [9]. The gaussian noise was not added to the test set data as it was already collected in presence of externally created background noise.

5.1 LSTM Regression Model

Our LSTM regression model consists of two layers of LSTM with 128 hidden units in each layer followed by a time distributed dense layer with 13 hidden units. The LSTM regression model architecture is shown in Figure 3. The model was trained for 1000 epochs to observe loss convergence and adam optimizer [18] was used. The Batch size was set to 100. Mean squared error (MSE) was used as the loss function.

During training time, we concatenate the generated noisy MFCC features (after adding gaussian noise) and recorded EEG features from the training set and feed it as a single vector input to the LSTM regression model and corresponding clean MFCC features from training set of dimension 13 are set as targets.

During test time, we concatenate the MFCC and EEG features from test set and feed it as a single vector input to the trained LSTM regression model to output corresponding enhanced MFCC. Griffin Lim reconstruction [19] algorithm is used to convert enhanced MFCC to speech.

Figure 3: LSTM regression model

5.2 GAN Model

Generative Adversarial Network (GAN) consists of two networks namely the generator model and the discriminator model which are trained simultaneously. The generator model learns to generate data from a latent space and the discriminator model evaluates whether the data generated by the generator is fake or is from true data distribution. The training objective of the generator is to fool the discriminator.

Our main motivation behind using GAN model was in the case of GAN, the loss function is learned during training of the model instead of using a fixed loss function in the case of LSTM regression model ( ie: MSE).

Our generator model consists of two parallel LSTMs with 128 hidden units in each layer. The outputs of the two parallel LSTMs are concatenated and fed into another LSTM with 128 hidden units followed by a time distributed dense layer of 13 hidden units. The architecture of discriminator model is similar to that of the generator model but instead of the time distributed dense layer, a dense layer with single hidden unit sigmoid activation is used. The last time step output of the preceding LSTM layer is fed into the dense layer.

During training time, the generator always takes noisy MFCC ( obtained after adding gaussian noise to clean MFCC from training set) and clean EEG ( from training set) as input pairs and outputs fake MFCC. Generator model architecture is shown in Figure 4. The discriminator can take three possible pairs of inputs during training. Let be the sigmoid output of the discriminator for (fake MFCC, clean EEG) pair input, be the sigmoid output of the discriminator for (clean MFCC, clean EEG) pair input and be the sigmoid output of the discriminator for (noisy MFCC, clean EEG) pair input, then we can define the loss function of the generator as and loss function of the discriminator as for speech enhancement. The model was trained for 200 epochs using adam optimizer. Discriminator model architecture is shown in Figure 5. Input 1, Input 2 in the figure refers to the three possible pairs of input for the discriminator during training. Figures 6 and 7 shows the training loss for the generator and discriminator models.

During test time, the trained generator model takes (MFCC, EEG) input pair from the test set and outputs enhanced MFCC and we use griffin lim reconstruction algorithm to convert enhanced MFCC to speech.

Figure 4: Generator in GAN model
Figure 5: Discriminator in GAN model
Figure 6: generator model training loss
Figure 7: discriminator model training loss

6 Results

To evaluate the quality of the enhanced speech we computed two major performance metrics namely Perceptual evaluation of speech quality (PESQ) [20] and Short Term Objective Intelligibility (STOI) [21] for test set speech data and corresponding enhanced speech outputted by the models when the test set data was given as input. We observed that both the two metrics were higher for enhanced speech output compared to that of the test set speech data as shown in Table 1 indicating the enhanced speech output was of better quality than the corresponding test set speech data.

Since STOI and PESQ calculations involve the use of a clean audio signal as reference we computed STOI and PESQ values only for two subjects data from test set as only two subjects were common in test set and training set, hence we had a clean reference speech signal only for these two common subjects from the training data set. The average STOI, PESQ values for all the test, corresponding enhanced utterances of the two subjects are shown in Table 1.

However we computed one more metric namely signal to noise ratio (SNR) for all the test set speech data for the eight subjects and for the enhanced speech outputted by the model for all the eight subjects test data input. There are multiple definitions of computing SNR in literature, in our case we computed SNR as ratio of mean to standard deviation of the speech signal. We observed an average SNR value of -0.42 for the test set speech data and average SNR of -0.37 for enhanced speech outputted by LSTM regression model and -0.18 for enhanced speech outputted by GAN model. We can observe that the enhanced speech outputted by the models had higher SNR value compared to the test set data, indicating the enhanced speech outputted was of better quality than the test set speech data.

Model
Test Set
avg
PESQ
Enhanced
Output
avg
PESQ
Test Set
avg
STOI
Enhanced
Output
avg
STOI
LSTM
Regression
-0.47 1.96 -0.01 -0.0038
GAN -0.47 2.60 -0.01 0.005
Table 1: Speech Enhancement Results

7 Conclusion

In this paper we demonstrated cleaning of noisy spoken speech using EEG features recorded in parallel with spoken speech. We make use of state-of-the-art deep learning models like GAN, LSTM regression and EEG signal processing principles to derive our results. To our best knowledge this is the first time a spoken speech enhancement using EEG features is demonstrated using deep learning models. We also observed that GAN based model demonstrated better EEG based speech enhancement results compared to LSTM regression based model. We further plan to publish the data sets used in this work to help advancement of research.

8 Acknowledgement

We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.

References

  • [1] Michael Berouti, Richard Schwartz, and John Makhoul, “Enhancement of speech corrupted by acoustic noise,” in ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1979, vol. 4, pp. 208–211.
  • [2] Yariv Ephraim, “Statistical-model-based speech enhancement systems,” Proceedings of the IEEE, vol. 80, no. 10, pp. 1526–1555, 1992.
  • [3] Shahla Parveen and Phil Green, “Speech enhancement with missing data techniques using recurrent neural networks,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2004, vol. 1, pp. I–733.
  • [4] Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, “Speech enhancement based on deep denoising autoencoder.,” in Interspeech, 2013, pp. 436–440.
  • [5] Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R Hershey, and Björn Schuller, “Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2015, pp. 91–99.
  • [6] Santiago Pascual, Joan Serrà, and Antonio Bonafonte, “Towards generalized speech enhancement with generative adversarial networks,” arXiv preprint arXiv:1904.03418, 2019.
  • [7] Santiago Pascual, Antonio Bonafonte, and Joan Serra, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
  • [8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [9] Gautam Krishna, Co Tran, Jianguo Yu, and Ahmed Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
  • [10] Gautam Krishna, Co Tran, Mason Carnahan, and Ahmed Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
  • [11] Gautam Krishna, Yan Han, Co Tran, Mason Carnahan, and Ahmed H Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
  • [12] Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, and Ahmed H Tewfik, “Speech recognition with no speech or with noisy speech beyond english,” arXiv preprint arXiv:1906.08045, 2019.
  • [13] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [14] Neetha Das, Simon Van Eyndhoven, Tom Francart, and Alexander Bertrand, “Eeg-based attention-driven speech enhancement for noisy speech mixtures using n-fold multi-channel wiener filters,” in 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 1660–1664.
  • [15] Shrikanth Narayanan, Asterios Toutios, Vikram Ramanarayanan, Adam Lammert, Jangwon Kim, Sungbok Lee, Krishna Nayak, Yoon-Chul Kim, Yinghua Zhu, Louis Goldstein, et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
  • [16] Arnaud Delorme and Scott Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [17] Sebastian Mika, Bernhard Schölkopf, Alex J Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.
  • [18] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [19] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
  • [20] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 749–752.
  • [21] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398196
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description