Advancing Speech Recognition With No Speech Or With Noisy Speech

Advancing Speech Recognition With No Speech Or With Noisy Speech

Gautam Krishna Brain Machine Interface Lab
The University of Texas at Austin
Austin, Texas
   Co Tran Brain Machine Interface Lab
The University of Texas at Austin
Austin, Texas
   Mason Carnahan Brain Machine Interface Lab
The University of Texas at Austin
Austin, Texas
   Ahmed Tewfik Brain Machine Interface Lab
The University of Texas at Austin
Austin, Texas
Abstract

In this paper we demonstrate end to end continuous speech recognition (CSR) using electroencephalography (EEG) signals with no speech signal as input. An attention model based automatic speech recognition (ASR) and connectionist temporal classification (CTC) based ASR systems were implemented for performing recognition. We further demonstrate CSR for noisy speech by fusing with EEG features.

electroencephalograpgy (EEG), speech recognition, deep learning, CTC, attention, technology accessibility

I Introduction

Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain. In [1] we demonstrated deep learning based automatic speech recognition (ASR) using EEG signals for a limited English vocabulary of four words and five vowels. In this paper we extend our work for a much larger English vocabulary and we use state of art end to end continuous speech recognition models to perform recognition. In our prior work we predicted isolated words and vowels.
ASR systems forms the front end or back end in many cutting edge voice activated technologies like Amazon Alexa, Apple Siri, Windows Cortana, Samsung Bixby etc. Unfortunately these systems are trained to recognize text only from acoustic features. This limits technology accessibility to people with speaking disabilities and disorders. The research work presented in this paper tries to address this issue by investigating speech recognition using only EEG signals with no acoustic input and also by combining EEG features along with traditional acoustic features to perform recognition. We believe the former will help with speech restoration for people who can not speak at all and the latter will help people who are having speaking disabilities like broken or discontinued speech etc to use voice activated technologies with better user experience there by helping in improving technology accessibility.
ASR performance is degraded in presence of noisy speech and in real life situations most of the speech is noisy. Inspired from the unique robustness to environmental artifacts exhibited by the human auditory cortex [2, 3] we used very noisy speech data for this work and demonstrated lower word error rate (WER) for smaller corpus using EEG features, concatenation of EEG features and acoustic features.

In [4] authors decode imagined speech from EEG using synthetic EEG data and connectionist temporal classification (CTC) network but in our work we use real EEG data, use EEG data recorded along with acoustics. In [5] authors perform envisioned speech recognition using random forest classifier but in our case we use end to end state of art models and perform recognition for noisy speech. In [6] authors demonstrate speech recognition using electrocorticography (ECoG) signals, which are invasive in nature but in our work we use non invasive EEG signals.
This work is mainly motivated by the results explained in [1, 7, 8, 4]. In [7] the authors used classification approach for identifying phonological categories in imagined and silent speech but in our work we used continuous speech recognition state of art models and our models were predicting words, characters at each time step. Similarly in [8] neural network based classification approach was used for predicting phonemes.

Major contribution of this paper is the demonstration of end to end continuous noisy speech recognition using only EEG features and this paper further validates the concepts introduced in [1] for a much larger English corpus.

Ii Automatic Speech Recognition System Models

An end to end ASR model maps input feature vectors to an output sequence of vectors of posterior probabilities of tokens without using separate acoustic model, pronunciation model and language model. In this work we implemented two different types of state of art end to end ASR models used for the task of continuous speech recognition and the input feature vectors can be EEG features or concatenation of acoustic and EEG features. We used Google’s tensorflow and keras deep learning libraries for building our ASR models.

Ii-a Connectionist Temporal Classification (CTC)

The main ideas behind CTC based ASR were first introduced in the following papers [9, 10]. In our work we used a single layer gated recurrent unit (GRU) [11] with 128 hidden units as encoder for the CTC network. The decoder consists of a combination of a dense layer ( fully connected layer) and a softmax activation. Output at every time step of the GRU layer is fed into the decoder network. The number of time steps of the GRU encoder is equal to product of the sampling frequency of the input features and the length of the input sequence. Since different speakers have different rate of speech, we used dynamic recurrent neural network (RNN) cell. There is no fixed value for time steps of the encoder.
Usually the number of time steps of the encoder (T) is greater than the length of output tokens for a continuous speech recognition problem. A RNN based CTC network tries to make length of output tokens equal to T by allowing the repetition of output prediction unit tokens and by introducing a special token called blank token [9] across all the frames. We used CTC loss function with adam optimizer [12] and during inference time we used CTC beam search decoder.

We now explain the loss function used in our CTC model. Consider training data set with training examples and the corresponding label set with target vectors . Consider any training example, label pair (,). Let the number of time steps of the RNN encoder for (,) is . In case of character based CTC model, the RNN predicts a character at every time step. Whereas in word based CTC model, the RNN predicts a word at every time step. For the sake of simplicity, let us assume that length of target vector is equal to . Let the probability vector output by the RNN at each time step be and let value of be denoted by . The probability that model outputs on input is given by . During the training phase, we would like to maximize the conditional probability , and thereby define the loss function as .

In case when the length of is less than , we extend the target vector by repeating a few of its values and by introducing blank token to create a target vector of length . Let the possible extensions of be denoted by . For example, when and , the possible extensions are , and . We then define as .
In our work we used character based CTC ASR model. CTC assumes the conditional independence constraint that output predictions are independent given the entire input sequence.

Ii-B RNN Encoder-Decoder or Attention model

RNN encoder - decoder ASR model consists of a RNN encoder and a RNN decoder with attention mechanism [13, 14, 15]. The number of time steps of the encoder is equal to the product of sampling frequency of the input features and the length of input sequence. There is no fixed value for time steps in our case. We used dynamic RNN cell. We used a single layer GRU with 128 hidden units for both encoder and decoder. A dense layer followed by softmax activation is used after the decoder GRU to get the prediction probabilities. Dense layer performs an affine transformation. The number of time steps of the decoder GRU is same as the number of words present in the sentence for a given training example. Training objective is to maximize the log probability of the ordered conditionals, ie: , where X is input feature vector, 's are the labels for the ordered words present in that training example and is the length of the output label sentence for that example. Cross entropy was used as the loss function with adam as the optimizer. We used teacher forcing algorithm [16] to train the model. During inference time we used beam search decoder.
We now explain the attention mechanism used in our attention model. Consider any training example, label pair (,). Let the number of times steps of encoder GRU for that example be . The GRU encoder will transform the input features () into hidden output feature vectors (). Let word label in (sentence) be , then to predict at decoder time step , context vector is computed and fed into the decoder GRU. is computed as , where is the attention weight vector satisfying the property .
can be intuitively seen as a measure of how much attention must pay to , . is mathematically defined as , where is hidden state of the decoder GRU at time step .
The way of computing value for depends on the type of attention used. In this work, we used bahdanau’s additive style attention [14], which defines as ) where and are learnable parameters during training of the model.

Fig. 1: EEG channel locations for the cap used in our experiments

Iii Design of Experiments for building the database

We built two types of simultaneous speech EEG recording databases for this work. For database A five female and five male subjects took part in the experiment. For database B five male and three female subjects took part in the experiment. Except two subjects, rest all were native English speakers for both the databases. All subjects were UT Austin undergraduate,graduate students in their early twenties.
For data set A, the 10 subjects were asked to speak the first 30 sentences from the USC-TIMIT database[17] and their simultaneous speech and EEG signals were recorded. This data was recorded in presence of background noise of 40 dB (noise generated by room air conditioner fan). We then asked each subject to repeat the same experiment two more times, thus we had 30 speech EEG recording examples for each sentence.
For data set B, the 8 subjects were asked to repeat the same previous experiment but this time we used background music played from our lab computer to generate a background noise of 65 dB. Here we had 24 speech EEG recording examples for each sentence.
We used Brain Vision EEG recording hardware. Our EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 1. We used EEGLab [18] to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
For data set A, we used data from first 8 subjects for training the model, remaining two subjects data for validation and test set respectively.
For data set B, we used data from first 6 subjects for training the model, remaining two subjects data for validation and test set respectively.

Iv EEG and Speech feature extraction details

EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. EEGlab’s [18] Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG), electromyography (EMG), electrooculography (EOG) etc from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1]. So in total we extracted 31(channels) X 5 or 155 features for EEG signals.The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
We used spectral entropy because it captures the spectral ( frequency domain) and signal complexity information of EEG. It is also a widely used feature in EEG signal analysis[19]. Similarly zero crossing rate was chosen as it is a commonly used feature both for speech recognition and bio signal analysis. Remaining features were chosen to capture time domain statistical information. We performed lot of experiments to identify this set of features. Initially we used only spectral entropy and zero crossing rate but we noticed that the performance of the ASR system significantly went up by 20 % when we added the remaining additional features.
The recorded speech signal was sampled at 16KHz frequency. We extracted Mel-frequency cepstrum (MFCC) as features for speech signal. We first extracted MFCC 13 features and then computed first and second order differentials (delta and delta-delta) thus having total MFCC 39 features. The MFCC features were also sampled at 100Hz same as the sampling frequency of EEG features to avoid seq2seq problem.

Fig. 2: Explained variance plot

V EEG Feature Dimension Reduction Algorithm Details

After extracting EEG and acoustic features as explained in the previous section, we used non linear methods to do feature dimension reduction in order to obtain set of EEG features which are better representation of acoustic features. We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) [20].We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 2. We used KPCA with polynomial kernel of degree 3 [1]. We further computed delta, delta and delta of those 30 EEG features, thus the final feature dimension of EEG was 90 (30 times 3) for both the data sets.
When we used the EEG features for ASR without dimension reduction, the ASR performance went down by 40 %. The non linear dimension reduction of EEG features significantly improved the performance of ASR.

Vi Results

The attention model was predicting a word and CTC model was predicting a character at every time step, hence we used word error rate (WER) as performance metric to evaluate attention model and character error rate (CER) for CTC model for different feature sets as shown below.
Table \@slowromancapi@ and \@slowromancapii@ shows the test time results for attention model for both the data sets when trained using EEG features and concatenation of EEG, acoustic features respectively. As seen from the results the attention model gave lower WER when trained and tested on smaller number of sentences. As the vocabulary size increase, the WER also went up. We believe for the attention model to achieve lower WER for larger vocabulary size more number of training examples or larger training data set is required as large number of weights need to be adapted. Figure 3 shows the training loss convergence of our attention model.
Table \@slowromancapiv@ and \@slowromancapv@ shows the results obtained using CTC model. The CTC model gave much better results than attention model demonstrating higher accuracy and lower error rates as shown in the table. We believe since CTC model has less parameters than attention model, it is able to generalize better than attention model with less amount of training examples. However the CTC model was trained for 500 epochs compared to 100 epochs for attention model to observe loss convergence and batch size was set to one for CTC model. Thus CTC model training was lot more time consuming than attention model. The CTC model was able to predict 10 sentences containing 59 unique words using only EEG features with 11.6 % CER on the most noisy data set B as shown in Table \@slowromancapiv@.
In [1] we have demonstrated that EEG sensors T7 and T8 features contributed most towards ASR performance. Table \@slowromancapvi@ shows the CTC model test time results when we trained the model using EEG features from only T7 and T8 sensors on the most noisy data set B. We observed that for smaller vocabulary, results were slightly comparable with the results from Table \@slowromancapiv@ where we used EEG features from all 31 sensors with dimension reduction. Table \@slowromancapiii@ shows the results for attention model when trained with EEG features from sensors T7 and T8 on data set B.
Figures 4,5 shows the visualization of the attention weights when the attention model was trained and tested using only EEG features for Data set B. The plots shows the EEG feature importance ( attention) distribution across time steps for predicting first,third sentence and it indicates that attention model was not able to attend properly to EEG features, which might be another reason for giving higher WER.

Fig. 3: Training loss convergence for attention model using only EEG features for first 10 sentences from data set A
Fig. 4: Visualization of attention weights for the first sentence
Fig. 5: Visualization of attention weights for the third sentence
Fig. 6: Training loss convergence for CTC model using only EEG features for first 3 sentences from data set B
Number
of Sentences
Number of unique
words contained
EEG
(WER %)
EEG
+MFCC
(WER %)
3 19 0 0
5 29 37.03 44.44
7 42 58.9 58.9
10 59 63.11 66.6
15 84 82.7 75.8
20 106 87.3 86.4
TABLE I: WER on test set for attention model for Data set A
Number
of Sentences
Number of unique
words contained
EEG
(WER %)
EEG
+MFCC
(WER %)
3 19 0 0
5 29 44.4 37.02
7 42 52.8 30.7
10 59 68 64
15 84 82.7 75.8
20 106 86.4 88
TABLE II: WER on test set for attention model for Data set B
Number
of Sentences
Number of
unique
words
contained
EEG
(WER %)
3 19 35.2
5 29 59.2
7 42 71.7
10 59 77.1
TABLE III: WER on test set for attention model for Data set B using EEG features from only T7 and T8 electrodes
Number
of Sentences
Number of
unique
words
contained
EEG
(CER %)
EEG+
MFCC
(CER %)
3 19 2.2 0
5 29 1 0
7 42 1.8 0
10 59 11.6 9.6
TABLE IV: CER on test set for CTC model for Data set B
Number
of Sentences
Number of
unique
words
contained
EEG
(CER %)
EEG+
MFCC
(CER %)
3 19 0.8 0
5 29 0.8 0
7 42 0.38 0
10 59 2.4 0
TABLE V: CER on test set for CTC model for Data set A
Number
of Sentences
Number of
unique
words
contained
EEG
(CER %)
3 19 0.8
5 29 11.6
7 42 18
10 59 22.01
TABLE VI: CER on test set for CTC model for Data set B using EEG features from only T7 and T8 electrodes

Vii Conclusion and Future work

In this paper we demonstrated the feasibility of using EEG features, concatenation of EEG and acoustic features for performing noisy continuous speech recognition. To our best knowledge this is the first time a continuous noisy speech recognition is demonstrated using only EEG features.
The CTC based model demonstrated better performance than attention based model.
We further plan to publish our speech EEG data base used in this work to help advancement of research in this area.
For future work, we plan to build a much larger speech EEG data base and also perform experiments with data collected from subjects with speaking disabilities.
We will also investigate whether it is possible to improve the attention model results by tuning hyper parameters to improve the model’s ability to condition on the input, training with more examples and by using external language model during inference time.

Viii Acknowledgement

We would like to thank Kerry Loader from Dell, Austin, TX for donating us the GPU to train our models used in this work.

References

  • [1] G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on.   IEEE, 2019.
  • [2] X. Yang, K. Wang, and S. A. Shamma, “Auditory representations of acoustic signals,” Tech. Rep., 1991.
  • [3] N. Mesgarani and S. Shamma, “Speech processing with a cortical representation of audio,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on.   IEEE, 2011, pp. 5872–5875.
  • [4] K. Wang, X. Wang, and G. Li, “Simulation experiment of bci based on imagined speech eeg decoding,” arXiv preprint arXiv:1705.07771, 2017.
  • [5] P. Kumar, R. Saini, P. P. Roy, P. K. Sahu, and D. P. Dogra, “Envisioned speech recognition using eeg sensors,” Personal and Ubiquitous Computing, vol. 22, no. 1, pp. 185–199, 2018.
  • [6] N. Ramsey, E. Salari, E. Aarnoutse, M. Vansteensel, M. Bleichner, and Z. Freudenburg, “Decoding spoken phonemes from sensorimotor cortex with high-density ecog grids,” Neuroimage, 2017.
  • [7] S. Zhao and F. Rudzicz, “Classifying phonological categories in imagined and articulated speech,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 992–996.
  • [8] P. Sun and J. Qin, “Neural networks based eeg-speech models,” arXiv preprint arXiv:1612.05369, 2016.
  • [9] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning.   ACM, 2006, pp. 369–376.
  • [10] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
  • [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [13] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [14] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
  • [15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [16] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989.
  • [17] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
  • [18] A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
  • [19] A. Zhang, B. Yang, and L. Huang, “Feature extraction of eeg signals using power spectral entropy,” in 2008 International Conference on BioMedical Engineering and Informatics.   IEEE, 2008, pp. 435–439.
  • [20] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.
Comments 5
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
378452
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
5

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description