Continuous Silent Speech Recognition using EEG
In this paper we explore continuous silent speech recognition using electroencephalography (EEG) signals. We implemented a connectionist temporal classification (CTC) automatic speech recognition (ASR) model to translate EEG signals recorded in parallel while subjects were reading English sentences in their mind without producing any voice to text.
Our results demonstrate the feasibility of using EEG signals for performing continuous silent speech recognition. We demonstrate our results for a limited English vocabulary consisting of 30 unique sentences.
A continuous silent speech recognition model tries to decode what a person was reading in their mind. It can be considered close to mind reading problem where thoughts are also decoded. Research along this direction can enable people with severe cognitive disabilities to use virtual assistants like Siri, Alexa, Bixby etc there by improving technology accessibility. It can also enable people with cognitive disabilities to communicate with other people. Continuous silent speech recognition technology can also potentially allow soldiers and scientists to perform covert communication in sensitive working environments. Finally continuous silent speech recognition technology can introduce a new form of thought based communication for able bodied people.
Electroencephalography (EEG) is a non invasive way of measuring electrical activity of human brain my placing EEG sensors on the scalp of the subject. EEG signals have high temporal resolution even though the spatial resolution is poor. On the other hand Electrocorticography (ECoG) is an invasive way of measuring electrical activity of human brain. ECoG signals have similar temporal resolution like EEG signals but has better spatial resolution and signal to noise ratio (SNR) than EEG signals. The major draw back of ECoG is that it is an invasive procedure requiring the subject to undergo a brain surgery in-order to implant the ECoG electrodes. In this work we use non invasive EEG signals to decode the thoughts of the subjects or perform continuous silent speech recognition.
In [1, 2, 3] authors demonstrated isolated and continuous speech recognition using EEG signals recorded in parallel while subjects were speaking out loud the English sentences and while they were listening to the English utterances for a limited English vocabulary. Authors in [2, 3, 1] used end-to-end automatic speech recognition (ASR) models like connectionist temporal classification (CTC) , attention model  and transducer model  to translate EEG input features directly to text. In this work we perform continuous silent speech recognition where we use a CTC model to map EEG features recorded while the subjects were reading English sentences in their mind, to text.
Other related works include  where authors demonstrated envisioned speech recognition using random forest classifier and in  authors demonstrated imagined speech recognition from EEG signals using synthetic EEG data and CTC network but in our work we use real experimental EEG data and larger vocabulary. In  authors demonstrated speech recognition using ECoG signals. In  the authors used classification approach for identifying phonological categories in imagined and silent speech but in this paper we demonstrate continuous silent speech recognition. In  authors demonstrated EEG based silent speech recognition for a vocabulary of five words but not at sentence level where continuous recognition is performed. Also in  authors used traditional hidden markov model (HMM) model but in this work we make use of state-of-the-art deep learning models and results are demonstrated for a much larger vocabulary size. In [12, 13] authors perform silent speech recognition but they didn’t use EEG neural recordings, in our work we make of EEG neural recordings which can lead to future work on mind reading or decoding thoughts. Similarly in  authors didn’t demonstrate continuous speech recognition and the work described in  is closely related to our work but they didn’t use EEG features for decoding and moreover our approach differs from them as they make use of convolutional neural network (CNN) to extract features whereas we extract more interpretable hand craft EEG features[1, 2] to train the model.
The major contribution of this work is the demonstration of feasibility of using EEG features to perform continuous silent speech recognition. We believe our results will motivate the research community to improve our results and come up with better state-of-the-art models that can perform continuous silent speech recognition using EEG features.
Ii Connectionist Temporal Classification (CTC)
The CTC ASR model ideas were first introduced in [14, 4]. The CTC model can perform continuous speech recognition by making the length of output tokens equal to number of time steps of the input features by allowing repetition of output tokens and by introducing a special token called blank token. Thus the CTC model is alignment free. The CTC ASR model consists of an encoder, decoder and a CTC loss function.
The encoder of our CTC model consists of two layers of gated recurrent unit (GRU)  with 128 hidden units in first GRU layer and 64 hidden units in the second GRU layer. Each GRU layer had a dropout regularization  with a dropout rate 0.1. The GRU layers were followed by a temporal convolutional network (TCN)  consisting of 32 filters. The decoder of the CTC model consists of a time distributed dense layer and a softmax activation function. The output of the encoder is fed into the decoder at every time step. The encoder takes the EEG features as input. The number of time steps of the encoder is calculated as the product of sampling frequency of the input EEG features and sequence length. There was no fixed value for the number of time steps. We used dynamic recurrent neural network (RNN) cell.
The CTC model was trained for 130 epochs with a batch size of 32 using adam  optimizer to optimize the CTC loss function. The mathematical details of CTC loss function are covered in [4, 14, 3, 2]. We used a character based model in this work. The model was predicting a character at every time step. During inference time a CTC beam search decoder is used in combination with an external 4-gram language model  known popularly as shallow fusion.
We used 80 % of the total EEG data as training set and remaining data as test set. The train-test split was done randomly. Figure 1 shows the architecture of our CTC model and Figure 2 shows CTC training loss convergence. All the scripts were written using Tensorflow 2.0 and Keras deep learning framework.
Iii Design of Experiments for building the database
Four male subjects in their early to mid twenties took part in the EEG experiment. Out of the four subjects three were non native English speakers and one subject was a native English speaker. Each subject was asked to read first 30 English sentences from USC-TIMIT database  in their mind without producing any voice and their EEG signals were recorded. The English sentences were shown to them on a computer screen. Each subject was then asked to repeat the same experiment two more times. The data was recorded in absence of background noise. There were 90 EEG recordings per each subject.
We used Brain product’s EEG recording hardware. The EEG cap had 32 wet EEG electrodes including one electrode as ground as shown in Figure 3. We used EEGLab  to obtain the EEG sensor location mapping. It is based on standard 10-20 EEG sensor placement method for 32 electrodes.
Iv EEG feature extraction details
EEG signals were sampled at 1000Hz and a fourth order IIR band pass filter with cut off frequencies 0.1Hz and 70Hz was applied. A notch filter with cut off frequency 60 Hz was used to remove the power line noise. The EEGlab’s  Independent component analysis (ICA) toolbox was used to remove other biological signal artifacts like electrocardiography (ECG) and electrooculography (EOG) from the EEG signals. We extracted five statistical features for EEG, namely root mean square, zero crossing rate,moving window average,kurtosis and power spectral entropy [1, 2]. Thus in total we extracted 31(channels) X 5 or 155 features for EEG signals. The EEG features were extracted at a sampling frequency of 100Hz for each EEG channel.
V EEG Feature Dimension Reduction Algorithm Details
We reduced the 155 EEG features to a dimension of 30 by applying Kernel Principle Component Analysis (KPCA) . We plotted cumulative explained variance versus number of components to identify the right feature dimension as shown in Figure 4. We used KPCA with polynomial kernel of degree 3 [1, 2].
We used word error rate (WER) as the performance metric to evaluate the CTC model during test time. Table 1 shows the results obtained during test time for different test set vocabulary sizes. The average WER is reported in Table 1.
For all the test set vocabulary sizes we observed WER in 80’s. We believe the test time performance can be improved by training the model with more number of examples. Usually the CTC is model is trained with larger data sets to observe state-of-the-art performance during test time.
We observed that our overall results were poor compared to the results demonstrated by authors in  where they used EEG signals recorded in parallel with spoken speech. Our overall results indicate that continuous silent speech recognition remains as a challenging problem and we hope our results will motivate other researchers to develop better decoding models. To the best of our knowledge this is the first time continuous silent speech recognition is demonstrated using real experimental EEG features at sentence level.
Vii Conclusion and Future work
In this work we demonstrated the feasibility of using EEG signals to perform continuous silent speech recognition.
Future work will focus on improving our current results by doing EEG source localization, developing better decoding models etc.
We would like to thank Kerry Loader and Rezwanul Kabir from Dell, Austin, TX for donating us the GPU to train the models used in this work.
- G. Krishna, C. Tran, J. Yu, and A. Tewfik, “Speech recognition with no speech or with noisy speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019.
- G. Krishna, C. Tran, M. Carnahan, and A. Tewfik, “Advancing speech recognition with no speech or with noisy speech,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019.
- G. Krishna, Y. Han, C. Tran, M. Carnahan, and A. H. Tewfik, “State-of-the-art speech recognition using eeg and towards decoding of speech spectrum from eeg,” arXiv preprint arXiv:1908.05743, 2019.
- A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning, 2014, pp. 1764–1772.
- J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
- A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649.
- P. Kumar, R. Saini, P. P. Roy, P. K. Sahu, and D. P. Dogra, “Envisioned speech recognition using eeg sensors,” Personal and Ubiquitous Computing, vol. 22, no. 1, pp. 185–199, 2018.
- K. Wang, X. Wang, and G. Li, “Simulation experiment of bci based on imagined speech eeg decoding,” arXiv preprint arXiv:1705.07771, 2017.
- N. Ramsey, E. Salari, E. Aarnoutse, M. Vansteensel, M. Bleichner, and Z. Freudenburg, “Decoding spoken phonemes from sensorimotor cortex with high-density ecog grids,” Neuroimage, 2017.
- S. Zhao and F. Rudzicz, “Classifying phonological categories in imagined and articulated speech,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 992–996.
- A. Porbadnigk, M. Wester, and T. S. Jan-p Calliess, “Eeg-based speech recognition impact of temporal effects,” 2009.
- A. Kapur, S. Kapur, and P. Maes, “Alterego: A personalized wearable silent speech interface,” in 23rd International Conference on Intelligent User Interfaces, 2018, pp. 43–53.
- E. J. Wadkins, “A continuous silent speech recognition system for alterego, a silent speech interface,” Ph.D. dissertation, Massachusetts Institute of Technology, 2019.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
- S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 369–375.
- S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim, Y. Zhu, L. Goldstein et al., “Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (tc),” The Journal of the Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
- A. Delorme and S. Makeig, “Eeglab: an open source toolbox for analysis of single-trial eeg dynamics including independent component analysis,” Journal of neuroscience methods, vol. 134, no. 1, pp. 9–21, 2004.
- S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel pca and de-noising in feature spaces,” in Advances in neural information processing systems, 1999, pp. 536–542.
- G. Krishna, C. Tran, M. Carnahan, Y. Han, and A. H. Tewfik, “Improving eeg based continuous speech recognition,” arXiv preprint arXiv:1911.11610, 2019.