Glottal Closure Instants Detection From Pathological Acoustic Speech Signal Using Deep Learning
In this paper, we propose a classification based glottal closure instants (GCI) detection from pathological acoustic speech signal, which finds many applications in vocal disorder analysis. Till date, GCI for pathological disorder is extracted from laryngeal (glottal source) signal recorded from Electroglottograph, a dedicated device designed to measure the vocal folds vibration around the larynx. We have created a pathological dataset which consists of simultaneous recordings of glottal source and acoustic speech signal of six different disorders from vocal disordered patients. The GCI locations are manually annotated for disorder analysis and supervised learning. We have proposed convolutional neural network based GCI detection method by fusing deep acoustic speech and linear prediction residual features for robust GCI detection. The experimental results showed that the proposed method is significantly better than the state-of-the-art GCI detection methods.
Glottal Closure Instants Detection From Pathological Acoustic Speech Signal Using Deep Learning
Gurunath Reddy M, Tanumay Mandal, Krothapalli Sreenivasa Rao Indian Institute of Technology Kharagpur, India email@example.com
noticebox[b]Machine Learning for Health (ML4H) Workshop at NeurIPS 2018.\end@float
Glottal closures are the instants of significant excitation to the vocal tract system occurs during closure of vocal folds for every cycle of vocal fold vibration during phonation. For healthy vocal folds, the glottal closures can be approximated with a sequence of strong impulses  due to abrupt closure of vocal folds, where as a disorder vocal fold produces a weak or smeared excitation due to incomplete closure of vocal folds . Hence, the methods developed to detect GCI from speech signal recorded from healthy vocal folds cannot be directly applied on the pathological speech. Therefore, we can find significantly less attempts to directly extract the GCI from pathological speech. An alternative to the acoustic speech signal is the laryngeal (Electroglottograph or EGG) signal, which is a low frequency signal recorded by measuring the impedance across the larynx by passing a weak electrical current with the help of two electrodes . The EGG signal thus captures the vocal folds activity free from vocal tract resonances which is prominent in raw speech. The negative peaks in the derivative of EGG co-insides with the GCI hence, we can find significantly many works based on EGG for detecting GCI from healthy and disordered vocal folds [4, 5, 6, 7, 8, 9, 10]. It is not always possible to record the EGG from the larynx, requires a dedicated device and requires at least basic skills to record EGG from the device. Where as acoustic speech can be easily recorded even with a microphone in a hand held device such as a smart phone, which can be processed in device to give the preliminarily assessment before consulting a doctor or the processed data can be sent to a pathologist for further analysis. Hence, there is a great need for GCI detection directly from speech signal.
Studies based on Modal and Pathological Speech. We can find many algorithms for GCI detection from modal or normal speech. Most of the available GCI detection methods relies on a representative signal from speech which emphasizes the locations of glottal closure instants. Further, the GCI are detected from the peaks of representative signal either by eliminating spurious instants or by picking genuine GCI using hand crafted heuristics. Most of the methods prefer linear prediction residual (LPR) as a representative signal [11, 12, 13, 14], which is a correlate of glottal source signal based on source filter model of speech production . Other methods exploit the properties of impulsive nature of excitation [16, 17, 18, 19] to derive the representative signal. Recently, Rachel et.al  and Kebin Wu et. al  showed their GCI detection method tuned for modal speech as an application to pathological speech on a limited data, which consists of one type of disorder: dysphonia. It should be noted that afore mentioned methods are developed for modal/normal speech and heavily depends on the choice of representation signal, model assumptions to extract the GCI. Also, they depend heavily on signal processing pipe lines requires manual tunning of parameters and hand crafted heuristics specific to dataset to reliably detect the GCI from speech. Recently, classification based data driven model is proposed for GCI detection from modal speech [22, 23]. It should be noted that the model is trained with hand crafted features extracted from speech signal and voting classifiers to detect the GCI. In this paper, we propose a deep CNN model, which requires no manual parameter tuning, trained with both raw acoustic and LPR to predict the GCI from pathological speech.
2 Vocal Disorder Speech and EGG Dataset
For carrying out this study, we collected simultaneous recordings of both EGG and speech signals from the patients who had pathological disorder in vocal folds from B. C. Roy Technology Hospital, Indian Institute of Technology Kharagpur, India. We have collected data from 78 patients registered for diagnosis and treatment from the same hospital. There were 45 male and 33 female patients with the age group of 22-72 years and 12-63 years respectively. The patients suffering from vocal fold disorders are categorized into six types namely vocal nodule (N), vocal polyp (P), laryngitis (L), thickened vocal folds (T), cancer (C) and paralysis (PV) vocal folds. The speech samples are captured using a high quality microphone. The EGG signals are recorded using clinical grade Electroglottograph (procured from TechCadenza, India) device. The acoustic speech and EGG of the sustained vowels /a/, /e/, /o/ are recorded each of them for three times from the participating patients. Sample speech and EGG signals for cancer (C), nodule (N), laryngitis (L) and thickened vocal folds (T) are shown in Fig 1 (due to page constraints, signals are shown for four disorders).
3 Proposed GCI detection Method
We propose a classification based GCI detection for pathological speech data by training multi-column (parallel) deep CNN models. The proposed CNN based GCI detection architecture is shown in Fig. 2.
3.1 Feature Representation, Training and Testing Dataset
To leverage the advantage of both raw speech and Linear prediction residual (LPR), we trained the deep models on the following input representations. 1) The low-pass filtered pathological speech signal (LPF_S). Speech is low-passed since high frequency components do not contribute to the GCI. 2) The low-pass filtered LPR (LPF_LPR) with LP order 12. The LPR is a noisy residual signal contains unwanted high frequency content hence low-pass filtered. 3) The positive clipped low-pass filtered speech (PC_LPF_S) and LPR (PC_LPF_LPR). Positive clipped since most of the GCI information is present in the negative portion of the signal . Low-pass filtering is performed with zero phase, order six low-pass Butterworth filter with cut-of-frequency of 1000Hz. The contemporaneous EGG signal recorded along with the speech is used as reference to manually mark the GCI locations on the speech signal. Speech signals are downsampled to 16 KHz and switched to negative polarity before labeling. The negative peaks in the derivative of the EGG signal is taken as reference to place the GCI markers after compensating the delay between the EGG and corresponding Speech signal. The input representation signals are chunked into frames of 16 samples and assigned a label for presence or absence of GCI. The frame of 16 samples around the GCI (captures GCI slope, shape and amplitude) shown in Fig. 3 as dotted box plot is labeled as GCI frame and rest as non-GCI frames for training the model. The training and testing data consist of 62 and 18 pathological speech samples respectively. It should be noted that each sample in the dataset is from unique patient and more than one patient can have same type of vocal disorder.
3.2 Multi-Column CNN Model
The proposed multicolumn deep CNN model for GCI detection is shown in Fig. 2. The final model consists of three CNN networks. Each network is trained with a input representation discussed in 3.1. Further, each CNN network consists of five convolution layers, each convolution layer is followed by a batch normalization. In our model, we have skipped max pooling layers because we want the model to capture the variations of GCI regions due to stochastic nature of input signal. The resulting feature vector from the CNN layer is connected densely to a sigmoid activation function to predict the posterior classification probability for each frame. The binary cross entropy loss function is optimized by ADAM optimizer with learning rate of 0.0001. The model is trained for 30 epochs until there is no change in validation loss. The weights are initialized from He normal distribution .
Initially, we trained single column CNN model for each input feature representation discussed in 3.1 separately to evaluate its significance for pathological GCI detection. We denote the models trained with LPF_S, LPF_LPR, PC_LPF_S and PC_LPF_LPR as Model 1, Model 2, Model 3 and Model 4 respectively. The F1-scores of Model 1, Model 2, Model 3 and Model 4 are 86.82, 82.94, 86.02, 85.52 respectively. We can observe that models trained with LPF_S, PC_LPF_S, PC_LPF_LPR have better F1-score than Model 2 trained with LPF_LPR. A further investigation into the predicted class probabilities from Model 1, Model 3 and Model 4 revealed that Model 3 assigned a little high probability to the non-GCI frames where secondary excitations are prominent, results in false alarms. Model 4 assigned very low probability at the low voiced and transition region frames results in high missed rates. Also, Model 4 assigns very low probability for the frames with dominant secondary excitation shown in Fig. 4, which is good for reducing false alarms. Hence, we trained a joint acoustic-residual model to reap the benefit of both models shown in Fig. 2. In this joint model, we extract the features trained from Model 3 and Model 4 and append them to a densely connected sigmoid activation function to predict the class probability. Since the model trained with features from low pass filtered speech signal LPF_S also gave good results, we combine the posterior probabilities of joint acoustic-residual model and model trained with LPF_S i.e, Model 1 in a maximum likelihood sense to predict the final class probability for classification. The frames which achieves class probability greater than or equal to 0.1 are classified as GCI frames. The location of maximum negative peak in the classified frame is considered as glottal closure instant.
4 Evaluation, Results and Summary
The proposed method is assessed with the reliability and accuracy measures given in . Identification Rate (IDR): measures the percentage of GCI for which exactly one GCI is detected. Miss Rate (MR): the percentage of GCI for which no GCI is detected. False Alarm (FAR): the percentage of GCI for which more than one GCI is detected. Identification accuracy (IDA): the standard deviation of the timing error between the detected and the reference GCI location (lower IDA is the better). We compared our method with the popular state-of-the-art GCI detection methods: SEDREAMS , DYPSA  and ZFF . The evaluation results of the proposed method compared with the other state-of-the-art methods shown in Table 1. From Table 1, we can observe that the proposed method is significantly better than other methods in detecting GCI with high IDR, low miss rate and false alarms compared to other methods. In summary, we proposed a combined joint acoustic-residual classification based GCI detection method for pathological speech. The evaluation results showed that the proposed method performs significantly better than other methods, which gave the hope to future research on acoustic speech for vocal disorder speech GCI detection and classification.
-  Lawrence R Rabiner and Ronald W Schafer. Digital processing of speech signals, volume 100. Prentice-hall Englewood Cliffs, NJ, 1978.
-  Akihito Yamauchi, Hisayuki Yokonishi, Hiroshi Imagawa, Ken-Ichi Sakakibara, Takaharu Nito, Niro Tayama, and Tatsuya Yamasoba. Quantification of vocal fold vibration in various laryngeal disorders using high-speed digital imaging. Journal of Voice, 30(2):205–214, 2016.
-  Donald G Childers and Ashok K Krishnamurthy. A critical review of electroglottography. Critical reviews in biomedical engineering, 12(2):131–161, 1985.
-  Pranav S Deshpande and M Sabarimalai Manikandan. Effective glottal instant detection and electroglottographic parameter extraction for automated voice pathology assessment. IEEE journal of biomedical and health informatics, 22(2):398–408, 2018.
-  Shaheen N Awan and Jordan A Awan. The effect of gender on measures of electroglottographic contact quotient. Journal of Voice, 27(4):433–440, 2013.
-  Mark RP Thomas and Patrick A Naylor. The sigma algorithm: A glottal activity detector for electroglottographic signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(8):1557–1566, 2009.
-  Martin Rothenberg and James J Mahshie. Monitoring vocal fold abduction through vocal fold contact area. Journal of Speech, Language, and Hearing Research, 31(3):338–351, 1988.
-  Vinay Kumar Mittal, B Yegnanarayana, and Peri Bhaskararao. Study of the effects of vocal tract constriction on glottal vibration. The Journal of the Acoustical Society of America, 136(4):1932–1941, 2014.
-  Ann-Christine Mecke, Johan Sundberg, Svante Granqvist, and Matthias Echternach. Comparing closed quotient in children singers’ voices as measured by high-speed-imaging, electroglottography, and inverse filtering. The Journal of the Acoustical Society of America, 131(1):435–441, 2012.
-  Lijiang Chen, Xia Mao, Pengfei Wei, and Angelo Compare. Speech emotional features extraction based on electroglottograph. Neural computation, 25(12):3294–3317, 2013.
-  Patrick A Naylor, Anastasis Kounoudes, Jon Gudnason, and Mike Brookes. Estimation of glottal closure instants in voiced speech using the dypsa algorithm. 2007.
-  Mark RP Thomas, Jon Gudnason, and Patrick A Naylor. Estimation of glottal closing and opening instants in voiced speech using the yaga algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):82–91, 2012.
-  AP Prathosh, P Sujith, AG Ramakrishnan, and Prasanta Kumar Ghosh. Cumulative impulse strength for epoch extraction. IEEE Signal Processing Letters, 23(4):424–428, 2016.
-  Andreas I Koutrouvelis, George P Kafentzis, Nikolay D Gaubitch, and Richard Heusdens. A fast method for high-resolution voiced/unvoiced detection and glottal closure/opening instant estimation of speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(2):316–328, 2016.
-  Gunnar Fant. Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, volume 2. Walter de Gruyter, 2012.
-  K Sri Rama Murty and B Yegnanarayana. Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8):1602–1613, 2008.
-  Thomas Drugman and Thierry Dutoit. Glottal closure and opening instant detection from speech signals. In Tenth Annual Conference of the International Speech Communication Association, 2009.
-  CHRISTOPHE D’ALESSANDRO and Nicolas Sturmel. Glottal closure instant and voice source analysis using time-scale lines of maximum amplitude. Sadhana, 36(5):601–622, 2011.
-  Vahid Khanagha, Khalid Daoudi, and Hussein M Yahia. Detection of glottal closure instants based on the microcanonical multiscale formalism. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1941–1950, 2014.
-  G Anushiya Rachel, P Vijayalakshmi, and T Nagarajan. Estimation of glottal closure instants from degraded speech using a phase-difference-based algorithm. Computer Speech & Language, 46:136–153, 2017.
-  Kebin Wu, David Zhang, and Guangming Lu. Gmat: Glottal closure instants detection based on the multiresolution absolute teager–kaiser energy operator. Digital Signal Processing, 69:286–299, 2017.
-  Jindrich Matoušek and Daniel Tihelka. Classification-based detection of glottal closure instants from speech signals. Proc. Interspeech 2017, pages 3053–3057, 2017.
-  Jindřich Matoušek and Daniel Tihelka. Glottal closure instant detection from speech signal using voting classifier and recursive feature elimination. Proc. Interspeech 2018, pages 2112–2116, 2018.
-  Pradeep Rengaswamy, Gurunath Reddy M, K Sreenivasa Rao, and Pallab Dasgupta. A robust non-parametric and filtering based approach for glottal closure instant detection. 2016.
-  HE Kubitschek. Normal distribution of cell generation rate. Experimental cell research, 26(3):439–450, 1962.