A COMPARATIVE STUDY OF ESTIMATING ARTICULATORY MOVEMENTS FROM PHONEME SEQUENCES AND ACOUSTIC FEATURES

A Comparative Study of Estimating Articulatory Movements From Phoneme Sequences and Acoustic Features

Abstract

Unlike phoneme sequences, movements of speech articulators (lips, tongue, jaw, velum) and the resultant acoustic signal are known to encode not only the linguistic message but also carry para-linguistic information. While several works exist for estimating articulatory movement from acoustic signals, little is known to what extent articulatory movements can be predicted only from linguistic information, i.e., phoneme sequence. In this work, we estimate articulatory movements from three different input representations: R1) acoustic signal, R2) phoneme sequence, R3) phoneme sequence with timing information. While an attention network is used for estimating articulatory movement in the case of R2, BLSTM network is used for R1 and R3. Experiments with ten subjects’ acoustic-articulatory data reveal that the estimation techniques achieve an average correlation coefficient of 0.85, 0.81, and 0.81 in the case of R1, R2, and R3 respectively. This indicates that attention network, although uses only phoneme sequence (R2) without any timing information, results in an estimation performance similar to that using rich acoustic signal (R1), suggesting that articulatory motion is primarily driven by the linguistic message. The correlation coefficient is further improved to 0.88 when R1 and R3 are used together for estimating articulatory movements.

\name

Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh \addressElectrical Engineering, Indian Institute of Science (IISc), Bangalore-560012, India \ninept {keywords} Attention network, BLSTM, electromagnetic articulograph, acoustic-to-articulatory inversion

1 Introduction

In speech production, articulatory movements provide an intermediate representation between neuro-motor planning (high level) and speech acoustics (low level) [6]. Fig. 1 demonstrates the top-down process involved in human speech production. Neuro-motor planning in the brain primarily aims to convey linguistic information (to express the thought) which are discrete abstract units. This information is passed through motor nerves to activate vocal muscles, which results in different temporally overlapping gestures of speech articulators (namely lips, tongue tip, tongue body, tongue dorsum, velum, and larynx), each of which regulates constriction in different parts of the vocal tract [13, 10]. These articulatory gestures, in turn, modulate the spectrum of acoustic signal, which results in the speech sound wave. The acoustic features extracted from speech, embed para-linguistic information along with the linguistic content including speaker identity, age, gender, and emotion state. The para-linguistic information is often encoded in the dynamics involved in the muscle coordination, articulatory timing, and vocal tract morphology of the speaker. To estimate articulatory movements, in this work, we consider phonemes as a representative of high level information, and Mel-Frequency Cepstral Coefficeints (MFCC) as acoustic features. We also consider a representation by combining timing information with the phoneme sequence, which captures both linguistic and timing information but lacks the para-linguistic information and this could be treated as an intermediate representation between phonemes and MFCCs. From the acoustic features perspective, MFCC have been shown to carry maximal mutual information with articulatory features [12, 16]. Unlike estimating from MFCC (frame-to-frame estimation), estimating articulatory movements from discrete phonemes is a top-down approach and very challenging due to the absence of timing information. It is unclear to what extent we can estimate articulatory movement from the phonemes. In this work, we investigate on the accuracy with which the articulatory representations can be predicted from the phoneme sequence, and how it differs with respect to the acoustic features and phoneme sequence with timing information.

Knowledge about the articulatory position information along with the speech acoustics have shown to benefit applications like language learning [24, 7], automatic speech recognition [9, 17], and speech synthesis [15, 18] tasks. A rich literature exists in estimating articulatory movements from acoustic features of speech which is typically known as acoustic-to-articulatory inversion (AAI). Various approaches were proposed including Gaussian Mixture Model (GMM) [26], Hidden Markov Model (HMM) [28], and neural network [23, 27]. The state-of-art performance is achieved by bidirectional long short-term memory networks (BLSTM)[14, 20].

Figure 1: Top-down process involved in human speech production mechanism [6]

On the other hand, there have been few attempts for estimating articulatory movements from text or phonological units. These attempts typically deployed the techniques from speech synthesis paradigm, using HMM [19] and BLSTM [29]. In fact, these works compared their respective performance with that from AAI. Works in [19, 29] reported a significant drop in phonological feature based performance compared to that using acoustic features (AAI model). The main reason for the drop in the performance from acoustic features to phonemes could be due to the limitations of the duration model with HMM [19, 29].

With advancements in the neutral network based modelling approaches, it has been shown that attention networks are able to capture the duration model more accurately, and achieve the state-of-the-art performance in speech synthesis applications [25]. In this work, we deploy the tacotron [25] based speech synthesis approach for phonemes to articulatory movement prediction. Since, the tacotron based model needs large amount of data to learn the alignments and the phonemes-to-articulatory movements mapping, we deploy generalized model training using ten subjects’ training data and fine-tuning approach for each subject as done for AAI in [14]. A systematic comparison of articulatory movement prediction using different features reveals that a correlation coefficient of 0.81 between predicted and original articulatory movements is obtained when phoneme sequence is used for prediction, without any timing information. On the other hand, the acoustic features based prediction achieves a correlation coefficient of 0.85 indicating that articulatory movements are primarily driven by linguistic information.

2 Data set

In this work, we consider a set of 460 phonetically balanced English sentences from MOCHA-TIMIT corpus as the stimuli for data collection from a group of 10 subjects comprising 6 males (M1, M2, M3, M4, M5, M6) and 4 females (F1, F2, F3, F4) in the age group of 20-28 years. All the subjects are native Indians with proficiency in English and reported to have no speech disorders in the past. All subjects were familiarized with the 460 sentences to obviate any elocution errors during recording. For each sentence, we simultaneously recorded audio using a microphone [8] and articulatory movement data using Electromagnetic articulograph (EMA) AG501 [1]. EMA AG501 has 24 channels to measure the horizontal, vertical, and lateral displacements and angular orientations of a maximum of 24 sensors. The articulatory movement was collected with a sampling rate of 250Hz. The sensors were placed following the guidelines provided in [21]. A schematic diagram of the sensor placement is shown in Fig. 2.

Figure 2: Schematic diagram indicating the placement of EMA sensors [14].

As indicated in Fig. 2, we used 6 sensors which were glued on different articulators, viz. Upper Lip (UL), Lower Lip (LP), Jaw, Tongue Tip (TT), Tongue Body (TB), and Tongue Dorsum (TD). Each of these six sensors captured the movements of articulators in 3D space. Additionally, we also glued two sensors behind the ears for head movement correction. In this study we consider the movements only in the midsagittal plane, indicated by X and Y directions in Fig. 2. Thus, we have twelve articulatory trajectories denoted by , , , , , , , , , , , . Before placing the sensors, the subjects were made to sit comfortably in the EMA recording setup. The subjects were given sufficient time to get used to speaking naturally with the sensors attached to different articulators. Manual annotations were performed to remove silence at the start and end of each sentence.

3 Proposed Approach

In this section, we first present the proposed approach for phoneme-to-articulatory prediction, and then describe a BLSTM based AAI approach, followed by a training scheme to overcome the limitation of scarcity of acoustic-articulatory data.

Articulatory movement estimation using attention network: In this work, we deploy the state-of-the-art speech synthesis model, tacotron architecture [25] to model duration information for articulatory movement estimation from phonemes. There are three major components in tacotron model, encoder, attention and decoder as shown in Fig. 3. Encoder takes discrete sequences of phonemes as an input and maps it to the hidden states , which acts as an input to the attention. The attention network models the time alignment between the encoder and decoder hidden states. The decoder hidden states, , are utilized to generate the articulatory movement trajectories over time.

Each phoneme is represented by a 40-dimensional one-hot encoded representation, which are fed to a 512-dimensional embedding layer, followed by a stack of 3 convolution layers and a BLSTM layer with 512 hidden units, which make up the encoder block of the model as depicted in Fig. 3. Attention network is a location-sensitive attention network which iterates over previous decoder output () and attention weights () and all the encoder hidden states (). The attention mechanism is governed by the equations below [5]:

(1)
(2)
(3)

where are attention weights and and parameters of attention network are denoted by , , weight matrices and and denotes the projection and bias vectors, respectively. In the Eq.(2), is computed by , which incorporates location awareness to the attention network [5], where is a weight matrix and is the set of previous time-step alignments. In Eq.(2), attention scores are computed as a function of encoder outputs () and previous attention weights () and decoder output (), which are further normalized using softmax to obtain attention weights. These obtained attention weights are utilized to compute fixed context vector as described by Eq.(3).

The decoder consists of two uni-directional LSTM layers with 1024 units, followed by a linear projection layer to predict the articulatory movements as shown in Fig. 3. The decoder computes the final output ( ) from the previous state output () and attention context vector (). To compute , decoder’s previous output is transformed by a two layered fully-connected network with 256 units (Pre-Net). The decoder hidden state outputs are further projected using two linear layers, one for articulatory sequence prediction and other to predict end of the sequence. The predicted articulatory trajectories are passed through a 5-layer convolutional Post-Net which predicts a residual to add to the prediction to improve the overall reconstruction. Each layer in the Post-Net consists of 512 filters with a dimension of followed by a batch normalization layer. Tanh activation is used at the end of each layer except for the final layer. For end sequence prediction, the decoder LSTM output and attention context are concatenated and projected down to a scalar and then passed through sigmoid activation to predict the probability that the output sequence has completed. This “Stop Token” prediction is used during inference to allow the model to dynamically determine when to terminate generation instead of always generating for a fixed duration.

Figure 3: Block diagram for phoneme sequence to articulatory movement prediction approach [25].

Articulatory movement estimation using BLSTM: Acoustic features and phoneme features with timing information have one-to-one correspondence between the input and output (articulatory position information) features, thereby implicitly encode timing information. Therefore, we do not need to model timing information explicitly for articulatory movement prediction in such cases. Deep recurrent neural networks architecture, namely BLSTM network has been shown to perform well in capturing context and smoothing characteristics and achieves the state-of-art AAI performance [14, 20]. So, in this work, we deploy a BLSTM neural network based method to estimate articulatory movements from acoustic and phoneme features with timing information. These input features are fed to three consecutive BLSTM layers with 256 units and at the output, we use a time distributed dense layer as a linear regression layer.

Generalized model and fine-tuning: Typically neural network models demand large amount of training data to achieve better performance. To overcome the scarcity of articulatory data to train a subject specific model, we deploy the training approach proposed in [14]. At the first step, we pool the training data from all the subjects and train a generic model to learn the mapping from input features to the target articulatory trajectories. In the second stage, we fine tune the generic model weights with respect to the target speaker, to learn speaker specific models.

4 Experimental Setup

Data pre-processing and feature computation: Articulatory data is first low-pass filtered to remove high frequency noise incurred during measurement process. The cutoff frequency of low-pass filter is 25Hz, which preserves the smoothly varying nature of the articulatory movement [12]. Further, we perform sentence wise mean removal and variance normalization along each articulatory dimension. The recorded speech is down-sampled from 48kHz to 16kHz, following which a forced alignment is performed to obtain phonetic transcription using Kaldi [22]. The resultant phonetic transcription of the dataset consists of 39 ARPABET symbols. For experiment with phonemes, we use 39 discrete phonemes plus a start token, which are encoded as a 40-dim one-hot encoded vector, denoted as “PHN”. We also incorporate the timing information to phoneme sequence using phonetic boundaries obtained from the forced alignment. In this feature set, we replicate the one-hot vector for every 10 msec in the entire duration of the corresponding phonemes, which we denote by “TPHN”. For acoustic features, we compute a 13-dim Mel-frequency cepstrum coefficients (MFCC), which are shown to be optimal for the AAI task [12, 16]. This TPHN feature set carries both linguistic information and timing information, and, hence, it works as an intermediate representation carrying information between MFCC and PHN. We also experimented with the concatenation of MFCC and TPHN to observe if there is any benefit of providing linguistic information explicitly along with MFCC. A summary of input features with respect to the hypothesized possible information they encode and the models used to estimate articulatory movements are reported in Table 1.

Input Features Encoded Information Model
Phonemes Sequence (PHN)
linguistic Attention
Time Aligned Phonemes (TPHN)
linguistic+timing BLSTM
MFCC
linguistic+para-linguistic
+timing
BLSTM
MFCC+TPHN
linguistic+para-linguistic
+timing
BLSTM
Table 1: Summary of input features, corresponding encoded information and models used for articulatory movement estimation.

Model training and evaluation procedure: The recorded acoustic-articulatory data from 460 sentences are divided into train (), validation (), and test () sets, for each subject. For BLSTM network training, we perform zero padding for all the sentences in train and validation set to obtain a fixed length sequence of 400 frames (4sec) with a batch size of 25. We choose root mean squared error (RMSE) and correlation coefficient (CC) as evaluation metrics to assess the performance the articulatory movement estimation technique. The RMSE and CC between the original and estimated trajectories are computed for each articulator separately [12, 14]. Note that the RMSE is computed on articulatory trajectories which are mean and variance normalized. So RMSE does not have any units.

For experiments with PHN, we perform teacher forcing approach [25] while training attention network, by iterating over the decoder hidden states output . We pass previous ground-truth articulatory position information as an input to attention and decoder instead of . During testing, we perform DTW alignment using Euclidean distance metric between the predicted and estimated articulatory trajectories, and then compute the RMSE and CC to assess the model performance. We use RMSE as an objective measure for learning weights for all the models.

Three types of training were done for all features: Subject-Dependent, Generic, and Fine-Tuned. In the first case, the models were trained for each subject separately, giving 10 models for 10 subjects, whereas in the second case, a single generic model is trained on all 10 subjects pooled together. And in the third case we trained the models for each subject separately, by fine-tuning the generic model with subject specific data. All the experiments for BLSTM were performed using Keras [4] and Tensorflow backend [2]. Experiments with attention network were performed using NVIDIA open source code repository [3].

5 Results and Discussion

In this section, we first present the results of the generalized model and fine-tuning approach to estimate articulatory movements. Then we compare the performance with different features (PHN, TPHN, MFCC) in a subject specific manner.

5.1 Generalized model and fine-tuning

Table 2, reports the performance with different training approaches across features. Interestingly, we observe that in all the cases fine-tuning of a model performs better than the subject-dependent model. This implies that pooling the data from all subjects helps in learning generic mapping across multiple subjects, and fine-tuning with speaker specific training data improves the speaker specific mapping. On the other-hand, while comparing the performance of the generic model with the subject dependent model, we observe that TPHN performance drops in generic model. This could be due to the lack of speaker specific para-linguistic information in the input features, unlike MFCC which encodes speaker information and enables BLSTM to learn mappings for multiple subjects by a single model [14]. Although speaker information lacks when PHN is used as the input feature, the performance using subject-dependent model is worse than that using generic model. This is primarily due to scarcity of the training data for the complex attention network. The relative improvements from generic to fine-tuned model across the PHN, TPHN, MFCC, and MFCC+TPHN are 18.53, 9.3, 0.5, and 0.8, respectively. Unlike MFCC, the greater improvements in PHN could be due to lack of para-linguistic information conveying speaker information for generic model with pooled data, which could lead to the coarse duration model across multiple subjects which, when fine-tuned with individual subject’s data, becomes more subject specific. We also perform experiments with MNGU corpus to compare tacotron based PHN model with HMMs [19]. We obtain 0.78 CC with tacotron model fine-tuned with MNGU corpus, while [19] reports 0.60 CC with HMM based approach.

Training PHN TPHN MFCC MFCC+TPHN
RMSE CC RMSE CC RMSE CC RMSE CC
Subject-Dependent
2.04
(0.109)
0.33
(0.031)
1.243
(0.087)
0.808
(0.033)
1.116
(0.095)
0.844
(0.025)
1.05
(0.086)
0.87
(0.024)
Generic
1.48
(0.098)
0.68
(0.043)
1.44
(0.108)
0.74
(0.046)
1.107
(0.091)
0.849
(0.023)
1.01
(0.083)
0.877
(0.022)
Fine-Tuned
1.18
(0.113)
0.806
(0.039)
1.239
(0.084)
0.815
(0.033)
1.090
(0.088)
0.854
(0.024)
0.99
(0.085)
0.884
(0.021)
Table 2: Performance comparison across different features. Numbers in brackets denote the standard deviation across all test sentences.

Figure 4: CC across all articulators for each speaker

5.2 Comparsion of performance across features

We compare the performance of models with respect to input features, namely PHN, TPHN, MFCC, and TPHN+MFCC (concatenation of TPHN and MFCC). Fig. 4, illustrates the box plot of CC using different input features, where x-axis represents different subjects and y-axis indicates the CC across all the articulators. While comparing the performance of PHN with TPHN, we observe that there is no significant difference () in performance for six subjects, namely M1, M2, M4, M6, F1, F2 and F4; which indicates that the timing information could be recovered from phoneme sequence and articulatory trajectories using attention network. This helps it to perform similar to the BLSTM model with TPHN features, which explicitly provides timing information. We observe that MFCC outperforms both PHN and TPHN features, this could be due to the fact that articulatory information is maximally preserved when speech acoustic signal is processed by auditory filters such as mel-scale [11, 16]. Experiments reveal that fusing the TPHN with MFCC features indicated by MFCC+TPHN, results in the best performance among all the features.

Figure 5: Relating attention weights to transitions in predicted articulatory trajectory in comparison with the corresponding ground truth for each phoneme label

5.3 Illustration of attention weights ()

To illustrate the attention weights learned for PHN to articulatory movement mapping in the fine-tuned model, we consider an example utterance “This was easy for us”. In Fig. 5, in the top subplot, we plot the ground-truth articulatory trajectories for lower lip () and tongue tip () and the corresponding phonetic boundaries are indicated by vertical lines. The attention weights are illustrated in the second subplot, where their corresponding phoneme labels are indicated on top of each weight profile. The estimated articulatory trajectories are plotted in the last subplot. Let us consider the attention weight corresponding to phoneme ‘z’ in “easy”. When the attention network provides a greater weight to the hidden state corresponding to phoneme ‘z’ (indicated by rectangular box), there is a tongue tip constriction in the predicted trajectories. Similar trend is observed with attention weight for the phoneme “f” in the word “for”, where we observe lip closures as a peak in lower lip vertical () movement. We observe that the trend in the original and predicted articulatory trajectories are similar.

6 Conclusion

In this work, we proposed phoneme-to-articulatory movement estimation using attention networks. We experimentally showed that, with phoneme sequence without any timing information, we achieve an estimation performance which is identical to that using timing information. This implies that attention networks are able to learn the timing information to estimate articulatory movements. Experiments performed with different features, reveal that MFCC concatenated with TPHN features achieve the best performance in estimating articulatory movements. In future, we plan to utilize the estimated articulatory movements in speech synthesis task and in developing audio-visual speech synthesis systems.

 
Authors thank all the subjects for their participation in the data collection. We also thank the Pratiksha Trust and the Department of Science and Technology, Govt. of India for their support in this work.

References

  1. 3D electromagnetic articulograph. Note: available online: http://www.articulograph.de/, last accessed: 4/2/2020 Cited by: §2.
  2. M. Abadi et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.
  3. F. Chollet (2019) NVIDIA: Tacotron 2 github repository. GitHub. Note: https://github.com/NVIDIA/tacotron2 Cited by: §4.
  4. F. Chollet (2015) Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: §4.
  5. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho and Y. Bengio (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §3.
  6. P. B. Denes, P. Denes and E. Pinson (1993) The speech chain. Macmillan. Cited by: Figure 1, §1.
  7. U. Desai, C. Yarra and P. K. Ghosh (2018) Concatenative articulatory video synthesis using real-time MRI data for spoken language training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4999–5003. Cited by: §1.
  8. EM9600 shotgun microphone. Note: avaliable online: http://www.tbone-mics.com/en/product/information/details/the-tbone-em-9600-richtrohr-mikrofon/, last accessed:4/2/2020 Cited by: §2.
  9. J. Frankel, K. Richmond, S. King and P. Taylor (Beijing, China (CD-ROM) 2000) An automatic speech recognition system using neural networks and linear dynamic models to recover and model articulatory traces. Proceedings of the International Conference on Spoken Language Processing. Cited by: §1.
  10. P. Gómez-Vilda et al. (2019) Neuromechanical modelling of articulatory movements from surface electromyography and speech formants. International Journal of Neural Systems 29 (02), pp. 1850039. Note: PMID: 30409059 External Links: Document, Link, https://doi.org/10.1142/S0129065718500399 Cited by: §1.
  11. P. K. Ghosh, L. M. Goldstein and S. S. Narayanan (2011) Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures. The Journal of the Acoustical Society of America 129 (6), pp. 4014–4022. Cited by: §5.2.
  12. P. K. Ghosh and S. Narayanan (2010) A generalized smoothness criterion for acoustic-to-articulatory inversion. The Journal of the Acoustical Society of America 128 (4), pp. 2162–2172. Cited by: §1, §4, §4.
  13. L. Goldstein and C. A. Fowler (2003) Articulatory phonology: a phonology for public language use. Phonetics and phonology in language comprehension and production: Differences and similarities, pp. 159–207. Cited by: §1.
  14. A. Illa and P. K. Ghosh (2018) Low resource acoustic-to-articulatory inversion using bi-directional long short term memory. In Proc. Interspeech, pp. 3122–3126. Cited by: §1, §1, Figure 2, §3, §3, §4, §5.1.
  15. A. Illa and P. K. Ghosh (2019) An investigation on speaker specific articulatory synthesis with speaker independent articulatory inversion. In Proc. Interspeech, pp. 121–125. External Links: Document, Link Cited by: §1.
  16. A. Illa and P. K. Ghosh (2019) Representation learning using convolution neural network for acoustic-to-articulatory inversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5931–5935. Cited by: §1, §4, §5.2.
  17. K. Kirchhoff (1999) Robust speech recognition using articulatory information. Ph.D. Thesis, University of Bielefeld. Cited by: §1.
  18. Z. Ling, K. Richmond, J. Yamagishi and R. Wang (2009) Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing 17 (6), pp. 1171–1185. Cited by: §1.
  19. Z. Ling, K. Richmond and J. Yamagishi (2010) An analysis of HMM-based prediction of articulatory movements. Speech Communication 52 (10), pp. 834–846. Cited by: §1, §5.1.
  20. P. Liu, Q. Yu, Z. Wu, S. Kang, H. Meng and L. Cai (2015) A deep recurrent approach for acoustic-to-articulatory inversion. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4450–4454. Cited by: §1, §3.
  21. A. K. Pattem, A. Illa and P. K. Ghosh (2018) Optimal sensor placement in electromagnetic articulography recording for speech production study. Computer Speech & Language 47, pp. 157–174. Cited by: §2.
  22. D. Povey et al. (2011) The Kaldi speech recognition toolkit. In IEEE workshop on automatic speech recognition and understanding, Cited by: §4.
  23. K. Richmond (2006) A trajectory mixture density network for the acoustic-articulatory inversion mapping.. In in Proceedings of the ICSLP, Pittsburgh, pp. 577–580. Cited by: §1.
  24. C. S, C. Yarra, R. Aggarwal, S. K. Mittal, K. N K, R. K T, A. Singh and P. K. Ghosh (2018) Automatic visual augmentation for concatenation based synthesized articulatory videos from real-time mri data for spoken language training. In Proc. Interspeech, pp. 3127–3131. Cited by: §1.
  25. J. Shen (2018) Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1, Figure 3, §3, §4.
  26. T. Toda, A. W. Black and K. Tokuda (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication 50 (3), pp. 215–227. Cited by: §1.
  27. Z. Wu, K. Zhao, X. Wu, X. Lan and H. Meng (2015) Acoustic to articulatory mapping with deep neural network. Multimedia Tools and Applications 74 (22), pp. 9889–9907. Cited by: §1.
  28. L. Zhang and S. Renals (2008) Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Processing Letters 15, pp. 245–248. Cited by: §1.
  29. P. Zhu, L. Xie and Y. Chen (2015) Articulatory movement prediction using deep bidirectional long short-term memory based recurrent neural networks and word/phone embeddings. In INTERSPEECH, Dresden, Germany, September 6-10, pp. 2192–2196. External Links: Link Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
408742
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description