Synthesis of Tongue Motion and Acoustics From
Text using a Multimodal Articulatory Database
We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a 3D model of the tongue surface to an articulatory dataset and training a statistical parametric speech synthesis system directly on the tongue model parameters. We evaluate the model at every step by comparing the spatial coordinates of predicted articulatory movements against the reference data. The results indicate a global mean Euclidean distance of less than 2.8 mm, and our approach can be adapted to add an articulatory modality to conventional TTS applications without the need for extra data.
2Dlong=two-dimensional \DeclareAcronym3Dlong=three-dimensional \DeclareAcronymBAPlong=aperiodicity per band \DeclareAcronymCAPTlong=computer-assisted pronunciation training \DeclareAcronymDFKIlong=German Research Center for Artificial Intelligence \DeclareAcronymDoFshort-plural=, long=degree of freedom, long-plural-form=degrees of freedom \DeclareAcronymEGGlong=electroglottography \DeclareAcronymEMAlong=electromagnetic articulography \DeclareAcronymEPGlong=electropalatography \DeclareAcronymESTlong=Edinburgh Speech Tools \DeclareAcronymF0short=F\textsubscript0, long=fundamental frequency \DeclareAcronymGMMlong=Gaussian mixture model \DeclareAcronymHMMlong=hidden Markov model \DeclareAcronymHTSlong=\acHMM based synthesis \DeclareAcronymMCDlong=mel cepstral distortion \DeclareAcronymMGClong=mel-generalized cepstral coefficients \DeclareAcronymMLSAlong=mel log spectrum approximation \DeclareAcronymMRIlong=magnetic resonance imaging \DeclareAcronymPCAlong=principal component analysis \DeclareAcronymRMSElong=root mean square error \DeclareAcronymTTSlong=text-to-speech \DeclareAcronymUTIlong=ultrasound tongue imaging \DeclareAcronymVUVlong=voiced-unvoiced \DeclareAcronymXRMBlong=X-ray microbeam \DeclareAcronymHOSVDlong=higher order singular value decomposition \addbibresourcereferences.bib \pdfstringdefDisableCommands\pdfstringdefDisableCommands
The sound of human speech is the direct result of production mechanisms in the human vocal tract. Air flows from the lungs through the glottis, whose vocal folds can be set to vibrate, the sound of which is then filtered by the shape of the tongue, lips, and other articulators, generating what we perceive as audible signals such as spoken language. Researchers in phonetics and linguistics have studied these speech production mechanisms for many years, but while the acoustic signal and facial movements can be observed and measured directly, doing the same for partially or fully hidden articulators such as the tongue and glottis is not as straightforward.
Consequently, sensing and imaging techniques have been applied to the challenge of observing speech production mechanisms in vivo, which has greatly improved our understanding of these processes. The corresponding modalities include, fluoroscopy [Munhall1995JASA], \acUTI [Stone2005], \acXRMB [Westbury1994], \acEMA [Schoenle1987, Hoole2010], and real-time \acMRI [Niebergall2012, Narayanan2014], among others. Some of these involve health hazards (due to ionizing radiation), and all are more or less invasive, but they produce biosignals which, in combination with simultaneous acoustic recordings, represent multimodal articulatory speech data. The benefits are tempered by the challenges of processing the imaging and/or point-tracking data, which in the field of speech processing has created new opportunities for collaboration with areas such as medical imaging and computer vision.
The biosignals that can be obtained using such modalities to record spoken language, provide opportunities to greatly enhance models of speech by integrating measurements of the underlying processes directly with the acoustic signal. This leads to more elegant and powerful approaches to speech analysis and synthesis [Ling2009, Ling2010SpeCom, Richmond2015]. However, it must be borne in mind that all of the biosignals produced by the modalities mentioned above represent a sampling of the articulators that is sparse in the temporal domain, the spatial domain, or both.
Depending on the manner in which the data is used for analysis or applications, the resolution may need to be increased, but the missing samples cannot be restored without prior knowledge, typically provided by a statistical model trained on other data.
In this study, we present an approach to multimodal \acTTS synthesis that generates the fully animated, \ac3D surface of the tongue, synchronized with synthetic audio, using data from a single-speaker, articulatory corpus that includes \acEMA motion capture of three tongue fleshpoints [Richmond2011]. The audio and articulatory motion are synthesized using the \acHTS framework [Zen2009], while the surface restoration is performed by means of a multilinear statistical tongue model [Hewer2018CSL] trained on a multi-speaker, volumetric \acMRI dataset [Richmond2012]. The potential application domains of this approach include audiovisual speech synthesis and \acCAPT, among others.
Deriving models suitable for producing speech related tongue motion is an active field of research. Such models can, for example, help to analyze and understand articulatory data that is very sparse in the spatial domain. Ideally, such tongue models should offer a good compromise between accuracy of the generated shape and the available \acpDoF for manipulating it. This means that biomechanical models such as those presented by \citetLloyd2012, \citetXu2015, \citetWrench2015, or \citetYu2017 might be too complex for this purpose. While such models aim to simulate the underlying mechanics of the human tongue as closely as possible, and can be used to visualize existing articulatory data, they can be challenging to control efficiently.
Geometric tongue models are less complex than their biomechanical counterparts. Here, we distinguish between generic and statistical tongue models. Generic tongue models are 3D models of the tongue that may be deformed and animated by using standard methods in computer graphics.
Statistical tongue models, on the other hand, are constructed by analyzing the \acpDoF of the tongue shape in recorded articulatory data, such as \acMRI recordings of speech related vocal tract shapes. Roughly speaking, such an analysis can be carried out in two ways. The first variant investigates shape variations related to the tongue pose that are specific for speech production. Examples of such approaches are the works by \citetEngwall2000, Badin2002, and \citetBadin2006, who examined those variations in \ac3D \acMRI scans from a single speaker, respectively. These methods only estimate the \acDoF that are tongue pose related, while shape variations that may describe anatomical differences are missing.
Another class of methods aims at investigating those anatomy and tongue pose related shape variations separately. This paradigm offers several advantages: First, the results give access to tongue models that may be adapted to new speakers. Second, this type of analysis may also provide insight into how anatomical differences affect human articulation. For \ac2D \acMRI, such work was conducted, e.g., by \citetHoole2000 and \citetAnanthakrishnan2010. \citetZheng2003 investigated those variations in a sparse point cloud extracted from 3D \acMRI. Most recently, we performed such an analysis on mesh representations of the tongue that were extracted from 3D \acMRI scans \citepHewer2018CSL.
Such geometrical models have been successfully used in previous work to generate animations from provided articulatory data: \citetKatz2014 presented a real-time visual feedback system that deforms a generic tongue model using \acEMA data. However, due to the generic nature of the model, their approach did not take anatomical differences into account. A statistical model was used in the approach by \citetBadin2008, who used volumetric imaging data of one speaker to derive the tongue model, and \acEMA data of the same speaker to animate it. \citetEngwall2003 followed a similar approach. Our own previous work utilized a multilinear statistical model to visualize \acEMA data, which allowed it to be adapted to different speakers \citepJames2016.
Independently, there is a growing body of work on application-oriented research to combine articulatory data, and features derived from it, with speech technology applications, such as to recover articulatory movements from the acoustic signal (“articulatory inversion mapping”, cf. \citepKing2007, Mitra2011 for examples), provide articulatory control for reactive \acTTS synthesis (e.g., \citepAstrinaki2013, Ling2013), or predict sparse articulatory movements from a symbolic representation (e.g., \citepLing2010SpeCom, Cai2015).
Early studies on animating full 3D tongue surface models using \acEMA data for multimodal speech synthesis, such as those of \citetEngwall2002ICSLP or \citetFagel2004SpeCom, used concatenative \acTTS systems. Other approaches (e.g., \citepBenYoussef2011PhD) for \acHMM based \acTTS with intra-oral animation also rely on acoustic-articulatory inversion mapping. However, to our knowledge, no previous study has presented an end-to-end system to directly synthesize acoustics and the motion of a full \ac3D model of the tongue surface from text using statistical parametric speech synthesis, particularly with a tongue model that can be easily adapted to the anatomy of different speakers.
Ii-a Multilinear Shape Space Model
In our approach, we utilize a multilinear model to describe different tongue shapes. This is achieved by using this model to create a function
that maps the parameters and to a polygon mesh . Such a mesh consists of a vertex set that contains positional data and a face set that uses these vertices to form the collection of surface patches of the represented shape. We note that these meshes have the same face set and only differ in the positional data of their vertices. The used parameters in the function describe two distinct sets of features: On the one hand, the speaker parameter determines the anatomical features of the generated tongue. The pose parameter , on the other hand, represents the shape properties that are related to articulation.
To compute the multilinear model, we use a database that consists of \acMRI scans of speakers showing their vocal tract configuration for different phonemes. By means of image processing and template matching methods, we extract tongue meshes from the \acMRI data, such that in the end, for each speaker, one mesh is available for each considered phoneme. This processing is described in detail by \citetHewer2016PSA, Hewer2018CSL. We then proceed to derive the \acDoF of the anatomy and speech related variations. To this end, we center the obtained meshes and turn them into feature vectors by serializing the positional data of their vertices. Afterwards, we construct a tensor of third order consisting of these feature vectors, such that the first mode of the tensor corresponds to the speakers, the second one to the considered phonemes, and the third one to the positional data.
In a final step, we apply \acHOSVD [Tucker1966] to obtain the following tensor decomposition:
In this decomposition, the tensor is of third order and represents our multilinear model. The operation is the -th mode multiplication of the tensor with the matrix . The two matrices and contain the parameters for reconstructing the original feature vectors: Each row of is a speaker parameter and each row of a pose parameter. Basically, each speaker parameter represents a point in the -dimensional speaker subspace and each pose parameter a point in the -dimensional pose subspace that are linked together by the tensor . We remark that, compared to a \acPCA model, such a multilinear model offers the advantage that it aims at capturing anatomical and articulation related shape variations separately.
The tensor can be used to create new positional data for provided parameters and :
where is a feature vector consisting of the positional data that corresponds to the mean mesh of the tongue shape collection. This generated information can be utilized to construct a new tongue shape: We reconstruct the vertex set by using the created positional data and combine it with the original face set to obtain our mesh. More details on how the model was derived and evaluated can be found in [Hewer2018CSL].
In our framework, we use this model to register data of an \acEMA corpus in order to obtain the corresponding parameters, which is done as follows: In a first step, we manually align the \acEMA data to the model space by using a provided reference coil. As we want to register the \acEMA data, we have to decide which coil corresponds to which vertex of the model mesh. This process is done in a semi-supervised way: The parameters are first set to random values and the associated mesh is generated. Next, for each considered coil the nearest vertex on the mesh is found. We then refine these correspondences iteratively by fitting the model to the coils and updating the nearest vertices. In the end, we keep the correspondences that resulted in the smallest average Euclidean distance. Finally, we inspect the result manually and repeat the experiment if the correspondences appear to be wrong. The tongue model mesh is shown in Fig. 1, highlighting the vertices selected to correspond with the three tongue coils in the \acEMA data. With these estimated correspondences, we fit the multilinear model to each considered \acEMA data frame of the corpus by minimizing the energy:
The data term measures the distances between the selected vertices of the generated mesh and the corresponding coil positions. The speaker consistency term weighted by generates energy if the current speaker parameter differs from the one of the previous time step. The remaining term, the pose smoothness term weighted by fulfills a similar role: It penalizes changes of the pose parameter over time. As a minimizer of this energy is the best compromise between those mentioned assumptions, the fitting results will be close to the data and show smooth transitions over time. The degree of smoothness can be controlled by adjusting the weights and . As the multilinear model can be used to measure the probability of generated shapes, we can also choose how far the results are allowed to deviate from the model mean: We limit the possible values for each entry of the parameters to an interval where , is the mean and the standard deviation of the corresponding entry in the training set of tongue meshes. In order to obtain a minimizer, we use a quasi-Newton solver [Liu1989] that supports limiting the solution to the given intervals.
Ii-B Multimodal Statistical Parametric Speech Synthesis\acreset
The \acHTS framework first presented by \citetZen2005 is a standard statistical parametric speech synthesis system. The architecture comprises four main parts:
the parametrization of the signal,
the training of the models,
the parameter generation, and
the signal rendering.
The focus of our study impacts the parametrization (a) and the rendering (d) stages. Therefore, we use the standard training stage (b), described in [Zen2005], and the standard parameter generation algorithms (c), described in [Tokuda2000].
The parametrization of the signal can be performed using any suitable signal processing tool, as long as it is kept consistent with the signal rendering. In the standard procedure, this is generally accomplished by coupling STRAIGHT [Kawahara1999] with a \acMLSA filter [Fukada1992]. First, STRAIGHT is used to extract the spectral envelope, the \acF0, and the aperiodicity. Generally, the \acF0 values are transformed into the logarithmic domain, to be more consistent with human hearing. Since the number of coefficients used of the spectral envelope and the aperiodicity is too high, the \acMLSA filter is used to parametrize these coefficients and to obtain the \acMGC and the \acBAP, respectively.
In this study, we propose to not only consider the parametrization of the acoustic signal but also the parametrization of speech articulation. In previous studies [Ling2009, Ling2010IS, Ling2010SpeCom], \acEMA data was used as the articulatory representation. In the present study, we work towards replacing the \acEMA data by the tongue model parameters. Therefore, our goal is to train on the trajectories of the tongue model parameters using the standard \acHTS framework as presented by \citetZen2005. The training models in \acHTS are \acpHMM, at a phone level, whose observations are composed by decision trees. The leaves of the decisions trees are \acpGMM which are used to produce the parameters at the generation level. The generation level consists of applying the algorithm presented by \citetTokuda2000. Fig. 2 presents the details of the modified architecture.
Iii-a Multilinear Model
As the database for deriving the multilinear model, we used \acMRI data from the Ultrax project \citepRichmond2012 (11 speakers) and combined it with the data of \citetBaker2011 (1 speaker), which was recorded as part of the Ultrax project, but released separately. In the end, the resulting tongue mesh collection contained, for each speaker, estimated shapes for the phone set \textipa[i, e, E, a, A, 2, O, o, u, 0, @, s, S]. Accordingly, the resulting multilinear model has and for the anatomy and tongue pose, respectively. The tongue mesh we used for the template matching was manually extracted from one \acMRI scan, made symmetric to remove some bias towards the original speaker, and finally remeshed to be more isotropic. It consists of vertices, faces, and has a spatial resolution of .
The data used for the experiments in this study is taken from the mngu0 corpus, specifically the “day 1” \acEMA subset [Richmond2011], which contains acoustic recordings, time-aligned phonetic transcriptions, and \acEMA motion capture data (sampled at using a Carstens AG500 articulograph).111From the mngu0 website, http://mngu0.org, we downloaded the following distribution packages: Day1 basic audio data downsampled to (v1.1.0) Day1 basic \acsEMA data, head corrected and unnormalized (v1.1.0) Day1 transcriptions, Festival utterances and ESPS label files (v1.1.1) We selected the “basic” (as opposed to the “normalized”) release variant of the \acEMA data, because it preserves the silent (i.e., non-speech) intervals, as well as the \ac3D nature and true spatial coordinates of the sensor data (after head motion compensation). The \acEMA coil layout for this data is shown in Fig. 3; the coils are explained in Table I.
In order to manipulate the \acEMA data more flexibly, the files were first converted from the binary \acEST format to a JSON structure. Invalid values (i.e., NaN) were replaced by linear interpolation. No further modification, in particular no smoothing, was applied.
From the provided acoustic data, signal parameters were extracted using STRAIGHT [Kawahara1999] with a frame rate of , matching that of the \acEMA data. As we follow the standard \acHTS methodology, we also kept the same parameters. Therefore, our signal parameters are , , and one coefficient for the \acF0.
From the utterances in the data, (, around ) were randomly selected and held back as a test set; the remaining utterances (around ) were used as the training set to build \acHTS synthesis voices. A comparison of phone distributions in the training and test sets shows a satisfactory match (cf. Fig. 4).
of the mngu0 Corpus.
Iii-C Acoustic Synthesis
As a baseline, we first built a conventional \acTTS system using the acoustic data only. This served mainly to validate our voicebuilding process and ensure that the transcriptions provided, and labels generated from them, along with the acoustic signal parameters, were able to generate audio of sufficient quality. Accordingly, we did not undertake a formal subjective listening test, and instead evaluated this baseline experiment using objective measures only.
We synthesized the utterances in the test set using two conditions. The first condition is the standard synthesis process. This condition allows us to evaluate the duration accuracy. For the second condition, we imposed the acoustic phone durations from the provided transcriptions to allow direct comparison with the natural recordings. For the following experiments, we synthesized both conditions as well. The objective evaluation was conducted based on the following metrics.
For the duration evaluation, we calculated the duration \acRMSE at the phone level (in ) between the reference duration and the one synthesized using the first condition.
Considering the other coefficients, we compared the synthesis result (), achieved using the second condition, to the reference () present in the test corpus. As the duration was imposed, we have the same number of frames for the produced utterance and the reference one. To evaluate the \acF0, we used three measures: the \acVUV error rate percentage creftype 6 to check the prediction of the \acF0, the \acRMSE in creftype 7, and the \acRMSE in cent creftype 8. The latter measure focuses on the frames which are voiced in both conditions (original and predicted \acF0). Furthermore, it is a log scale measure adapted to the human perception.
Finally, to evaluate the spectral envelope production, we computed the \acMCD between the \acMGC vectors of dimension in :
Except for the duration, all parameters were evaluated at the frame level. Based on these measures, we can compare our results to previous studies, such as the one presented by \citetYokomizo2010.
The results of this evaluation are given in Table II and comprise the mean, standard deviation, and confidence interval with a value at . Compared to [Yokomizo2010], we achieved slightly better results, notwithstanding the different dataset. Therefore, we can conclude that our acoustic prediction is consistent with the state of the art in \acHTS.
Iii-D Combined Acoustic and \acEMA Synthesis
Adopting the paradigm of early multimodal fusion, we combined the acoustic signal parameters with the \ac3D positions of the seven \acEMA coils shown in Table I, increasing the vector size by , to parameters per frame. Using the \acHTS framework, we then built another \acTTS system from this multimodal data.
Synthesizing the test set in this way, we obtained, in addition to the audio, synthetic trajectories of predicted \acEMA coil positions. To evaluate the combined acoustic and \acEMA synthesis, we computed the same objective measures as in Section III-C. We also computed the Euclidean distance in space between the observed and predicted positions for the \acEMA coils. Finally we computed the \acRMSE between the dynamics of the trajectories of the coils using a unit of millimeters per frame (). The results of this evaluation are given in Table III. We see that the differences in the acoustic measures compared to the acoustic-only synthesis (cf. Table II) are negligible.
The comparison between the observed and predicted trajectories for one test utterance is illustrated in Fig. 5. The observed and predicted (synthesized) positions of the three tongue coils are shown in each of the three dimensions in the data, along with the Euclidean distance. Silent intervals and consonants classified as coronal \textipa[t, d, n, l, s, z, S, Z, T, D] and dorsal \textipa[g, k, N], based on the provided phonetic transcription, have been highlighted. This helps visualize the correspondence between gestures of the tongue tip (coil T1) and tongue back (coils T2 and T3) for coronal and dorsal consonants, respectively, and the phonetic units they produce.
Several points merit discussion. First of all, there are large mismatches between the observed and predicted tongue \acEMA coil positions during the silent (pause) intervals at the beginning and end of the utterance. This can be attributed to the fact that the wide range of the speaker’s tongue movements during non-speech intervals are not distinguished in the provided annotations, but invariably labeled with the same pause symbol. However, there are at least two very distinct shapes for the tongue during such silent intervals, including a “rest” and a “ready” position (just before speech is produced), in addition to other complex movements such as swallowing. In the absence of distinct labels corresponding to these positions and movements, none of this silent variation can be captured by the \acpHMM trained on this data; instead, the tongue coils are unsurprisingly predicted to hover around global means.
Secondly, there is noticeable oversmoothing and target extrema are not always quite reached. This can typically be attributed to the \acHMM based synthesis technique, despite the integration of global variance. The dynamics, however, are well represented, and the predicted positional trajectories, as well as their derivatives, match the observed reference quite closely.
The axis appears to suffer from a greater amount of prediction error than the or axes. However, it should be noted that the positional variation along the axis is an order of magnitude smaller than that along the axis. It must also be borne in mind that nearly all of the speech-related movements occur in the mid-sagittal plane, represented by the (anterior/posterior) and (inferior/superior) axes; variation along the axis corresponds to lateral movements, which are infrequent during speech.222Incidentally, the “normalized” release variant of the mngu0 \acEMA dataset follows this rationale and consists of flattened, \ac2D data, with all coil positions projected onto the mid-sagittal plane. Having said that, the axis can serve to illustrate the physical coil locations on the tongue in the “day 1” recording session; to wit, the tongue tip coil is actually attached out of plane, a few millimeters to one side.
The Euclidean distances during speech are in the millimeter range, indicating that the predictions of \acEMA coil positions are accurate to within the precision of the \acEMA measurements themselves. However, there appears to be a certain amount of fluctuation with a more or less regular range and shape. The peaks of this fluctuation appear to correlate with spikes in the rms channels of the provided \acEMA data, which supports the hypotheses that it is either an artifact of the algorithm which calculates the coil positions and orientations from the raw amplitudes \citepStella2012, or measurement noise in the articulograph itself \citepKroos2012, or, conceivably, a combination of both factors. Of course, the noise in the Euclidean distance analysis is a direct consequence of our decision to refrain from smoothing the provided \acEMA data.333Perhaps the rms jitter in the unsmoothed measurements could also be exploited in adaptive \acEMA denoising.
Iii-E \acEMA Synthesis
While the combined acoustic and \acEMA synthesis produced satisfactory results, the requirement to train the system on a multimodal dataset such as mngu0 represents a significant drawback; compared to the reasonably wide availability of conventional, acoustic databases designed for speech synthesis, the number of suitable articulatory databases is extremely low. Encouraged by the practical equivalence in the evaluation of the acoustic measures described in Sections III-D and III-C, we therefore considered the question of decoupling the \acEMA synthesis completely from the acoustic data. Accordingly, we used the \acHTS framework to build another \acTTS system trained only on the \acEMA data, without the acoustic parameters.
Under this condition, the evaluation of the duration \acRMSE and Euclidean distances between the predicted and observed \acEMA coils, computed using the formula given by creftype 7, is given in Table IV. As we can see, the results are nearly identical to those in Table III, which confirms the validity of this approach. Fig. 6 visualizes the comparison between the observed and predicted trajectories for one test utterance.
Iii-F Tongue-only \acEMA Synthesis
In order to focus on the tongue in the following section, we first needed to investigate how far the tongue coil \acEMA positions can be predicted in isolation from the remaining \acEMA coils. To this end, we created a modified version of the \acTTS system described in the previous section, by including only the tongue coils (T1, T2, and T3), and excluding the rest of the \acEMA data from the training set.
Table V gives the evaluation of the \acEMA synthesis restricted to the three tongue coils. Comparing these results with those in Table IV, we observe that the values are virtually identical, which confirms the validity of this approach. As before, the comparison between the observed and predicted trajectories for one test utterance is shown in Fig. 7. It should be noted that despite the removal of the \acEMA coil on the lower incisor, some residual jaw motion is implicitly retained in the movements of the tongue coils.
Iii-G Model-based Tongue Motion Synthesis
At last, having verified that the \acHTS framework can be used to synthesize audio and predict the movements of three tongue \acEMA coils using separate models trained on the mngu0 database, we prepared a new kind of \acTTS system to predict the shape and motion of the entire tongue surface, by integrating the multilinear model into the process.
To this end, we first estimated the anatomical features (cf. Section II-A) of the speaker in the mngu0 dataset as follows: We used the upper incisor coil as a reference and estimated the correspondences between the three tongue coils and the model vertices, chosen as described in Section II-A. During this correspondence optimization, we used . Thus, we limit the admissible values for each entry of the model parameters to the interval where is the mean and the standard deviation of the corresponding model parameter. By using such a small interval, we try to prevent overfitting during this step. Afterwards, we fitted the model to all \acEMA data frames and stored the obtained parameter values. Here, we used the speaker consistency weight and the pose smoothness weight in the fitting energy. Thus, we demanded very smooth transitions in this case and especially penalized changes of the speaker’s anatomy over time. In this step, we used to give the approach some freedom during the fitting. We then averaged all obtained speaker parameters to get an estimate of the considered speaker’s anatomical features.
Next, we again fitted the model to all \acEMA data frames where, this time, we fixed the speaker parameter to the estimated anatomy. We note that this approach causes the multilinear model to behave like a single-speaker \acPCA model. This time, we used the weight to increase the influence of the data term. However, we decided to use this time to motivate the approach to consider more plausible shapes.
We note that the settings for the fitting were selected manually by an expert. Of course, this selection might be optimized for the used \acEMA dataset by performing a thorough analysis.
The pose parameters resulting from this fitting step were taken as the training data, and we used the \acHTS framework to build a new \acTTS system that predicts the tongue model parameter values directly from the input text.
To evaluate the performance of this system against the reference \acEMA data, we extracted the spatial coordinates of the vertices assigned during the adaptation step (see above) to produce synthetic trajectories that served as a virtual surrogate for predicted \acEMA data.
We evaluated this synthetic \acEMA data against the reference as before; Table VI provides the Euclidean distances between the predicted and observed \acEMA coils, and one test utterance is visualized in Fig. 8. It should be noted that the tongue model itself contains a temporal smoothing term, which ensures that a noisy sequence of input frames does not cause the \ac3D mesh to change shape or position too rapidly; however, this extra smoothing contributes to widespread target undershoot in the comparison. Overall, the results of this evaluation are very promising, and we can confirm that as far as possible, with only three surface points on the tongue, the animation of the full tongue appears to closely match the observed reference.
Finally, in order to compare the three experimental \acTTS systems (trained without acoustic data), we analyzed the distribution of Euclidean distances between each system and the observed reference data over the entire test set; the results are shown in Fig. 9. The distances are slightly greater when the non-tongue \acEMA coils are excluded, and greater still when the \acEMA prediction is replaced by the direct synthesis of tongue model parameters. However, overall, the distances remain in the same range, which indicates that the latter approach perform no worse than synthesis of \acEMA data – while adding the full \ac3D tongue surface into the synthesis process.
In this study, we have presented a new process of synthesizing acoustic speech and synchronized animation of a full \ac3D surface model of the tongue. We used the \acHTS framework with a single-speaker, multimodal articulatory database containing \acEMA motion capture data. First, we demonstrated a conventional, fused multimodal approach, then separated the two modalities while ensuring that the objective evaluation measures remained comparable. Finally, we adapted a multi-linear statistical model of the tongue and integrated it into the \acTTS process, and evaluated its accuracy by comparing the spatial coordinates of vertices on the model surface to the reference \acEMA data from the original speaker’s tongue movements. The results are very encouraging, and we believe that this will enable multimodal \acTTS applications that provide tongue animation with human-like performance.
It should be noted that the acoustic synthesis and predicted phone durations need not come from the same corpus as the one used for training the tongue model parameter synthesis system. Under certain conditions, it would be straightforward to use a different, conventional \acTTS system with speech recordings from a different speaker in combination with this tongue model parameter synthesis, perhaps adapting it in the speaker subspace automatically or by hand, to generate a multimodal \acTTS application with plausible, speech-synchronized tongue motion, without the requirement of having articulatory data available for the target speaker. In this way, it is possible to first synthesize the acoustic speech signal, and to provide the predicted acoustic durations to guide the synthesis of corresponding tongue model parameters, which are then used to render the animation of the \ac3D tongue model in real time.
However, there is clearly more work to be done, and in future research, we intend to refine and improve our system, and to evaluate it using human subjects who will rate it perceptually. Such a study can include intelligibility, such as the contribution of visible tongue movements during degraded, noisy, or absent audible speech. However, we also plan to assess the impact on perceived naturalness by integrating the tongue model into a realistic talking avatar (e.g., \citepTaylor2012, Schabus2014), and investigating the importance of naturalistic tongue movements for the overall impression of such avatars in multimodal spoken interaction scenarios with artificial characters. This may also lead us to model distinct non-speech poses for the tongue, such as separate “rest” and “ready” positions.
Regarding the tongue model integration, we plan to further investigate such factors as the impact of reducing the dimensionality of the model subspaces on synthesis performance, optimizing the vertex correspondence with \acEMA data, improving the fitting results by adjusting the weights for the smoothness terms, and exploring speaker adaptation using volumetric data, such as the \acMRI subset of the mngu0 corpus \citepSteiner2012.
This work was funded by the German Research Foundation (DFG) under grants EXC 284 and SFB 1102. The authors would like to thank Korin Richmond, Phil Hoole, and Simon King for creating and releasing the mngu0 database. Studies such as the one described in this paper would not be possible without such high-quality, open databases.