Studying Mutual Phonetic Influencewith a Web-Based Spoken Dialogue System

Studying Mutual Phonetic Influence with a Web-Based Spoken Dialogue System


This paper presents a study on mutual speech variation influences in a human-computer setting. The study highlights behavioral patterns in data collected as part of a shadowing experiment, and is performed using a novel end-to-end platform for studying phonetic variation in dialogue. It includes a spoken dialogue system capable of detecting and tracking the state of phonetic features in the user’s speech and adapting accordingly. It provides visual and numeric representations of the changes in real time, offering a high degree of customization, and can be used for simulating or reproducing speech variation scenarios. The replicated experiment presented in this paper along with the analysis of the relationship between the human and non-human interlocutors lays the groundwork for a spoken dialogue system with personalized speaking style, which we expect will improve the naturalness and efficiency of human-computer interaction.

Spoken dialogue systems Phonetic convergence Human-computer interfaces.

asp short=ASP, long=additional speech processing \DeclareAcronymasr short=ASR, long=automatic speech recognition, long-plural=s, short-plural=s \DeclareAcronymcall short=CALL, long=computer-assisted language learning, long-plural=s, short-plural=s \DeclareAcronymcapt short=CAPT, long=computer-assisted pronunciation training, long-plural=s, short-plural=s \DeclareAcronymcat short=CAT, long=communication accommodation theory, long-plural=s, short-plural=s \DeclareAcronymdfg short=DFG, long=German Research Foundation \DeclareAcronymdm short=DM, long=dialogue manager, long-plural=s, short-plural=s \DeclareAcronymgui short=GUI, long=graphical user interface, long-plural=s, short-plural=s \DeclareAcronymhci short=HCI, long=human-computer interaction \DeclareAcronymhhi short=HHI, long=human-human interaction, long-plural=s, short-plural=s \DeclareAcronymhmm short=HMM, long=hidden markov model, long-plural=s, short-plural=s \DeclareAcronymhts short=HTS, long=\achmm based Speech Synthesis System \DeclareAcronymits short=ITS, long=intelligent tutoring system, long-plural=s, short-plural=s \DeclareAcronymnlg short=NLG, long=natural language generation, long-plural=s, short-plural=s \DeclareAcronymnlp short=NLP, long=natural language processing, long-plural=s, short-plural=s \DeclareAcronymnlu short=NLU, long=natural language understanding, long-plural=s, short-plural=s \DeclareAcronymrmse short=RMSE, long=root-mean-square error \DeclareAcronymsds short=SDS, long=spoken dialogue system, long-plural=s, short-plural=s \DeclareAcronymsmo short=SMO, long=sequential minimization optimization, long-plural=s, short-plural=s \DeclareAcronymsvm short=SVM, long=support vector machine, long-plural=s, short-plural=s \DeclareAcronymtts short=TTS, long=text-to-speech

1 Introduction

With expanding research on, and growing use of, \acpsds, a main challenge in the development of \achci systems of this kind is making them as close as possible to \achhi in terms of naturalness, fluency, and efficiency. One aspect of such \acphhi is the relationship of mutual influences between the interlocutors. Influence here means changes in one interlocutor’s conversational behavior triggered by the behavior of the other interlocutor. We refer to changes that make the interlocutors’ behaviors more similar as convergence. Convergence can occur in different modalities and with respect to various aspects of the conversation, like eye gaze, gestures, lexical choices, body language, and more. In this paper, we concentrate on phonetic-level influences, i.e., phonetic convergence. More specifically, we examine pronunciation variations over the course of \acphci. As speech is the principal modality used for interacting with \acpsds, we believe it is an especially important modality to study in the field of \achci. Simulating and triggering convergence on the phonetic level, as found in \achhi, may contribute a lot to the naturalness of dialogues of humans with computers. \Acpsds with such personalized speech style are expected to offer more natural and efficient interactions, and move one more step away from the interface metaphor [5] toward the human metaphor [3].

The novel system introduced in Section 3 tracks the states of segment-level phonetic features during the dialogue. All of the analyses are automated and run in real time. This not only saves time and manual work typically needed in convergence studies, but also makes the system more suitable for integration into other applications. In Section 4, we use this newly introduced system with recordings collected as part of a shadowing experiment to examine the relationship of mutual influences between a (simulated) user and the system. Using these signals, the system provides both visual and numerical evidence of the mutual influences between the interlocutors over the course of the interaction. The system itself will be made freely available under an open-source license.

2 Background and Related Work

Integrating support for changes in the speech signal into computer systems may enhance \achci and provide improved tools for studying convergence in \achci. [18] discusses the advantages of systems that dynamically adapt their speech output to that of the user, and the challenges involved in developing and using these systems.

2.1 Phonetic Convergence

According to [19], phonetic convergence is defined as an increase in segmental and suprasegmental similarity between two interlocutors (e.g., [27]). In contrast to entrainment, we use the term convergence to describe dynamic, mutual, and non-imposing changes. Phonetic convergence has been found to various extent in conversational settings [13]. There is evidence for phonetic convergence being both an internal mechanism [21] and socially motivated [9]. Previous studies of phonetic convergence in spontaneous dyadic conversations have focused on speech rate [26], timing-related phenomena [23], pitch [8], intensity [12], and perceived attractiveness [16]. Phonetic convergence is often examined in the scope of shadowing experiments, in which the participants are asked to produce certain utterances after hearing them produced in some stimuli (e.g., [7]). This is typically done with single target words embedded in a carrier sentence. The experiment showcasing our system in Section 4 uses whole sentences as stimuli, in which the target features are embedded, making it a semi-conversational \achci setting.

2.2 Adaptive Spoken Dialogue Systems

Various studies have investigated entrainment and priming in \acpsds, aiming to better understand \achci dynamics and improve task-completion performance. [15], for example, focused on dynamic entrainment and adaptation on the lexical level. Others, like [17], concentrated on word frequency. [20] examined changes in both lexical choice and word frequency. While these studies addressed the changes in experimental, scripted scenarios, the theoretical foundations for studying these changes in spontaneous dialogue exist as well [2]. [6] provide examples of online adaptation for dialogue policies and belief tracking.

It is important to note that while all of the studies mentioned above examine various aspects of dialogues, none of those are related to speech – the primary modality used to interact with \acpsds. Studying convergence of speech in an \achci context is made possible with more natural synthesis technology, which gives fine-grained control over parameters of the system’s spoken output. Many systems that deal with adaptation of speech-related features focus on prosodic characteristics like intonation or speech rate. [10] sheds light on acoustic-prosodic entrainment in both \achhi and \achci via the use of interactive avatars. [1] found that users’ speech rate can be manipulated using a simulated \acsds. Similar results were found when intensity changes in children’s interaction with synthesized \actts output were examined [4].

All of the above provide solid ground for further investigation of phonetic convergence in \achci using \acpsds.

3 System

Figure 1: An overview of the system architecture. The background colors distinguish client components, server components, and external resources that can be customized.

The system introduced here is an end-to-end, web-based \acsds with a focus on phonetic convergence and its analysis over the course of the interaction. Besides placing convergence in the spotlight, it is designed to be flexible and to meet the researcher’s needs by offering a wide range of customizations (see Section 3.2). Its online access via a web browser makes it scalable and simple for the end-user to operate. The system’s architecture and functionality are described in Section 3.1, its \acgui and operation in Section 3.3, and an example of its utilization is demonstrated in Section 4. Ultimately, it offers an experimentation platform for studying phonetic convergence, with emphasis on the following:

Temporal analysis

offering real-time visualization of the interlocutors’ relations with respect to selected phonetic features over the course of the interaction.


allowing the user to experiment with different scenarios by configuring parameters and definitions in many of the system’s components.

Online scalability

connecting multiple web clients to a server, allowing users to use it anywhere without preceding installation and configurations, and helping experimenters to collect and replay acquired data.

3.1 Architecture

As the system aims to offer a customizable playground for experimenting and studying phonetic convergence in \achci, a key aspect of its architecture is the separation between client-side, server-side, and external resources (see Figure 1). All of the resources and configuration files needed for designing the interaction are located on the server. Running the client and server on different machines allows users to interact with the system using a web browser alone.

Figure 2: The architecture of the dialogue system component. The \acsasp module (dashed line) between the \acasr and \actts modules is responsible for performing additional speech processing required for analyzing the phonetic changes. Though additional links between the \acasp module and other modules (like \acnlg for example) could be made, those are beyond the scope of this work.

As shown in Figure 2, the dialogue system component consists of typical \acsds modules such as \acnlu and a \acdm, but also contains an \acfasp module [24]. This module is responsible for processing the audio and extracts the features required by the convergence model. While the \acnlu component uses merely the transcription provided by the \acasr, the \acasp module analyzes the speech signal itself. More specifically, it tracks occurrences of the defined features and passes their measured values to the convergence model, which, in turn, forwards the tracked feature parameters to the \actts synthesis component.

3.2 Models and Customizations

The computational model for phonetic convergence used in the system is described in [25]. Different phonetic convergence behavioral patterns that were observed in \achhi and \achci experiments can be simulated by combinations of the model’s parameters presented in Table 1. All of the parameters can be modified in the system’s configuration file.


allowed range*

allowed value range for new instances

history size

maximum number of exemplars in pool

update frequency

frequency to recalculate feature’s value

calculation method*

method to calculate pool value

convergence rate

weight given to pool value when recalculating

convergence limit*

the maximum degree of convergence allowed


Table 1: Summary of the computational model’s parameters in their order of application in the convergence pipeline. Parameters marked with an asterisk \enquote** are defined for each feature independently.

The entire convergence process is based on the the tracked phonetic features that are considered \enquoteconvergeable, i.e., prone to variation, and is triggered whenever the \acasr component detects a segment containing a phoneme associated with one or more of these features. Each feature is defined by a key-value map, in which the parameters from Table 1 are configured. A classifier can be associated with each feature to provide real-time predictions for both the user’s and the system’s realizations of that feature, as demonstrated in Figure 3. With this information available, more meaningful insights can be gained into the dynamics of phonetic changes in the dialogue.

The dialogue domain is specified in an XML-based file. More details on the domain file can be found in [14]. The format of the domain file makes it easy to define new scenarios for the system, such as a task-specific dialogue, general-purpose chat, or an experimental setup.

Speech processing is a central aspect of the system. Different models can be used, e.g., for improving performance or changing the language or the \acasr module or the output voice of the \actts module.

3.3 Graphical User Interface

The system’s \acgui consists of three main areas:

Figure 3: A screenshot of the plot area showing the states of the feature \textipa[E:] vs. \textipa[e:] (in 2-dimensional formant space) during an interaction. The system’s internal convergence model (orange, bottom right) gradually adapts to the user’s (blue, upper left) detected realizations. A prediction of the feature’s current realization is given for both interlocutors. The annotation box marks the turn in which the system has aggregated enough evidence from the user’s utterances and changes its pronunciation from \textipa[E:] (its initial state) to \textipa[e:] (the user’s preferred variation).

In the chat area, the interaction between the user and the system is shown in a chat-like representation. Each turn’s utterance appears inside a chat bubble with different colors and orientations for the user and the system. The turns are also numbered, to better track the dialogue progress and analysis shown by the plots in the graph area. It is also possible to replay the utterance of a turn by clicking the \enquotePlay button in its corresponding bubble.

In the interaction area, the user can interact with the system with written or spoken input. Text-based interactions progress through the dialogue (if applicable) and trigger any subsequent domain model, but will not affect convergence-related models, since there is no audio input to process. Spoken input can be provided either by speaking into the microphone or via audio files with pre-recorded speech. The latter option is especially useful for simulating specific user input, or for reproducing a previous experiment, as done in Section 4.

In the graph area, each of the tracked features is visualized in a separate plot, and new data points are added whenever a new instance of the feature is detected. Hovering over a data point in a graph reveals additional information, such as the turn in which it was added, or the realized variant of the feature produced in that turn as predicted by its classifier. These dynamic, interactive plots make it possible to shed light on how the interlocutors influence each other, whether or not they are aware of it, throughout their exchanges. Figure 3 shows such a graph with several accumulated data points.

4 Showcase: Examining Convergence Behaviors

For demonstrating a possible use of the system, we simulated the shadowing experiment detailed in [7] using the system and its analyses to look into types of participant convergence behavior with respect to the features examined in the experiment (see Table 2). This experiment is designed to trigger phonetic convergence by confronting the participants with stimuli in which certain phonetic features are realized in a manner different from their own realizations. The simulation was carried out by building a domain file with the experimental procedure, including the transition between the experiment’s phases, as well as the flow within each phase. This automates the procedure and adapts it to the participant’s pace. Participants were simulated by using their recorded speech from the original experiment in the same order. The use of the system for this purpose results in an automated, reproducible execution, with additional insights like classification of feature realizations and dynamic visualizations in the \acgui. The classifiers were trained offline on the data points acquired from analyzing the stimuli. However, the system also supports incremental, online re-training whenever requested by the user, for example after every time the convergence model is updated. For the demonstration presented here, a \acsmo [22] implementation of the \acsvm classifier was used for training. Each turn’s number and prediction are added as an interactive annotation to the dynamic graph of the relevant features, as shown in Figure 3. Finally, using the system, the experiment is transformed into an automated dialogue scenario, which enhances its \achci nature.

Sentence Feature
War das Gerät sehr teuer? \textipa[E:] vs. \textipa[e:] in word-medial ä
Was the device very expensive?
Ich bin süchtig nach Schokolade. \textipa[Iç] vs. \textipa[Ik] in word-final -ig
I am addicted to chocolate.
Wir besuchen euch bald wieder. \textipa[\sn] vs. \textipa[@n] in word-final -en
We will visit you soon again.
Table 2: Examples of stimuli sentences, each containing one target feature.

4.1 Finding Behavioral Patterns

In this section, we focus on the validation for the feature \textipa[E:] vs. \textipa[e:] as a representative example for the phonetic adaptation capability of the system. Although the classified realization is binary (\textipa[E:] or \textipa[e:]), the underlying representation used by the model is gradual. Both of these views on the feature can be seen in the graph area, as shown in Figure 3.

The degree of convergence was examined per utterance in the shadowing phase of the experiment. Three main groups emerged, each with a different behavior: one group of participants showing little to no tendency to converge (changes in of their utterances), the second, with varying degrees of convergence (   to   ), and a third group of participants who were very sensitive to the stimuli’s variation (). We refer to these groups as Low, Mid, and High, respectively. The feature’s classifier was determined on the fly, so that the prediction for each utterance was made based on the type of the stimulus to which the participant was listening. As Table 3 shows, the Low and High groups are both of significant size, indicating that these two distinct behaviors exist in the data and can be spotted by the system.

In addition, we validated the separation between these behaviors. To this end, we regarded the shadowing phase as an annotation task, where the annotators are the predictors of the user and the system. Note that similarity would mean complete convergence to every stimulus, which cannot be reasonably expected (cf. [7]). The Cohen’s kappa () values4 of the Low group are expected to be the lowest, as a lesser degree of convergence was found among these participants. By the same logic, the High group is expected to have the highest agreement, and the Mid to have values between the two other groups. Indeed, this hypothesis holds: weak agreement was found in the Low group, strong agreement in the High group, and a value close to (indicating no consistent behavior) for the Mid group.

5 Conclusion and Future Work

We have introduced a system with an integrated \acfsds, which can track and analyze mutual influence on the phonetic level during the interaction based on an internal convergence model. This combines work done in the fields of phonetic convergence and adaptive \acpsds, and contributes to the understanding of power relations between a human and a computer interlocutors. Many aspects of the system are customizable, which makes it flexible in terms of possible supported scenarios. The system can also run on a separate server, which makes it easier to scale its online use.

To showcase its capabilities, we simulated a replication of a shadowing experiment, which examined phonetic convergence regarding certain segment-level phonetic features. Three main user behaviors were found with respect to their tendency to change their pronunciation based on the system’s stimulus input. This sheds light on possible relations and dynamics between a user and a system in \achci. Running the experiment in this way not only saved time by automating the annotation and phonetic analysis, but also offered additional insight such as visualization and on-the-fly classification. We believe that this shows that phonetic convergence can be studied using our \acsds, and that this is one step forward toward personalized, phonetically aware \acpsds, which will enable more natural and efficient interaction.

Similarity () Agreement () Size ()
Low < ***
Mid *
High ***
All *
Table 3: A summary of the measures for similarity and agreement between the predictor annotations of user and model productions in the shadowing phase.

Future work will pursue two independent directions. Regarding phonetic convergence, supporting more features will make the system more comprehensive and useful for studying a wider range of phenomena. Specifically, adding support for supra-segmental features will enable replication of experiments similar to e.g., [11] in the same manner as in Section 4. As for user acceptance, it would be interesting to examine whether users show any preference toward an \acsds that converges to their speech on the phonetic level, and whether they would change their speaking style based on the system’s output, forming an interaction with mutual and dynamic convergence similar to \achhi. The first research question can be tested by comparing user interaction with a baseline system and one with convergence capabilities, and evaluating the users’ performance and satisfaction. The second research question can be investigated by comparing the users’ speech when interacting with either system configuration. Additionally, to test the system’s influence on users’ speech, the users can train with an intelligent \acfcall, such as a \acfcapt system, which will change its learner model based on their input. Metrics such as task completion rate, performance accuracy, and completion time can be used to evaluate how helpful the system is.


Funded by the \acdfg under grants STE 2363/1 and MO 597/6.


  1. email:
  2. email:
  3. email:
  4. as calculated by the kappa2 command of the irr R package (v0.84),


  1. Bell, L., Gustafson, J., Heldner, M.: Prosodic adaptation in human-computer interaction. In: 15th International Congress of Phonetic Sciences (ICPhS). pp. 2453–2456. Barcelona (2003),
  2. Brennan, S.E.: Lexical entrainment in spontaneous dialog. In: International Symposium on Spoken Dialogue (ISSD). pp. 41–44. Philadelphia, PA, USA (1996)
  3. Carlson, R., Edlund, J., Heldner, M., Hjalmarsson, A., House, D., Skantze, G.: Towards human-like behaviour in spoken dialog systems. In: Swedish Language Technology Conference (SLTC). Gothenburg, Sweden (2006)
  4. Coulston, R., Oviatt, S., Darves, C.: Amplitude convergence in children’s conversational speech with animated personas. In: Interspeech. pp. 2689–2692. Denver, CO, USA (2002),
  5. Edlund, J., Heldner, M., Gustafson, J.: Two faces of spoken dialogue systems. In: Workshop Dialogue on Dialogues: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems. Pittsburgh, PA (2006)
  6. Gašić, M., Breslin, C., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis, P., Young, S.: On-line policy optimisation of Bayesian spoken dialogue systems via human interaction. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8367–8371. Vancouver, BC, Canada (2013).
  7. Gessinger, I., Raveh, E., Le Maguer, S., Möbius, B., Steiner, I.: Shadowing synthesized speech – segmental analysis of phonetic convergence. In: Interspeech. pp. 3797–3801. Stockholm, Sweden (2017).
  8. Gessinger, I., Schweitzer, A., Andreeva, B., Raveh, E., Möbius, B., Steiner, I.: Convergence of pitch accents in a shadowing task. In: Speech Prosody. pp. 225–229. Poznań, Poland (2018).
  9. Kim, M., Horton, W.S., Bradlow, A.R.: Phonetic convergence in spontaneous conversations as a function of interlocutor language distance. Laboratory Phonology 2(1), 125–156 (2011).
  10. Levitan, R.: Acoustic-prosodic Entrainment in Human-human and Human-computer Dialogue. Ph.D. thesis, Columbia University, New York, NY, USA (2014).
  11. Levitan, R., Beňuš, Š., Gálvez, R.H., Gravano, A., Savoretti, F., Trnka, M., Weise, A., Hirschberg, J.: Implementing acoustic-prosodic entrainment in a conversational avatar. In: Interspeech. pp. 1166–1170. San Francisco, CA, USA (2016).
  12. Levitan, R., Hirschberg, J.: Measuring acoustic-prosodic entrainment with respect to multiple levels and dimensions. In: Interspeech. pp. 3081–3084. Florence, Italy (2011),
  13. Lewandowski, N.: Talent in Nonnative Phonetic Convergence. Ph.D. thesis, University of Stuttgart, Stuttgart, Germany (2012).
  14. Lison, P., Kennington, C.: Developing spoken dialogue systems with the OpenDial toolkit. In: Workshop on the Semantics and Pragmatics of Dialogue (SemDial). pp. 194–195. Gothenburg, Sweden (2015)
  15. Lopes, J., Eskenazi, M., Trancoso, I.: Automated two-way entrainment to improve spoken dialog system performance. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8372–8376. Vancouver, BC, Canada (2013).
  16. Michalsky, J., Schoormann, H.: Pitch convergence as an effect of perceived attractiveness and likability. In: Interspeech. pp. 2253–2256. Stockholm, Sweden (2017).
  17. Nenkova, A., Gravano, A., Hirschberg, J.: High frequency word entrainment in spoken dialogue. In: ACL Human Language Technologies (HLT). pp. 169–172. Columbus, OH, USA (2008),
  18. Oviatt, S., Darves, C., Coulston, R.: Toward adaptive conversational interfaces: Modeling speech convergence with animated personas. ACM Transactions on Computer-Human Interaction 11(3), 300–328 (2004).
  19. Pardo, J.S.: On phonetic convergence during conversational interaction. Journal of the Acoustical Society of America 119(4), 2382–2393 (2006).
  20. Parent, G., Eskenazi, M.: Lexical entrainment of real users in the Let’s Go spoken dialog system. In: Interspeech. pp. 3018–3021. Makuhari, Chiba, Japan (2010),
  21. Pickering, M.J., Garrod, S.: Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences 27(2), 169–190 (2004).
  22. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Burges, C.J.C., Schölkopf, B., Smola, A.J. (eds.) Advances in Kernel Methods, pp. 185–208. MIT Press (1999)
  23. Putman, W.B., Street, R.L.: The conception and perception of noncontent speech performance: Implications for speech-accommodation theory. International Journal of the Sociology of Language 1984(46), 97–114 (1984).
  24. Raveh, E., Steiner, I.: A phonetic adaptation module for spoken dialogue systems. In: Workshop on the Semantics and Pragmatics of Dialogue (SemDial). pp. 162–163. Saarbrücken, Germany (2017)
  25. Raveh, E., Steiner, I., Möbius, B.: A computational model for phonetically responsive spoken dialogue systems. In: Interspeech. pp. 884–888. Stockholm, Sweden (2017).
  26. Schweitzer, A., Walsh, M.: Exemplar dynamics in phonetic convergence of speech rate. In: Interspeech. pp. 2100–2104. San Francisco, CA, USA (2016).
  27. Walker, A., Campbell-Kibler, K.: Repeat what after whom? Exploring variable selectivity in a cross-dialectal shadowing task. Frontiers in Psychology 6(546), 1–18 (2015).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description