Non linear time compression of clear and normal speech at high rates
We compare a series of time compression methods applied to normal and clear speech. First we evaluate a linear (uniform) method applied to these styles as well as to naturally-produced fast speech. We found, in line with the literature, that unprocessed fast speech was less intelligible than linearly compressed normal speech. Fast speech was also less intelligible than compressed clear speech but at the highest rate (three times faster than normal) the advantage of clear over fast speech was lost. To test whether this was due to shorter speech duration we evaluate, in our second experiments, a range of methods that compress speech and silence at different rates. We found that even when the overall duration of speech and silence is kept the same across styles, compressed normal speech is still more intelligible than compressed clear speech. Compressing silence twice as much as speech improved results further for normal speech with very little additional computational costs.
Non linear time compression of clear and normal speech at high rates
|Cassia Valentini-Botinhao, Mirjam Wester, Junichi Yamagishi|
|Markus Toman, Michael Pucher, Dietmar Schabus|
|The Centre for Speech Technology Research (CSTR), University of Edinburgh, UK|
|National Institute of Informatics, Japan|
|Telecommunications Research Center Vienna (FTW), Austria|
Achieving fast speech with high levels of intelligibility is an elusive goal [1, 2, 3, 4, 5]. Nevertheless, there are compelling reasons why one would want to achieve it. For instance, speeding through recordings of long meetings to quickly obtain relevant content [6, 7], or as a speech output interface for blind text-to-speech (TTS) users [8, 9] . Furthermore, understanding what makes speech more intelligible can also bring improvements to hearing aids [10, 1, 2].
During the production of fast speech there is an increasing amount of overlap of articulatory gestures which results in a decrease in intelligibility, as the articulatory targets, important for clear pronunciation, are no longer reached. When producing fast speech, vowels are compressed more than consonants  and both word-level  and sentence-level  stressed syllables are compressed to a lesser degree than unstressed ones. Yet another important aspect of fast speech is the significant reduction in pauses. It is claimed that reducing pauses is possibly the strongest acoustic change when speaking faster , most probably due to the limitations of how much speakers can speed up their articulation rate .
Janse and colleagues [3, 4] have shown that fast speech (approximately times faster than normal speech) is harder to process, in terms of reaction times, and is less preferred than linearly compressed speech. Following the literature, linearity here refers to the fact that the compression rate is the same across the sentence, i.e. vowels, consonants, silence and speech are compressed at the same rate. Furthermore in , Janse and colleagues found that linearly compressed speech is more intelligible and preferred over a non linearly compressed version in which fast speech prosodic patterns were mimicked at a high speaking rate ( times).  reported that linearly speeding up normal rate sentences to a fast rate led to more intelligible sentences. Speeding both the natural and fast speech further (ultra-fast) resulted in comparable levels of intelligibility with the fast speech being rated as more natural. Similarly, the results in  show that linear compression of natural plain speech leads to higher intelligibility rates than natural fast speech.
Janse claims in  that possibly the only non linear aspect of fast speech duration changes that can improve intelligibility at high speaking rates is the removal of pauses but only when rates are relatively high, i.e., non linear compression (compressing pauses more than speech) at high speaking rates (faster than fast speech). Results obtained using the MACH1 algorithm  confirm this. The MACH1 method is based on the acoustics of fast speech with the addition of compressed pauses. At ultra fast speaking rates ( and ) MACH1 improves comprehension and is preferable to linearly compressed speech, however, no advantage was found at a fast speech speaking rate () .  proposed a non uniform time scaling of speech based on the waveform similarity overlap and add (WSOLA) time compression method  where pauses, vowels, phone transitions and consonants are compressed differently (order is from more to less). Authors reported a preference for the non linear compression method over the linear compression method and natural fast speech.
All the above studies show that time compression methods using fast speech do not tend to lead to higher intelligibility scores than when speech at normal speaking rates is compressed. At high rates non linear compression of normal speech shows some promise.
A possible alternative to using fast speech is approaching the problem from the other direction, i.e., by using clear speech as the basis for compression[20, 21]. Clear speech is a speaking style adopted when speaking in difficult communication situations. Clear speech is significantly more intelligible than conversational speech (particularly for individuals with some sort of hearing impairment) but at the expense of longer utterance duration . Studies [1, 2] testing linear and non linear time compression of clear speech found that for both compression methods clear speech reproduced at a conversational speaking rate was not more intelligible than conversational speech. However,  did find that non uniform time compression was less deleterious to the intelligibility of clear speech than a linear method but both types of compressed clear speech were still no more intelligible than unprocessed conversational speech; differences were larger for hearing impaired listeners. Krause and Braida  thought that this might be a problem with the compression method and that the intelligibility advantage of clear speech is not only due to its longer duration. They  investigated whether clear speech produced at high speaking rates could still bring intelligibility benefits over conversational speech and found that it does.  also found that clear speech which was linearly compressed to match casual speech speeds was more intelligible than unmodified casual speech.
In this paper, we are interested in testing whether the clear speech advantage still holds for particularly high speaking rates and for normal hearing individuals, aiming to reproduce such results for the generation of synthetic speech to be used by blind individuals. The questions this paper sets out to answer are: can compressed clear speech be more intelligible than compressed normal speech, and can we improve results by applying a simpler non linear technique that compresses speech and silence regions at different rates?
The remainder of this paper is as follows: Section 2 presents the database used in the experiments, Section 3 shows results of linear compression applied to a range of speaking styles at a range of speaking rates, Section 4 describes the evaluation of non linear time compression methods applied to clear and normal speech at the highest rate. This is followed by a discussion and conclusions.
We recorded a Scottish female voice talent reading prompts presented sentence by sentence. The same sentences were read in four styles: normal, fast and two types of clear. Each style was elicited by different instructions. For the normal style we asked the voice talent to speak as she would normally do. For the fast style she was asked to read the sentences out load as fast as she could while still maintaining intelligibility. To create the two types of clear speech she was instructed to speak as if talking to someone with an hearing impairment (clear h) and to a computer (clear c). To illustrate how the database looks like Fig. 1 shows the waveform and spectogram in Praat for the same sentence spoken with the fast and clear h style 111Speech samples used in the evaluation can be found at: https://wiki.inf.ed.ac.uk/CSTR/ClearSpeech.
These instructions led to speech with a wide range of timing properties. Table 1 presents, for each speaking style, the syllables per second (SPS), words per minute (WPM), speech duration () and silence duration (). All values are calculated per sentence and averaged across sentences. The values of SPS and WPM were based on a manual annotation of part of the data and consider the whole utterance including pauses, while the other values were calculated automatically by using an energy based speech detection method .
Table1 shows that the slowest style is the clear c with SPS, followed by clear h with , normal with and fast with . The duration of speech and silence inform how much of these differences are due to speech regions being longer or due to longer silence regions. For this analysis, we take normal as the reference. Silence regions not only include pauses but also phone regions such as the burst that takes place during stops. The rate of increase of speech and silence duration is of and for clear h and and for clear c. The rate of compression of fast speech and silence duration is of and . These values show that the overall durational differences seen in the clear and fast speech style recorded here were in fact due to silence regions being stretched and compressed, respectively.
3 Linear time compression
Fast speech has been reported to be less intelligible than linearly compressed normal speech. Here we explore whether the same holds true when compressing other speaking styles. We evaluate the intelligibility of speech compressed using the waveform similarity overlap and add (WSOLA) time compression method  to illustrate a linear, also referred to as uniform, compression as was evaluated in [18, 22, 9].
In this section, we compress the four different speech styles described in the previous section at four different rates. The rates chosen for this experiment are: fast (the rate that brings each style’s duration to match the duration of the fast speech) as well as , and times faster than normal speech. Intelligibility results for linearly compressed normal and fast speech for the same speaker have previously been reported in .
3.1 Listening experiment
Twenty native English speakers with no self reported hearing impairment participated in this experiment. Each individual listened to eight different sentences for each of the conditions tested and had to type the words they understood sentence by sentence. Prior to the test they undertook a small training session containing one example of each condition.
The results are calculated as the percentage of word errors averaged across a listener, taking into consideration misspellings and word contractions. Word errors are counted as words that did not appear in the transcription, irrespective of their placement as done in . Fig.2 shows the word errors for each speaking style at each compression rate. Error bars refer to the standard deviation of the error calculated across listeners.
The most intelligible style for all rates was the normal style, leading to only % word errors for the highest speaking rate. The least intelligible style was the computer directed clear speech, which already produced a similar error at the 2x rate and more than % word errors at the 3x rate. For moderate speaking rates (fast and 2x) the most intelligible style, after normal, was the hearing impaired directed clear speech (clear h), with less than % errors for the 2x rate, while for higher rates fast speech becomes more intelligible, with % for the 3x rate where clear h obtained %. At the fast speech rate, linearly compressed normal speech was more intelligible than uncompressed fast speech, which supports the findings in [3, 4]. We found that at this rate linearly compressed clear h was also more intelligible than unprocessed fast speech.
One particularly interesting finding was the fact that, at higher rates, compressed fast speech was more intelligible than compressed clear speech even though fast speech at its own rate was found to be less intelligible. In our view, there are two striking differences between the fast and clear h data: the duration of the silence and speech regions and how well articulatory targets were met. We expect that fast speech is inherently less intelligible than normal and clear speech due to substitutions and deletions that take place when one speaks fast. Therefore, we expect that the fast speech advantage at higher rates is due to highly compressed silence.
4 Non linear time compression
Clear speech is known to be more intelligible than normal speech but in our previous experiment we found no intelligibility gains when compressing clear speech to produce speech at high rates. The fact that fast speech, even though inherently less intelligible, led to better results at the highest rate, could indicate that compressed clear speech is not as intelligible because of the presence of long silences which makes the speech duration of its compressed version much shorter.
For this experiment, we focus on improving results for compressed clear h and possibly normal speech for the higher speaking rate of 3x. For this we will exploit a range of non linear compression methods, always focusing on applying different rates to speech and silence. To calculate silence regions we apply a silence detection algorithm based on a fixed energy threshold as done in . For all methods, we calculate the rate per frame according to the characteristics of the current frame (speech or silent) and feed this information to the WSOLA method which calculates the best match for the next frame to overlap and add.
4.1 Non linear methods
Table 2 presents the acronyms for the conditions we evaluate. L refers to the condition tested in the previous experiment, i.e. compress the whole utterance (speech and silence) with the same rate (the linear or uniform method).
We were interested in exploring whether clear speech was found to be less intelligible at higher rates because the duration of speech was shorter due to the presence of longer silence regions. To test this hypothesis we create condition NL1 where we compress speech to the same rate as in L so that speech duration is the same across styles and silence remains uncompressed. The final utterance duration will therefore be longer for NL1 than for L.
The resulting durations across speaking styles will vary as the duration of silence is different which makes the comparison of NL1 across styles unfair. The fair comparison is condition NL2 applied to clear speech only, where silence is also compressed so that the overall utterance duration is the same across N-NL1 and C-NL2.
Finally to test the theory that at higher rates pauses harm more than aid, we compress silence Y times more than speech for both styles (N-NL3 and C-NL3). Y was chosen to be two.
4.2 Listening experiment
As in the previous experiment, twenty native English speakers with no self reported hearing impairment transcribed eight sentences for each condition after hearing each sentence only once. One participant was removed from the results as his/her word errors were found to be excessively high compared to the others.
The results are calculated as the percentage of word errors averaged across a listener, in the same way as in the previous experiment. Fig.3 presents the word error in percentage for the 3x rate.
Similar to what was found in the previous experiment the word errors for linearly compressed clear speech (C-L) % is around twice as high as results with linearly compressed normal speech (N-L) %. This relation remains the same when the final duration of speech and silence is the same across styles: % of N-NL1 and % of C-NL2.
Not compressing pauses improved results for both styles significantly as we see NL1 scores are lower than L, but at the expense of a longer utterance duration. Compressing silence twice as much as speech also improves results as we see that for both styles NL3 results are better than L, even though the utterance duration is the same. The difference was found to be significant only for the normal style. All types of non linear methods applied to clear speech resulted in intelligibility scores closer to normal speech but not equal to or better than.
We were interested in finding whether linearly compressed clear speech was found to be less intelligible at higher rates because the duration of speech was shorter due to the presence of longer silence regions. This was however not the case as when we set the duration of speech and silence to be the same for compressed normal speech and clear speech, we found that compressed normal speech was more intelligible. One possible reason is that clear speech had to be compressed considerably more than normal speech which could have caused more artefacts due to larger phase differences at frame boundaries brought upon by the compression. The WSOLA implementation used here  suggests no more than 4.0x the compression rate and for many sentences clear speech was compressed more than this. Future work will involve reducing such artefacts.
The non linear compression method improved results for both styles. Unfortunately, this was not enough to make compressed clear speech more intelligible than normal speech. This is an interesting result as the overhead of applying energy detection is quite small and requires no further delay; it can be done online as opposed to methods where all silence regions are first completely removed. We would like to quantify what further intelligibility gains more complicated non linear methods inspired by fast speech acoustics [16, 18] might bring and particularly whether most of the gain they obtain are due to heavy silence compression.
Another point of discussion is why the type of clear speech used in this experiment (read clear speech) was less intelligible after time compression; this may be because it was not recorded in a communicative spontaneous task, where acoustic changes are known to be less extreme .
In this paper we exploit the use of clear speech to increase the intelligibility of speech reproduced at extremely high speaking rates. We first evaluate linearly compressed speech of four styles produced by the same speaker: normal, fast and two types of clear speech - computer directed and hearing-impaired directed speech (clear h). We found that unprocessed fast speech was less intelligible than linearly compressed clear h and normal but at the highest speaking rates clear h was worse than fast. As possibly the only advantage fast speech has over clear in terms of intelligibility is the shorter silence regions, we exploit in our second experiments a range of time compression methods that compress speech and silence differently. We found that even when the duration of speech and silence is kept the same across styles, compressed normal speech is still more intelligible. Compressing silence twice as much as speech improved results for both styles at the expense of very little overhead.
This work was partially supported by the BMWF - Sparkling Science project Sprachsynthese von Auditiven Lehrbüchern für Blinde SchülerInnen (SALB) and the EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).
-  M. Picheny, N. Durlach, and L. Braida, “Speaking clearly for the hard of hearing. III: An attempt to determine the contribution of speaking rate to differences in intelligibility between clear and conversational speech.” J. Speech Hear. Res., vol. 32, no. 3, pp. 600 – 603, 1989.
-  R. Uchanski, S. Choi, L. Braida, C. Reed, and N. Durlach, “Speaking clearly for the hard of hearing IV: Further studies of the role of speaking rate.” J. Speech Hear. Res., vol. 39, no. 3, pp. 494 – 509, 1996.
-  E. Janse, S. Nooteboom, and H. Quené, “Word-level intelligibility of time-compressed speech: prosodic and segmental factors,” Speech Comm., vol. 41, no. 2, pp. 287–301, 2003.
-  E. Janse, “Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech,” Speech Comm., vol. 42, no. 2, pp. 155–173, 2004.
-  J. Krause and L. Braida, “Investigating alternative forms of clear speech: The effects of speaking rate and speaking mode on intelligibility,” J. Acoust. Soc. Am., vol. 112, no. 5, pp. 2165–2172, 2002.
-  B. Arons, “Speechskimmer: a system for interactively skimming recorded speech,” ACM Trans. on Computer-Human Interaction, vol. 4, no. 1, pp. 3–38, 1997.
-  S. Tucker and S. Whittaker, “Time is of the essence: an evaluation of temporal compression algorithms,” in Proc. SIGCHI Conf. on Human Factors in Comp. Sys. ACM, 2006, pp. 329–338.
-  M. Pucher, D. Schabus, and J. Yamagishi, “Synthesis of fast speech with interpolation of adapted HSMMs and its evaluation by blind and sighted listeners,” in Proc. Interspeech, Chiba, Japan, Sept. 2010, pp. 2186–2189.
-  C. Valentini-Botinhao, M. Toman, M. Pucher, D. Schabus, and J. Yamagishi, “Intelligibility analysis of fast synthesized speech,” in Proc. Interspeech, Singapore, Sept. 2014.
-  M. Picheny, N. Durlach, and L. Braida, “Speaking clearly for the hard of hearing i: Intelligibility differences between clear and conversational speech.” J. Speech Hear. Res., vol. 28, no. 1, pp. 96 – 103, 1985.
-  T. Gay, “Effect of speaking rate on vowel formant movements,” J. Acoust. Soc. Am., vol. 63, no. 1, pp. 223–230, 1978.
-  R. Port, “Linguistic timing factors in combination,” J. Acoust. Soc. Am., vol. 69, no. 1, pp. 262–274, 1981.
-  F. Goldman-Eisler, Psycholinguistics: Experiments in spontaneous speech. London: Academic Press, 1968.
-  R. Greisbach, “Reading aloud at maximal speed,” Speech Comm., vol. 11, no. 4-5, pp. 469 – 473, 1992.
-  D. Moers, P. Wagner, B. Möbius, F. Müllers, and I. Jauk, “Integrating a fast speech corpus in unit selection speech synthesis: Experiments on perception, segmentation, and duration prediction,” in Proc. Speech Prosody, Chicago, USA, May 2010.
-  M. Covell, M. Withgott, and M. Slaney, “Mach1: Nonuniform time-scale modification of speech,” in Proc. ICASSP, vol. 1. Seattle, USA: IEEE, May 1998, pp. 349–352.
-  L. He and A. Gupta, “Exploring benefits of non-linear time compression,” in Proc. ACM Int. Conf. on Multimedia. Ottawa, Canada: ACM, Sept. 2001, pp. 382–391.
-  M. Demol, W. Verhelst, K. Struyve, and P. Verhoeve, “Efficient non-uniform time-scaling of speech with WSOLA,” in Proceedings of SPECOM, Petras, Greece, Oct. 2005, pp. 163–166.
-  W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in Proc. ICASSP, vol. 2, April 1993, pp. 554–557.
-  R. Uchanski, “Clear speech,” in The Handbook of Speech Perception, D. B. Pisoni and R. E. Remez, Eds. Blackwell, 2008, pp. 207–235.
-  V. Hazan and R. Baker, “Acoustic-phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions,” J. Acoust. Soc. Am., vol. 130, no. 4, pp. 2139–2152, 2011.
-  M. Koutsoginannaki, M. Pettinato, C. Mayo, V. Kandia, and Y. Stylianou, “Can modified casual speech reach the intelligibility of clear speech?” in Proc. Interspeech, Portland, USA, 2012.
-  M. Cooke, C. Mayo, C. Valentini-Botinhao, Y. Stylianou, B. Sauert, and Y. Tang, “Evaluating the intelligibility benefit of speech modifications in known noise conditions,” Speech Comm., vol. 55, pp. 572–585, 2013.
-  L. Rabiner and R. Schafer, Theory and Applications of Digital Speech Processing, 1st ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2010.
-  V. Hazan and R. Baker, “Does reading clearly produce the same acoustic-phonetic modifications as spontaneous speech in a clear speaking style?” in Proc. DiSS-LPSS Joint Workshop, 2010, pp. 7–10.