A Cycle-GAN Approach to Model Natural Perturbations in Speech for ASR Applications

A Cycle-GAN Approach to Model Natural Perturbations in Speech for ASR Applications

Abstract

Naturally introduced perturbations in audio signal, caused by emotional and physical states of the speaker, can significantly degrade the performance of Automatic Speech Recognition (ASR) systems. In this paper, we propose a front-end based on Cycle-Consistent Generative Adversarial Network (CycleGAN) which transforms naturally perturbed speech into normal speech, and hence improves the robustness of an ASR system. The CycleGAN model is trained on non-parallel examples of perturbed and normal speech. Experiments on spontaneous laughter-speech and creaky-speech datasets show that the performance of four different ASR systems improve by using speech obtained from CycleGAN based front-end, as compared to directly using the original perturbed speech. Visualization of the features of the laughter perturbed speech and those generated by the proposed front-end further demonstrates the effectiveness of our approach.

\name

Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, Sunil Kumar Kopparapu \addressTCS Research and Innovation - Mumbai, INDIA {keywords} CycleGAN, laughter-speech, creaky-speech, automatic speech recognition

1 Introduction

Performance of Automatic Speech Recognition (ASR) systems have seen significant jumps with the adoption of deep learning techniques. Recently, ASR systems have been shown to perform on par with human transcribers [xiong2018microsoft]. At the same time, the use of voice assistants such as Siri, Google Assistant, Amazon Alexa etc., have led to the wide use of ASR systems in various day-to-day applications. However, recent studies have shown that adversarial examples, generated by either adding a small amount of noise or by modifying a few bits of the audio signal, can be used to attack ASR systems to generate a completely different output [iter2017generating, carlini2018audio], even though the changes in the audio signal cannot be perceived by humans. Similar to these artificial perturbations in the audio signal, natural perturbations in human speech may also have an adverse effect on the performance of ASR systems. Natural perturbations in speech can arise due to the psychological and physical state of the speaker. Examples of naturally perturbed speech include expressive speech containing different emotions such as laughter, excitement, frustration, etc. and speech generated with different voice qualities such as creakiness, breath, etc.

In this paper, we show that the performance of the state-of-the-art deep neural network based ASR systems can significantly degrade for speech colored either by emotion or voice-quality. We show that these natural perturbations can be handled by Cycle-consistent Generative Adversarial Networks (CycleGANs) [zhu2017unpaired], a variant of Generative Adversarial Networks (GANs) [goodfellow2014generative] which can learn distributions of data across different domains even without a parallel corpus. The generator from our CycleGAN model learns to filter out the natural perturbations in speech and hence can be used as a front-end processor to improve the robustness of ASR to natural perturbations. Interestingly, in absence of these perturbations in the input speech to the CycleGAN model, the front-end processing does not affect the ASR performance. The main contributions of this work are

  • An analysis of the performance of state-of-the-art ASR systems on naturally perturbed laughter and creaky-speech.

  • An approach to train a CycleGAN model to obtain a front-end for transforming perturbed speech into normal speech.

  • An analysis of the proposed front-end and its effectiveness in improving performance of state-of-the-art ASR systems.

The rest of the paper is organized as follows. Section 2 provides a brief overview of the related work. Detailed description of our CycleGAN model is given in Section 3. Experiments and results are presented in Section 4 followed by an analysis on the learned transformation in Section 5 and the conclusion in Section 6.

(a) Different blocks used in generator and discriminator networks.
(b) Block diagram of the generator network. Note: ’c’ refers to channels, ’k’ refers to kernel size and ’s’ refers to strides. ’T’ refers to number of frames in the input.
(c) Block diagram of the discriminator network. Note: ’c’ refers to channels, ’k’ refers to kernel size and ’s’ refers to strides.
Figure 1: Block diagram of our proposed Cycle-GAN model to transform perturbed speech to normal speech. Note: in above figure, C refers to convolutional layer, Gated-C is gated convolution, I-Gated-C is instance normalized gated convolution, Res-C is residual convolution block and SI-Gated-C is pixel shuffled I-Gated-C

2 Related Work

Previous work have analyzed the effect of emotional speech on ASR and shown significant degradation in the performance of GMM-HMM-based ASR systems [athanaselis2005asr, vlasenko2012towards]. They proposed adaptation of the acoustic and language models of the ASR system to capture the variations exhibited by emotive speech, in order to improve the ASR performance. As opposed to model adaptation, we propose an approach based on transforming emotional speech to normal speech. Recently, emotive-to-neutral speech conversion has been achieved by modeling prosody-based features [raju2016application]. But this approach requires a parallel corpus (i.e., same utterance spoken in neutral and with emotion), which is very difficult to collect for spontaneous speech. Similarly, GMM-HMM-based systems have been considered for synthesizing creaky-speech [narendra2017generation], but no previous work has considered the conversion of creaky to neutral speech due to lack of a parallel corpus of creaky and neutral speech.

We propose a parallel-data-free approach to transform speech perturbed with emotions and voice quality to normal speech, based on CycleGANs [zhu2017unpaired]. CycleGAN was earlier used for voice conversion without parallel-data [kaneko2017parallel]. Compared to [kaneko2017parallel], our approach provides a front-end processor which can add robustness to ASR on utterances perturbed with emotion and voice quality. This paper presents the details of our CycleGAN model, the training loss functions and additional experimental results which further validate the performance of our approach.

3 Perturbed Speech to Normal Speech Transformation with CycleGANs

GANs consist of two different networks i.e., a generator and a discriminator . Generator is used to generate the fake samples , that resemble a given data distribution , by taking random sample from a prior distribution as input, and the discriminator is used to discriminate fake samples from real samples in the data . Both, generator and discriminator are trained using an adversarial loss function [goodfellow2014generative]. GANs were initially proposed for the generation of images when provided with some arbitrary random noise as input, and thereafter have achieved impressive results in image generation [denton2015deep], image-to-image translation [isola2017image] and style transfer [johnson2016perceptual]. More recently, unpaired image-to-image translation was successfully learned by adopting a variant of GAN, called cycle-consistent adversarial networks [zhu2017unpaired, taigman2016unsupervised]. We adopt the concept of CycleGAN for performing the task of non-parallel speech-to-speech emotion conversion.

We use a CycleGAN to model the transformation of perturbed speech features to normal speech features . The CycleGAN model architecture, considered in this work, is motivated from [kaneko2017parallel]. A typical GAN tries to minimize the adversarial loss which measures how far is the generated data from the target data . In case of perturbed speech to normal speech transformation without parallel utterances, a typical GAN with only the adversarial loss may not be able to preserve the context information in the speech features. The CycleGAN model can handle this using a pair of GANs with two adversarial loss functions and an additional cycle consistency loss function.

The first adversarial loss, given as:

(1)

corresponds to the forward mapping, which is the transformation from the perturbed speech to normal speech. The second adversarial loss, given as:

(2)

corresponds to the inverse mapping, which transforms the normal speech back to the perturbed speech.

The cycle consistency loss given as:

(3)

helps to preserve the context information, by ensuring that normal speech can be reconstructed by the cascade of the forward and inverse mapping generators and perturbed speech can be reconstructed by the cascade of the inverse and forward mapping generators, respectively.

In addition to the above mentioned losses, we also included the identity-loss function [zhu2017unpaired], given as:

(4)

was originally used for color preservation and we found this loss to be crucial for maintaining the linguistic information during conversion of speech.

The complete loss function () of our CycleGAN model is given as:

(5)

The cycle consistency loss is scaled with a trade-of parameter whereas the identity-loss is scaled with a trade-of parameter .

The generator and discriminator networks in our CycleGAN model consist of different convolutional blocks as shown in Figure 1(a). Gated convolutional (Gated C) blocks consist of gated linear units, which achieved state-of-the-art performance in language and speech modeling, as an activation function for the convolutional layers [dauphin2017language]. Instance normalized gated convolution (I-Gated-C) block uses instance normalization, proposed for style-transfer in [johnson2016perceptual], after gated C block. Residual convolution (Res-C) blocks are considered to stack multiple convolutional layers, enabling to build a very deep network for the generator [he2016deep].

The generator network consists a total of convolutional blocks as shown in Figure 1(b). These include one stride- gated convolution block, two stride- I-gated convolution blocks, residual blocks [he2016deep], two -stride SI-gated convolution blocks, and one stride- convolution block. All convolution layers are -dimensional to preserve the temporal structure [kaneko2017sequence]. The discriminator network consists of -dimensional convolutional blocks as shown in Figure 1(c). Gated linear units were used as the activation function for all the convolutional blocks. For the discriminator network, we use a patch GAN [ledig2017photo, li2016precomputed], which classifies whether each patch is real or fake (i.e., perturbed or normal speech).

4 Experiments and Results

4.1 Dataset

We use two spontaneous speech datasets, namely, AMI meeting corpus [mccowan2005ami] and Buckeye corpus of conversational speech [pitt2007buckeye] to analyze the effect of natural perturbations. Both these datasets consist of manual annotations and time-stamps for speech perturbed with emotions and voice-quality. From both these datasets, speech data comprising of female speakers and male speakers was considered for training gender-dependent CycleGAN models. We consider utterances for each gender and for each class (i.e., normal speech, laughter-speech and creaky-speech). Out of these utterances, utterances are used for train and utterances for test. It is to be noted that all these utterances are non-parallel. Each utterance is of - second(s) in duration.

Google IBM ASpIRE
no FE FE no FE FE no FE FE
MFBs MFBs+APs MFBs MFBs+APs MFBs MFBs+APs
Laughter- %WER 38.4 30.9 23.5 (14.9) 50.4 49.6 42.4 (8.0) 53.5 45.1 32.5 (21.0)
Speech %SER 91.8 79.6 75.5 (16.3) 93.1 89.7 89.7 (3.4) 93.1 91.4 89.7 (3.4)
Creaky- %WER 27.4 22.9 16.4 (11.0) 29.2 24.3 21.3 (7.9) 32.2 30.2 24.3 (7.9)
Speech %SER 86.1 77.8 63.9 (22.2) 88.9 86.1 86.1 (2.8) 94.4 91.7 83.3 (11.1)
Table 1: ASR performance without front-end (no FE) and with front-end (FE). Numbers in parenthesis with denote reduction in the error rate.

4.2 Feature Extraction

The WORLD vocoder system [morise2016world] is used to extract features from the speech signal. The speech signal is sampled at kHz, and Mel filterbank (MFB) features, logarithmic fundamental frequency ( ) and aperiodic components (APs) are extracted within a window of length msec for every msec. -dimensional MFBs and -dimensional APs are modeled by the proposed CycleGAN architecture to convert the features extracted from the input perturbed speech into the features corresponding to normal speech. Previous work on speaker conversion [ohtani2006maximum, kaneko2017parallel], have used only the spectral features (MFBs). But for perturbed speech conversion, we found that modeling both, spectral features (MFBs) and aperiodic components (APs) resulted in better conversion to normal speech than considering only spectral features (MFBs). Logarithm Gaussian normalized transformation [liu2007high] was used to convert the values from the source speech to those corresponding to the target speech.

no FE FE
Perturbation MFBs MFBs+APs
Laughter speech 56.5 53.0 41.7 (14.8)
Creaky speech 33.5 29.8 23.7 (9.8)
Table 2: DeepSpeech model performance without front-end (no FE) and with front-end (FE) in terms of character error rate (%CER). Numbers in parenthesis with denote reduction in the error rate.

4.3 Training Details

In order to achieve a more stable training of the CycleGAN models and to generate higher quality outputs, we used the least square function to compute the adversarial loss instead of the commonly used negative log likelihood objective function [mao2017least, zhu2017unpaired]. The CycleGAN models were trained using the Adam optimizer with a batch size of . The initial learning rates of the generator and the discriminator are and , respectively. The learning rates were decayed by a factor of after each epoch. In all the experiments, the cycle consistency loss trade-of parameter was set to a value of . The identity-loss trade-of parameter was set to for the first epochs and set to after epochs.

In this paper, we have trained gender-specific CGAN models (using speech collected from multiple speakers within the same gender) to transform perturbed speech to normal speech. But unlike the task of voice conversion, where each model is trained to convert speech between a pair of speakers, our proposed CGAN models are able to handle variations across multiple speakers (within the same gender) to transform perturbed speech to normal speech. Moreover, we have considered speech from speakers unseen during training to test the scalability of the trained CGAN models.

(a) Laughter-speech
Gnd. see if anything out there to see you such
Orig. [noise] see anything out there to you
[laughter]
MFBs see if anything out there to you use
MFBs+APs see if anything out there to see you
(b) Creaky-speech
Gnd. think they should get rid of it
orig. like they should get kid
MFBs think they should get rid of
MFBs+APs think they should get rid of
Table 3: Outputs of ASpIRE ASR model when different signals (original and transformed) are provided as input. Note: Gnd. and orig. refers to ground truth and original signal, respectively. MFBs, (MFBs+APs) refers to signals transformed using CGAN models trained on only MFBs and MFBs + APs, respectively.
(a) Normal Speech
(b) Laughter Perturbed Speech
(c) Transformed (Normal) Speech
Figure 2: t-SNE projection of Mel filterbank output features (Best viewed in color).

4.4 Results

Table 1 presents the performance of Google cloud ASR1, IBM ASR2 and Kaldi ASR (with ASpIRE models) [peddinti2015jhu, aspirelink] with and without our proposed front-end, when tested with laughter-speech (speech perturbed with emotion) and creaky-speech (speech perturbed with voice-quality). The performance is evaluated in terms of % Word Error Rate (%WER) and % Sentence Error Rate (%SER). Lower values of WER and SER indicate better performances. Table 1 shows that our proposed front-end improves the performance of each of the ASR systems. It can be observed that modeling both, spectral and aperiodic components (i.e., MFB + APs) performs better than modeling only MFBs in the proposed front-end. Absolute reductions of %, %, % in WER and % , %, % in SER, are achieved for Google ASR, IBM ASR and ASpIRE ASR, respectively, when our proposed front-end (MFBs+APs) is used to convert laughter-speech to normal speech. Similarly, absolute reductions of %, %, % in WER and %, % % in SER are obtained for Google ASR, IBM ASR and ASpIRE ASR, respectively, when our proposed front-end (MFBs+APs) is used to convert creaky-speech to normal speech.

The ASR performances shown in Table 1 are influenced by the strength of the language model used by the respective ASR systems. To check ASR performance without the effect of a language model, we also present the results from the DeepSpeech model3 which converts speech to a sequence of English characters. Table 3 shows the % Character Error Rate (%CER) performance of the DeepSpeech model with and without the proposed front-end. The DeepSpeech model was trained on hours of LibriSpeech data and did not use a language model for decoding. It can be observed from Table 3 that our proposed front-end gives significant reduction in CER of the DeepSpeech model.

Table 3 shows the speech-to-text outputs of an ASpIRE model when tested with original (orig.) perturbed speech (laughter-speech and creaky-speech), transformed normal speech samples using CGAN models trained using MFBs and MFBs+APs, respectively. For instance, it can be observed from Table 3 that the original laughter-speech signal is transcribed as having [noise] and [laughter] along with substitutions and deletions (i.e., if, see and such) whereas the CGAN transformed signals have no [noise] and [laughter] (for both MFBs and MFBs+APs) and has less deletions (to, use and such for MFBs, and only such deleted for MFBs+APs). Similarly for creaky-speech as input, CGAN transformed signals are better transcribed (only ’it’ is deleted for both MFBs and MFBs+APs models) compared to the original (orig.) creaky-speech signal (“of it” is deleted and “rid” is substituted with “kid” for the orig. signal).

5 Analysis of the Learned Front-end Transformation

Figure 2 shows a -dimensional t-SNE projection [vanDerMaaten2008] of the Mel filterbank features for (a) normal speech, (b) laughter perturbed speech [dumpala2014analysis] and (c) laughter perturbed speech transformed to normal speech by the proposed front-end. It can be observed that the filterbank features for normal speech and transformed (normal) speech are quite similar to each other and that they differ significantly from the filterbank features for laughter-speech. Additionally, the spread of the filterbank features for laughter-speech is reduced in the -dimensional t-SNE space. We hypothesize that this may be due to the reduction in vowel space for laughter-speech [bachorowski2001acoustic].

Figure 3: Violin plot of output from filters to of the Mel filterbank (Best viewed in color).

For a more detailed analysis, Figure 3 shows violin plots [Hintze.Nelson1998] of the output of the filters to of the Mel filterbank, for normal speech, laughter perturbed speech and laughter perturbed speech transformed to normal speech. Output of the filters to do not show visible differences and hence they are not shown. It can be observed from Figure 3 that the distribution of the feature values for normal speech and transformed (normal) speech are similar and they exhibit similar variations. It implies that the front-end is able to (a) capture the distribution of the Mel filterbank outputs of both normal and laughter perturbed speech, and (b) transform laughter perturbed speech to equivalent normal speech.

6 Conclusion

We proposed a novel front-end based on CycleGANs to transform naturally perturbed speech to normal speech. Experiments on spontaneous laughter-speech and creaky-speech utterances show significant improvements in performance of the Google ASR, IBM ASR, the Kaldi ASR with ASpIRE model and that of a DeepSpeech model. We found that adding aperiodic components to spectral features gives a better performance. Visualization of the laughter-speech features and the transformed speech features gives insights on the transformation performed by our proposed front-end.

References

Footnotes

  1. https://cloud.google.com/speech-to-text/
  2. https://www.ibm.com/watson/services/speech-to-text/
  3. https://github.com/mozilla/DeepSpeech/releases/tag/v0.4.0-alpha.3
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
403199
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description