Improving GANs for Speech Enhancement

Improving GANs for Speech Enhancement

Abstract

Generative adversarial networks (GAN) have recently been shown to be efficient for speech enhancement. Most, if not all, existing speech enhancement GANs (SEGANs) make use of a single generator to perform one-stage enhancement mapping. In this work, we propose two novel SEGAN frameworks, iterated SEGAN (ISEGAN) and deep SEGAN (DSEGAN). In the two proposed frameworks, the GAN architectures are composed of multiple generators that are chained to accomplish multiple-stage enhancement mapping which gradually refines the noisy input signals in stage-wise fashion. On the one hand, ISEGAN’s generators share their parameters to learn an iterative enhancement mapping. On the other hand, DSEGAN’s generators share a common architecture but their parameters are independent; as a result, different enhancement mappings are learned at different stages of the network. We empirically demonstrate favorable results obtained by the proposed ISEGAN and DSEGAN frameworks over the vanilla SEGAN. The source code is available at http://github.com/pquochuy/idsegan.

speech enhancement, generative adversarial networks, SEGAN, ISEGAN, DSEGAN
\NewEnviron

NORMAL \NewEnvironSMALL \NewEnvironHUGE

I Introduction

The goal of speech enhancement is to improve the quality and intelligibility of speech which is degraded by background noise [14, 32]. Speech enhancement can serve as a front-end to improve performance of an automatic speech recognition system [31]. It also play an important role in applications like communication systems, hearing aids, and cochlear implants, where contaminated speech needs to be enhanced prior to signal amplification to reduce discomfort [32]. Significant progress on this research topic has been made with the involvement of deep learning paradigms. Deep neural networks (DNNs) [2, 11], convolutional neural networks (CNNs) [17, 15], and recurrent neural networks (RNNs) [31, 5] have been exploited either to produce the enhanced signal directly via a regression form [2, 17] or to estimate the contaminating noise which is subtracted from the noisy signal to obtain the enhanced signal [15]. Significant improvements on speech enhancement performance have been reported by these deep-learning based methods compared to more conventional ones, such as Wiener filtering [13], spectral subtraction [3] or minimum mean square error (MMSE) estimation [4, 6].

There exists a class of generative methods relying on generative adversarial networks (GANs) [7] which have been recently demonstrated to be efficient for speech enhancement [18, 19, 9, 22, 12, 21]. When GANs are used for this task, the enhancement mapping is accomplished by the generator whereas the discriminator , by discriminating between real and fake signals, transmits information to so that can learn to produce output that resembles the realistic distribution of the clean signals. Using GANs, speech enhancement can be done using either magnitude spectrum input [12] or raw waveform input [18, 19] although the latter is more desirable due to being end-to-end in nature.

Fig. 1: Illustration of the vanilla SEGAN [18], the proposed ISEGAN with two shared generators , and the proposed DSEGAN with two independent generators and .

Existing speech enhancement GAN (SEGAN) systems share a common feature – the enhancement mapping is accomplished via a single stage with a single generator [18, 19, 12], which may not be optimal. Here we break the entire enhancement mapping into multiple maps in a divide-and-conquer manner. Each of the “simpler” mappings is realized by a generator and the generators are chained to enhance a noisy input signal gradually one after another to yield an enhanced signal. In this way, a generator is tasked to refine or correct the output produced by its predecessor. We hypothesize that it would be better to carry out multi-stage enhancement mapping rather than a single-stage one as in prior works [18, 19, 12]. We then propose two new SEGAN frameworks, namely iterated SEGAN (ISEGAN) and deep SEGAN (DSEGAN) as illustrated in Figure 1, to validate this hypothesis. Similar to [18, 19], ISEGAN and DSEGAN receive raw audio waveform as input. In the former the generators’ parameters are tied. Sharing of parameters constrains ISEGAN’s generators to perform the same enhancement mapping iteratively. In the latter, the generators have independent parameters, and therefore different mappings are expected at different enhancement stages. We will demonstrate that, out of the proposed ISEGAN, DSEGAN, and the vanilla SEGAN [18], DSEGAN obtains the best results on objective evaluation metrics while ISEGAN performs comparably to the vanilla SEGAN. However, subjective evaluation results show that both ISEGAN and DSEGAN outperform their vanilla SEGAN counterpart.

Ii Vanila SEGAN

GAN [7] is a class of generative models that learn to map a sample from some prior distribution to a sample belonging to the training data’s distribution . A GAN is composed of two components: a generator and a discriminator . is learned to imitate the real training data distribution and to generate novel samples in that distribution by mapping the data distribution characteristics to the manifold defined in the prior . is usually a binary classifier and and are trained via adversarial training. That is, has to classify the samples coming from as real and those generated by as fake while tries to fool such that classifies its output as real. The objective function of this adversarial learning process is

(1)

Given a dataset consisting of pairs of raw signals: clean speech signal and noisy speech signal , speech enhancement is to find a mapping to map the noisy signal to the clean signal . Conforming to GAN’s principle, SEGAN proposed in [18] has its generator tasked for the enhancement mapping. Presented with the noisy raw speech signal together with the latent representation , produces the enhanced speech signal . The discriminator of SEGAN receives a pair of signal as input and classifes the pair as real whereas the pair as fake. The objective function of SEGAN is

(2)

To improve the stability, SEGAN further employs least-squares GAN (LSGAN) [16] to replace the discriminator ’s cross-entropy loss by the least-square loss. The least-squares objective functions of and are explicitly written as

(3)
(4)

respectively. In (4), distance between the clean sample and the generated sample is included to encourage the generator to generate more fine-grained and realistic results [18, 10, 20]. The influence of the -norm term is regulated by the hyper-parameter which was set to in [18].

Iii Iterated SEGAN and Deep SEGAN

Quan et al. [23] showed that using an additional generator chained to the generator of a GAN leads to better image-reconstruction performance. In light of this, instead of using the single-stage enhancement mapping with one generator as in the vanilla SEGAN [18], we propose to break the mapping into multiple stages by using a chain of () generators , as illustrated in Fig. 1 for the case . In ISEGAN, the generators share their parameters, i.e. , and they can be viewed as an iterated generator with the number of iterations of . In constrast, DSEGAN’s generators are independent, they can be viewed as a deep generator with the depth of . Both ISEGAN and DSEGAN reduce to the vanilla SEGAN when .

At the enhancement stage , , the generator receives the output of its preceding generator together with the latent representation and is expected to produce a better enhanced signal :

(5)
(6)

The output of the last generator is considered as the final enhanced signal, i.e. , which is expected to be of better quality than all the intermediate enhanced versions. The outputs of the generators can be interpreted as different checkpoints and by forcing the desired ground-truth between the checkpoints, we encourage the chained generators to produce gradually better enhancement results.

To enforce the generators in the chain to learn a proper mapping for signal enhancement, the discriminator classifies the pair as real while all pairs , , , as fake, as illustrated in Fig. 2. The least-squares objective functions of and are

(7)
(8)

Unlike the vanilla SEGAN [18], the discriminator in cases of ISEGAN and DSEGAN needs to handle imbalanced data as there are fake examples generated with respect to every real example. Therefore, it is necessary to divide the second term in (7) by to balance out penalization for real and fake examples misclassification. In addition, the first term in (8) is also divided by to level its magnitude with that of the -norm term [18]. To regulate the enhancement curriculum in multiple stages, we set to . That is, is set to double while the last is fixed to 100 as in case of the vanilla SEGAN [18]. With this curriculum, we expect the enhanced output of a generator to be twice as good as that of its preceding generator in terms of -norm. As a result, the enhancement mapping learned by a generator in the chain doesn’t need to be perfect as in single-stage enhancement since its output will be refined by its successor.

Fig. 2: Adversarial training of ISEGAN and DSEGAN. The discriminator is learned to classify the pair as real (a), and all the pairs , , , as fake (b). The chained generators are learned to fool so that classifies the pairs , , , as real (c). Dashed lines represent the flow of gradient backdrop.

Iv Network architecture

Iv-a Generators

The architecture of the generators , , of both ISEGAN and DSEGAN is illustrated in Fig. 3. They make use of an encoder-decoder with fully-convolutional layers [24], which is similar to that used in the vanilla SEGAN [18]. Each generator receives a segment of raw speech signal with length of samples (approximately one second at 16 kHz) as input. The generators’ encoder is composed of 11 one-dimensional strided convolutional layers with a common filter width of 31 and a common stride length of 2, followed by parametric rectified linear units (PReLUs) [8]. The number of filters is designed to increase along the encoder’s depth to compensate for the smaller and smaller convolutional output, resulting in output sizes of , , , , , , , , , , at the 11 convolutional layers, respectively. At the end of the encoder, the encoding vector is concatenated with the noise sample sampled from the normal distribution and presented to the decoder. The generator decoder mirrors the encoder architecture with the same number of filters and filter width (see Figure 3) to reverse the encoding process by means of deconvolutions (i.e. fractional-strided transposed convolution). Note that each deconvolutional layer is again followed by a PReLU. The skip connections are employed to connect an encoding layer to its corresponding decoding layer to allow the information of the waveform to flow into the decoding stage [18].

Fig. 3: The architecture of a generator in ISEGAN and DSEGAN, which is similar to the vanilla SEGAN’s generator [18].

Iv-B Discriminator D

The discriminator has similar architecture to the encoder part of the generators described in Section IV-A, except that it has two-channel input and use virtual batch-norm [25] before LeakyReLU activation with [18]. In addition, is topped up with a one-dimensional convolutional layer with one filter of width one (i.e. convolution) to reduce the last convolutional output size from to features before classification takes place with a softmax layer.

V Experimental Setup

Metric Noisy Weiner [14] SEGAN [18] SEGAN* ISEGAN DSEGAN

PESQ

CSIG

CBAK

COVL

SSNR

TABLE I: Results obtained by the proposed ISEGAN and DSEGAN on the objective evaluation metrics in comparison with those of the baselines and the noisy speech signals. * indicates our re-implementation.

V-a Dataset

To assess the performance of the proposed ISEGAN and DSEGAN and demonstrate their advantages over the vanilla SEGAN, we carried out experiments on the database in [29] on which the vanilla SEGAN was evaluated [18]. The dataset originated from the Voice Bank corpus [30] and consists of data from 30 speakers out of which 28 and 2 speakers were included in the training and test set, respectively.

A total of 40 noisy conditions was made in the training data by combining 10 types of noises (2 artificial and 8 stemmed from the Demand database [27]) with 4 signal-to-noise ratios (SNRs) each: 15, 10, 5, and 0 dB. For the test set, 20 noisy conditions are considered, combining 5 types of noise from the Demand database with 4 SNRs each: 17.5, 12.5, 7.5, and 2.5 dB. There are about 10 and 20 different utterances in each noisy condition per speaker in the training and test set, respectively. All utterances were downsampled to 16 kHz.

V-B Baseline system

The vanilla SEGAN [18] was used as a baseline for comparison. Besides the performance reported in [18], we repeated training the vanilla SEGAN to ensure a similar experimental setting across systems. In addition, the Weiner method based on a priori SNR estimation [14, 26] was used as a second baseline.

V-C Network parameters

The implementation was based on Tensorflow framework [1]. The networks were trained for 100 epochs with RMSprop optimizer [28] and a learning rate of . The vanilla SEGAN was trained with a minibatch size of 100 while the minibatch size was reduced to 50 to train ISEGAN and DSEGAN to cope with their larger memory footprints. We experimented with different values for to investigate the influence of the number of iterations of ISEGAN and the depth of DSEGAN.

As in [18], during training, raw speech segments of length 16384 samples were extracted from the training utterances with 50% overlap. A high-frequency preemphasis filter of coefficient was applied to each signal segment before presenting to the networks [18]. During testing, raw speech segments were extracted from a test utterance without overlap. They were processed by a trained network, deemphasized, and eventually concatenated to produce the enhanced utterance.

V-D Objective evaluation

Fig. 4: Noisy signals and the enhanced signals of two test utterances produced by the vanilla SEGAN baseline, ISEGAN, and DSEGAN. (a) p257_219.wav and (b) 232_051.wav.

Following [18], we quantified the quality of the enhanced signals produced by the systems under study based on 5 objective evaluation metrics suggested in [14]:

  • : perceptual evaluation of speech quality.

  • : mean opinion score (MOS) prediction of the signal distortion attending to the speech signal.

  • : MOS prediction of the instrusiveness of background noise.

  • : MOS prediction of the overall effect.

  • : segmental SNR.

The metrics were computed for each system by averaging over all 824 files of the test set. We experimentally found that the performance may vary with different network checkpoints, so we report the mean and standard deviation of each metric over the 5 latest network checkpoints.

The objective evaluation results obtained by different systems are shown in Table I. First, the results obtained by our re-implementation of the vanilla SEGAN are more or less consistent with those reported in the seminal work [18]. As expected, the enhanced signals obtained by the vanilla SEGAN were of better quality than both the noisy ones and those of the Weiner baseline across most of the metrics. Second, ISEGAN, overall, performs comparably with the vanilla SEGAN. It slightly surpasses the baseline counterpart in PESQ, CBAK, and SSNR (i.e. with and ) but marginally underperforms in CSIG and COVL. DSEGAN, however, obtains the best results, outperforming both the vanilla SEGAN and ISEGAN across all the metrics. For example, with , DSEGAN leads to relative improvements of , , , , and over the baseline on PESQ, CSIG, CBAK, COVL, and SSNR, respectively. Third, the results in the table suggest marginal impact of ISEGAN’s number of iterations and DSEGAN’s depth larger than since no significant changes are seen on the metrics.

V-E Subjective evaluation

To validate the objective evaluation, we conducted a small-scale subjective evaluation of four conditions: noisy signals, vanilla SEGAN, ISEGAN and DSEGAN signals (with set to two). Twenty volunteers aged 18–52 (F=6, M=14), with self-reported normal hearing, were asked to provide forced binary quality assessments between pairs of 20 randomly presented sentences, balanced in terms of speakers and noise types, i.e. each comparison varied only in the type of system. Following a familiarization session, tests were run individually using MATLAB, with listeners wearing Philips SHM1900 headphones in a low-noise environment. For each pair of utterances, the selected higher quality one was rewarded while the lower quality received no reward. A preference score was obtained for each system by dividing its accumulated reward by the count of its occurrences in the test. Due to the small sample size, we assessed statistical significance of results using -test. Results confirm that the three SEGAN signals are perceived as higher quality than the noisy signals ( to , with ). DSEGAN and ISEGAN together significantly outperform vanilla SEGAN ( to , ). However, DSEGAN and ISEGAN qualities were not significantly different ( to ) in this small test. Results support the detailed objective evaluation in which DSEGAN performs much better than either SEGAN or noise, however we find that ISEGAN also performs well in subjective tests.

Vi Conclusions

This paper presents two novel GAN frameworks named ISEGAN and DSEGAN, for speech enhancement. Improving on the vanilla SEGAN, which has a single generator, ISEGAN and DSEGAN architectures comprise multiple chained generators. ISEGAN and DSEGAN differ in that the generators of the former share their parameters while those of the latter are independent. With multiple generators, ISEGAN and DSEGAN learn multiple mappings, each corresponding to a generator in the chain, to accomplish a multi-stage enhancement process. Unlike from the single-stage mapping of the vanilla SEGAN, the multi-stage mapping allows a generator in the chain to refine the enhanced signal output by its predecessor(s) to produce better versions of the enhanced signal. Objective tests demonstrated that the proposed ISEGAN and DSEGAN perform comparably and are better than vanilla SEGAN. On the other hand, both systems achieve significantly favourable results over the vanilla SEGAN counterpart on the subjective perceptual test.

References

  1. M. Abadi et al. (2016) TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §V-C.
  2. Y. X. andd J. Du, L.-R. Dai and C.-H. Lee (2015) A regression approach tospeech enhancement based on deep neural networks. IEEE/ACM Trans. on Audio, Speech and Language Processing (TASLP) 23 (1), pp. 7–19. Cited by: §I.
  3. S. Boll (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. on acoustics, speech, and signal processing 27 (2), pp. 113–120. Cited by: §I.
  4. Y. Ephraim and D. Malah (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. on Acoustics, Speech, and Signal Processing 33 (2), pp. 443–445. Cited by: §I.
  5. H. Erdogan, J. R. Hershey, S. Watanabe and J. L. Roux (2015) Phase sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proc. ICASSP, pp. 708–712. Cited by: §I.
  6. T. Gerkmann and R. C. Hendriks (2011) Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. on Audio, Speech, and Language Processing, pp. 1383–1393. Cited by: §I.
  7. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680. Cited by: §I, §II.
  8. K. He, X. Zhang, S. Ren and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proc. ICCV, pp. 1026–1034. Cited by: §IV-A.
  9. T. Higuchi, K. Kinoshita, M. Delcroix and T. Nakatani (2017) Adversarial training for data-driven speech enhancement without parallel corpus. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),, pp. 40–47. Cited by: §I.
  10. P. Isola, J.-Y. Zhu, T. Zhou and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proc. CVPR, pp. 5967–5976. Cited by: §II.
  11. A. Kumar and D. Florencio (2016) Speech enhancement in multiple-noise conditions using deep neural networks. In Interspeech, pp. 3738–3742. Cited by: §I.
  12. Z. X. Li, L. R. Dai, Y. Song and I. McLoughlin (2018) A conditional generative model for speech enhancement. Circuits, Systems, and Signal Processing 37 (11), pp. 5005–5022. Cited by: §I, §I.
  13. J. Lim and A. Oppenheim (1978) All-pole modeling of degraded speech. IEEE Trans. on Acoustics, Speech, and Signal Processing 26 (3), pp. 197–210. Cited by: §I.
  14. P. C. LoizouB. Raton (Ed.) (2013) Speech enhancement: theory and practice. 2 edition, CRC Press, Inc.. Cited by: §I, §V-B, §V-D, TABLE I.
  15. N. Mamun, S. Khorram and J. H. L. Hansen (2019) Convolutional neural network-based speech enhancement for cochlear implant recipients. arXiv Preprint arXiv:1907.02526. Cited by: §I.
  16. X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang and S. P. Smolley (2017) Least squares generative adversarial networks. In Proc. ICCV, pp. 2813–2821. Cited by: §II.
  17. S. R. Park and J. Lee (2017) A fully convolutional neural network for speech enhancement. In Proc. Interspeech, Cited by: §I.
  18. S. Pascual, A. Bonafonte and J. Serrà (2017) SEGAN: speech enhancement generative adversarial network. In Proc. Interspeech, pp. 3642–3646. Cited by: Fig. 1, §I, §I, §II, §II, §III, §III, Fig. 3, §IV-A, §IV-B, §V-A, §V-B, §V-C, §V-D, §V-D, TABLE I.
  19. S. Pascual, J. Serrà and A. Bonafonte (2019) Towards generalized speech enhancement with generative adversarial networks. CoRR abs/1904.03418. Cited by: §I, §I.
  20. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proc. CVPR, pp. 2536–2544. Cited by: §II.
  21. R. Prabhavalkar (2018) Exploring speech enhancement with generative adversarial networks for robust speech recognition. In Proc. ICASSP,, pp. 5024–5028. Cited by: §I.
  22. S. Qin and T. Jiang (2018) Improved wasserstein conditional generative adversarial network speech enhancement. EURASIP Journal on Wireless Communications and Networking 2018 (1), pp. 181. Cited by: §I.
  23. T. M. Quan, T. Nguyen-Duc and W.-K. Jeong (2018) Compressed sensing mri reconstructionusing a generative adversarial network with a cyclic loss. IEEE Trans. on Medical Imaging 37 (6), pp. 1488–1497. Cited by: §III.
  24. A. Radford, L. Metz and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. ICLR, Cited by: §IV-A.
  25. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford and X. Chen (2016) Improved techniques for training GANs. In Proc. NIPS, pp. 2226–2234. Cited by: §IV-B.
  26. P. Scalart and J. V. Filho (1996) Speech enhancement based on a priorisignal to noise estimation. In Proc. ICASSP, Vol. 2, pp. 629–632. Cited by: §V-B.
  27. J. Thiemann, N. Ito and E. Vincent (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. The Journal of the AcousticalSociety of America 133 (5), pp. 3591–3591. Cited by: §V-A.
  28. T. Tieleman and G. Hinton (2012) Lecture 6.5 - RMSprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning. Cited by: §V-C.
  29. C. Valentini-Botinhao, X. Wang, S. Takaki and J. Yamagishi (2016) Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Proc. 9th ISCA Speech Synthesis Workshop, pp. 146–152. Cited by: §V-A.
  30. C. Veaux, J. Yamagishi and S. King (2013) The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In Proc. 2013 International Conference Oriental COCOSDA, pp. 1–4. Cited by: §V-A.
  31. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey and B. Schuller (2015) Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. Proc. Intl. Conf. on Latent Variable Analysis and Signal Separation, pp. 91–99. Cited by: §I.
  32. L.-P. Yang and Q.-J. Fu (2005) Spectral subtraction-based speech enhancementfor cochlear implant patients in background noise. Journal of the Acoustical Society of America 117 (3), pp. 1001–1004. Cited by: §I.
Comments 1
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
404888
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
1

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description