Parametric Resynthesis with neural vocoders
Noise suppression systems generally produce output speech with compromised quality. We propose to utilize the high quality speech generation capability of neural vocoders for noise suppression. We use a neural network to predict clean mel-spectrogram features from noisy speech and then compare two neural vocoders, WaveNet and WaveGlow, for synthesizing clean speech from the predicted mel spectrogram. Both WaveNet and WaveGlow achieve better subjective and objective quality scores than the source separation model Chimera++. Further, WaveNet and WaveGlow also achieve significantly better subjective quality ratings than the oracle Wiener mask. Moreover, we observe that between WaveNet and WaveGlow, WaveNet achieves the best subjective quality scores, although at the cost of much slower waveform generation.
The Graduate Center, CUNY
New York, NY, USA
email@example.com Michael I Mandel Brooklyn College
Computer and Information Science
Brooklyn, NY, USA
Speech enhancement, speech synthesis, enhancement-by-synthesis, neural vocoder, WaveNet, WaveGlow
Traditionally, speech enhancement methods modify noisy speech to make it more like the original clean speech . Such modification of a noisy signal can introduce additional distortions in the speech signal. Signal distortions generally occur from two problems, over-suppression of the speech and under-suppression of the noise. In contrast, parametric speech synthesis methods can produce high quality speech from only text or textual information. Parametric speech synthesis methods predict an acoustic representation of speech from text and then use a vocoder to generate clean speech from the predicted acoustic representation.
We propose combining speech enhancement and parametric synthesis methods by generating clean acoustic representations from noisy speech and then using a vocoder to synthesize “clean” speech from the acoustic representations. We call such a system parametric resynthesis (PR). The first part of the PR system removes noise and predicts the clean acoustic representation. The second part, the vocoder, generates clean speech from this representation. As we are using a vocoder to resynthesize the output speech, the performance of the system is limited by the vocoder synthesis quality.
In our previous work , we built a PR system with a non-neural vocoder, WORLD . Compared to such non-neural vocoders, neural vocoders like WaveNet  synthesize higher quality speech, as shown in the speech synthesis literature [4, 5, 6, 7, 8, 9]. More recent neural vocoders like WaveRNN , Parallel WaveNet , and WaveGlow  have been proposed to improve the synthesis speed of WaveNet while maintaining its high quality. Our goal is to utilize a neural vocoder to resynthesize higher quality speech from noisy speech than WORLD allows. We choose WaveNet and WaveGlow for our experiments, as these are the two most different architectures.
In this work we build PR systems with two neural vocoders (PR-neural). Comparing PR-neural to other systems, we show that neural vocoders produce both better speech quality and better noise reduction quality in subjective listening tests than our previous model, PR-World. We show that the PR-neural systems perform better than a recently proposed speech enhancement system, Chimera++ , in all quality and intelligibility scores. And we show that PR-neural can achieve higher subjective intelligibility and quality ratings than the oracle Wiener mask. We also discuss end-to-end training strategies for the PR-neural vocoder system.
Speech synthesis can be divided into two broad categories, concatenative and parametric speech synthesis. Traditionally, concatenative speech synthesis has produced the best quality speech. Concatenative systems stitch together small segments of speech recordings to generate new utterances. We previously proposed speech enhancement systems using concatenative synthesis techniques [13, 14, 15], named “concatenative resynthesis.” Concatenative speech enhancement systems can generate high quality speech with a slight loss in intelligibility, but they are speaker-dependent and generally require a very large dictionary of clean speech.
With the advent of the WaveNet neural vocoder, parametric speech synthesis with WaveNet surpassed concatenative synthesis in speech quality . Hence, here we use WaveNet and WaveNet-like neural vocoders for better quality synthesis. A modified WaveNet model, previously has been used as an end-to-end speech enhancement system . This method works in the time domain and models both the speech and the noise present in an observation. Similarly, the SEGAN  and Wave-U-Net  models are end-to-end source separation models that work in the time domain. Both SEGAN and Wave-U-Net down-sample the audio signal progressively in multiple layers and then up-sample them to generate speech. SEGAN which follows a generative adverserial approach has a slightly lower PESQ than Wave-U-Net. Compared to the WaveNet denoising model of  and Wave-U-Net, our proposed model is simpler and noise-independent because it does not model the noise at all, only the clean speech. Moreover, we are able to use the original WaveNet model directly without the modification of .
3 Model Overview
Parametric resynthesis consists of two parts, as shown in Figure 1. The first part is a prediction model that predicts the acoustic representation of clean speech from noisy speech. This part of the PR model removes noise from a noisy observation. The second part of the PR model is a vocoder that resynthesizes “clean” speech from these predicted acoustic parameters. Here we choose to compare two neural vocoders, WaveNet and WaveGlow. Both WaveNet and WaveGlow can generate speech conditioned on a log mel-spectrogram, so the log mel-spectrogram is used as the intermediate acoustic parameters.
3.1 Prediction Model
The prediction model uses the noisy mel-spectrogram, , as input and the clean mel-spectrogram, , from parallel clean speech as ground truth. An LSTM  with multiple layers is used as the core architecture. The model is trained to minimize the mean squared error between the predicted mel-spectrogram, , and the clean mel-spectrogram.
The Adam optimizer is used as the optimization algorithm for training. At test time, given a noisy mel-spectrogram, a clean mel-spectrogram is predicted.
3.2 Neural Vocoders
Next, conditioned on the predicted mel-spectrogram, a neural vocoder is used to synthesize de-noised speech. We compare two neural vocoders: WaveNet  and WaveGlow . The neural vocoders are trained to generate clean speech from corresponding clean mel-spectrograms.
WaveNet  is a speech waveform generation model, built with dilated causal convolutional layers. The model is autoregressive, i.e. generation of one speech sample at time step () is conditioned on all previous time step samples (). The dilation of the convolutional layers increases by a factor of 2 between subsequent layers and then repeats starting from 1. Gated activations with residual and skip connections are used in WaveNet. It is trained to maximize the likelihood of the clean speech samples. The normalized log mel-spectrogram is used in local conditioning.
The output of WaveNet is modelled as a mixture of logistic components, as described in [9, 8] for high quality synthesis. The output is modelled as a -component logistic mixture. The model predicts a set of values , where each component of the distribution has its own parameters and the components are mixed with probability . The likelihood of sample is then
where and is the probability density function of clean speech conditioned on mel-spectrogram .
We use a publicly available implementation of WaveNet111https://github.com/r9y9/wavenet_vocoder with a setup similar to tacotron2 : 24 layers grouped into 4 dilation cycles, 512 residual channels, 512 gate channels, 256 skip channels, and output as mixture-of-logistics with 10 components. As it is an autoregressive model, the synthesis speed is very slow. The PR system with WaveNet as its vocoder is referred to as PR-WaveNet.
WaveGlow  is based on the Glow concept  and has faster synthesis than WaveNet. WaveGlow learns an invertible transformation between blocks of eight time domain audio samples and a standard normal distribution conditioned on the log mel spectrogram. It then generates audio by sampling from this Gaussian density.
The invertible transformation is a composition of a sequence of individual invertible transformations (), normalizing flows. Each flow in WaveGlow consist of a convolutional layer followed by an affine coupling layer. The affine coupling layer is a neural transformation that predicts a scale and bias conditioned on the input speech and mel-spectrogram . Let be the learned weight matrix for the convolutional layer and be the predicted scale value at the affine coupling layer.
For inference, WaveGlow samples from a uniform Gaussian distribution and applies the inverse transformations () conditioned on the mel-spectrogram () to get back the speech sample . Because parallel sampling from Gaussian distribution is trivial, all audio samples are generated in parallel. The model is trained to minimize the log likelihood of the clean speech samples ,
where is the number of coupling transformations, is the number of convolutions, is the log-likelihood of the spherical Gaussian with variance and in training is used. Note that WaveGlow refers to this parameter as , but we use to avoid confusion with the logistic function in (2). We use the official published waveGlow implementation222 https://github.com/NVIDIA/waveglow with original setup (12 coupling layers, each consisting of 8 layers of dilated convolution with 512 residual and 256 skip connections). We refer to the PR system with WaveGlow as its vocoder as PR-WaveGlow.
3.3 Joint Training
Since the neural vocoders are originally trained on clean mel spectrograms and are tested on predicted mel-spectrogram , we can also train both parts of the PR-neural system jointly. The aim of joint training is to compensate for the disparity between the mel spectrograms predicted by the prediction model and consumed by the neural vocoder. Both parts of the PR-neural systems are pretrained then trained jointly to maximize the combined loss of vocoder likelihood and negative mel-spectrogram squared loss. These models are referred as PR-neural vocoder-Joint. We experiment both with and without fine-tuning these models.
For our experiments, we use the LJSpeech dataset  to which we add environmental noise from CHiME-3 . The LJSpeech dataset contains 13100 audio clips from a single speaker with varying length from 1 to 10 seconds at sampling rate of 22 kHz. The clean speech is recorded with the microphone in a MacBook Pro in a quiet home environment. CHiME-3 contains four types of environmental noises: street, bus, pedestrian, and cafe. Note that the CHiME-3 noises were recorded at 16 kHz sampling rate. To mix them with LJSpeech, we synthesized white Gaussian noise in the 8-11 kHz band matched in energy to the 7-8 kHz band of the original recordings. The SNR of the generated noisy speech varies from dB to dB SNR with an average of 1 dB. We use 13000 noisy files for training, almost 24 hours of data. The test set consist of 24 files, 6 from each noise type. The SNR of the test set varies from dB to dB. The mel-spectrograms are created with window size 46.4 ms, hop size 11.6 ms and with 80 mel bins. The prediction model has 3-bidirectional LSTM layers with 400 units each and was trained with initial learning rate 0.001 for 500 epochs with batch size 64.
Both WaveGlow and WaveNet have published pre-trained models on the LJSpeech data. We use these pre-trained models due to limitations in GPU resources (training the WaveGlow model from scratch takes 2 months on a GPU GeForce GTX 1080 Ti). The published WaveGlow pre-trained model was trained for k iterations (batch size 12) with weight normalization . The pre-trained WaveNet model was trained for k iterations (batch size 2). The model also uses L2-regularization with a weight of . The average weights of the model parameters are saved as an exponential moving average with a decay of 0.9999 and used for inference, as this is found to provide better quality . PR-WaveNet-Joint is initialized with the pre-trained prediction model and WaveNet. Then it is trained end-to-end for k iterations with batch size 1. Each training iteration takes s on a GeForce GTX 1080 GPU. PR-WaveGlow-Joint is also initialized with the pre-trained prediction and WaveGlow models. It was then trained for k iterations with a batch size of 3. On a GeForce GTX 1080 Ti GPU, each iteration takes s. WaveNet synthesizes audio samples sequentially, the synthesis rate is samples per second or 0.004 realtime. Synthesizing 1 s of audio at 22 kHz takes s. Because WaveGlow synthesis can be done in parallel, it takes s to synthesize s of audio at a 22 kHz sampling rate.
We compare these two PR-neural models with PR-World, our previously proposed model , where the WORLD vocoder is used and the intermediate acoustic parameters are the fundamendal frequency, spectral envelope, and band aperiodicity used by WORLD . Note that WORLD does not support 22 kHz sampling rates, so this system generates output at 16 kHz. We also compare all PR models with two speech enhancement systems. First is the oracle Wiener mask (OWM), which has access to the original clean speech. The second is a recently proposed source separation system called Chimera++, which uses a combination of the deep clustering loss and mask inference loss to estimate masks. We use our implementation of Chimera++, which we verified to be able to achieve the reported performance on the same dataset as the published model. It was trained with the same data as the PR systems. In addition to the OWM, we measure the best case resynthesis quality by evaluating the neural vocoders conditioned on the true clean mel spectrograms.
Following [16, 17, 18] we compute composite objective metrics SIG: signal distortion, BAK: background intrusiveness and OVL: overall quality as described in [24, 25]. All three measures produce numbers between 1 and 5, with higher meaning better quality. We also report PESQ scores as a combined measure of quality and STOI  as a measure of intelligibility. All test files are downsampled to 16 KHz for measuring objective metrics.
We also conducted a listening test to measure the subjective quality and intelligibility of the systems. For the listening test, we choose 12 of the 24 test files, with three files from each of the four noise types. The listening test follows the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) paradigm . Subjects were presented with 9 anonymized and randomized versions of each file to facilitate direct comparison: 5 PR systems (PR-WaveNet, PR-WaveNet-Joint, PR-WaveGlow, PR-WaveGlow-Joint, PR-World), 2 comparison speech enhancement systems (oracle Wiener mask and Chimera++), and clean and noisy signals. The PR-World files are sampled at 16 kHz but the other 8 systems used 22 kHz. Subjects were also provided reference clean and noisy versions of each file. Five subjects took part in the listening test. They were told to rate the speech quality, noise-suppression quality, and overall quality of the speech from , with being the best.
Subjects were also asked to rate the subjective intelligibility of each utterance on the same scale. Specifically, they were asked to rate a model higher if it was easier to understand what was being said. We used an intelligibility rating because in our previous experiments asking subjects for transcripts showed that all systems were near ceiling performance. This could also have been a product of presenting different versions of the same underlying speech to the subjects. Intelligibility ratings, while less concrete, do not suffer from these problems.333All files are available at http://mr-pc.org/work/waspaa19/
Table 1 shows the objective metric comparison of the systems. In terms of objective quality, comparing neural vocoders synthesizing from clean speech, we observe that WaveGlow scores are higher than WaveNet. WaveNet synthesis has higher SIG quality, but lower BAK and OVL. Comparing the speech enhancement systems, both PR-neural systems outperform Chimera++ in all measures. Compared to the oracle Wiener mask, the PR-neural systems perform slightly worse. After further investigation, we observe that the PR resynthesis files are not perfectly aligned with the clean signal itself, which affects the objective scores significantly. Interestingly, with both, PR-neural-Joint performance decreases. When listening to the files, the PR-WaveNet-Joint sometimes contains mumbled unintelligible speech and PR-WaveGlow-Joint introduces more distortions.
In terms of objective intelligibility, we observe the clean WaveNet model has lower STOI than WaveGlow. For the STOI measurement as well, both speech inputs need to be exactly time-aligned, which the WaveNet model does not necessarily provide. The PR-neural systems have higher objective intelligibility than Chimera++. With PR-WaveGlow, we observe that when trained jointly, STOI actually goes down from 0.87 to 0.84. We observe that tuning WaveGlow’s parameter (our ) for inference has an effect on quality and intelligibility. When a smaller is used, the synthesis has more speech drop-outs. When a larger is used, these drop-outs decrease, but also the BAK score decreases. We believe that with a lower , when conditioned on a predicted spectrogram, the PR-WaveGlow system only generates segments of speech it is confident in, and mutes the rest.
Figure 2 shows the result of the quality listening test. PR-WaveNet performs best in all three quality scores, followed by PR-WaveNet-Joint, PR-WaveGlow-Joint, and PR-WaveGlow. Both PR-neural systems have much higher quality than the oracle Wiener mask. The next best model is PR-WORLD followed by Chimera++. PR-WORLD performs comparably to the oracle Wiener mask, but these ratings are lower than we found in . This is likely due to the use of 22 kHz sampling rates in the current experiment but 16 kHz in our previous experiments.
Figure 3 shows the subjective intelligibility ratings. We observe that noisy and hidden noisy signals have reasonably high subjective intelligibility, as humans are good at understanding speech in noise. The OWM has slightly higher subjective intelligibility than PR-WaveGlow. PR-WaveNet has slightly but not significantly higher intelligibility, and the clean files have the best intelligibility. The PR-neural-Joint models have lower intelligibility, caused by the speech drop-outs or mumbled speech as mentioned above.
6 Discussion of Joint Training
Table 2 shows the results of further investigation of the drop in performance caused by jointly training the PR-neural systems. The PR-neural-Joint models are trained using the vocoder losses. After joint training, both WaveNet and WaveGlow seemed to change the prediction model to make the intermediate clean mel-spectrogram louder. As training continued, this predicted mel-spectrogram did not approach the clean spectrogram, but instead became a very loud version of it, which did not improve performance. When the prediction model was fixed and only the vocoders were fine-tuned jointly, we observed a large drop in performance. In WaveNet this introduced more unintelligible speech, making it smoother but garbled. In WaveGlow this increased speech dropouts (as can be seen in the reduced STOI scores). Finally with the neural vocoder fixed, we trained the prediction model to minimize a combination of mel spectrogram MSE and vocoder loss. This provided slight improvements in performance: both PR-WaveNet and PR-WaveGlow improved intelligibility scores as well as SIG and OVL.
This paper proposes the use of neural vocoders in parametric resynthesis for high quality speech enhancement. We show that using two neural vocoders, WaveGlow and WaveNet, produces better quality enhanced speech than using a traditional vocoder like WORLD. We also show that PR-neural models outperform the recently proposed Chimera++ mask-based speech enhancement system in all intelligibility and quality scores. Finally we show that PR-WaveNet achieves significantly better subjective quality scores than the oracle Wiener mask. In future, we will explore the speaker-dependence of these models.
This material is based upon work supported by the National Science Foundation (NSF) grant IIS-1618061. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
-  D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, Oct. 2018.
-  S. Maiti and M. I. Mandel, “Speech denoising by parametric resynthesis,” arXiv preprint arXiv:1904.01537, 2019.
-  M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, Jul. 2016.
-  A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio.” in Proc. ISCA SSW, Sept. 2016, p. 125.
-  W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” arXiv preprint arXiv:1710.07654, 2017.
-  A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent WaveNet vocoder,” in Proc. Interspeech, vol. 2017, 2017, pp. 1118–1122.
-  Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint arXiv:1703.10135, 2017.
-  J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017.
-  A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017.
-  N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.
-  R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” arXiv preprint arXiv:1811.00002, 2018.
-  Z. Wang, J. L. Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proc. ICASSP, Apr. 2018, pp. 686–690.
-  S. Maiti and M. I. Mandel, “Concatenative resynthesis using twin networks,” Proc. Interspeech, pp. 3647–3651, 2017.
-  S. Maiti, J. Ching, and M. Mandel, “Large vocabulary concatenative resynthesis,” in Proc. Interspeech, 2018.
-  A. R. Syed, T. V. Anh, and M. I. Mandel, “Concatenative resynthesis with improved training signals for speech enhancement,” in Proc. Interspeech, 2018.
-  D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” in Proc. ICASSP, 2018, pp. 5069–5073.
-  S. Pascual, A. Bonafonte, and J. Serrà, “Segan: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
-  C. Macartney and T. Weyde, “Improved speech enhancement with the wave-u-net,” arXiv preprint arXiv:1811.11307, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” arXiv preprint arXiv:1807.03039, 2018.
-  K. Ito, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
-  J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in Proc. ASRU, 2015, pp. 504–511.
-  T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Proc. NIPS, 2016, pp. 901–909.
-  Y. Hu and P. C. Loizou, “Subjective comparison of speech enhancement algorithms,” in Proc. ICASSP, vol. 1, May 2006, pp. I–I.
-  Y. Hu and P. C. Loizou, “Evaluation of objective measures for speech enhancement,” in Proc. Interspeech, 2006.
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217.
-  “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union Radiocommunication Standardization Sector (ITU-R), Tech. Rep. BS.1534-3, 2015.