Parametric Resynthesis with neural vocoders

Parametric Resynthesis with neural vocoders


Noise suppression systems generally produce output speech with compromised quality. We propose to utilize the high quality speech generation capability of neural vocoders for noise suppression. We use a neural network to predict clean mel-spectrogram features from noisy speech and then compare two neural vocoders, WaveNet and WaveGlow, for synthesizing clean speech from the predicted mel spectrogram. Both WaveNet and WaveGlow achieve better subjective and objective quality scores than the source separation model Chimera++. Further, WaveNet and WaveGlow also achieve significantly better subjective quality ratings than the oracle Wiener mask. Moreover, we observe that between WaveNet and WaveGlow, WaveNet achieves the best subjective quality scores, although at the cost of much slower waveform generation.


Soumi Maiti The Graduate Center, CUNY
Computer Science
New York, NY, USA Michael I Mandel Brooklyn College
Computer and Information Science
Brooklyn, NY, USA


Speech enhancement, speech synthesis, enhancement-by-synthesis, neural vocoder, WaveNet, WaveGlow

1 Introduction

Traditionally, speech enhancement methods modify noisy speech to make it more like the original clean speech [1]. Such modification of a noisy signal can introduce additional distortions in the speech signal. Signal distortions generally occur from two problems, over-suppression of the speech and under-suppression of the noise. In contrast, parametric speech synthesis methods can produce high quality speech from only text or textual information. Parametric speech synthesis methods predict an acoustic representation of speech from text and then use a vocoder to generate clean speech from the predicted acoustic representation.

We propose combining speech enhancement and parametric synthesis methods by generating clean acoustic representations from noisy speech and then using a vocoder to synthesize “clean” speech from the acoustic representations. We call such a system parametric resynthesis (PR). The first part of the PR system removes noise and predicts the clean acoustic representation. The second part, the vocoder, generates clean speech from this representation. As we are using a vocoder to resynthesize the output speech, the performance of the system is limited by the vocoder synthesis quality.

In our previous work [2], we built a PR system with a non-neural vocoder, WORLD [3]. Compared to such non-neural vocoders, neural vocoders like WaveNet [4] synthesize higher quality speech, as shown in the speech synthesis literature [4, 5, 6, 7, 8, 9]. More recent neural vocoders like WaveRNN [10], Parallel WaveNet [9], and WaveGlow [11] have been proposed to improve the synthesis speed of WaveNet while maintaining its high quality. Our goal is to utilize a neural vocoder to resynthesize higher quality speech from noisy speech than WORLD allows. We choose WaveNet and WaveGlow for our experiments, as these are the two most different architectures.

In this work we build PR systems with two neural vocoders (PR-neural). Comparing PR-neural to other systems, we show that neural vocoders produce both better speech quality and better noise reduction quality in subjective listening tests than our previous model, PR-World. We show that the PR-neural systems perform better than a recently proposed speech enhancement system, Chimera++ [12], in all quality and intelligibility scores. And we show that PR-neural can achieve higher subjective intelligibility and quality ratings than the oracle Wiener mask. We also discuss end-to-end training strategies for the PR-neural vocoder system.

2 Background

Speech synthesis can be divided into two broad categories, concatenative and parametric speech synthesis. Traditionally, concatenative speech synthesis has produced the best quality speech. Concatenative systems stitch together small segments of speech recordings to generate new utterances. We previously proposed speech enhancement systems using concatenative synthesis techniques [13, 14, 15], named “concatenative resynthesis.” Concatenative speech enhancement systems can generate high quality speech with a slight loss in intelligibility, but they are speaker-dependent and generally require a very large dictionary of clean speech.

With the advent of the WaveNet neural vocoder, parametric speech synthesis with WaveNet surpassed concatenative synthesis in speech quality [4]. Hence, here we use WaveNet and WaveNet-like neural vocoders for better quality synthesis. A modified WaveNet model, previously has been used as an end-to-end speech enhancement system [16]. This method works in the time domain and models both the speech and the noise present in an observation. Similarly, the SEGAN [17] and Wave-U-Net [18] models are end-to-end source separation models that work in the time domain. Both SEGAN and Wave-U-Net down-sample the audio signal progressively in multiple layers and then up-sample them to generate speech. SEGAN which follows a generative adverserial approach has a slightly lower PESQ than Wave-U-Net. Compared to the WaveNet denoising model of [16] and Wave-U-Net, our proposed model is simpler and noise-independent because it does not model the noise at all, only the clean speech. Moreover, we are able to use the original WaveNet model directly without the modification of [16].

Figure 1: Parametric Resynthesis model

3 Model Overview

Parametric resynthesis consists of two parts, as shown in Figure 1. The first part is a prediction model that predicts the acoustic representation of clean speech from noisy speech. This part of the PR model removes noise from a noisy observation. The second part of the PR model is a vocoder that resynthesizes “clean” speech from these predicted acoustic parameters. Here we choose to compare two neural vocoders, WaveNet and WaveGlow. Both WaveNet and WaveGlow can generate speech conditioned on a log mel-spectrogram, so the log mel-spectrogram is used as the intermediate acoustic parameters.

3.1 Prediction Model

The prediction model uses the noisy mel-spectrogram, , as input and the clean mel-spectrogram, , from parallel clean speech as ground truth. An LSTM [19] with multiple layers is used as the core architecture. The model is trained to minimize the mean squared error between the predicted mel-spectrogram, , and the clean mel-spectrogram.


The Adam optimizer is used as the optimization algorithm for training. At test time, given a noisy mel-spectrogram, a clean mel-spectrogram is predicted.

3.2 Neural Vocoders

Next, conditioned on the predicted mel-spectrogram, a neural vocoder is used to synthesize de-noised speech. We compare two neural vocoders: WaveNet [4] and WaveGlow [11]. The neural vocoders are trained to generate clean speech from corresponding clean mel-spectrograms.

3.2.1 WaveNet

WaveNet [4] is a speech waveform generation model, built with dilated causal convolutional layers. The model is autoregressive, i.e. generation of one speech sample at time step () is conditioned on all previous time step samples (). The dilation of the convolutional layers increases by a factor of 2 between subsequent layers and then repeats starting from 1. Gated activations with residual and skip connections are used in WaveNet. It is trained to maximize the likelihood of the clean speech samples. The normalized log mel-spectrogram is used in local conditioning.

The output of WaveNet is modelled as a mixture of logistic components, as described in [9, 8] for high quality synthesis. The output is modelled as a -component logistic mixture. The model predicts a set of values , where each component of the distribution has its own parameters and the components are mixed with probability . The likelihood of sample is then


where and is the probability density function of clean speech conditioned on mel-spectrogram .

We use a publicly available implementation of WaveNet111 with a setup similar to tacotron2 [8]: 24 layers grouped into 4 dilation cycles, 512 residual channels, 512 gate channels, 256 skip channels, and output as mixture-of-logistics with 10 components. As it is an autoregressive model, the synthesis speed is very slow. The PR system with WaveNet as its vocoder is referred to as PR-WaveNet.

3.2.2 WaveGlow

WaveGlow [11] is based on the Glow concept [20] and has faster synthesis than WaveNet. WaveGlow learns an invertible transformation between blocks of eight time domain audio samples and a standard normal distribution conditioned on the log mel spectrogram. It then generates audio by sampling from this Gaussian density.

The invertible transformation is a composition of a sequence of individual invertible transformations (), normalizing flows. Each flow in WaveGlow consist of a convolutional layer followed by an affine coupling layer. The affine coupling layer is a neural transformation that predicts a scale and bias conditioned on the input speech and mel-spectrogram . Let be the learned weight matrix for the convolutional layer and be the predicted scale value at the affine coupling layer.

For inference, WaveGlow samples from a uniform Gaussian distribution and applies the inverse transformations () conditioned on the mel-spectrogram () to get back the speech sample . Because parallel sampling from Gaussian distribution is trivial, all audio samples are generated in parallel. The model is trained to minimize the log likelihood of the clean speech samples ,


where is the number of coupling transformations, is the number of convolutions, is the log-likelihood of the spherical Gaussian with variance and in training is used. Note that WaveGlow refers to this parameter as , but we use to avoid confusion with the logistic function in (2). We use the official published waveGlow implementation222 with original setup (12 coupling layers, each consisting of 8 layers of dilated convolution with 512 residual and 256 skip connections). We refer to the PR system with WaveGlow as its vocoder as PR-WaveGlow.

3.3 Joint Training

Since the neural vocoders are originally trained on clean mel spectrograms and are tested on predicted mel-spectrogram , we can also train both parts of the PR-neural system jointly. The aim of joint training is to compensate for the disparity between the mel spectrograms predicted by the prediction model and consumed by the neural vocoder. Both parts of the PR-neural systems are pretrained then trained jointly to maximize the combined loss of vocoder likelihood and negative mel-spectrogram squared loss. These models are referred as PR-neural vocoder-Joint. We experiment both with and without fine-tuning these models.

4 Experiments

For our experiments, we use the LJSpeech dataset [21] to which we add environmental noise from CHiME-3 [22]. The LJSpeech dataset contains 13100 audio clips from a single speaker with varying length from 1 to 10 seconds at sampling rate of 22 kHz. The clean speech is recorded with the microphone in a MacBook Pro in a quiet home environment. CHiME-3 contains four types of environmental noises: street, bus, pedestrian, and cafe. Note that the CHiME-3 noises were recorded at 16 kHz sampling rate. To mix them with LJSpeech, we synthesized white Gaussian noise in the 8-11 kHz band matched in energy to the 7-8 kHz band of the original recordings. The SNR of the generated noisy speech varies from  dB to  dB SNR with an average of 1 dB. We use 13000 noisy files for training, almost 24 hours of data. The test set consist of 24 files, 6 from each noise type. The SNR of the test set varies from  dB to  dB. The mel-spectrograms are created with window size 46.4 ms, hop size 11.6 ms and with 80 mel bins. The prediction model has 3-bidirectional LSTM layers with 400 units each and was trained with initial learning rate 0.001 for 500 epochs with batch size 64.

Both WaveGlow and WaveNet have published pre-trained models on the LJSpeech data. We use these pre-trained models due to limitations in GPU resources (training the WaveGlow model from scratch takes 2 months on a GPU GeForce GTX 1080 Ti). The published WaveGlow pre-trained model was trained for k iterations (batch size 12) with weight normalization [23]. The pre-trained WaveNet model was trained for k iterations (batch size 2). The model also uses L2-regularization with a weight of . The average weights of the model parameters are saved as an exponential moving average with a decay of 0.9999 and used for inference, as this is found to provide better quality [8]. PR-WaveNet-Joint is initialized with the pre-trained prediction model and WaveNet. Then it is trained end-to-end for k iterations with batch size 1. Each training iteration takes  s on a GeForce GTX 1080 GPU. PR-WaveGlow-Joint is also initialized with the pre-trained prediction and WaveGlow models. It was then trained for k iterations with a batch size of 3. On a GeForce GTX 1080 Ti GPU, each iteration takes  s. WaveNet synthesizes audio samples sequentially, the synthesis rate is samples per second or 0.004 realtime. Synthesizing 1 s of audio at 22 kHz takes  s. Because WaveGlow synthesis can be done in parallel, it takes  s to synthesize  s of audio at a 22 kHz sampling rate.

We compare these two PR-neural models with PR-World, our previously proposed model [2], where the WORLD vocoder is used and the intermediate acoustic parameters are the fundamendal frequency, spectral envelope, and band aperiodicity used by WORLD [3]. Note that WORLD does not support 22 kHz sampling rates, so this system generates output at 16 kHz. We also compare all PR models with two speech enhancement systems. First is the oracle Wiener mask (OWM), which has access to the original clean speech. The second is a recently proposed source separation system called Chimera++[12], which uses a combination of the deep clustering loss and mask inference loss to estimate masks. We use our implementation of Chimera++, which we verified to be able to achieve the reported performance on the same dataset as the published model. It was trained with the same data as the PR systems. In addition to the OWM, we measure the best case resynthesis quality by evaluating the neural vocoders conditioned on the true clean mel spectrograms.

Following [16, 17, 18] we compute composite objective metrics SIG: signal distortion, BAK: background intrusiveness and OVL: overall quality as described in [24, 25]. All three measures produce numbers between 1 and 5, with higher meaning better quality. We also report PESQ scores as a combined measure of quality and STOI [26] as a measure of intelligibility. All test files are downsampled to 16 KHz for measuring objective metrics.

We also conducted a listening test to measure the subjective quality and intelligibility of the systems. For the listening test, we choose 12 of the 24 test files, with three files from each of the four noise types. The listening test follows the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) paradigm [27]. Subjects were presented with 9 anonymized and randomized versions of each file to facilitate direct comparison: 5 PR systems (PR-WaveNet, PR-WaveNet-Joint, PR-WaveGlow, PR-WaveGlow-Joint, PR-World), 2 comparison speech enhancement systems (oracle Wiener mask and Chimera++), and clean and noisy signals. The PR-World files are sampled at 16 kHz but the other 8 systems used 22 kHz. Subjects were also provided reference clean and noisy versions of each file. Five subjects took part in the listening test. They were told to rate the speech quality, noise-suppression quality, and overall quality of the speech from , with being the best.

Subjects were also asked to rate the subjective intelligibility of each utterance on the same scale. Specifically, they were asked to rate a model higher if it was easier to understand what was being said. We used an intelligibility rating because in our previous experiments asking subjects for transcripts showed that all systems were near ceiling performance. This could also have been a product of presenting different versions of the same underlying speech to the subjects. Intelligibility ratings, while less concrete, do not suffer from these problems.333All files are available at

5 Results

Clean 5.0 5.0 5.0 4.50 1.00
WaveGlow 5.0 4.1 5.0 3.81 0.98
WaveNet 4.9 2.8 4.0 3.05 0.94
Oracle Wiener 4.0 2.4 3.2 2.90 0.91
PR-WaveGlow 3.9 2.5 3.1 2.58 0.87
PR-WaveNet 3.8 2.2 3.0 2.46 0.87
Chimera++ 3.7 2.1 2.8 2.44 0.86
PR-WaveGlow-Joint 3.6 2.5 2.9 2.28 0.84
PR-WaveNet-joint 3.5 2.1 2.7 2.31 0.83
PR-World 2.8 2.1 2.3 1.53 0.79
Noisy 1.9 1.9 1.7 1.58 0.74
Table 1: Speech enhancement objective metrics: higher is better. Systems in the top section decode from clean speech as upper bounds. Systems in the middle section use oracle information about the clean speech. Systems in the bottom section are not given any oracle knowledge. All systems sorted by SIG.

Table 1 shows the objective metric comparison of the systems. In terms of objective quality, comparing neural vocoders synthesizing from clean speech, we observe that WaveGlow scores are higher than WaveNet. WaveNet synthesis has higher SIG quality, but lower BAK and OVL. Comparing the speech enhancement systems, both PR-neural systems outperform Chimera++ in all measures. Compared to the oracle Wiener mask, the PR-neural systems perform slightly worse. After further investigation, we observe that the PR resynthesis files are not perfectly aligned with the clean signal itself, which affects the objective scores significantly. Interestingly, with both, PR-neural-Joint performance decreases. When listening to the files, the PR-WaveNet-Joint sometimes contains mumbled unintelligible speech and PR-WaveGlow-Joint introduces more distortions.

In terms of objective intelligibility, we observe the clean WaveNet model has lower STOI than WaveGlow. For the STOI measurement as well, both speech inputs need to be exactly time-aligned, which the WaveNet model does not necessarily provide. The PR-neural systems have higher objective intelligibility than Chimera++. With PR-WaveGlow, we observe that when trained jointly, STOI actually goes down from 0.87 to 0.84. We observe that tuning WaveGlow’s parameter (our ) for inference has an effect on quality and intelligibility. When a smaller is used, the synthesis has more speech drop-outs. When a larger is used, these drop-outs decrease, but also the BAK score decreases. We believe that with a lower , when conditioned on a predicted spectrogram, the PR-WaveGlow system only generates segments of speech it is confident in, and mutes the rest.

Figure 2: Subjective quality: higher is better. Error bars show twice the standard error.

Figure 3: Subjective Intelligibility: higher is better.

Figure 2 shows the result of the quality listening test. PR-WaveNet performs best in all three quality scores, followed by PR-WaveNet-Joint, PR-WaveGlow-Joint, and PR-WaveGlow. Both PR-neural systems have much higher quality than the oracle Wiener mask. The next best model is PR-WORLD followed by Chimera++. PR-WORLD performs comparably to the oracle Wiener mask, but these ratings are lower than we found in [2]. This is likely due to the use of 22 kHz sampling rates in the current experiment but 16 kHz in our previous experiments.

Figure 3 shows the subjective intelligibility ratings. We observe that noisy and hidden noisy signals have reasonably high subjective intelligibility, as humans are good at understanding speech in noise. The OWM has slightly higher subjective intelligibility than PR-WaveGlow. PR-WaveNet has slightly but not significantly higher intelligibility, and the clean files have the best intelligibility. The PR-neural-Joint models have lower intelligibility, caused by the speech drop-outs or mumbled speech as mentioned above.

WaveNet 3.8 2.2 3.0 2.46 0.87
WaveNet \checkmark 3.9 2.2 3.1 2.49 0.88
WaveNet \checkmark 3.1 1.9 2.3 2.02 0.78
WaveNet \checkmark \checkmark 3.5 2.1 2.7 2.29 0.83
WaveGlow 3.9 2.5 3.1 2.58 0.87
WaveGlow \checkmark 4.0 2.5 3.2 2.70 0.90
WaveGlow \checkmark 3.6 2.5 2.9 2.24 0.82
WaveGlow \checkmark \checkmark 3.6 2.4 2.9 2.28 0.84
Table 2: Objective metrics for different joint fine-tuning schemes for PR-neural systems components.

6 Discussion of Joint Training

Table 2 shows the results of further investigation of the drop in performance caused by jointly training the PR-neural systems. The PR-neural-Joint models are trained using the vocoder losses. After joint training, both WaveNet and WaveGlow seemed to change the prediction model to make the intermediate clean mel-spectrogram louder. As training continued, this predicted mel-spectrogram did not approach the clean spectrogram, but instead became a very loud version of it, which did not improve performance. When the prediction model was fixed and only the vocoders were fine-tuned jointly, we observed a large drop in performance. In WaveNet this introduced more unintelligible speech, making it smoother but garbled. In WaveGlow this increased speech dropouts (as can be seen in the reduced STOI scores). Finally with the neural vocoder fixed, we trained the prediction model to minimize a combination of mel spectrogram MSE and vocoder loss. This provided slight improvements in performance: both PR-WaveNet and PR-WaveGlow improved intelligibility scores as well as SIG and OVL.

7 Conclusion

This paper proposes the use of neural vocoders in parametric resynthesis for high quality speech enhancement. We show that using two neural vocoders, WaveGlow and WaveNet, produces better quality enhanced speech than using a traditional vocoder like WORLD. We also show that PR-neural models outperform the recently proposed Chimera++ mask-based speech enhancement system in all intelligibility and quality scores. Finally we show that PR-WaveNet achieves significantly better subjective quality scores than the oracle Wiener mask. In future, we will explore the speaker-dependence of these models.

8 Acknowledgements

This material is based upon work supported by the National Science Foundation (NSF) grant IIS-1618061. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description