# Speech Synthesis Using Generative Adversarial Networks

###### Abstract

In this report, we study the difficulty of producing speeches from scratch, either conditionally or unconditionally, using generative adversarial networks (GAN). We explored generating and discriminating audio in different forms with various architectures. Despite seeing reasonablylooking waveforms and spectrograms, the actual audio sounded far from realistic speech, presumably due to sensitivity to high-frequency artifacts.

## 1 Introduction

There has been much recent research involved in generating distribution samples using GANs. This work has many examples of image generation. There has also been work [?] [?] generating audio, but never using a GAN.

In this report, we attempted to generate audio using GANs. We experimented generating and discriminating multiple audio forms, including raw waveforms, spectrograms, and mel-spectrograms. We also tried feed-forward and recurrent architectures. All approaches were able to generated realistic-looking waveforms and spectrograms all look realistic, we however found that the actual generated audio is far from realistic. The experiments and results are described in detail in Section 3. The code is available in https://github.com/BarclayII/audiogan 2 Related Works

## 2.1 Generative Adversarial Networks

GANs learn to model the data distribution by generating realistic samples using two competing models [?]. The generator G produces a sample from a noise vector z: ˆ x = G(z), where z is drawn from a known prior, e.g. standard normal distribution. The discriminator D learns to tell whether the inputˆxinputˆ inputˆx comes from the real dataset or is generated by G. The training process alternates between minimizing L D and L G w.r.t. the parameters of D and G respectively:

$$\left. \begin{array} { l } { - E _ { x \sim p _ { d a t a } } [ \eta _ { D } \operatorname { log } D ( x ) + ( 1 - \eta _ { D } ) \operatorname { log } ( 1 - D ( x ) ) ] - E _ { z \sim p _ { z } } \operatorname { log } ( 1 ) } \\ { - E _ { z \sim p _ { z } } [ \eta _ { G } \operatorname { log } D ( G ( z ) ) + ( 1 - \eta _ { G } ) \operatorname { log } ( 1 - D ( G ( z ) ) ) ] } \end{array} \right.$$where η D is the hyperparameter for label smoothing (usually 0.9) [?]. There is active research discussing the optimal value for η G ; we tried 1 such as a normal GAN and 0.5 as a boundary-seeking GAN [?].

It is widely known that GANs are unstable [?]. Feature matching is a common method to help the generator to find the appropriate distributions by additionally minimizing the difference of average features on each intermediate discriminator layer between real and generated samples [?]:

$$L _ { G } ^ { f m , ( l ) } = \| E _ { x \sim p _ { \text {data} } } f ^ { ( l ) } ( x ) - E _ { z \sim p _ { z } } f ^ { ( l ) } ( z ) \| ^ { 2 }$$where f (l) is the l-th intermediate layer activation of D. We generalized the idea of feature matching in Section 3.5.4.

## 2.2 Speech and Audio Synthesis

Several speech synthesis networks have been developed recently. Depending on the method of waveform generation we can categorize the frameworks as follows:

1. Raw waveform synthesizer. V-RNN [?] is an RNN based on variational auto-encoder which proved to be capable of modeling audios. It represents the audio as a sequence of frames, each of which consists of 200 amplitudes. In our work, we tried the same representation of audio output, but replaced the variational encoder-decoder architecture with a GAN.

WaveNet [?] has succeeded in generating audio using raw audio amplitudes. Wavenet is a conditional audio amplitude model. This model can be represented as:

$$W _ { t } ^ { G } = F ( W _ { 0 : t - 1 } , \text {word} )$$This model, therefore, can be trained through traditional supervised learning, so it is much more straight forward to learn through stochastic optimization on a data distribution. However, it must generate samples autoregressively, which is much more computationally expensive as it involves many generation iterations.

2. Spectrogram synthesizer. Tacotron [?] also translates text to audio, but it performs this task via an intermediate spectogram representation using a linear-frequency spectogram representation. Tacotron also performs a task similar to our GAN objective of generating the entire audio representation simultaneously. The difference, however, is that the model is designed as a supervised optimization model as well. Therefore, Tacotron successfully translates text to audio, but it does so without conditioning on random noise input.

We experimented both raw waveform generation and spectrogram generation.

## 3 Methods and Experiments

## 3.1 Evaluation

Although we note that recently kernel Maximum Mean Discrepancy and 1-NN two-sample test are proposed for quantifying GAN evaluation [?], as of this writing, we did not find any prior GAN models that generates speech to compare with. In the scope of our project, we extrinsically evaluate our model performance by manual inspection, i.e. actually listening to the generated audio. Throughout development, we also monitored many intrinsic evaluation metrics. We monitored the discriminator's average accuracy on real and generator samples over time as a core intrinsic metric. In the conditional case, we also measured the discriminator's ability to discriminate correct versus mismatched audio over time.

## 3.2 Dataset

We took Fisher English Corpus [?], which is a collection of English conversations with transcripts as the data source. We use FAVE toolkit [?] to align the speeches with individual words, and extracted the audio segments for individual words. We further resample the audio to a sample rate of 8000Hz. For conditional generation, we split 10% of the words to form a holdout set.

## 3.3 Balancing Generator and Discriminator

[?] suggested that the discriminator should be trained more frequently than the generator in order to provide the generator an accurate-enough direction to update. Therefore, after each generator update, we kept updating the discriminator until it classifies at least 60% of the real and 60% of the generated samples correctly.

## 3.4 Raw Waveform Generation

For raw waveform generation, we limit the maximum number of amplitudes to generate to 24000. For longer audio samples, we took the first 24000 values.

## 3.4.1 Generator

The generator G is an LSTM with 512 units which generates 200 floating-point amplitudes between -1 and 1 at each time step, similar to how VRNN [?] generated speech:

$$z _ { t } \sim N ( 0 , I )$$ $$\left. \begin{array} { c } { h _ { t } ^ { G } = L S T M ( x _ { t - 1 } , h _ { t - 1 } ^ { G } , z _ { t } ) } \\ { x _ { t } = \operatorname { tanh } ( M L P ( h _ { t } ^ { G } ) ) } \\ { x = [ x _ { 1 } ; x _ { 2 } ; \ldots ; x _ { t } ] } \end{array} \right.$$where z t is a 100-dimensional vector sampled from standard normal distribution.

## 3.4.2 Policy Gradient for Sequence Length Control

The generator should also be able to know when to stop audio generation, given a noise vector. For each time step, we additionally predict the probability s t to stop generation afterwards with a stopper network M LP s .

$$s _ { t } = \operatorname { sigmoid } ( M L P _ { s } ( h _ { t } ^ { G } )$$) At generation, we randomly decide whether to stop at time step t by sampling from a Bernoulli distribution with parameter s t . Since halting generation requires sampling, M LP s is not directly trainable via back-propagation. We therefore train M LP s with policy gradient, with the reward as the negative of the discriminator loss on this sample, and the baseline as an exponential moving average of historical discriminator losses. The policy gradient algorithm therefore tries to minimize the discriminator loss as well. This is equivalent to minimizing the following loss function w.r.t. M LP s :

$$L _ { G } ^ { p g } = - ( L _ { D } - b ) [ \sum _ { t = 1 } ^ { T ( G ( z ) ) - 1 } \operatorname { log } ( 1 - s _ { t } ) + \operatorname { log } s _ { T ( G ( z ) ) }$$where L D is the discriminator loss described below, T (G(z)) is the first time step where M LP s samples 1, and b is the baseline of the policy gradient algorithm. In our experiment, the coefficient for exponential moving average is 0.5.

## 3.4.3 Discriminator

We employed a discriminator inspired by [?]. The discriminator D is a bidirectional LSTM with 512 units, which takes in 200 amplitudes at a time, and it outputs a separate score at each time step.

$$\left. \begin{array}{l}{ h _ { t } ^ { D , f } = L S T M ( x _ { t } , h _ { t - 1 } ^ { D , f } ) }\\{ h _ { t } ^ { D , b } = L S T M ( x _ { t } , h _ { t + 1 } ^ { D , b } ) }\\{ D ^ { t } ( x ) = M L P ( h _ { t } ^ { D , f } , h _ { t } ^ { D , b } ) }\end{array} \right.$$The loss function of D is the average cross entropy over the scores on all time steps:

$$\left. \begin{array}{l}{ L _ { D } = E _ { x \sim p _ { \text {data} } } [ \frac { 1 } { T ( x ) } \sum _ { t = 1 } ^ { T ( x ) } [ \eta _ { D } \operatorname { log } D ^ { t } ( x ) + ( 1 - \eta _ { D } ) \operatorname { log } ( 1 - D ^ { t } ( x ) ) ) }\\{ + E _ { z \sim p _ { z } } [ \frac { 1 } { T ( G ( z ) ) } \sum _ { t = 1 } ^ { T ( G ( z ) ) } \operatorname { log } ( 1 - D ^ { t } ( G ( z ) ) ) ] }\end{array} \right.$$where T (x) is the function which computes the number of frames, i.e. groups of 200 amplitudes, of x. This form of discriminator is able to focus more on local patterns, while keeping an eye on the global structure.

## 3.4.4 Conditional Generation

We also tested a variant of our GAN whose generator and discriminator takes a word in the form of a sequence of characters into condition. Both models use a separate bidirectional LSTM which summarizes the characters into a vector representation c D and c G respectively. c G is fed into the generator by changing Equation (2) to

$$h _ { t } ^ { G } = L S T M ( x _ { t - 1 } , h _ { t - 1 } ^ { G } , z _ { t } , c _ { G } )$$The discriminator takes c D into condition by computing a pairwise ranking. First we obtain the summarization vector of the whole audio sequence:

$$h ^ { D } = [ h _ { T ( x ) } ^ { D , f } ; h _ { 1 } ^ { D , b } ]$$We then minimize an additional ranking loss using negative sampling by uniformly sampling another word from the vocabulary, computing its summarization vector c D , and additionally minimizing

$$L _ { D } ^ { r a n k } = \operatorname { max } ( 0,1 + h ^ { D ^ { T } } c _ { D } ^ { \prime } - h ^ { D ^ { T } } c _ { D } )$$3.4.5 Results Figure 1 shows the comparison between conditionally generated raw waveforms and real waveforms. All the generated waveforms look reasonable, as they all have some structured peaks. Notably, the generated sample for word poo has a similar global structure to that of the real sample. Despite this, the actual audio sounded like noise.

## 3.5 Spectrogram Generation

Training the network to generate raw waveforms is particularly hard not only because GAN is inherently unstable, but also because (1) we were training a recurrent network over sequences which could be as long as 100 time steps, (2) policy gradient for length control also imposes another layer of instability. More importantly though (3) audio amplitudes are not individually valuable representation of noise. Rather, the local relations between audio amplitudes, or more precisely the magnitudes of local frequencies or periods of variability within the audio amplitudes are more directly related to audio output. Therefore, to represent magnitudes of local frequencies throughout an audio sample, we tried to train the model to generate and discriminate log-scale spectrograms instead.

For each audio clip, we performed a short-time Fourier transform (STFT) with FFT window size of 2048, Hann window length of 2048, and hop size of 512. We then limited the maximum number of spectrogram frames to 24. We computed the real-valued "log-spectrogram" by computing the log-magnitudes of all the elements in the complex-valued spectrogram. Finally, we normalize the result by subtracting the mean and dividing by the standard deviation of all the values.

For evaluation, we reconstructed the waveform from real-valued spectrogram according to [?]. Empirically, we chose this process because it worked well on the real spectograms.

## 3.5.1 Generator

The generator G for log-spectrograms was a feed-forward network based on 1D convolutions. It first samples a random feature map z with 32 channels and 6 frames, then maps z into a set of 1-dimensional feature maps, then computes the distribution of spectrogram length and individual values of the log-spectrogram from the feature maps:

$$\operatorname { Pr } ( l ) = \operatorname { softmax } ( G _ { l } ( G _ { \phi } ( z ) ) \quad x = G _ { x } ( G _ { \phi } ( z ) )$$where G φ , G l and G x are all deep convolutional networks, whose structures are depicted in Table 1. 1 In general, they are mostly composed by the following types of building blocks:

• Convolution Layer : We denote a convolution layer with kernel size 3 and output channels 128 as Conv-128-3.

• Transposed Convolution Layer : Similarly, we denote a transposed convolution layer with kernel size 3, 128 output channels, and stride 2 as ConvT-128-3-2.

• Parallel Convolutions: Similar to Inception module [?], this module first passes the input into several 1D-convolutional layers in parallel, and concatenates the output along the feature channel dimension. We denote a module with 4 1D convolutions with 128 output channels, each with kernel sizes 1, 1, 3, 5, as PC-128-1-1-3-5.

• Bottlenecking Residual Convolutions: This module passes the input through a residual block, whose 1D convolutional layer with stride 2, an intermediate convolutional layer with stride 1, and a deconvolutional layer with stride 2. We denote a module with 2048 intermediate channels, 1024 output channels, kernel size 3, as BR-2048-1024-3.

LeakyReLU with coefficient 0.1 is inserted after every convolution, except the last layer of G x where no non-linearity is introduced.

Conv-1024-3 BR-2048-1024-3

Conv-1-3 Conv-1025-3 BR-2048-1024-3 PC-512-1-3-3-5 BR-2048-1024-3

Conv-1025-3 ConvT-1024-4-2 PC-256-1-1-3-3 BR-2048-1024-3 BR-2048-1024-3 ConvT-1024-4-2 Table 1: Generator structures. The notations is described in Section 3.5.1

## 3.5.2 Spectrogram Length Control with Gumbel Softmax

Usually, sampling from a categorical distribution is non-differentiable. One way to circumvent the need of policy gradient while making the whole computation differentiable is to use StraightThrough Gumbel Softmax [?]. Assuming that we computed the probabilities of length p i = Pr(l = i), we "sample" a one-hot vectorˆpvectorˆ vectorˆp from p (hence determining the length of spectrogram), and approximate the gradientˆpgradientˆgradientˆp by computing the following:

## 3.5.3 Discriminator

The discriminator D performs masked convolution on feature maps. Given the one-hot vectorˆpvectorˆ vectorˆp and its gradientˆpgradientˆgradientˆp, we can compute the mask m to be applied on the feature map, as well as its gradient m, by cumulatively adding up the values:

$$m _ { i } = \sum _ { j = i } ^ { L } \hat { p } _ { i } \quad \nabla m _ { i } = \sum _ { j = i } ^ { L } \nabla \hat { p } _ { i }$$We insert a LeakyReLU with coefficient 0.1 after every convolution. Before the first convolution and after every LeakyReLU, we broadcast the mask into the same shape as the spectrogram, and apply an element-wise multiplication. The overall structure of discriminator is shown in Table 2.

## 3.5.4 Generalized Feature Matching

For spectrogram generation, we generalized the feature matching formulation in [?] to n-th central moments, in addition to the vanilla feature matching: Conv-256-3 Conv-1-3 Fully Connected with 1 output unit Sigmoid Table 2: Discriminator structure. The notation is the same as in Table 1. where µ (n)

$$L _ { G } ^ { f m , ( l , n ) } = \| \mu _ { x \sim p _ { \text {data} } } ^ { ( n ) } ( f ^ { ( l ) } ( x ) ) - \mu _ { z \sim p z } ^ { ( n ) } ( f ^ { ( l ) } ( z ) ) |$$ $$( x ) = E \cdot x ^ { n } - E [ x ^ { n } ]$$In our experiments, we additionally minimized the generalized feature matching loss for every intermediate discriminator layer with n = 2 and n = 4. Figure 3 shows the comparison between generated and real spectrograms for unconditional generation, where the dataset contains all the speeches for word action. It is not intuitive to look at the waveforms to tell which one is realistic or not; all three waveforms looked reasonable. However, the generated samples sounded like static, repetitive high pitch oscillations, or other wrong audio sounds. A closer inspection on the spectrogram reveals that the generated spectrograms and real spectrograms indeed look different. The real data tends to have peak regions evenly spaced across the frequency axis, while this phenomenon is not as significant in generated ones.

## 3.5.5 Results

## 4 Discussion

We saw that evaluating audio quality by looking at waveforms and spectrograms is not quite reliable. The reason we suspected is that audios, unlike images, are very sensitive to high-frequency signals. We observed that even when the raw waveforms and spectrograms look realistic image-wise in big picture, with closer inspection we can see certain differences of patterns. Comparison between conditionally generated spectrogram and reconstructed waveforms and those from real samples, at iteration 560,000. Subfigures (a), (c), (e), and (g) are for the word actual, while the rest are for exploits.