Deep Voice 3: 2000-Speaker Neural Text-to-Speech

Deep Voice 3: 2000-Speaker Neural Text-to-Speech

1Introduction

Artificial speech synthesis, also called text-to-speech (TTS), is traditionally done with complex multi-stage hand-engineered pipelines [26]. Recent work on neural TTS has demonstrated impressive results – yielding pipelines with simpler features, fewer components, and higher quality synthesized speech. There is not yet a consensus on the optimal neural network architecture for TTS, however, sequence-to-sequence models [28] have shown to be quite promising.

In this paper, we propose a novel fully-convolutional architecture for speech synthesis, scale it to very large audio data sets, and address several real-world issues that come up when attempting to deploy an attention-based TTS system. Specifically, we make the following contributions:

  1. We propose a fully-convolutional character-to-spectrogram architecture, which enables fully paralleled computation over elements in a sequence and trains an order of magnitude faster than analogous architectures using recurrent cells [28].

  2. We show that our architecture trains quickly and scales to the LibriSpeech dataset [18], which consists of nearly 820 hours of audio data from 2484 speakers.

  3. We demonstrate that we can generate monotonic attention behavior, avoiding error modes commonly occurred in speech synthesis.

  4. We compare the quality of several waveform synthesis methods for a single speaker, including WORLD [15], Griffin-Lim [12], and WaveNet [17].

  5. We describe the implementation of an inference kernel for Deep Voice 3, which can serve up to ten million queries per day on one single-GPU server.

2Related Work

Our work builds upon the state-of-the-art in neural speech synthesis and attention-based sequence-to-sequence learning.

Several recent works tackle the problem of synthesizing speech with neural networks, including Deep Voice 1 [2], Deep Voice 2 [3], Tacotron [28], Char2Wav [23], VoiceLoop [25], SampleRNN [14], and WaveNet [17]. Deep Voice 1 & 2 retain the traditional structure of TTS pipelines, separating grapheme-to-phoneme conversion, duration and frequency prediction, and waveform synthesis. In contrast to Deep Voice 1 & 2, Deep Voice 3 employs an attention-based sequence-to-sequence model, yielding a more compact architecture. Similar to Deep Voice 3, Tacotron and Char2Wav are the two proposed sequence-to-sequence models for neural TTS. Tacotron is a neural text-to-spectrogram conversion model, used with Griffin-Lim for spectrogram-to-waveform synthesis. Char2Wav predicts the parameters of WORLD vocoder [15] and uses a SampleRNN conditioned upon WORLD parameters for waveform generation. In contrast to Char2Wav and Tacotron, Deep Voice 3 avoids Recurrent Neural Networks (RNNs) 1 to speed up training and alleviates several challenging error modes that attention models fall into. Thus, Deep Voice 3 makes attention-based TTS feasible for a production TTS system with no compromise on accuracy. Finally, WaveNet and SampleRNN are proposed as neural vocoder models for waveform synthesis. It is also worth noting that there are numerous alternatives for high-quality hand-engineered vocoders in the literature, such as STRAIGHT [13], Vocaine [1], and WORLD [15]. Deep Voice 3 adds no novel vocoder, but has the potential to be integrated with different waveform synthesis methods with slight modifications of its architecture.

The automatic speech recognition (ASR) dataset can be very large in scale, but are typically recorded under various conditions and with varying microphones, that are not as perfectly clean as ideal TTS corpora. Our work is not the first to attempt a multi-speaker TTS system on ASR corpora. For example, [30] builds several speaker-adaptive HMM-based TTS systems [29] on various ASR corpora with hundreds of speakers. Nonetheless, to the best of our knowledge, Deep Voice 3 is the first TTS system to scale to thousands of speakers.

Sequence-to-sequence models [24] encode a variable-length input to hidden states, and which are then processed at the decoder to generate the target sequence. An attention mechanism allows the decoder to adaptively choose which hidden states in encoder to focus on while generating the target sequence [5]. Attention-based sequence-to-sequence models are widely applied in machine translation [5], speech recognition [8], and text summarization [21]. Recent improvements in attention mechanisms relevant to Deep Voice 3 include enforced-monotonic attention during training [19], fully-attentional non-recurrent architectures [27], and convolutional sequence-to-sequence models [10]. Deep Voice 3 demonstrates the utility of monotonic attention during training in TTS, a new domain where monotonicity is expected. Alternatively, we show that with a simple heuristic to only enforce monotonicity during inference, a standard attention mechanism can work just as well or even better. Deep Voice 3 also builds upon the convolutional sequence-to-sequence architecture from [10] by introducing a positional encoding similar to that used in [27], augmented with a rate adjustment to account for the mismatch between input and output domain lengths.

3Model Architecture

In this section, we present our fully-convolutional sequence-to-sequence architecture for TTS (see Fig. Figure 1). Our architecture is capable of converting a variety of textual features (characters, phonemes, stresses) to a variety of acoustic features (mel-band spectrograms, linear-scale log magnitude spectrograms, or a set of vocoder features such as fundamental frequency, spectral envelope, and aperiodicity parameters). These acoustic features can then be inputs for audio waveform synthesis models. Deep Voice 3 architecture consists of three components:

  • Encoder

    : A fully-convolutional encoder, which converts textual features to an internal learned representation.

  • Decoder

    : A fully-convolutional causal decoder, which decodes the learned representation with a multi-hop convolutional attention mechanism into a low-dimensional audio representation (mel-band spectrograms) in an auto-regressive manner.

  • Converter

    : A fully-convolutional post-processing network, which predicts final output features (depending on the waveform synthesis method) from the decoder hidden states. Unlike the decoder, the converter is non-causal and can thus depend on future context information.

The overall objective function to be optimized is a linear combination of the losses from the decoder (Section Section 3.4) and the converter (Section Section 3.6). The whole model is trained in an end-to-end manner, excluding the vocoder (WORLD, Griffin-Lim, or WaveNet). In multi-speaker scenario, trainable speaker embeddings as in [3] are used across encoder, decoder and converter. Next, we describe each of these components and the data preprocessing in detail. Model hyperparameters are available in Table 4 within Appendix Section 8.

Figure 1: Deep Voice 3 uses residual convolutional layers to encode textual features into per-timestep key and value vectors for an attention-based decoder. The decoder uses these to predict the mel-band log magnitude spectrograms that correspond to the output audio. (Light blue dotted arrows depict the autoregressive synthesis process during inference.) The hidden states of the decoder are then fed to a converter network to predict the acoustic features for waveform synthesis. Please see Appendix  for more details.
Figure 1: Deep Voice 3 uses residual convolutional layers to encode textual features into per-timestep key and value vectors for an attention-based decoder. The decoder uses these to predict the mel-band log magnitude spectrograms that correspond to the output audio. (Light blue dotted arrows depict the autoregressive synthesis process during inference.) The hidden states of the decoder are then fed to a converter network to predict the acoustic features for waveform synthesis. Please see Appendix for more details.

3.1Text Preprocessing

Text preprocessing is crucial for good performance. Feeding raw text (characters with spacing and punctuation) yields acceptable performance on many utterances. However, some utterances will have mispronunciations of rare words, or have skipped words and repeated words. We alleviate these issues by normalizing the input text as follows:

  1. We uppercase all characters in the input text.

  2. We remove all intermediate punctuation marks.

  3. We end every utterance with a period or question mark.

  4. We replace spaces between words with special separator characters which indicate the duration of pauses inserted by the speaker between words 2.

3.2Joint Representation of Characters and Phonemes

Deployed TTS systems [6] should include a way to modify pronunciations to correct common mistakes (which typically include proper nouns, foreign words, and domain-specific jargon). A conventional way to do this is maintaining a dictionary mapping words to their phonetic representations and manually editing it in the case of errors.

Our model can directly convert characters (including punctuation and spacing) to acoustic features, and hence learns an implicit grapheme-to-phoneme model. An implicit conversion is difficult to correct when the model makes mistakes. Thus, in addition to character models, we also train phoneme-only models and mixed character-and-phoneme models by allowing phoneme input option explicitly. These models are identical to character-only models, except for the input layer of the encoder, which takes phoneme and phoneme stress embeddings instead of or in addition to character embeddings.

A phoneme-only model requires a preprocessing step to convert words to their phoneme representations (by using an external phoneme dictionary or a separately trained grapheme-to-phoneme model)3. A mixed character-and-phoneme model requires a similar preprocessing step, except for words not in the phoneme dictionary. These out-of-vocabulary words are input as characters, allowing the model to use its implicitly learned grapheme-to-phoneme model. While training a mixed character-and-phoneme model, every word is replaced with its phoneme representation with some fixed probability (e.g., 0.9) at each training iteration. We find that augmenting phonemes as input improves performance in terms of pronunciation accuracy and minimizing attention errors, especially when generalizing to utterances longer than those in the training set. More importantly, models that support phonemes representation allow correcting mispronunciations by editing the phoneme dictionary, which is a highly preferred feature in real-world production system.

3.3Encoder

The encoder network (depicted in Figure 1) begins with an embedding layer, which converts characters or phonemes into trainable vector representations. These embeddings are projected via a fully-connected layer from the embedding dimension to a target dimensionality, followed by a series of convolution blocks (Fig. Figure 2), and then projected back to the embedding dimension to create the attention key vectors . The attention value vectors are given by . The key vectors are used by each attention block to compute attention weights, whereas the final context vector is computed as a weighted average over the value vectors (see Section 3.5).

Figure 2
Figure 2
Figure 3
Figure 3

The convolution blocks (depicted in Figure 2) used in our encoder and elsewhere in the architecture consist of a convolution, a gated-linear unit as the nonlinear activation, a residual connection to the input, and a scaling factor of 4. To preserve the sequence length, inputs are padded with timesteps of zeros on the left (for causal convolutions) or timesteps of zeros on the left and on the right (for standard non-causal convolutions), where is an odd convolution filter width 5. Dropout is applied to the inputs prior to the convolution.

3.4Decoder

The decoder (depicted in Figure 1) generates audio in an autoregressive manner by predicting a group of future audio frames given all past audio frames. Since the decoder is autoregressive, it must use exclusively causal convolutions. Audio frames are processed in groups of frames and are represented by a low-dimensional mel-band log-magnitude spectrogram. The choice of can have a significant impact on the performance, as decoding several frames together is better than simply decoding one, which confirms a similar result from [28].

The decoder network consists of a several fully-connected layers with rectified linear unit (ReLU) nonlinearities, a series of attention blocks (described in Section 3.5), and finally fully-connected output layers which predict the next group of audio frame and also a binary “done” prediction (indicating whether the last frame of the utterance has been synthesized). Dropout is applied before each fully-connected layer prior to the attention blocks, except for the very first one. An L1 loss is computed using the output spectrograms and a binary cross-entropy loss is computed using the “done” prediction.

3.5Attention Block

Figure 4: Positional encodings are added to both keys and query vectors, with rates of \omega_{\text{key}} and \omega_{\text{query}} respectively. Forced monotonocity can be applied at inference by adding a mask of large negative values to the logits. One of two possible attention schemes is used: softmax or monotonic attention from . During training, attention weights are dropped out.
Figure 4: Positional encodings are added to both keys and query vectors, with rates of and respectively. Forced monotonocity can be applied at inference by adding a mask of large negative values to the logits. One of two possible attention schemes is used: softmax or monotonic attention from . During training, attention weights are dropped out.

We use a dot-product attention mechanism (depicted in Figure 4) similar to [27]. The attention mechanism uses a query vector (the hidden state of the decoder) and the per-timestep key vectors from the encoder to compute attention weights, and then outputs a context vector computed from the weighted average of the value vectors.

In addition to the embeddings generated by the encoder and decoder, we add a positional encoding to both the key and the query vectors. These positional encodings are computed as (for even ) or (for odd ), where is the timestep index, is the channel index in the positional encoding, is the total number of channels in the positional encoding, and is the position rate of the encoding. The position rate dictates the average slope of the line in the attention distribution, roughly corresponding to speed of speech. For a single speaker, is set to one for the decoder and fixed for the encoder to the ratio of output timesteps to input timesteps (computed across the entire dataset). For multi-speaker datasets, is computed for both the encoder and decoder from the speaker embedding for each speaker (depicted in Figure 4). As sine and cosine functions form an orthonormal basis, this initialization creates a favorable inductive bias for the model as the attention distribution due to positional encodings is effectively a straight diagonal line (Fig. ?).

We initialize the fully-connected layer weights used to compute hidden attention vectors to the same values for the query projection and the key projection. Positional encodings are used in all attention blocks. We use context normalization as in [10]. A fully-connected layer is applied to the context vector to generate the output of the attention block.

When using this attention mechanism, positional encodings greatly improve quality and are key to having a functional convolutional attention mechanism. However, even with the positional encodings, the model may sometimes repeat or skip words. We consider two different mechanisms to alleviate this. The first mechanism is imposing constraint during inference that attention is monotonic: instead of computing the softmax over the entire input, we instead compute the softmax only over a fixed window starting at the last attended-to position and going forward several timesteps 6. The attended-to position is initially set to zero and later computed as the index of the highest attention weight within the current window. Visualization of the attention distributions of this approach is shown in Fig. ?. The second mechanism is applying the monotonic attention introduced in [19], which, unlike the generic attention with inference constraint, incorporates the monotonicity during training. In practice, both approaches work well for creating a clear, monotonic attention curve, but using monotonic attention during training results in the model frequently mumbling words.

3.6Converter

The converter network takes as inputs the activations from the last hidden layer of the decoder, applies several non-causal convolution blocks, and then predicts parameters for downstream waveform generation models. Unlike the decoder, the converter is non-causal and non-autoregressive, so it can use future context from the decoder to predict its outputs.

The loss function of converter network depends on the type of downstream vocoders:

  1. L1 loss on linear-scale (log-magnitude) spectrograms for use with Griffin-Lim,

  2. L1 and cross entropy losses on parameters of WORLD vocoder (see Figure 3),

  3. L1 loss on linear-scale (log-magnitude) spectrograms for use with WaveNet neural vocoder.

For Griffin-Lim audio synthesis, we also find that using a pre-emphasis along with raising the spectrogram to a power before waveform synthesis is helpful for improved audio quality, as suggested in [28]. For the WORLD vocoder, we predict a boolean value (whether the current frame is voiced or unvoiced), an F0 value (if the frame is voiced), the spectral envelope, and the aperiodicity parameters. We use a cross-entropy loss for the voiced-unvoiced prediction and L1 losses for all other predictions. For WaveNet vocoder, we use mel-scale spectrograms from the decoder in inference, and feed them as the conditioner for WaveNet, which is separately trained. 7

4Results

In this section, we present several different experiments and metrics that have been useful for the development of a production-quality speech synthesis system. We quantify the performance of our system and compare it to other recently published neural TTS systems.

Data:

For single-speaker synthesis, we use an internal English speech data set containing approximately 20 hours data with the sampling rate of 48 KHz. For multi-speaker synthesis, we use VCTK and LibriSpeech data sets. VCTK dataset consists audios for 108 speakers, with a total duration of 44 hours. LibriSpeech data set consists audios for 2484 speakers, with a total duration of 820 hours. The sampling rate for VCTK is 48 KHz, whereas for LibriSpeech is 16 KHz.

Fast Training:

We compare Deep Voice 3 to Tacotron, a recently published attention-based TTS system. For our system on single-speaker data, the average training iteration time (for batch size 4) is 0.06 seconds using one GPU as opposed to 0.59 seconds for Tacotron, indicating a ten-fold increase in training speed. In addition, Deep Voice 3 converges after K iterations for all three datasets in our experiment, while Tacotron requires M iterations as suggested in [28]. This significant speedup is due to the fully-convolutional architecture of Deep Voice 3, which highly exploits the parallelism of a GPU during training.

Attention Error Modes:

Attention-based neural TTS systems hit several error modes which can reduce synthesis quality – including mispronunciations, skipped words, and repeated words. One reason is that the attention-based architecture does not impose a monotonically progressing distribution. In order to track the occurrence of these errors, we construct a custom 100-sentence test set (see Appendix Section 10) that includes particularly-challenging cases from deployed TTS systems (e.g. dates, acronyms, URLs, repeated words, proper nouns, foreign words etc.) Counted attention errors are listed in Table 1 and indicate that the model with joint representation of characters and phonemes, trained with standard attention mechanism but enforced the monotonic constraint at inference, largely outperforms other approaches.

Table 1: Counted attention errors of single-speaker Deep Voice 3 models on the 100-sentence test set, which is given in Appendix . Phonemes& Characters refers to the model trained by using joint representation of characters and phonemes as discussed in Section . As there are out-of-vocabulary words, we didn’t include phoneme-only model. All models assume Griffin-Lim as the vocoder. Any mispronunciation, skipping or repeating errors in each sentence is counted as one - in other words, every entry in the table is upper limited by 100.

Text Input

Attention Inference constraint Repeat Mispronounce Skip
Characters-only Dot-Product Yes 3 35 19
Phonemes & Characters Dot-Product No 12 10 15
Phonemes&Characters Dot-Product Yes 1 4 3
Phonemes & Characters Monotonic No 5 9 11

Naturalness:

We demonstrate that choice of waveform synthesis matters for naturalness ratings and compare it to other published neural TTS systems. Results in Table 2 indicate that WaveNet, a neural vocoder, achieves the highest MOS of , followed by WORLD and Griffin-Lim at and , respectively. Thus, we show that the most natural waveform synthesis can be done with a neural vocoder and that basic spectrogram inversion techniques can match advanced vocoders. The WaveNet vocoder sounds more natural as the WORLD vocoder introduces various noticeable artifacts. Yet, lower inference latency may render WORLD vocoder preferable: the heavily engineered WaveNet implementation runs at 3X realtime per CPU core [3], while in our testing WORLD runs up to 40X realtime per CPU core (see the subsection below).

Table 2: Mean Opinion Score (MOS) ratings with 95% confidence intervals using different waveform synthesis methods. We use the crowdMOS toolkit ; batches of samples from these models were presented to raters on Mechanical Turk. Since batches contained samples from all models, the experiment naturally induces a comparison between the models.

Model

Mean Opinion Score (MOS)
Deep Voice 3 (Griffin-Lim)
Deep Voice 3 (WORLD)
Deep Voice 3 (WaveNet)
Tacotron (Griffin-Lim)
Tacotron (WaveNet)
Deep Voice 2 (WaveNet)

Multi-Speaker Synthesis:

To demonstrate that our model is capable of handling multi-speaker speech synthesis effectively, we train our models on the VCTK and LibriSpeech data sets. For LibriSpeech (as it is an ASR data set), we apply a preprocessing step of standard denoising (using SoX [4]) and splitting long utterances into multiple at pause locations (which are determined by Gentle [16]) to improve performance. We use Griffin-Lim and WORLD as the vocoder for VCTK, and only Griffin-Lim for LibriSpeech due to its efficiency. Results are presented in Table 3. We purposefully include ground-truth samples in the set being evaluated, because the accents in datasets are likely to be unfamiliar to our North American crowdsourced raters and will thus be rated poorly due to the accent rather than the model quality. Our model with WORLD vocoder archives a comparable MOS of 3.44 on VCTK in contrast to 3.66 from Deep Voice 2, which is the state-of-the-art multi-speaker neural TTS using WaveNet as vocoder and seperately optimized duration and frequency prediction building blocks. We expect further improvement by using WaveNet for multi-speaker synthesis, although it will substantially slow down the system at inference. The MOS on LibriSpeech is lower, which we mainly attribute to the lower quality of the training dataset due to various recording conditions. Lastly, we observe that the learned speaker embeddings lie in a meaningful latent space (see Fig. ? in Appendix Section 9).

Table 3: MOS ratings with 95% confidence intervals for audio clips from neural TTS systems on multi-speaker datasets. To obtain MOS, we also use crowdMOS toolkit as detailed in Table .

Model

MOS (VCTK) MOS (LibriSpeech)
Deep Voice 3 (Griffin-Lim)
Deep Voice 3 (WORLD) -
Deep Voice 2 (WaveNet) -
Tacotron (Griffin-Lim) -
Ground truth

Optimizing Inference for Deployment:

In order to deploy a neural TTS system in a cost-effective manner, the system must be able to handle as much traffic as alternative systems on a comparable amount of hardware. To do so, we target a throughput of ten million queries per day or 116 queries per second (QPS) 8 on a single-GPU server with twenty CPU cores, which we find is comparable in cost to commercially deployed TTS systems. By implementing custom GPU kernels for the Deep Voice 3 architecture and parallelizing WORLD synthesis across CPUs, we demonstrate that our model can handle ten million queries per day. One can find more details in Appendix Section 7.

5Conclusion

We introduce Deep Voice 3, a neural text-to-speech system based on a novel fully-convolutional sequence-to-sequence acoustic model with a position-augmented attention mechanism. We describe common error modes in sequence-to-sequence speech synthesis models and show that we successfully avoid these common error modes with Deep Voice 3. We show that our model is agnostic of the waveform synthesis method, and adapt it for Griffin-Lim spectrogram inversion, WaveNet, and WORLD vocoder synthesis. We demonstrate also that our architecture is capable of multispeaker speech synthesis by augmenting it with trainable speaker embeddings, a technique described in Deep Voice 2. Finally, we describe the production-ready Deep Voice 3 system in full including text normalization and performance characteristics, and demonstrate state-of-the-art quality through extensive MOS evaluations. Future work will involve improving the implicitly learned grapheme-to-phoneme model, jointly training with a neural vocoder, and training on cleaner and larger datasets to scale to model the full variability of human voices and accents from hundreds of thousands of speakers.

6Detailed Model Architecture of Deep Voice 3

The detailed model architecture in depicted in Figure 5.

Figure 5: Deep Voice 3 uses a deep residual convolutional network to encode text and/or phonemes into per-timestep key and value vectors for an attentional decoder. The decoder uses these to predict the mel-band log magnitude spectrograms that correspond to the output audio. (Light blue dotted arrows depict the autoregressive synthesis process during inference.) The hidden state of the decoder then gets fed to a converter network to output linear spectrograms for Griffin-Lim or parameters for WORLD, which can be used to synthesize the final waveform. Weight normalization   is applied to all convolution filters and fully-connected layer weight matrices in the model.
Figure 5: Deep Voice 3 uses a deep residual convolutional network to encode text and/or phonemes into per-timestep key and value vectors for an attentional decoder. The decoder uses these to predict the mel-band log magnitude spectrograms that correspond to the output audio. (Light blue dotted arrows depict the autoregressive synthesis process during inference.) The hidden state of the decoder then gets fed to a converter network to output linear spectrograms for Griffin-Lim or parameters for WORLD, which can be used to synthesize the final waveform. Weight normalization is applied to all convolution filters and fully-connected layer weight matrices in the model.

7Optimizing Deep Voice 3 for deployment

Running inference with a TensorFlow graph turns out to be prohibitively expensive, averaging approximately 1 QPS 9. Instead, we implement custom GPU kernels for Deep Voice 3 inference. Due to the complexity of the model and the large number of output timesteps, launching individual kernels for different operations in the graph (convolutions, matrix multiplications, unary and binary operations etc.) is impractical: the overhead of launch a CUDA kernel is approximately 50 s, which, when aggregated across all operations in the model and all output timesteps, limits throughput to approximately 10 QPS. Thus, we implement a single kernel for the entire model, which avoids the overhead of launching many CUDA kernels. Finally, instead of batching computation in the kernel, our kernel operates on a single utterance and we launch as many concurrent streams as there are Streaming Multiprocessors (SMs) on the GPU. Every kernel is launched with one block, so we expect the GPU to schedule one block per SM, allowing us to scale inference speed linearly with the number of SMs.

On a single P100 GPU with 56 SMs, we achieve an inference speed of 115 QPS, which corresponds to our target ten million queries per day. We parallelize WORLD synthesis across all 20 CPUs on the server, permanently pinning threads to CPUs in order to maximize cache performance. In this setup, GPU inference is the bottleneck, as WORLD synthesis on 20 cores is faster than 115 QPS.

We believe that inference can be made significantly faster through more optimized kernels, smaller models, and fixed-precision arithmetic; we leave these aspects to future work.

8Model Hyperparameters

All hyperparameters of the models used in this paper are shown in Table 4.

Table 4: Hyperparameters used for best models for the three datasets used in the paper.

Parameter

Single-Speaker VCTK LibriSpeech
FFT Size 4096 4096 4096
FFT Window Size / Shift 2400 / 600 2400 / 600 1600 / 400
Audio Sample Rate 48000 48000 16000
Reduction Factor 4 4 4
Mel Bands 80 80 80
Sharpening Factor 1.4 1.4 1.4
Character Embedding Dim. 256 256 256
Encoder Layers / Conv. Width / Channels 7 / 5 / 64 7 / 5 / 128 7 / 5 / 256
Decoder Affine Size 128, 256 128, 256 128, 256
Decoder Layers / Conv. Width 4 / 5 6 / 5 8 / 5
Attention Hidden Size 128 256 256
Position Weight / Initial Rate 1.0 / 6.3 0.1 / 7.6 0.1 / 2.6
Converter Layers / Conv. Width / Channels 5 / 5 / 256 6 / 5 / 256 8 / 5 / 256
Dropout Probability 0.95 0.95 0.99
Number of Speakers 1 108 2484
Speaker Embedding Dim. - 16 32
ADAM Learning Rate 0.001 0.0005 0.0005
Anneal Rate / Anneal Interval - 0.98 / 30000 0.95 / 30000
Batch Size 16 16 16
Max Gradient Norm 100 100 50.0
Gradient Clipping Max. Value 5 5 5

9Latent Space of the Learned Embeddings

Similar to [3], we apply principal component analysis to the learned speaker embeddings and analyze the speakers based on their ground truth genders. Fig. ? shows the genders of the speakers in the space spanned by the first two principal components. We observe a very clear separation between male and female genders, suggesting the low-dimensional speaker embeddings constitute a meaningful latent space.

10100-sentence test set

The 100 sentences used to quantify the results in Table 1 are listed below (note that symbol corresponds to pause):

image

image

Footnotes

  1. RNNs process each state sequentially and thus make model parallelism very challenging to utilize fully.
  2. We use four different word separators, indicating slurred-together words, standard pronunciation and space characters, a short pause between words, and a long pause between words. The pause durations can be obtained through either manual labeling or by estimated by a text-audio aligner such as Gentle [16]. For example, the sentence “Either way, you should shoot very slowly,” with a long pause after “way” and a short pause after “shoot”, would be written as “Either way%you should shoot/very slowly%.” with % representing a long pause and / representing a short pause for encoding convenience. Our single-speaker dataset is labeled by hand and our multi-speaker datasets are annotated using Gentle.
  3. In this work, we use CMUDict 0.6b.
  4. The scaling factor ensures that we preserve the input variance early in training. We initialize the convolution filter weights as in [10] to start training with zero-mean and unit-variance activations throughout the entire network.
  5. We restrict to odd convolution widths to simplify the convolution arithmetic.
  6. We use the window size as 3 in all the experiments.
  7. Note that this differs from [3] as it used linear-scale log-magnitude spectrograms. We typically observe better performance with a lower dimensional conditioner for WaveNet.
  8. A query is defined as synthesizing the audio for a one second utterance.
  9. The poor TensorFlow performance is due to the overhead of running the graph evaluator over hundreds of nodes and hundreds of timesteps. Using a technology such as XLA with TensorFlow could speed up evaluation but is unlikely to match the performance of a hand-written kernel.

References

  1. Vocaine the vocoder and applications in speech synthesis.
    Yannis Agiomyrgiannakis. In ICASSP, 2015.
  2. Deep Voice: Real-time neural text-to-speech.
    Sercan Ö. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. In ICML, 2017.
  3. Deep Voice 2: Multi-speaker neural text-to-speech.
    Sercan Ö. Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. In NIPS, 2017b.
  4. Sox - sound exchange.
    Chris Bagwell. https://sourceforge.net/p/sox/code/ci/master/tree/
  5. Neural machine translation by jointly learning to align and translate.
    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. In ICLR, 2015.
  6. Siri on-device deep learning-guided unit selection text-to-speech system.
    Tim Capes, Paul Coles, Alistair Conkie, Ladan Golipour, Abie Hadjitarkhani, Qiong Hu, Nancy Huddleston, Melvyn Hunt, Jiangchuan Li, Matthias Neeracher, et al. In Interspeech, 2017.
  7. Learning phrase representations using RNN encoder-decoder for statistical machine translation.
    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. In EMNLP, 2014.
  8. Attention-based models for speech recognition.
    Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. In NIPS, 2015.
  9. Language modeling with gated convolutional networks.
    Yann Dauphin, Angela Fan, Michael Auli, and David Grangier. In ICML, 2017.
  10. Convolutional sequence to sequence learning.
    Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. In ICML, 2017.
  11. Recent advances in Google real-time HMM-driven unit selection synthesizer.
    Xavi Gonzalvo, Siamak Tazari, Chun-an Chan, Markus Becker, Alexander Gutkin, and Hanna Silen. In Interspeech, 2016.
  12. Signal estimation from modified short-time fourier transform.
    Daniel Griffin and Jae Lim. IEEE Transactions on Acoustics, Speech, and Signal Processing
  13. Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds.
    Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne. Speech communication
  14. SampleRNN: An unconditional end-to-end neural audio generation model.
    Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. In ICLR, 2017.
  15. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications.
    Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. IEICE Transactions on Information and Systems
  16. Gentle.
    Robert Ochshorn and Max Hawkins. https://github.com/lowerquality/gentle
  17. WaveNet: A generative model for raw audio.
    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. arXiv:1609.03499
  18. Librispeech: an ASR corpus based on public domain audio books.
    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 5206–5210. IEEE, 2015.
  19. Online and linear-time attention by enforcing monotonic alignments.
    Colin Raffel, Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. In ICML, 2017.
  20. Crowdmos: An approach for crowdsourcing mean opinion score studies.
    Flávio Ribeiro, Dinei Florêncio, Cha Zhang, and Michael Seltzer. In IEEE ICASSP, 2011.
  21. A neural attention model for abstractive sentence summarization.
    Alexander M Rush, Sumit Chopra, and Jason Weston. In EMNLP, 2015.
  22. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.
    Tim Salimans and Diederik P Kingma. In NIPS, 2016.
  23. Char2wav: End-to-end speech synthesis.
    Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. In ICLR workshop, 2017.
  24. Sequence to sequence learning with neural networks.
    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. In NIPS, 2014.
  25. Voice synthesis for in-the-wild speakers via a phonological loop.
    Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. arXiv:1707.06588
  26. Text-to-Speech Synthesis


    Paul Taylor. .
  27. Attention is all you need.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. arXiv:1706.03762
  28. Tacotron: Towards end-to-end speech synthesis.
    Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. In Interspeech, 2017.
  29. Robust speaker-adaptive hmm-based text-to-speech synthesis.
    Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals. IEEE Transactions on Audio, Speech, and Language Processing
  30. Thousands of voices for hmm-based speech synthesis–analysis and application of tts systems built on various asr corpora.
    Junichi Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian, Yong Guan, Rile Hu, Keiichiro Oura, Yi-Jian Wu, et al. IEEE Transactions on Audio, Speech, and Language Processing
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
1147
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description