FPUTS : Fully Parallel UFANS-based End-to-End Text-to-Speech System

FPUTS : Fully Parallel UFANS-based End-to-End Text-to-Speech System

Abstract

A Text-to-speech (TTS) system that can generate high quality audios with small time latency and fewer errors is required for industrial applications and services. In this paper, we propose a new non-autoregressive, fully parallel end-to-end TTS system. It utilizes the new attention structure and the recently proposed convolutional structure, UFANS. Different to RNN, UFANS can capture long term information in a fully parallel manner . Compared with the most popular end-to-end text-to-speech systems, our system can generate equal or better quality audios with fewer errors and reach at least 10 times speed up of inference.

FPUTS : Fully Parallel UFANS-based End-to-End Text-to-Speech System

Dabiao Ma, Zhiba Su, Wenxuan Wang, Yuhao Lu

AI Lab, Turing Robot co.ltd, Beijing, China

Chinese University of Hong Kong, Shenzhen

madabiao, suzhiba@uzoo.cn, wangwenxuan1@link.cuhk.edu.cn, luyuhao@uzoo.cn,

Index Terms: text to speech, acoustic model, UFANS, FPUTS, non-autoregressive, fully parallel

1 Introduction

TTS systems aim to convert texts to human-like speeches. End-to-end TTS system is a type of system that can be trained on (text,audio) pairs with minimal human annotation[1]. It has components, acoustic model and vocoder. Acoustic model predicts acoustic intermediate features from texts. Vocoder, e.g. Griffin-Lim [2], WORLD [3], WaveNet [4], synthesizes speeches with generated acoustic features. In industry, the main task of acoustic model is to map characters or phonemes to acoustic feature frames with fewer errors and low time latency.

Tacotron [1] uses an autoregressive attention [5] structure to predict alignment, and uses combination of Gated Recurrent Unit (GRU) [6] and convolutions as encoder and decoder. Deep voice 3 [7] also uses an autoregressive structure and uses convolutions to speed up training and inference. DCTTS [8] greatly speeds up the training of attention module by introducing guided attention but is still autoregressive.

Those autoregressive attention structures greatly limit the inference speed of these systems in the context of parallel computation. And those models also suffer from serious error modes e.g. repeated words, mispronunciations, or skipped words [7]. A non-autoregressive ,fully parallel attention structure that can perfectly determine alignment with fewer errors is needed for industrial applications and services.

In this paper, we propose a novel fully parallel end-to-end acoustic system. Specifically, we make the following contributions:

  • We propose a new non-autoregressive, fully parallel phoneme-to-spectrogram TTS system, which enables fully parallel computation, trains and inferences an order of magnitude faster than autoregressive TTS system.

  • We propose a novel non-autoregressive alignment module.

  • We propose UFANS decoder. It can generate better quality result than common convolutional decoder.

  • We propose two-stage training strategy, which enhances alignment training effect and speed.

  • We demonstrate that our TTS system can reduce error modes commonly affecting sequence-to-sequence models.

2 Model Architecture

Our model consists of three parts,see Fig.5. The encoder converts phonemes into hidden states that are sent to decoder; The alignment module determines the alignment width of each phoneme, from which the number of frames that attend on that phoneme can be induced; The decoder receives alignment information and converts the encoder hidden states into acoustic features. See details in Appendix A for figures of overall structure.

Figure 1: Model architecture.

2.1 Encoder

The encoder consists of one embedding layer and several dense layers. It encodes phonemes into hidden states. See details in Appendix A.

2.2 Alignment Module

Figure 2: Alignment module architecture

Alignment module determines the mapping from phonemes to acoustic features. We discard autoregressive structure, which is widely used in other alignment modules[7][1], for time latency issue. Our novel alignment module consists of one embedding layer, one UFANS [9] structure, trainable position encoding and several matrix multiplications, see Fig.2.

2.2.1 Fully parallel UFANS structure

Figure 3: UFANS model architecture

UFANS is a modified version of U-Net for TTS task aiming to speed up inference, see Fig.3. It is fully parallel, has large receptive field and can combine different levels of features.

For each phoneme , we define the ’AlignmentWidth’ which represents its relationship with frame numbers. Suppose the number of phonemes in an utterance is , and UFANS outputs a sequence of scalars : ;

Then we relate the alignment width to the acoustic frame index . The intuition is that the acoustic frame with index = should be the one that attends most on -th phoneme. And we need a structure that satisfies the intuition.

2.2.2 Trainable position encoding

Positional encoding [10] is a method to embed a sequence of absolute positions into a sequence of vectors. [7][11] use sine and cosine functions of different frequencies and add those positional encoding vectors to input embeddings. But they both take positional encoding as a supplement to help the training of attention module and the position encoding vectors remain constant. We propose a trainable position encoding.

We define the absolute alignment position of -th phoneme as :

(1)

Now choose float numbers log uniformly from range and get a sequence of frequencies . For -th phoneme, the positional encoding vector of this phoneme is defined as :

(2)

And if we concatenate together, we get a matrix that represents positional information of all the phonemes, denoted as ’Key’, see Fig.2 :

(3)

And similarly, for the -th frame of the acoustic feature, the positional encoding vector is defined as :

(4)

And if we concatenate all the vectors, we get the matrix that represents positional information of all the acoustic frames, denoted as ’Query’, see Fig.2:

(5)

And now define the attention matrix as :

(6)

That is, the attention of -th frame on -th phoneme is proportional to the inner product of their encoding vectors. This inner product can be rewritten as :

(7)

It is clear when , the -th frame is the one that attends most on -th phoneme. The normalized attention matrix is :

(8)

Now represents how much -th frame attends on -th phoneme.

Then we use argmax to build new attention matrix :

(9)

Now define the number of frames that attend more on -th phoneme than any other phoneme to be its attention width . From the definition of attention width, is actually a matrix representing attention width . The alignment width and are different but related. see Appendix B.

Sine and cosine positional encoding has two very important properties that make it suitable for this task. In brief, function has a heavy tail that enables one acoustic frame to receive phoneme information very far away; The gradient function is insensitive to the term . See details in Appendix C.

2.3 UFANS Decoder

The decoder receives alignment information and converts the encoded phonemes information from encoder to acoustic features. We use UFANS as our decoder which has large receptive field. It generates good quality acoustic features in a fully parallel manner.

2.4 Loss module

We use Acoustic Loss,denoted as , to evaluate the quality of generated acoustic feature. It is the -norm or -norm mean error between predicted acoustic features and true features.

2.5 Training Strategy

We propose a 2 stage training strategy. Our model focus more on alignment learning in stage 1. In stage 2 we fix the alignment module and train the whole system. In order to enhance the performance and speed of alignment learning, we use convolutional decoder and design an alignment loss.

2.5.1 Stage 1 :Alignment Learning

Convolutional Decoder: Decoder with large receptive field helps system generate good quality acoustic features, but the learning of alignment will be greatly disturbed. See appendix D. We replace UFANS decoder with convolutional decoder. The convolutional decoder consists of several convolution layers with gated activation [12], several Dropout [13] operations and one dense layer. The receptive field is set to be small.
Alignment Loss: We define an Alignment Loss, denoted as , based on the fact that the summation of alignment width should be equal or close to the frame length of acoustic features , see Appendix B. We relax this restriction by using a threshold :

(10)

The final loss is a weighted addition of and :

(11)

2.5.2 Stage 2 : Overall Training

After alignment learning in Stage 1, we get a good quality alignment module. Then in overall training stage 2, we fix the alignment module, use UFANS as decoder and use Acoustic Loss to train the overall end-to-end system.

3 Experiments and results

3.1 Dataset

LJ speech[14] is a public speech dataset consisting of 13100 pairs of text and 22050 HZ audio clips. The clips vary from 1 to 10 seconds and the total length is about 24 hours. Phoneme-based textual features are given. Two kinds of acoustic features are extracted. One is based on WORLD vocoder that uses mel-frequency cepstral coefficients(MFCCs). The other is linear-scale log magnitude spectrograms and mel-band spectrograms that can be feed into Griffin-Lim algorithm or a trained WaveNet vocoder.

The WORLD vocoder uses 60 dimensional mel-frequency cepstral coefficients, 2 dimensional band aperiodicity, 1 dimensional logarithmic fundamental frequency, their delta, delta-delta dynamic features and 1 dimensional voice/unvoiced feature. It is 190 dimensions in total. The WORLD vocoder based feature uses FFT window size 2048 and has a frame time 5 ms.

The spectrograms are obtained with FFT size 2048 and hop size 275. The dimensions of linear-scale log magnitude spectrograms and mel-band spectrograms are 1025 and 80.

3.2 Implementation Details

Tacotron, DCTTS and Deep Voice3 are used as baseline systems to evaluate our FPUTS system. All the details, like hyper-parameters, of those systems are shown in Appendix E.

3.3 Main Results

3.3.1 Inference speed

The inference speed evaluates time latency of synthesizing a one-second speech, which includes data transfer from main memory to GPU global memory, GPU calculations and data transfer back to main memory. The estimation is performed on a GTX 1080Ti graphics card. As is shown in Table 3, our FPUTS model is able to greatly take advantage of parallel computations and is significantly faster than other systems.

3.3.2 Mos

Harvard Sentences List 1 and List 2 are used to evaluate the mean opinion score (MOS) of a system. The synthesized audios are evaluated on Amazon Mechanical Turk using crowdMOS method [15]. The score ranges from 1 (Bad) to 5 (Excellent). As is shown in Table 3, Our FPUTS is no worse than other end-to-end system. The MOS of WaveNet-based audios are lower than expected since background noise exists in these audios.

3.3.3 Error mode

Attention-based neural TTS systems may run into several error modes that can reduce synthesis quality. For example, repetition means repeated pronunciation of one or more phonemes, mispronunciation means wrong pronunciation of one or more phonemes and skip word means one or more phonemes are skipped.

In order to track the occurrence of attention errors, 100 sentences are randomly selected from Los Angeles Times, Washington Post and some fairy tales. As is shown in Table 3, Our FPUTS system is more robust than other systems.

width=0.45 Tacotron DCTTS Deep Voice 3 FPUTS Autoregressive Yes Yes Yes No Reduction factor 5 4 4 No Inference speed (ms) 6157 494.3 105.4 9.9

Table 2: MOS results

width=0.45 Methods Training step (K) Vocoder MOS Tacotron 650 Griffin Lim DCTTS 375/360 Griffin Lim Deep Voice 3 650 Griffin Lim FPUTS 250/350 Griffin Lim Tacotron 650 WaveNet DCTTS 375/360 WaveNet FPUTS 250/350 WaveNet FPUTS 230/180 WORLD

Table 3: Error modes comparison

width=0.45 Input Repetition Mispronunciation Skip Tacotron characters 0 5 4 DCTTS characters 0 10 1 Deep Voice 3 characters and phonemes 1 5 3 FPUTS phonemes 1 2 1

Table 1: Inference speed comparison

3.4 Alignment Learning Analysis

Alignment learning is an important part of our system which greatly affects the quality of generated audios. So we further discuss the factors that can affect the alignment quality.

3.4.1 Evaluation of alignment

Two methods are used to evaluate the alignment quality in stage 1. 100 audios are randomly selected from training data, denoted as origin data. Their utterances are fed to our system to generate audios, denoted as resynthesized version of origin data.

The first method is objectively computing the difference between the phoneme duration of origin data and their predicted attention width, see example in appendix I.1. The second method is to subjectively listen to origin data and their resynthesized version to check whether they have the similar phoneme duration, see appendix G. Fig.4 is an attention width plot of an utterance selected randomly from the Internet.

Figure 4: Attention plot of text : This is the destination for all things related to development at stack overflow.
Phoneme : DH IH S IH Z DH AH D EH S T AH N EY SH AH N F AO R AO L TH IH NG Z R IH L EY T IH D T UW D IH V EH L AH P M AH N T AE T S T AE K OW V ER F L OW .

3.4.2 Position encoding function and alignment quality

We replace the sine and cosine positional encoding alignment function with Gaussian function in stage 1. The experimental results show that the model can not learn correct alignment with Gaussian function. See Appendix I.2 for more details. And we give a theoretical analysis in Appendix C.

3.4.3 Decoder and alignment quality

In order to identify the relationship between decoder and alignment quality in stage 1, we replace simple convolutional decoder by UFANS with 6 down-sampling layers. Experiments show the computed attention width is much worse than that with the simple convolutional decoder. And the synthesized audios also suffer from error modes like repeated words and skipped words. The results show the receptive field of the decoder should be small in alignment learning stage. More details are shown in Appendix I.3.

4 Discussion and conclusion

In this paper, a new non-autoregressive, fully parallel TTS system is proposed. It fully utilizes the power of parallel computation and reaches at least 10 times speed up of inference compared with most popular end-to-end TTS systems. It generates audios of equal or better quality with fewer errors. All efforts are made to find a lightweight TTS system for deployment that can produce good quality audios with little inference latency and fewer errors. This paper describes and analyzes every component of FPUTS in detail and compares it with most popular end-to-end TTS systems. Future study focuses on how to reduce errors, how to produce better quality audios and how to use both characters and phonemes in FPUTS.

References

  • [1] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available: http://arxiv.org/abs/1703.10135
  • [2] D. W. Griffin, Jae, S. Lim, and S. Member, “Signal estimation from modified short-time fourier transform,” IEEE Trans. Acoustics, Speech and Sig. Proc, pp. 236–243, 1984.
  • [3] M. MORISE, F. YOKOMORI, and K. OZAWA, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
  • [4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1609.html
  • [5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv e-prints, vol. abs/1409.0473, Sep. 2014. [Online]. Available: https://arxiv.org/abs/1409.0473
  • [6] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).    Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734. [Online]. Available: http://www.aclweb.org/anthology/D14-1179
  • [7] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HJtEm4p6Z
  • [8] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969
  • [9] D. Ma, Z. Su, Y. Lu, W. Wang, and Z. Li, “Ufans: U-shaped fully-parallel acoustic neural structure for statistical parametric speech synthesis with 20x faster,” arXiv preprint arXiv:1811.12208, Nov 2018.
  • [10] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin, “Convolutional sequence to sequence learning,” in ICML, 2017.
  • [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
  • [12] A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves, “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.    Curran Associates, Inc., 2016, pp. 4790–4798. [Online]. Available: http://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf
  • [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html
  • [14] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  • [15] F. Protasio Ribeiro, D. Florencio, C. Zhang, and M. Seltzer, “Crowdmos: An approach for crowdsourcing mean opinion score studies,” in ICASSP.    IEEE, May 2011. [Online]. Available: https://www.microsoft.com/en-us/research/publication/crowdmos-an-approach-for-crowdsourcing-mean-opinion-score-studies/

Appendix A Overall system

See Figure 5.

(a) Alignment training stage procedure
(b) Overall training stage procedure. Here alignment module is inherited from alignment learning stage and remains fixed during overall training
Figure 5: Detailed overall training procedure

Appendix B Alignment width and Attention Width

For two adjacent absolute alignment positions and , consider the two functions and . The values of the two functions only depend on the relative position of to . From Appendix C, it is known function decreases when moves away from (locally, but it is sufficient here). So when , , when , . Thus is the right attention boundary of phoneme , simiarly the left attention boundary is . It can be deduced that :

(12)
(13)
(14)

which means attention width and alignment width can be linearly transformed to each other. And it is further deduced that :

(15)

Appendix C Properties of sine and cosine positional encoding alignment structure

Besides sine and cosine positional encoding alignment structure, other attention structures may work, e.g. attention based on Gaussian function. But experiments show Gaussian function is not suitable for this task. The reason is revealed below.

Let . The sine and cosine alignment attention function of -th phoneme is . And also consider a Gaussian attention function . Since the two functions only depend on , it is convenient to set to 0 and set .

After normalization :

(16)

The normalized functions are drawn in Figure 6.

Figure 6: Function and


The alignment attention has a much heavier tail than Gaussian function. Heavy tail is necessary to learn the alignment. In Figure 7, the -th frame is currently attending mostly on -th phoneme, but the correct phoneme for -th frame is -th phoneme. To learn alignment, -th frame should be able to receive information from -th phoneme, so is not allowed to happen. If Gaussian function is used, vanishes too fast as increases. The heavy tail of the sine and cosine positional encoding alignment attention helps acoustic frames receive information from correct but distant phoneme.

Figure 7: Frame is currently attending mostly on -th phoneme, but -th phoneme is the correct phoneme. Gradients and are amplified by or before reaching -th phoneme and -th phoneme.


Now let fixed, make the variable, and consider the two functions, , . Suppose during training, -th frame receives information from phonemes and realizes that -th phoneme is more probable than -th phoneme to be the correct phoneme. Then the backward information flow (gradient) from -th frame to (alignment attention) or (Gaussian) is larger than the flow to or , that is . From the chain rule of gradient, the gradient that receives is using alignment attention or using Gaussian function; the gradient that receives is or .

Consider the two functions, , . The two functions (after normalization) are drawn in Figure 8.

It is obvious that for Gaussian function even if , vanishes quickly as moves away from . Thus the backward flow vanishes quickly and -th phoneme can not receive sufficient backward information; For alignment attention, though the function is oscillating, it is much more insensitive to the term . So does not vanish and the relation still holds.

Figure 8: Normalized and

Appendix D Decoder in alignment learning stage should have small receptive field

In Figure 9, -th frame receives information from with attention if using a decoder with large receptive field, and with attention if using a decoder with small receptive field. From appendix C, it is obvious . Then even if is large, -th frame can still attends mostly on -th phoneme, which is great disturbance to learn alignment.

Figure 9: The green arrow is the forward information flow using a decoder with large receptive field, the yellow one is forward information flow using a decoder with much smaller receptive field.

Appendix E Hyperparameters of Tacotron, DCTTS, Deep Voice 3 and FPUTS in overall training stage

Table 5 shows main hyperparameters of these models. Table 4 shows hyperparameters of FPUTS. Entry ’’ is the size of the hidden state. Entry ’GatedConv (linear)’ is the number of layers and hidden state size of gated convolutions that are specific to predict linear-scale log magnitude spectrograms. Tacotron, DCTTS, Deep Voice 3 take mel-band spectrograms as an intermediate feature whose length is reduced by a reduction factor to speed up inference. FPUTS does not use this trick.

width=0.7 UFANS (alignment) UFANS (decoder) GatedConv (linear) optimizer/lr spectrograms 512 512 4/512 6/512 3/512 10.0 0.01 Adam/0.0004 MFCCs 512 512 4/512 6/512 - 20.0 0.01 Adam/0.0004

Table 4: Hyperparameters of FPUTS
embedding size 256
number of banks (encoder) 16
channel size of Conv1D (encoder) 128
number/size of GRU (encoder) 1/128
number/size of GRU (attention) 1/256
number/size of GRU (decoder) 2/256
number of banks (post) 8
channel size of Conv1D (post) 128
number/size of GRU (post) 1/128
reduction factor 5
(a) Hyperparameters of Tacotron
embedding size 128
hidden state size (Text2Mel) 256
hidden state size (SSRN) 512
reduction factor 4
(b) Hyperparameters of DCTTS
embedding size 256
layers/Conv1D width/ 7/5/
Conv1D channels (encoder) 64
layers/Conv1D width/ (decoder) 4/5
size of attention (decoder) 256
layers/Conv1D width/ 4/5/
Conv1D channels (converter) 256
reduction factor 4
(c) Hyperparameters of Deep Voice 3
Table 5: Hyperparameters of Tacotron, DCTTS and Deep Voice 3

Appendix F Loss curves of alignment learning

Figure 10 is the loss curves of and during alignment learning stage.

(a) of MFCCs with -norm loss function
(b) of MFCCs with threshold 20
(c) of mel-band spectrograms with -norm loss function
(d) of mel-band spectrograms with threshold 10
Figure 10: curves of and in alignment learning stage

Appendix G Performance evaluation of alignment learning

See Table 6.

mismatch weakly match match perfectly match
MFCCs 0 4 14 82
mel-band spectrograms 0 1 17 82
Table 6: Statistical result of how resynthesized waveforms match the real ones

Appendix H Loss curves with Gaussian function as attention function and loss curves with UFANS as decoder in alignment learning stage

See Figure 12 and Figure 12

Figure 11: Loss curves of with Gaussian function, with mel-band spectrograms and -norm loss function
Figure 12: Loss curves of with UFANS as decoder, with mel-band spectrograms and -norm loss function

h.1 Attention width with UFANS as decoder

Table 7(c) is the attention width with UFANS as decoder. Clearly it is much worse than Table 7(a).

Appendix I Analysis of resynthesized waveforms after alignment learning stage

Here only results with mel-band spectrograms using Griffin-Lim algorithm are shown. For MFCCs, results are similar.

Figure 13: The upper is real audio of ’LJ048-0033’, the lower is the resynthesized audio from alignment learning model.
text : prior to November twenty two nineteen sixty three
phoneme : P R AY ER T UW N OW V EH M B ER T W EH N T IY T UW N AY N T IY N S IH K S T IY TH R IY

i.1 Attention width comparison

The phoneme durations are obtained by hand. Figure 13 is the labeled phonemes of audio ’LJ048-0033’. Table 7(a) shows the comparison of phoneme durations of real audio and computed attention width of resynthesized audio.

width= P R AY ER T UW N OW V EH M B ER T W EH N T real 5.35 7.28 15.48 13.43 4.96 3.44 3.36 5.44 4.72 7.20 4.56 1.92 7.12 5.36 3.36 3.84 3.20 4.16 resynthesized 3.55 7.97 13.28 11.37 4.88 4.00 6.19 5.27 5.46 6.39 3.56 2.08 6.13 5.69 4.34 3.03 2.50 5.92 IY T UW N AY N T IY N S IH K S T IY TH R IY real 10.80 9.76 9.76 6.80 6.08 6.16 7.28 5.28 5.36 6.56 6.16 4.08 3.52 6.32 9.36 9.76 13.92 12.88 resynthesized 10.89 11.26 9.69 7.72 5.33 6.55 7.30 5.90 5.81 5.43 5.11 4.33 3.57 6.81 10.57 11.54 12.95 12.85

(a) Phoneme durations of real audio and computed attention width of resynthesized audio

width= P R AY ER T UW N OW V EH M B ER T W EH N T real 5.35 7.28 15.48 13.43 4.96 3.44 3.36 5.44 4.72 7.20 4.56 1.92 7.12 5.36 3.36 3.84 3.20 4.16 resynthesized 6.31 6.03 5.78 6.11 6.59 6.73 6.74 6.76 6.75 6.75 6.77 6.80 6.84 6.82 6.79 6.78 6.77 6.77 IY T UW N AY N T IY N S IH K S T IY TH R IY real 10.80 9.76 9.76 6.80 6.08 6.16 7.28 5.28 5.36 6.56 6.16 4.08 3.52 6.32 9.36 9.76 13.92 12.88 resynthesized 6.78 6.79 6.77 6.75 6.76 6.74 6.72 6.74 6.76 6.80 6.84 6.82 6.79 6.78 6.77 6.81 6.97 7.24

(b) Phoneme durations of real audio and computed attention width of resynthesized audio with Gaussian function replacing alignment attention

width= P R AY ER T UW N OW V EH M B ER T W EH N T real 5.35 7.28 15.48 13.43 4.96 3.44 3.36 5.44 4.72 7.20 4.56 1.92 7.12 5.36 3.36 3.84 3.20 4.16 resynthesized 4.08 8.09 9.41 8.45 6.90 5.70 5.21 5.71 5.98 5.29 4.87 5.20 5.43 5.14 5.10 5.37 6.42 7.55 IY T UW N AY N T IY N S IH K S T IY TH R IY real 10.80 9.76 9.76 6.80 6.08 6.16 7.28 5.28 5.36 6.56 6.16 4.08 3.52 6.32 9.36 9.76 13.92 12.88 resynthesized 7.14 7.38 8.96 8.20 5.39 5.54 7.31 6.34 5.56 6.42 5.84 4.76 5.62 7.61 8.26 8.17 9.02 8.97

(c) Phoneme durations of real audio and computed attention width of resynthesized audio with UFANS as decoder
Table 7: Phoneme durations of real audio and computed attention width of resynthesized audio

i.2 Attention width with Gaussian function replacing alignment attention

Table 7(b) clearly shows that model with Gaussian function is not able to learn alignment.

i.3 Attention width with UFANS as decoder

Table 7(c) is the attention width with UFANS as decoder. Clearly it is much worse than Table 7(a).

Appendix J Loss comparison in overall training stage

Figure 14 and Figure 15 are loss curves during the overall training stage. For DCTTS, training of Text2Mel and SSRN are separated. Note that DCTTS, Tacotron, Deep Voice 3 are all autoregressive structures. During training, they use real spectrograms of previous step to train spectrograms of next step; But during inference, predicted spectrograms are used to predict next spectrograms. FPUTS is non-autoregressive which means all spectrograms are predicted at the same time during training and inference. It makes sense that during training FPUTS has a slightly higher loss than the other three systems.

(a) mel-band spectrograms, FPUTS
(b) mel-band spectrograms, Tacotron
(c) mel-band spectrograms, DCTTS
(d) mel-band spectrograms, Deep Voice 3
Figure 14: Loss curves of mel-band spectrograms
(a) linear log magnitude spectrograms, FPUTS
(b) linear log magnitude spectrograms, Tacotron
(c) linear log magnitude spectrograms, DCTTS
(d) linear log magnitude spectrograms, Deep Voice 3
Figure 15: Loss curves of linear log spectrograms
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
370688
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description