FPUTS : Fully Parallel UFANS-based End-to-End Text-to-Speech System
A Text-to-speech (TTS) system that can generate high quality audios with small time latency and fewer errors is required for industrial applications and services. In this paper, we propose a new non-autoregressive, fully parallel end-to-end TTS system. It utilizes the new attention structure and the recently proposed convolutional structure, UFANS. Different to RNN, UFANS can capture long term information in a fully parallel manner . Compared with the most popular end-to-end text-to-speech systems, our system can generate equal or better quality audios with fewer errors and reach at least 10 times speed up of inference.
FPUTS : Fully Parallel UFANS-based End-to-End Text-to-Speech System
Dabiao Ma, Zhiba Su, Wenxuan Wang, Yuhao Lu
AI Lab, Turing Robot co.ltd, Beijing, China
Chinese University of Hong Kong, Shenzhen
madabiao, email@example.com, firstname.lastname@example.org, email@example.com,
Index Terms: text to speech, acoustic model, UFANS, FPUTS, non-autoregressive, fully parallel
TTS systems aim to convert texts to human-like speeches. End-to-end TTS system is a type of system that can be trained on (text,audio) pairs with minimal human annotation. It has components, acoustic model and vocoder. Acoustic model predicts acoustic intermediate features from texts. Vocoder, e.g. Griffin-Lim , WORLD , WaveNet , synthesizes speeches with generated acoustic features. In industry, the main task of acoustic model is to map characters or phonemes to acoustic feature frames with fewer errors and low time latency.
Tacotron  uses an autoregressive attention  structure to predict alignment, and uses combination of Gated Recurrent Unit (GRU)  and convolutions as encoder and decoder. Deep voice 3  also uses an autoregressive structure and uses convolutions to speed up training and inference. DCTTS  greatly speeds up the training of attention module by introducing guided attention but is still autoregressive.
Those autoregressive attention structures greatly limit the inference speed of these systems in the context of parallel computation. And those models also suffer from serious error modes e.g. repeated words, mispronunciations, or skipped words . A non-autoregressive ,fully parallel attention structure that can perfectly determine alignment with fewer errors is needed for industrial applications and services.
In this paper, we propose a novel fully parallel end-to-end acoustic system. Specifically, we make the following contributions:
We propose a new non-autoregressive, fully parallel phoneme-to-spectrogram TTS system, which enables fully parallel computation, trains and inferences an order of magnitude faster than autoregressive TTS system.
We propose a novel non-autoregressive alignment module.
We propose UFANS decoder. It can generate better quality result than common convolutional decoder.
We propose two-stage training strategy, which enhances alignment training effect and speed.
We demonstrate that our TTS system can reduce error modes commonly affecting sequence-to-sequence models.
2 Model Architecture
Our model consists of three parts,see Fig.5. The encoder converts phonemes into hidden states that are sent to decoder; The alignment module determines the alignment width of each phoneme, from which the number of frames that attend on that phoneme can be induced; The decoder receives alignment information and converts the encoder hidden states into acoustic features. See details in Appendix A for figures of overall structure.
The encoder consists of one embedding layer and several dense layers. It encodes phonemes into hidden states. See details in Appendix A.
2.2 Alignment Module
Alignment module determines the mapping from phonemes to acoustic features. We discard autoregressive structure, which is widely used in other alignment modules, for time latency issue. Our novel alignment module consists of one embedding layer, one UFANS  structure, trainable position encoding and several matrix multiplications, see Fig.2.
2.2.1 Fully parallel UFANS structure
UFANS is a modified version of U-Net for TTS task aiming to speed up inference, see Fig.3. It is fully parallel, has large receptive field and can combine different levels of features.
For each phoneme , we define the ’AlignmentWidth’ which represents its relationship with frame numbers. Suppose the number of phonemes in an utterance is , and UFANS outputs a sequence of scalars : ;
Then we relate the alignment width to the acoustic frame index . The intuition is that the acoustic frame with index = should be the one that attends most on -th phoneme. And we need a structure that satisfies the intuition.
2.2.2 Trainable position encoding
Positional encoding  is a method to embed a sequence of absolute positions into a sequence of vectors.  use sine and cosine functions of different frequencies and add those positional encoding vectors to input embeddings. But they both take positional encoding as a supplement to help the training of attention module and the position encoding vectors remain constant. We propose a trainable position encoding.
We define the absolute alignment position of -th phoneme as :
Now choose float numbers log uniformly from range and get a sequence of frequencies . For -th phoneme, the positional encoding vector of this phoneme is defined as :
And if we concatenate together, we get a matrix that represents positional information of all the phonemes, denoted as ’Key’, see Fig.2 :
And similarly, for the -th frame of the acoustic feature, the positional encoding vector is defined as :
And if we concatenate all the vectors, we get the matrix that represents positional information of all the acoustic frames, denoted as ’Query’, see Fig.2:
And now define the attention matrix as :
That is, the attention of -th frame on -th phoneme is proportional to the inner product of their encoding vectors. This inner product can be rewritten as :
It is clear when , the -th frame is the one that attends most on -th phoneme. The normalized attention matrix is :
Now represents how much -th frame attends on -th phoneme.
Then we use argmax to build new attention matrix :
Now define the number of frames that attend more on -th phoneme than any other phoneme to be its attention width . From the definition of attention width, is actually a matrix representing attention width . The alignment width and are different but related. see Appendix B.
Sine and cosine positional encoding has two very important properties that make it suitable for this task. In brief, function has a heavy tail that enables one acoustic frame to receive phoneme information very far away; The gradient function is insensitive to the term . See details in Appendix C.
2.3 UFANS Decoder
The decoder receives alignment information and converts the encoded phonemes information from encoder to acoustic features. We use UFANS as our decoder which has large receptive field. It generates good quality acoustic features in a fully parallel manner.
2.4 Loss module
We use Acoustic Loss,denoted as , to evaluate the quality of generated acoustic feature. It is the -norm or -norm mean error between predicted acoustic features and true features.
2.5 Training Strategy
We propose a 2 stage training strategy. Our model focus more on alignment learning in stage 1. In stage 2 we fix the alignment module and train the whole system. In order to enhance the performance and speed of alignment learning, we use convolutional decoder and design an alignment loss.
2.5.1 Stage 1 :Alignment Learning
Convolutional Decoder: Decoder with large receptive field helps system generate good quality acoustic features, but the learning of alignment will be greatly disturbed. See appendix D.
We replace UFANS decoder with convolutional decoder. The convolutional decoder consists of several convolution layers with gated activation , several Dropout  operations and one dense layer. The receptive field is set to be small.
Alignment Loss: We define an Alignment Loss, denoted as , based on the fact that the summation of alignment width should be equal or close to the frame length of acoustic features , see Appendix B. We relax this restriction by using a threshold :
The final loss is a weighted addition of and :
2.5.2 Stage 2 : Overall Training
After alignment learning in Stage 1, we get a good quality alignment module. Then in overall training stage 2, we fix the alignment module, use UFANS as decoder and use Acoustic Loss to train the overall end-to-end system.
3 Experiments and results
LJ speech is a public speech dataset consisting of 13100 pairs of text and 22050 HZ audio clips. The clips vary from 1 to 10 seconds and the total length is about 24 hours. Phoneme-based textual features are given. Two kinds of acoustic features are extracted. One is based on WORLD vocoder that uses mel-frequency cepstral coefficients(MFCCs). The other is linear-scale log magnitude spectrograms and mel-band spectrograms that can be feed into Griffin-Lim algorithm or a trained WaveNet vocoder.
The WORLD vocoder uses 60 dimensional mel-frequency cepstral coefficients, 2 dimensional band aperiodicity, 1 dimensional logarithmic fundamental frequency, their delta, delta-delta dynamic features and 1 dimensional voice/unvoiced feature. It is 190 dimensions in total. The WORLD vocoder based feature uses FFT window size 2048 and has a frame time 5 ms.
The spectrograms are obtained with FFT size 2048 and hop size 275. The dimensions of linear-scale log magnitude spectrograms and mel-band spectrograms are 1025 and 80.
3.2 Implementation Details
Tacotron, DCTTS and Deep Voice3 are used as baseline systems to evaluate our FPUTS system. All the details, like hyper-parameters, of those systems are shown in Appendix E.
3.3 Main Results
3.3.1 Inference speed
The inference speed evaluates time latency of synthesizing a one-second speech, which includes data transfer from main memory to GPU global memory, GPU calculations and data transfer back to main memory. The estimation is performed on a GTX 1080Ti graphics card. As is shown in Table 3, our FPUTS model is able to greatly take advantage of parallel computations and is significantly faster than other systems.
Harvard Sentences List 1 and List 2 are used to evaluate the mean opinion score (MOS) of a system. The synthesized audios are evaluated on Amazon Mechanical Turk using crowdMOS method . The score ranges from 1 (Bad) to 5 (Excellent). As is shown in Table 3, Our FPUTS is no worse than other end-to-end system. The MOS of WaveNet-based audios are lower than expected since background noise exists in these audios.
3.3.3 Error mode
Attention-based neural TTS systems may run into several error modes that can reduce synthesis quality. For example, repetition means repeated pronunciation of one or more phonemes, mispronunciation means wrong pronunciation of one or more phonemes and skip word means one or more phonemes are skipped.
In order to track the occurrence of attention errors, 100 sentences are randomly selected from Los Angeles Times, Washington Post and some fairy tales. As is shown in Table 3, Our FPUTS system is more robust than other systems.
3.4 Alignment Learning Analysis
Alignment learning is an important part of our system which greatly affects the quality of generated audios. So we further discuss the factors that can affect the alignment quality.
3.4.1 Evaluation of alignment
Two methods are used to evaluate the alignment quality in stage 1. 100 audios are randomly selected from training data, denoted as origin data. Their utterances are fed to our system to generate audios, denoted as resynthesized version of origin data.
The first method is objectively computing the difference between the phoneme duration of origin data and their predicted attention width, see example in appendix I.1. The second method is to subjectively listen to origin data and their resynthesized version to check whether they have the similar phoneme duration, see appendix G. Fig.4 is an attention width plot of an utterance selected randomly from the Internet.
3.4.2 Position encoding function and alignment quality
We replace the sine and cosine positional encoding alignment function with Gaussian function in stage 1. The experimental results show that the model can not learn correct alignment with Gaussian function. See Appendix I.2 for more details. And we give a theoretical analysis in Appendix C.
3.4.3 Decoder and alignment quality
In order to identify the relationship between decoder and alignment quality in stage 1, we replace simple convolutional decoder by UFANS with 6 down-sampling layers. Experiments show the computed attention width is much worse than that with the simple convolutional decoder. And the synthesized audios also suffer from error modes like repeated words and skipped words. The results show the receptive field of the decoder should be small in alignment learning stage. More details are shown in Appendix I.3.
4 Discussion and conclusion
In this paper, a new non-autoregressive, fully parallel TTS system is proposed. It fully utilizes the power of parallel computation and reaches at least 10 times speed up of inference compared with most popular end-to-end TTS systems. It generates audios of equal or better quality with fewer errors. All efforts are made to find a lightweight TTS system for deployment that can produce good quality audios with little inference latency and fewer errors. This paper describes and analyzes every component of FPUTS in detail and compares it with most popular end-to-end TTS systems. Future study focuses on how to reduce errors, how to produce better quality audios and how to use both characters and phonemes in FPUTS.
-  Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis model,” CoRR, vol. abs/1703.10135, 2017. [Online]. Available: http://arxiv.org/abs/1703.10135
-  D. W. Griffin, Jae, S. Lim, and S. Member, “Signal estimation from modified short-time fourier transform,” IEEE Trans. Acoustics, Speech and Sig. Proc, pp. 236–243, 1984.
-  M. MORISE, F. YOKOMORI, and K. OZAWA, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
-  A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1609.html
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv e-prints, vol. abs/1409.0473, Sep. 2014. [Online]. Available: https://arxiv.org/abs/1409.0473
-  K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734. [Online]. Available: http://www.aclweb.org/anthology/D14-1179
-  W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HJtEm4p6Z
-  H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969
-  D. Ma, Z. Su, Y. Lu, W. Wang, and Z. Li, “Ufans: U-shaped fully-parallel acoustic neural structure for statistical parametric speech synthesis with 20x faster,” arXiv preprint arXiv:1811.12208, Nov 2018.
-  J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin, “Convolutional sequence to sequence learning,” in ICML, 2017.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
-  A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves, “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 4790–4798. [Online]. Available: http://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html
-  K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
-  F. Protasio Ribeiro, D. Florencio, C. Zhang, and M. Seltzer, “Crowdmos: An approach for crowdsourcing mean opinion score studies,” in ICASSP. IEEE, May 2011. [Online]. Available: https://www.microsoft.com/en-us/research/publication/crowdmos-an-approach-for-crowdsourcing-mean-opinion-score-studies/
Appendix A Overall system
See Figure 5.
Appendix B Alignment width and Attention Width
For two adjacent absolute alignment positions and , consider the two functions and . The values of the two functions only depend on the relative position of to . From Appendix C, it is known function decreases when moves away from (locally, but it is sufficient here). So when , , when , . Thus is the right attention boundary of phoneme , simiarly the left attention boundary is . It can be deduced that :
which means attention width and alignment width can be linearly transformed to each other. And it is further deduced that :
Appendix C Properties of sine and cosine positional encoding alignment structure
Besides sine and cosine positional encoding alignment structure, other attention structures may work, e.g. attention based on Gaussian function. But experiments show Gaussian function is not suitable for this task. The reason is revealed below.
Let . The sine and cosine alignment attention function of -th phoneme is . And also consider a Gaussian attention function . Since the two functions only depend on , it is convenient to set to 0 and set .
After normalization :
The normalized functions are drawn in Figure 6.
The alignment attention has a much heavier tail than Gaussian function. Heavy tail is necessary to learn the alignment. In Figure 7, the -th frame is currently attending mostly on -th phoneme, but the correct phoneme for -th frame is -th phoneme. To learn alignment, -th frame should be able to receive information from -th phoneme, so is not allowed to happen. If Gaussian function is used, vanishes too fast as increases. The heavy tail of the sine and cosine positional encoding alignment attention helps acoustic frames receive information from correct but distant phoneme.
Now let fixed, make the variable, and consider the two functions, , . Suppose during training, -th frame receives information from phonemes and realizes that -th phoneme is more probable than -th phoneme to be the correct phoneme. Then the backward information flow (gradient) from -th frame to (alignment attention) or (Gaussian) is larger than the flow to or , that is . From the chain rule of gradient, the gradient that receives is using alignment attention or using Gaussian function; the gradient that receives is or .
Consider the two functions, , . The two functions (after normalization) are drawn in Figure 8.
It is obvious that for Gaussian function even if , vanishes quickly as moves away from . Thus the backward flow vanishes quickly and -th phoneme can not receive sufficient backward information; For alignment attention, though the function is oscillating, it is much more insensitive to the term . So does not vanish and the relation still holds.
Appendix D Decoder in alignment learning stage should have small receptive field
In Figure 9, -th frame receives information from with attention if using a decoder with large receptive field, and with attention if using a decoder with small receptive field. From appendix C, it is obvious . Then even if is large, -th frame can still attends mostly on -th phoneme, which is great disturbance to learn alignment.
Appendix E Hyperparameters of Tacotron, DCTTS, Deep Voice 3 and FPUTS in overall training stage
Table 5 shows main hyperparameters of these models. Table 4 shows hyperparameters of FPUTS. Entry ’’ is the size of the hidden state. Entry ’GatedConv (linear)’ is the number of layers and hidden state size of gated convolutions that are specific to predict linear-scale log magnitude spectrograms. Tacotron, DCTTS, Deep Voice 3 take mel-band spectrograms as an intermediate feature whose length is reduced by a reduction factor to speed up inference. FPUTS does not use this trick.
Appendix F Loss curves of alignment learning
Figure 10 is the loss curves of and during alignment learning stage.
Appendix G Performance evaluation of alignment learning
See Table 6.
|mismatch||weakly match||match||perfectly match|
Appendix H Loss curves with Gaussian function as attention function and loss curves with UFANS as decoder in alignment learning stage
h.1 Attention width with UFANS as decoder
Appendix I Analysis of resynthesized waveforms after alignment learning stage
Here only results with mel-band spectrograms using Griffin-Lim algorithm are shown. For MFCCs, results are similar.
i.1 Attention width comparison
The phoneme durations are obtained by hand. Figure 13 is the labeled phonemes of audio ’LJ048-0033’. Table 7(a) shows the comparison of phoneme durations of real audio and computed attention width of resynthesized audio.
i.2 Attention width with Gaussian function replacing alignment attention
Table 7(b) clearly shows that model with Gaussian function is not able to learn alignment.
i.3 Attention width with UFANS as decoder
Appendix J Loss comparison in overall training stage
Figure 14 and Figure 15 are loss curves during the overall training stage. For DCTTS, training of Text2Mel and SSRN are separated. Note that DCTTS, Tacotron, Deep Voice 3 are all autoregressive structures. During training, they use real spectrograms of previous step to train spectrograms of next step; But during inference, predicted spectrograms are used to predict next spectrograms. FPUTS is non-autoregressive which means all spectrograms are predicted at the same time during training and inference. It makes sense that during training FPUTS has a slightly higher loss than the other three systems.