Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments
End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to output acoustic features using a single network. A recent advance of end-to-end TTS is due to a key technique called attention mechanisms, and all successful methods proposed so far have been based on soft attention mechanisms. However, although network structures are becoming increasingly complex, end-to-end TTS systems with soft attention mechanisms may still fail to learn and to predict accurate alignment between the input and output. This may be because the soft attention mechanisms are too flexible. Therefore, we propose an approach that has more explicit but natural constraints suitable for speech signals to make alignment learning and prediction of end-to-end TTS systems more robust. The proposed system, with the constrained alignment scheme borrowed from segment-to-segment neural transduction (SSNT), directly calculates the joint probability of acoustic features and alignment given an input text. The alignment is designed to be hard and monotonically increase by considering the speech nature, and it is treated as a latent variable and marginalized during training. During prediction, both the alignment and acoustic features can be generated from the probabilistic distributions. The advantages of our approach are that we can simplify many modules for the soft attention and that we can train the end-to-end TTS model using a single likelihood function. As far as we know, our approach is the first end-to-end TTS without a soft attention mechanism.
Yusuke Yasuda, Xin Wang, Junichi Yamagishi
\addressNational Institute of Informatics, Japan
The University of Edinburgh, Edinburgh, UK
SOKENDAI (The Graduate University for Advanced Studies), Japan \firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Index Terms: text-to-speech synthesis, end-to-end, neural network
End-to-end text-to-speech (TTS) synthesis method directly converts an input letter or phoneme sequence to an output acoustic feature sequence. All methods proposed so far have been based on an encoder-decoder sequence-to-sequence model with a soft attention mechanism [1, 2, 3, 4, 5, 6].
The aforementioned frameworks are very promising, and end-to-end TTS systems such as Tacotron 2  can produce synthetic speech with a quality comparable to that of human speech. The use of neural waveform models is one of the reasons, but we believe that improved model architectures are another important reasons for the improved performance of the end-to-end TTS systems. For example, a transition from Tacotron  to Tacotron 2  has extended the attention mechanism from additive attention  to location sensitive attention , post-net to improve predicted acoustic features, an additional stop flag prediction network, non-deterministic output generation using dropout during inference, and a significant increase in model parameter size. However, even if the attention network is well-trained, it may produce unacceptable errors such as skipping input words, repeating the same phrases, and prolonging the same sounds.
Therefore, efforts have been made to improve or constrain soft attention mechanisms for end-to-end TTS systems. Most of them aim to enforce a monotonic alignment structure in order to reduce alignment errors. Such attempts include location-sensitive attention , monotonic attention , and forward attention . In addition, techniques other than the attention mechanism itself can enforce monotonic alignment, such as window masking [4, 10] and penalty loss for off-diagonal attention distribution . Thus, although the quality of synthetic speech generated from end-to-end TTS systems is very high, the model architecture and its objective function are very complex, and many engineering tricks are used.
Can we construct an encoder-decoder end-to-end TTS system without using such complicated networks and engineering tricks? We hypothesize that the main cause is the soft attention mechanisms. Therefore, we propose a new end-to-end TTS system that can be optimized based on a likelihood function only. Furthermore, the proposed TTS system uses more explicit but natural constraints for speech signals instead of soft attention mechanisms and should make alignment learning more robust and efficient. The constrained alignment is conceptually borrowed from segment-to-segment neural transduction (SSNT) [12, 13] and is extended to continuous outputs 111SSNT is a similar model to RNN transducer , a well-known method in speech recognition. A notable difference is that SSNT separates transition probability from output probability, which is a preferred feature for TTS which is a task to predict continuous output.. Because the SSNT calculates the joint probability of an acoustic feature sequence and an alignment given the input text, we can compute the likelihood function by marginalizing all possible alignments. The alignments are hard and monotonic increase by definition.
The proposed framework and likelihood function are similar to those of a hidden Markov model (HMM) because all the possible alignments can be marginalized using a forward-backward algorithm over a trellis consisting of input labels and the output spectrum. However, unlike the conventional HMM or its mixture density network-based HMM , its transition and output probabilities are computed using encoder-decoder autoregressive neural networks like Tacotron. In other words, an input sequence is nonlinearly transformed into an encoded vector, and both the output and transition probabilities are computed in a non-linear autoregressive manner in the SSNT-based TTS.
Our system is still under development and the quality of synthetic speech is not perfect yet. However, we present our first evaluation on the performance of the current system in this paper. We focus on our system’s description and its evaluation for a standard reading speech corpus in this paper. Its application to a verbal performance corpus is investigated in , and it shows effectiveness of our system in that corpus.
The rest of the paper is structured as follows. In Section 2, we overview end-to-end TTS systems with soft attention. In Section 3, we describe the new encoder-decoder TTS using marginalization of monotonic hard alignments and likelihood-based learning. In Section 4, we show experimental results using the system under development, and we summarize our findings in Section 5.
|System||Network||Alignment||Decoder output||Post-net output||Waveform synthesis|
|VoiceLoop ||Memory buffer||GMM||Vocoder||-||WORLD|
|Deep Voice 3 ||CNN||Dot-product||Mel||Linear/Vocoder||Griffin-Lim/WORLD/WaveNet|
|Tacotron 2 ||RNN||Location-sensitive||Mel||Mel||WaveNet|
2 Overview of end-to-end TTS with soft attention
Table 1 summarizes major end-to-end TTS methods. All the existing systems consists of an encoder and decoder with attention as basic components. Most systems have a pre-net  at the entrance of the decoder [2, 4, 5, 6]. Although some of the earliest studies [1, 3] do not have the pre-net in the decoder, all the others do after  introduced it.
The choice of target acoustic features is crucial for the end-to-end approach . Some of the earliest work chose vocoder parameters as a target [1, 3]. Vocoder parameters are challenging because they have a long sequence length caused by fine analysis conditions for reliable feature extraction.  used a mel spectrogram with coarse-grained analysis conditions as target features to reduce the gap of the length between input text and target acoustic feature sequences. All the other studies followed the same condition [2, 4, 5, 6].
The attention mechanism is the core of those approaches. Many attention mechanisms have been used for TTS. Additive attention  and dot-product attention  are content based families  that consider input content to align input to output features. GMM attention  is a location based attention  that consider input location only. Location sensitive attention  extends the additive attention by considering the previous alignment, so it has both properties of content-based and location-based attention. The systems using vocoder parameters [1, 3] seem to choose GMM attention. GMM attention has monotonic progress properties for modes of attention distributions for input location, so it is suitable for predicting long sequences like vocoder parameters. The systems based on CNN or self-attention to enable parallel training [4, 6] seem to use do-product attention which can be combined with positional encoding  for constructing sequential relations and an initial monotonic alignment in parallel. The systems based on RNN seem to use additive attention and its extension [2, 5, 10, 17].
In addition to the decoder, some systems have a post-net, an additional network that predicts acoustic features. A post-net was originally introduced to convert acoustic features to different acoustic features that were suitable for an adopted waveform synthesis method, for example, from mel spectrograms to linear spectrograms  or mel spectrograms to vocoder parameters . In recent studies the role of the post-net was to improve the acoustic features predicted by the decoder to improve quality further [5, 6]. The post-net introduces an additional loss term in the objective function.
Recent methods predicts binary stop flags to determine when to stop prediction [5, 4, 6]. As opposed to predicting the fixed length output, the stop flag enables avoiding unnecessary computation. The stop flag introduces an additional loss term in the objective function.
Speech is nondeterministic. An exactly identical utterance can not be reproduced in speech. To enable the randomness,  enables dropout in pre-net during inference.
Although the aforementioned techniques have contributed to the advancement of the end-to-end TTS, some issues remain unsolved by these techniques. All content-based attentions and their extensions do not guarantee forward progress of alignment from the previous position. They include forward attention , which extends the location sensitive attention by incorporating monotonic transition formulation. The GMM attention  has monotonic progress properties for the mode position of each Gaussian component. However, it sometimes gives broad, mode-free, or multi-modal distributions that result in muffled speech, possibly because it does not consider the content of input. Monotonic attention  is the only attention mechanism that guarantees monotonic progress by switching to hard attention during inference from soft attention during training, but it is not successful for text-to-speech synthesis .
Learning stop flags seems to be trivial because it is just a binary classification task and because only one boundary has a flag turn from not to stopping to stopping. However, it tends to overfit because its loss has an order of magnitude lower than the acoustic feature loss term. To alleviate this problem,  introduced higher weighting at the boundary where a flag reaches a stop state in the binary loss term. The stop flag is an extra complexity to implement such a trivial feature.
Enabling randomness for generated speech by dropout is not widely utilized because the dropout is normally disabled during prediction.
3 Proposed SSNT-based TTS
3.1 Model definition and learning of SSNT-based TTS
Let us denote the input letter or phone sequence as , where is a -th letter or phone of the input sequence. We then use and to denote an output acoustic feature sequence and acoustic features at time , respectively. Our approach is to model the output acoustic feature sequence given a letter or phone sequence by marginalizing all possible alignments over a trellis consisting of the input and output sequences:
where represents one of the possible paths of the trellis. Figure 1 shows trellis structure of our model. We then use the concept of SSNT [12, 13]. More specifically, we factorize the joint probability of Eq. (1) into an alignment transition probability and emission probability for acoustic features with the 1st-order Markov assumption of :
and we use neural networks to compute the two probabilities.
To constrain the alignment probability as left-to-right with a self transition, an alignment transition variable with two possible values, is further introduced for the alignment probability of Eq. (2):
Note that the transition keeps the input position, while transition proceeds and reads one more input. Please also note that and that a neural network predicts only one of them.
The emission probability was a discrete distribution because SSNT was originally proposed for NLP tasks such as abstractive sentence summarization, morphological inflection generation, and machine translation and because its outputs are words, the emission probability was a discrete distribution. In our case, the output is continuous, and hence we have to define our own emission probability distribution function. In this paper, we simply use a multivariate Gaussian distribution as the emission probability of the acoustic features:
Note that is predicted by an encoder-decoder network with autoregressive feedback similar to Tacotron.
Our model can be trained by minimizing the negative log likelihood:
Here is a forward variable of the forward-backward algorithm at the final input position and the final time step . The final forward variable can be calculated recursively:
The gradients of the negative log likelihood with respect to can be computed using the standard back-propagation algorithm. For more efficient gradient computation, the gradient can be computed with the following form:
3.2 Inference of SSNT-based TTS
During prediction, the alignment variable is incremented by sampling from the Bernoulli distribution with a parameter obtained by normalizing the two nonzero cases of Eq. (3.1), namely, , where and
Note that this is different from a transition matrix in HMM synthesis which models duration with exponential distribution. Our system models transition of alignment as a latent variable so our system do not define distribution of duration like HMM synthesis.
In the original SSNT, its prediction stops when the EOS token is the output. In our case, its prediction stops when the alignment variable reaches the final input position222 uses the same criteria to stop prediction to consume full input. causing all of the input sequence to be used.
Acoustic features may be sampled from the Gaussian distribution using the conditional input at the sampled alignment position. However, in this paper, we simply use the mean of the prediction probability distribution as the generated acoustic features. We will investigate the random sampling strategy after we properly estimate a full covariance of Eq. (4).
3.3 Network structure of SSNT-based TTS
Figure 2 shows the detailed network structure of the system for SSNT-based speech synthesis system. The network consists of an encoder and decoder. The encoder processes either a letter or phone sequence. We then use the CNN-based encoder , which consists of a stack of convolutional layers, and a bidirectional LSTM layer .
The decoder processes the acoustic feature sequence in an autoregressive manner. The predicted acoustic features from the previous time step are first processed by the pre-net , which consists of fully connected layers with ReLU activation  and dropout regularization . The pre-net’s output is passed through a stack of LSTM layers . The LSTM stack’s output is concatenated with the encoder’s output and then transformed by a fully connected layer with tanh activation. The output of tanh nonlinearity is passed to two networks. One is a fully connected layer with sigmoid activation to compute the alignment transition probability of Eq. (3.1). The other is a linear layer to compute the emission probability of Eq. (4).
4.1 Experimental conditions
We used the same conditions as in our previous experiment . A Japanese speech corpus from the ATR Ximera dataset  was used. This corpus contains a total of 28,959 utterances from a professional female speaker and is around 46.9 hours in duration. We used manually annotated phonemes labels . The phoneme labels had 58 classes, including silences, pauses, and short pauses. All sentences start and end with a silence symbol. Although Japanese is a pitch-accented language, we did not use accentual type labels in this paper. To train our proposed systems, we trimmed the beginning and ending silence from the utterances, after which the duration of the corpus was 33.5 hours. Fixed length silences were prepended and appended to target mel spectrogram. These data preparation made any phoneme symbols inappropriate to be skipped. The frame size and shift used for the mel spectrogram were 50 ms and 12.5 ms, respectively. We used 27,999 utterances for training, 480 for validation, and 480 for testing.
Phoneme embedding vectors have 256 dimensions. For the encoder, we used the same conditions as . For the decoder, we used two pre-net layers with 256 and 128 dimensions, two LSTM layers with 256 dimensions each, and two fully connected layers for context projection with 256 dimensions each. We applied zoneout regularization  to all LSTM layers with probability 0.1 as in . We set the reduction factor  to be 2. The Adam optimizer was used for training . The validation loss was steady so we stopped training at 510 epochs by checking quality of predicted spectrogram in validation set. Finally, we used -law WaveNet for the waveform generation .
4.2 Subjective evaluations
We conducted a listening test to measure the quality of synthetic speech. We chose Japanese Tacotron and self-attention Tacotron without accentual type labels  as baselines in this experiment. All the synthetic speech waveforms were generated using the identical WaveNet model, which was trained using natural mel spectrograms with a 12.5 ms frame shift and 16 kHz sampling frequency.
We recruited 104 native speakers of Japanese as listeners by crowdsourcing. The listeners evaluated 40 audio samples that contained eight randomly selected sentences generated by each of five systems in a random order in a single test set. The systems included natural speech and analysis-by-synthesis in addition to our system and the two aforementioned baseline systems. One listener could evaluated at most ten test sets. One audio sample was evaluated five times, and we got a total of 19,200 data points.
Figure 3 shows the results of the subjective evaluation. Among the baseline systems, self-attention Tacotron got 3.130.03, and Japanese Tacotron got 3.050.03. The scores of the baseline systems are consistent with the previous work . The scores were relatively low because we did not provide any pitch accentual type labels, even though Japanese is a pitch-accented language. Our system got a MOS score of 2.330.03. Unfortunately, it was rated lower than the baseline systems.
To understand the reasons, we did a simple investigation of generated audio files and discovered out our model overestimated the phoneme duration, especially for pauses within a sentence. Pauses had much longer duration than other phonemes, but, its acoustic features had less useful information; hence, deciding when Shift transition should be made would be difficult. Figure 4 shows an example sample that has alignment error of overestimation of pause duration. In fact, the MOS score of sentences that did not include pauses was and for SSNT and Tacotron, respectively, and the score difference was smaller. We also discovered that our method needed longer training time due to the marginalization process. The performance might be improved after using sufficiently long training periods.
We designed the alignment structure to be hard and monotonic, which enabled us to avoid some alignment errors that are commonly observed in soft-attention-based approaches. Such alignment errors include muffling, skipping, and repeating. Muffling error is caused by an attention distribution without a sharp mode, skipping error is caused by discontinuous attention, and repeating error is caused by a repeat of backward jumps of the mode of attention. These errors were not observed in our method because they could not happen by definition.
However, we still observed different types of alignment errors in our samples, such as prolongations of vowel duration. We also observed that even though the alignment looked acceptable, that is, a monotonic increase occurred, wrong phonemes were sometimes generated due to a poorly trained model. When the alignment was not properly specified, its emission probability could be learned from different acoustic segments. This is a disadvantage of SSNT’s hard alignments. However, we also found an advantage with the hard alignment. From our informal listening of generated speech, compared with soft-attention-based methods, speech generated from our method tended to have relatively distinct pronunciation.
We proposed a new method for end-to-end TTS without soft attention. In contrast with soft attention based methods, our method has a simpler architecture and ensures monotonic alignment structure by design. Our method represents an alignment variable as a latent variable, and the model can be trained by maximizing the total probability that can be derived by marginalizing the latent alignments. During inference, the alignment variable can be randomly sampled from the learned distribution, and the inference simply stops when the alignment reaches the final input position.
Thanks to the design of hard and monotonic alignment, our proposed method could avoid some alignment errors that were commonly observed in soft attention based approaches. Our method also replaced many engineering features in soft attention based approaches with a probabilistic approach, which included advanced attention mechanisms to enforce a monotonic alignment, stop flag prediction network, and nondeterministic inference by dropout.
Although our analysis revealed that some generated speech contained another type of alignment errors, and although a subjective evaluation showed that the quality of synthetic speech from our system was not yet competitive with soft attention based methods on the reading speech corpus yet, our other research showed the effectiveness of the proposed method on a verbal performance corpus that is much more challenging data than reading speech .
Our future work includes performance optimization for faster training of SSNT-based TTS. We expect fast training will help to reduce the remaining alignment errors induced by insufficient training time. In addition, we plan to use a more complex probability distribution function for the emission probability of the acoustic features. In this study, we chose the isotropic Gaussian distribution for the emission probability. This was not an ideal choice because the target mel spectrogram had a clear correlation across frequency dimensions. We expect that full covariance modeling will enable random sampling of acoustic features and that a sufficiently complex probability distribution will lead to higher quality generated speech.
This work was partially supported by a JST CREST Grant (JPMJCR18A6, VoicePersonae project), Japan, and by MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, and 18KT0051), Japan. The numerical calculations were carried out on the TSUBAME 3.0 supercomputer at the Tokyo Institute of Technology.
-  J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech synthesis,” in Proc. ICLR (Workshop Track), 2017.
-  Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017, pp. 4006–4010.
-  Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “VoiceLoop: Voice fitting and synthesis via a phonological loop,” in Proc. ICLR, 2018.
-  W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” in Proc. ICLR, 2018.
-  J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783. [Online]. Available: https://doi.org/10.1109/ICASSP.2018.8461368
-  N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Close to human quality TTS with transformer,” CoRR, vol. abs/1809.08895, 2018. [Online]. Available: http://arxiv.org/abs/1809.08895
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1409.0473
-  J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Proc. NIPS, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, pp. 577–585.
-  A. Graves, “Generating sequences with recurrent neural networks,” CoRR, vol. abs/1308.0850, 2013. [Online]. Available: http://arxiv.org/abs/1308.0850
-  J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Forward attention in sequence-to-sequence acoustic modeling for speech synthesis,” in Proc. ICASSP. IEEE, 2018, pp. 4789–4793.
-  H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. IEEE, 2018, pp. 4784–4788. [Online]. Available: https://doi.org/10.1109/ICASSP.2018.8461829
-  L. Yu, J. Buys, and P. Blunsom, “Online segment to segment neural transduction,” in Proc. EMNLP, J. Su, X. Carreras, and K. Duh, Eds. The Association for Computational Linguistics, 2016, pp. 1307–1316.
-  L. Yu, P. Blunsom, C. Dyer, E. Grefenstette, and T. Kociský, “The neural noisy channel,” in Proc. ICLR, 2017.
-  A. Graves, “Sequence transduction with recurrent neural networks,” CoRR, vol. abs/1211.3711, 2012. [Online]. Available: http://arxiv.org/abs/1211.3711
-  K. Tokuda, K. Hashimoto, K. Oura, and Y. Nankaku, “Temporal modeling in neural network based statistical parametric speech synthesis,” in 9th ISCA Speech Synthesis Workshop, 2016, pp. 106–111. [Online]. Available: http://dx.doi.org/10.21437/SSW.2016-18
-  S. Kato, Y. Yasuda, X. Wang, E. Cooper, S. Takaki, and J. Yamagishi, “Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences,” submitted to SSW10, 2019.
-  Y. Yasuda, X. Wang, S. Takaki, and J. Yamagishi, “Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language,” in Proc. ICASSP, May 2019, pp. 6905–6909.
-  T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, Eds., 2015, pp. 1412–1421. [Online]. Available: http://aclweb.org/anthology/D/D15/D15-1166.pdf
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017, pp. 6000–6010. [Online]. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need
-  C. Raffel, M. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online and linear-time attention by enforcing monotonic alignments,” in Proc. ICML, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 2017, pp. 2837–2846.
-  J. Eisner, “Inside-outside and forward-backward algorithms are just backprop (tutorial paper),” in Proceedings of the Workshop on Structured Prediction for NLP@EMNLP 2016, Austin, TX, USA, November 5, 2016, K. Chang, M. Chang, A. M. Rush, and V. Srikumar, Eds. Association for Computational Linguistics, 2016, pp. 1–17. [Online]. Available: https://doi.org/10.18653/v1/W16-5901
-  M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. [Online]. Available: https://doi.org/10.1109/78.650093
-  V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proc. ICML, J. Fürnkranz and T. Joachims, Eds., 2010, pp. 807–814.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: https://doi.org/10.1162/neco.1918.104.22.1685
-  H. Kawai, T. Toda, J. Yamagishi, T. Hirai, J. Ni, N. Nishizawa, M. Tsuzaki, and K. Tokuda, “XIMERA: A concatenative speech synthesis system with large scale corpora,” IEICE Transactions on Information and System (Japanese Edition), pp. 2688–2698, 2006.
-  H.-T. Luong, X. Wang, J. Yamagishi, and N. Nishizawa, “Investigating accuracy of pitch-accent annotations in neural-network-based speech synthesis and denoising effects,” in Proc. Interspeech, 2018, pp. 37–41.
-  D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,” in Proc. ICLR, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2014.
-  A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499