Efficiently Trainable Text-to-Speech System Based on
Deep Convolutional Networks With Guided Attention
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units. Recurrent neural network (RNN) has been a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN component often requires a very powerful computer, or very long time typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show an alternative neural TTS system, based only on CNN, that can alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.
Efficiently Trainable Text-to-Speech System Based on
Deep Convolutional Networks With Guided Attention
Index Terms— Text-to-speech, deep learning, convolutional neural network, attention, sequence-to-sequence learning.
Text-to-speech (TTS) is getting more and more common recently, and is getting to be a basic user interface for many systems. To encourage further use of TTS in various systems, it is significant to develop a handy, maintainable, extensible TTS component that is accessible to speech non-specialists, enterprising individuals and small teams who do not have massive computers.
Traditional TTS systems, however, are not necessarily friendly for them, as these systems are typically composed of many domain-specific modules. For example, a typical parametric TTS system is an elaborate integration of many modules e.g. a text analyzer, an generator, a spectrum generator, a pause estimator, and a vocoder that synthesize a waveform from these data, etc.
Deep learning  sometimes can unite these internal building blocks into a single model, and directly connects the input and the output; this type of technique is sometimes called ‘end-to-end’ learning. Although such a technique is sometimes criticized as ‘a black box,’ nevertheless, an end-to-end TTS system named Tacotron , which directly estimates a spectrogram from an input text, has achieved promising performance recently, without intensively-engineered parametric models based on domain-specific knowledge. Tacotron, however, has a drawback that it exploits many recurrent units, which are quite costly to train, making it almost infeasible for ordinary labs without luxurious machines to study and extend it further. Indeed, some people tried to implement open clones of Tacotron [3, 4, 5, 6], but they are struggling to reproduce the speech of satisfactory quality as clear as the original work.
The purpose of this paper is to show Deep Convolutional TTS (DCTTS), a novel, handy neural TTS, which is fully convolutional. The architecture is largely similar to Tacotron , but is based on a fully convolutional sequence-to-sequence learning model similar to the literature . We show this handy TTS actually works in a reasonable setting. The contribution of this article is twofold: (1) Propose a fully CNN-based TTS system which can be trained much faster than an RNN-based state-of-the-art neural TTS system, while the sound quality is still acceptable. (2) An idea to rapidly train the attention, which we call ‘guided attention,’ is also shown.
1.1 Related Work
1.1.1 Deep Learning and TTS
Recently, deep learning-based TTS systems have been intensively studied, and some of recent studies are achieving surprisingly clear results. The TTS systems based on deep neural networks include Zen’s work in 2013 , the studies based on RNN e.g. [9, 10, 11, 12], and recently proposed techniques e.g. WaveNet [13, sec. 3.2], Char2Wav , DeepVoice1&2 [15, 16], and Tacotron .
Some of them tried to reduce the dependency on hand-engineered internal modules. The most extreme technique in this trend would be Tacotron , which depends only on mel and linear spectrograms, and not on any other speech features e.g. . Our method is close to Tacotron in a sense that it depends only on these spectral representations of audio signals.
Most of the existing methods above use RNN, a natural technique for time series prediction. An exception is WaveNet, which is fully convolutional. Our method is also based only on CNN but our usage of CNN would be different from WaveNet, as WaveNet is a kind of a vocoder, or a back-end, which synthesizes a waveform from some conditioning information that is given by front-end components. On the other hand, ours is rather a front-end (and most of back-end processing). We use CNN to synthesize a spectrogram, from which a simple vocoder can synthesize a waveform.
1.1.2 Sequence to Sequence seq2seq Learning
Recently, recurrent neural network (RNN) has been a standard technique to map a sequence into another sequence, especially in the field of natural language processing, e.g. machine translation [17, 18], dialogue system [19, 20], etc. See also [1, sec. 10.4].
RNN-based seq2seq, however, has some disadvantages. One is that a vanilla encode-decoder model cannot encode too long sequence into a fixed-length vector effectively. This problem has been resolved by a mechanism called ‘attention’ , and the attention mechanism now has become a standard idea in seq2seq learning techniques; see also [1, sec. 22.214.171.124].
Another problem is that RNN typically requires much time to train, since it is less suitable for parallel computation using GPUs. In order to overcome this problem, several people proposed the use of CNN, instead of RNN, e.g. [22, 23, 24, 25, 26]. Some studies have shown that CNN-based alternative networks can be trained much faster, and can even outperform the RNN-based techniques.
Gehring et al.  recently united these two improvements of seq2seq learning. They proposed an idea how to use attention mechanism in a CNN-based seq2seq learning model, and showed that the method is quite effective for machine translation. Our proposed method, indeed, is based on the similar idea to the literature .
2.1 Basic Knowledge of the Audio Spectrograms
An audio waveform can be mutually converted to a complex spectrogram by linear maps called STFT and inverse STFT, where and denote the number of frequency bins and temporal bins, respectively. It is common to consider only the magnitude , since it still has useful information for many purposes, and that is almost identical to in a sense that there exist many phase estimation ( waveform synthesis) techniques from magnitude spectrograms, e.g. the famous Griffin&Lim algorithm ; see also e.g. . We always use RTISI-LA , an online G&L, to synthesize a waveform. In this paper, we always normalize STFT spectrograms as , and convert back when we finally need to synthesize the waveform, where are pre- and post-emphasis factors.
It is also common to consider a mel spectrogram , , by applying a mel filter-bank to . This is a standard dimension reduction technique in speech processing. In this paper, we also reduce the temporal dimensionality from to by picking up a time frame every four time frames, to accelerate the training of Text2Mel shown below. We also normalize mel spectrograms as .
2.2 Notation: Convolution and Highway Activation
In this paper, we denote 1D convolution layer  by a space saving notation , where is the sizes of input channel, is the sizes of output channel, is the size of kernel, is the dilation factor, and an argument is a tensor having three dimensions (batch, channel, temporal). The stride of convolution is always . Convolution layers are preceded by appropriately-sized zero padding, whose size is suitably determined by a simple arithmetic so that the length of the sequence is kept constant. Let us also denote the 1D deconvolution layer as . The stride of deconvolution is always in this paper. Let us write a layer composition operator as , and let us write networks like and , etc. is an element-wise activation function defined by .
Convolution layers are sometimes followed by a Highway Network -like gated activation, which is advantageous in very deep networks: where are properly-sized two matrices, output by a layer as . The operator is the element-wise multiplication, and is the element-wise sigmoid function. Hereafter let us denote .
3 Proposed Network
Our DCTTS model consists of two networks: (1) Text2Mel, which synthesize a mel spectrogram from an input text, and (2) Spectrogram Super-resolution Network (SSRN), which convert a coarse mel spectrogram to the full STFT spectrogram. Fig. 2 shows the overall architecture of the proposed method.
3.1 Text2Mel: Text to Mel Spectrogram Network
We first consider to synthesize a coarse mel spectrogram from a text. This is the main part of the proposed method. This module consists of four submodules: Text Encoder, Audio Encoder, Attention, and Audio Decoder. The network first encodes the input sentence consisting of characters, into the two matrices . On the other hand, the network encodes the coarse mel spectrogram , of previously spoken speech, whose length is , into a matrix .
An attention matrix , defined as follows, evaluates how strongly the -th character and -th time frame are related,
implies that the module is looking at -th character at the time frame , and it will look at or or characters around them, at the subsequent time frame . Whatever, let us expect those are encoded in the -th column of . Thus a seed , decoded to the subsequent frames , is obtained as
The resultant is concatenated with the encoded audio , as , because we found it beneficial in our pilot study. Then, the concatenated matrix is decoded by the Audio Decoder module to synthesize a coarse mel spectrogram,
The result is compared with the temporally-shifted ground truth , by a loss function , and the error is back-propagated to the network parameters. The loss function was the sum of L1 loss and the binary divergence ,
where . Since the binary divergence gives a non-vanishing gradient to the network, , it is advantageous in gradient-based training. It is easily verified that the spectrogram error is non-negative, , and the equality holds iff .
3.1.1 Details of TextEnc, AudioEnc, and AudioDec
Our networks are fully convolutional, and are not dependent on any recurrent units. Instead of RNN, we sometimes take advantages of dilated convolution [32, 13, 24] to take long contextual information into account. The top equation of Fig. 2 is the content of . It consists of the character embedding and the stacked 1D non-causal convolution. A previous literature  used a heavier RNN-based component named ‘CBHG,’ but we found this simpler network also works well. and , shown in Fig. 2, are composed of 1D causal convolution layers with Highway activation. These convolution should be causal because the output of is feedbacked to the input of in the synthesis stage.
3.2 Spectrogram Super-resolution Network (SSRN)
We finally synthesize a full spectrogram , from the obtained coarse mel spectrogram , by a spectrogram super-resolution network (SSRN). Upsampling frequency from to is rather straightforward. We can achieve that by increasing the convolution channels of 1D convolutional network. Upsampling in temporal direction is not similarly done, but by twice applying deconvolution layers of stride size 2, we can quadruple the length of sequence from to . The bottom equation of Fig. 2 shows . In this paper, as we do not consider online processing, all convolutions can be non-causal. The loss function was the same as Text2Mel: sum of binary divergence and L1 distance between the synthesized spectrogram and the ground truth .
4 Guided Attention
4.1 Guided Attention Loss: Motivation, Method and Effects
In general, an attention module is quite costly to train. Therefore, if there is some prior knowledge, it may be a help incorporating them into the model to alleviate the heavy training. We show that the simple measure below is helpful to train the attention module.
In TTS, the possible attention matrix lies in the very small subspace of . This is because of the rough correspondence of the order of the characters and the audio segments. That is, if one reads a text, it is natural to assume that the text position progresses nearly linearly to the time , i.e., , where . This is the prominent difference of TTS from other seq2seq learning techniques such as machine translation, in which an attention module should resolve the word alignment between two languages that have very different syntax, e.g. English and Japanese.
Based on this idea, we introduce another constraint on the attention matrix to prompt it to be ‘nearly diagonal,’ where In this paper, we set . If is far from diagonal (e.g., reading the characters in the random order), it is strongly penalized by the loss function. This subsidiary loss function is simultaneously optimized with the main loss with equal weight.
Although this measure is based on quite a rough assumption, it improved the training efficiency. In our experiment, if we added the guided attention loss to the objective, the term began decreasing only after 100 iterations. After 5K iterations, the attention became roughly correct, not only for training data, but also for new input texts. On the other hand, without the guided attention loss, it required much more iterations. It began learning after 10K iterations, and it required 50K iterations to look at roughly correct positions, but the attention matrix was still vague. Fig. 3 compares the attention matrix, trained with and without guided attention loss.
4.2 Forcibly Incremental Attention in Synthesis Stage
In the synthesis stage, the attention matrix sometimes fails to look at the correct characters. Typical errors we observed were (1) it occasionally skipped several characters, and (2) it repeatedly read a same word twice or more. In order to make the system more robust, we heuristically modify the matrix to be ‘nearly diagonal,’ by a simple rule as follows. We observed this device sometimes alleviated such misattentions. Let be the position of the character to be read at -th time frame; . Comparing the current position and the previous position , unless the difference is within the range , the current attention position forcibly set to Kronecker’s delta, to forcibly make the attention target incremental, i.e., .
5.1 Experimental Setup
|Sampling rate of audio signals||22050 Hz|
|STFT window function||Hanning|
|STFT window length and shift||1024 (46.4 [ms]), 256 (11.6[ms])|
|STFT spectrogram size||( depends on audio clip)|
|Mel spectrogram size||( depends on audio clip)|
|Dimension , and||128, 256, 512|
|Emphasis factors||(0.6, 1.3)|
|RTISI-LA window and iteration||100, 10|
|Character set,||a-z,.’- and Space and NULL|
|Method||Iteration||Time||MOS (95% CI)|
|Open Tacotron ||877K||12 days|
To train the networks, we used LJ Speech Dataset , a public domain speech dataset consisting of 13K pairs of text&speech, 24 hours in total, without phoneme-level alignment. These speech are a little reverbed. We preprocessed the texts by spelling out some of abbreviations and numeric expressions, decapitalizing the capitals, and removing less frequent characters not shown in Table 2, where NULL is a dummy character for zero-padding.
We implemented our neural networks using Chainer 2.0 . The models are trained on a household gamine PC equipped with two GPUs. The main memory of the machine was 62GB, which is much larger than the audio dataset. Both GPUs were NVIDIA GeForce GTX 980 Ti, with 6 GB memories.
For simplicity, we trained Text2Mel and SSRN independently and asynchronously, using different GPUs. All network parameters are initialized using He’s Gaussian initializer . Both networks were trained by ADAM optimizer . When training SSRN, we randomly extracted short sequences of length for each iteration, to save the memory usage. To reduce the frequency of disc access, we only took the snapshot of the parameters, every 5K iterations. Other parameters are shown in Table 2.
As it is not easy for us to reproduce the original results of Tacotron, we instead used a ready-to-use model  for comparison, which seemed to produce the most reasonable sounds among the open implementations. It is reported that this model was trained using LJ Dataset for 12 days (877K iterations) on a GTX 1080 Ti, newer GPU than ours. Note, this iteration is still much less than the original Tacotron, which was trained for more than 2M iterations.
We evaluated mean opinion scores (MOS) for both methods by crowdsourcing on Amazon Mechanical Turk using crowdMOS toolkit . We used 20 sentences from Harvard Sentences List 1&2. We synthesized the audio using 5 methods shown in Table 2. The crowdworkers evaluated these 100 clips, rating them from 1 (Bad) to 5 (Excellent). Each worker is supposed to rate at least 10 clips. To obtain more responses with higher quality, we set a few incentives shown in the literature. The results were statistically processed using the method shown in the literature .
5.2 Result and Discussion
In our setting, the training throughput was 3.8 minibatch/s (Text2Mel) and 6.4 minibatch/s (SSRN). This implies that we can iterate the updating formulae of Text2Mel, 200K times only in 15 hours. Fig. 4 shows an example of attention, synthesized mel and full spectrograms, after 15 hours training. It shows that the method can almost correctly look at the correct characters, and can synthesize quite clear spectrograms. More samples will be available at the author’s web page.111https://github.com/tachi-hi/tts_samples
In our crowdsourcing experiment, 31 subjective evaluated our data. After the automatic screening by the toolkit , 560 scores by 6 subjective were selected for final statistics calculation. Table 2 compares the performance of our proposed method (DCTTS) and an open Tacotron. Our MOS (95% confidence interval) was (15 hours training) while the Tacotron’s was . Although it is not a strict comparison since the frameworks and machines are different, it would be still concluded that our proposed method is quite rapidly trained to the satisfactory level compared to Tacotron.
6 Summary and Future Work
This paper described a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), as well as a technique to train the attention module rapidly. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.
Although the audio quality is far from perfect yet, it may be improved by tuning some hyper-parameters thoroughly, and by applying some techniques developed in deep learning community. We believe this handy method encourages further development of the applications based on speech synthesis. We can expect this simple neural TTS may be extended to other versatile purposes, such as emotional/non-linguistic/personalized speech synthesis, singing voice synthesis, music synthesis, etc., by further studies. In addition, since a neural TTS has become this lightweight, the studies on more integrated speech systems e.g. some multimodal systems, simultaneous training of TTS+ASR, and speech translation, etc., may have become more feasible. These issues should be worked out in the future.
The authors would like to thank the OSS contributors, and the members and contributors of LibriVox public domain audiobook project, and @keithito who created the speech dataset.
-  I. Goodfellow et al., Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
-  Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017, arXiv:1703.10135.
-  A. Barron, “Implementation of Google’s Tacotron in TensorFlow,” 2017, Available at GitHub, https://github.com/barronalex/Tacotron (visited Oct. 2017).
-  K. Park, “A TensorFlow implementation of Tacotron: A fully end-to-end text-to-speech synthesis model,” 2017, Available at GitHub, https://github.com/Kyubyong/tacotron (visited Oct. 2017).
-  K. Ito, “Tacotron speech synthesis implemented in TensorFlow, with samples and a pre-trained model,” 2017, Available at GitHub, https://github.com/keithito/tacotron (visited Oct. 2017).
-  R. Yamamoto, “PyTorch implementation of Tacotron speech synthesis model,” 2017, Available at GitHub, https://github.com/r9y9/tacotron_pytorch (visited Oct. 2017).
-  J. Gehring et al., “Convolutional sequence to sequence learning,” in Proc. ICML, 2017, pp. 1243–1252, arXiv:1705.03122.
-  H. Zen et al., “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013, pp. 7962–7966.
-  Y. Fan et al., “TTS synthesis with bidirectional LSTM based recurrent neural networks,” in Proc. Interspeech, 2014, pp. 1964–1968.
-  H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” in Proc. ICASSP, 2015, pp. 4470–4474.
-  S. Achanta et al., “An investigation of recurrent neural network architectures for statistical parametric speech synthesis.,” in Proc. Interspeech, 2015, pp. 859–863.
-  Z. Wu and S. King, “Investigating gated recurrent networks for speech synthesis,” in Proc. ICASSP, 2016, pp. 5140–5144.
-  A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv:1609.03499, 2016.
-  J. Sotelo et al., “Char2wav: End-to-end speech synthesis,” in Proc. ICLR, 2017.
-  S. Arik et al., “Deep voice: Real-time neural text-to-speech,” in Proc. ICML, 2017, pp. 195–204, arXiv:1702.07825.
-  S. Arik et al., “Deep voice 2: Multi-speaker neural text-to-speech,” in Proc. NIPS, 2017, arXiv:1705.08947.
-  K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. EMNLP, 2014, pp. 1724–1734.
-  I. Sutskever et al., “Sequence to sequence learning with neural networks,” in Proc. NIPS, 2014, pp. 3104–3112.
-  O. Vinyals and Q. Le, “A neural conversational model,” in Proc. Deep Learning Workshop, ICML, 2015.
-  I. V. Serban et al., “Building end-to-end dialogue systems using generative hierarchical neural network models.,” in Proc. AAAI, 2016, pp. 3776–3784.
-  D Bahdanau et al., “Neural machine translation by jointly learning to align and translate,” in Proc. ICLR 2015, arXiv:1409.0473, 2014.
-  Y. Kim, “Convolutional neural networks for sentence classification,” in Proc. EMNLP, 2014, pp. 1746–1752, arXiv:1408.5882.
-  X. Zhang et al., “Character-level convolutional networks for text classification,” in Proc. NIPS, 2015, arXiv:1509.01626.
-  N. Kalchbrenner et al., “Neural machine translation in linear time,” arXiv:1610.10099, 2016.
-  Y. N. Dauphin et al., “Language modeling with gated convolutional networks,” arXiv:1612.08083, 2016.
-  J. Bradbury et al., “Quasi-recurrent neural networks,” in Proc. ICLR 2017, 2016.
-  D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Trans. ASSP, vol. 32, no. 2, pp. 236–243, 1984.
-  P. Mowlee et al., Phase-Aware Signal Processing in Speech Communication: Theory and Practice, Wiley, 2016.
-  X. Zhu et al., “Real-time signal estimation from modified short-time fourier transform magnitude spectra,” IEEE Trans. ASLP, vol. 15, no. 5, 2007.
-  Y. LeCun and Y. Bengio, “The handbook of brain theory and neural networks,” chapter Convolutional Networks for Images, Speech, and Time Series, pp. 255–258. MIT Press, 1998.
-  R. K. Srivastava et al., “Training very deep networks,” in Proc. NIPS, 2015, pp. 2377–2385.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in Proc. ICLR, 2016.
-  K. Ito, “The LJ speech dataset,” 2017, Available at https://keithito.com/LJ-Speech-Dataset/ (visited Sep. 2017).
-  S. Tokui et al., “Chainer: A next-generation open source framework for deep learning,” in Proc. Workshop LearningSys, NIPS, 2015.
-  K. He et al., “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. ICCV, 2015, pp. 1026–1034, arXiv:1502.01852.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR 2015, 2014, arXiv:1412.6980.
-  F. Ribeiro et al., “CrowdMOS: An approach for crowdsourcing mean opinion score studies,” in Proc ICASSP, 2011, pp. 2416–2419.