TTS Skins: Speaker Conversion via ASR

TTS Skins: Speaker Conversion via ASR


We present a fully convolutional wav-to-wav network for converting between speakers’ voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition (ASR), and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate the ability to perform multi-voice TTS in those voices, by converting the voice of a TTS robot. We observe no degradation in the quality of the generated voices, in comparison to the reference TTS voice. The modularity of our approach, which separates the target voice generation from the TTS module, enables client-side personalized TTS in a privacy-aware manner.

TTS Skins: Speaker Conversion via ASR

Adam Polyak, Lior Wolf, Yaniv Taigman

Facebook AI Research

{adampolyak, wolf, yaniv}

Index Terms: speech recognition, human-computer interaction

1 Introduction

Text-To-Speech (TTS) and Automatic Speech Recognition (ASR) are currently two of the most common human-machine interfaces, deployed on popular virtual assistants. Recent advances in neural networks have dramatically increased the naturalness of TTS systems in synthesizing artificial human speech from text, thereby increasing their accessibility to users.

These systems are trained end-to-end, from text to speech, on about 100 hours of a single speaker, usually recorded in a professional studio. Enriching this interface by providing additional speakers can either be done by collecting a similar dataset for each speaker, or through some adaptation process. Currently, most adaptation methods have shown limited success, either by requiring a similar data collection effort or by providing inferior results with respect to the identity of the speaker.

We present a fully convolutional encoder-decoder network, that can be plugged into existing TTS systems and provide additional speaker identities (“skins”), requiring a considerably smaller quantity of target speaker recordings and without text. Crucially, our encoder utilizes an ASR network that provides both high-level acoustic perception, as well as speaker-independent features.

Our main contributions are: (i) We present the utility of off the self ASR networks as suitable encoders for a voice conversion network. (ii) We present a conditioned WaveNet decoder based on these features. (iii) We propose a multispeaker TTS solution that has favorable properties, with regards to the required training data for the majority of the voices.

2 Method

Figure 1: The architecture of our method. The non-causal pre-trained encoder embeds the source audio signal to speech features to which the speaker’s embedding is concatenated. The joint embedding is used to condition the autoregressive decoder at multiple layers. During training, the decoder’s input is the source audio, instead of the generated one.

The voice conversion method is based on training an autoencoder architecture, conditioned on an embedding of one of the speakers. In our method, the encoder is taken to be a subnetwork of a pre-trained ASR network, specifically the public implementation [1] of Wav2Letter [2]. Since an ASR network is used, by design, it is invariant to the speaker, and the method does not require any disentanglement terms or other domain (speaker) confusion terms. This greatly adds to the stability of the method. During training, the speaker embeddings are learned and a softmax reconstruction loss is applied to the output of the shared WaveNet decoder.

A diagram of the architecture is shown in Fig. 1. It includes the pre-trained encoder , the learned speaker embeddings and a single decoder, which is shared among all speakers, .

The encoder is a deep time delay neural network (TDNN), comprised of blocks of 1D-convolutional layers. It operates on log-mel filterbank extracted from a 16 kHz speech signal, using a sliding window of 320 ms with hops of 160 ms. Initially, the mel features sequence is downsampled by , using a single strided convolutional layer. The downsampeld input then passes a series of 1D-convolutional blocks, each with convolutional layers. The 1D-convolutional layers are composed of a sequence of a convolutional operation, batch normalization, clipped ReLU activation, and dropout. Blocks are composed of a series of the former layer, with a residual connection adding a projection of the block input to the block output.

The decoder receives as input the sequence that was generated so far (an autoregressive model) and is conditioned on: (i) the latent representation produced by the encoder , and (ii) the desired speaker embedding. The first conditioning signal is obtained by upsampling this encoding in the temporal domain to obtain the original audio rate. Is it then applied, after the speaker embedding is concatenated to it, to each of the WaveNet layers. It is passed through a convolutional layer, that is different for each conditioning location.

The WaveNet decoder has four blocks of ten residual-layers and a resulting receptive field of 250ms (4,093 samples), as in [3]. At the top of the WaveNet, there are two fully connected layers and a softmax activation, which provides probabilities for the quantized audio output (256 options) at the next timeframe. We have developed efficient CUDA kernels that can decode autoregresively in real time (less than one second processing time for one second of output), thereby enabling efficient conversion. These kernels are loosely based on the NV-WaveNet kernels provided by NVIDIA (, and were optimized to utilize NVIDIA’s Volta architecture. Additional modifications are based on the WaveNet variant of [3] : (i) we add terms to ResNet skip connections that are directly computed from the WAV, (ii) we double the kernel capacity to 128 residual channels, and (iii) we add conditioning to the last fully connected layer.

2.1 Training and the Losses Used

Since the encoder is taken off-the-shelf, we only train the decoder . Let denote the conditioning of the decoder on a speaker that is embedded in a vector . During training, we fit the vector of each speaker and store these in a Look Up Table (LUT). During training, autoencoder paths are trained, each of the form . The following optimization problem is defined:


where are the parameters of the decoder, are the training samples of speaker , and is the cross entropy loss, applied to each element of the output and the corresponding element of the target separately. Note that the decoder is an autoregressive model that is conditioned on the output of in the previous frames. During training, the autoregressive model is fed the target output from the previous time-step, instead of the generated output, a practice known as “teaching forcing”.

Once the network is trained, given a new voice sample to convert (existing or a new speaker), we apply the autoencoder pathway of speaker , in order to obtain a conversion to this voice: .

Following Chen et al [4], we present a variant of our method, which adds the fundamental frequency values as additional conditioning to the network. This enables the model to preserve more of the original prosody of the original voice sample . Following this variant, the optimization problem is now defined as:


where is the fundamental frequency series extracted from input voice sample . Similarly, voice conversion for speaker is obtained via .

2.2 Fitting New Speakers

It is desired to be able to add additional voices to the network post training, preferably with little training data and using a shorter training procedure. In order to achieve this, we perform fine-tuning of the pre-trained network, using the samples of the new speaker. In such a case, we fit both the embedding and fine tune the weights of the decoder . Furthermore, instead of using randomly initialized embeddings for the new speakers, as was done in [5], the new speakers embedding are initialized with the mean of all pre-trained network speaker embeddings:


The same loss as in Eq. 2 is used, without reintroducing the training samples.

2.3 Application to TTS

By converting the voice of an existing (arbitrary) TTS robot, the method can be applied to adapt the TTS robot to new voices. The new voices are either the voices that the voice conversion network was trained on, or new voices that were fitted afterwards. It is not required to have samples of the underlying TTS robot during training, since the method converts from any voice to the target voices. This allows the flexibility of easily replacing the TTS robot by other robots, which may not be available at the time of training.

Once the voice conversion network is trained, the TTS robot can generate a sample for any text, and the voice conversion network can convert this sample to the target voice. Since the entire process does not require transcribed samples of the target speaker, the method can be trained on unconstrained test. However, only part of the characteristics of the speaker are captured since, for example, the speaking tempo is not modeled according to the target speaker.

The modularity of our approach, which separates the TTS process and the target speaker adaptation (skin) enables the user to train and deploy such skins privately at the client-side, without the need to share voice data with cloud-based systems. Generating speech from text can be processed remotely, where the audio output is sent over the wire to the client. The client can then convert the received audio using the target skin, which was previously trained and stored locally in a privacy-aware manner.

3 Experiments

To demonstrate our method, we train a single ASR-features-conditioned WaveNet decoder on the LibriSpecch corpus [6]. Specifically, we use the ’train-clean-100’ partition, which is composed of 100 hours of audio spoken by 251 speakers with an equal distribution between both genders. Each speaker has about 25 minutes of speech. We perform two sets of experiments. First, we evaluate our model’s ability to convert input speech to the set of speakers trained on. Second, we explore the model’s ability to adapt to new speakers through comparison with the top performing methods of the recent Voice Conversion Challenge [7] (VCC2018). Our samples can be found at

We use both user-study and automatic based success metrics: (i) Mean Opinion Scores (MOS), on a scale between 1–5, which were computed using the crowdMOS [8] package. (ii) Mel cepstral distortion (MCD) between synthesized utterances and the reference utterances of the same speaker. Dynamic time warping is used to calculate the distortion for unaligned sequences. (iii) Speaker classification is employed to evaluate the capability of the method to synthesize distinguished voices, similar to [9, 10, 11].

Automatic speaker identification results are obtained by training a multi-class CNN on the ground-truth training set of multiple speakers, and tested on the generated ones. The network operates on the WORLD vocoder features [12] extracted from the input and employs five batched-normalized convolutional layers of filters with 128 ReLU activated channels. This is followed by max-pooling, average pooling over time, two fully-connected layers, and ending with a softmax of size 251 corresponding to the number of speakers in the dataset.

Tab. 2 depicts the results of our method based on the success metrics described above. The metrics are evaluated on three settings: (i) conversion from speakers that were seen during training, (ii) conversion from speakers unseen during training - to this end we use LibriSpeech ’test-clean’ partition which includes 40 additional speakers, and (iii) conversion from an off-the-shelf state of the art TTS engine [13]. In all settings, we use the WaveNet autoencoder voice conversion system of [14] as baseline.

As can be seen, our full method improves all metrics over the autoencoder baseline in all three settings. Furthermore, adding improves both quality and identifiability of our method output, except for when converting from the fixed TTS robot. Additionally, the quality of samples generated from voices seen in the training set is preferred over samples generated by converting samples from unseen speakers or from the TTS voice.

3.1 Evaluation on the voice conversion challenge 2018

The 2018 voice conversion challenge (VCC2018) [7] evaluated 23 voice conversion systems on two tasks. First, the Hub task, in which participants were asked to convert between speakers with parallel training corpora. Second, the Spoke task, for which participants were given training data for source and target speakers that contained different sets of utterances. Each task included four source speakers and four target speakers. Source speakers of the Hub task were all different from the source speakers in the Spoke task. Both tasks shared the same group of target speakers. The training data for each speaker (either source or target) is composed of 81 samples of recorded speech, the total amount of time varies between 4-5 minutes.

In each task, participants are required to transform 35 utterances from each source speaker to each target speaker. As a result, each task requires the generation of samples. Samples are evaluated subjectively for their quality and similarity to a reference samples recorded by the target speakers. In our experiments, we used MOS for both quality and similarity evaluation. Quality of generated samples was rated on a scale of 1–5. Similarity was rated on a scale of 1–4, which included the amount of confidence, as described in Tab. 1. Similarity was rated by asking if a generated sample if of the same speaker as the reference sample. Following the challenge, the organizers published the submitted samples making it a suitable test bed for evaluating our method’s ability to adapt to new speakers based on a limited amount of data, as described above in Sec. 2.2.

Rating Similarity Confidence
1 Not same Absolutely sure
2 Not same Not sure
3 Same Not sure
4 Same Absolutely sure
Table 1: Description of similarity ratings used in our evaluation of the 2018 Voice Conversion Challenge

Tab. 3 shows ours results versus the best performing method of the VCC 2018, which is known as N10 [15]. In addition, we included the results of N17 [16], which matched the similarity scores of N10 on the Hub task. As can be seen, our system performs directly on par with the best performing system of the VCC 2018. This is despite our model smaller capacity (128 wavenet channels versus 256 wavenet channels in N10 method).

Data Method MOS MCD Identification
Source voice Ground truth 4.420.77 98.28
Seen during training Full method 3.711.03 8.491.18 98.67
w/o F0 3.231.13 8.521.20 97.56
Autoencoder baseline 2.841.27 9.391.33 33.96
Unseen during training Full method 3.651.02 8.571.17 98.51
w/o F0 3.231.05 8.601.23 97.62
Autoencoder baseline 2.761.27 9.531.33 32.04
TTS robot Unconverted 4.240.76 9.631.00
Full method 3.561.00 8.000.85 95.36
w/o F0 3.030.98 8.090.88 96.27
Autoencoder baseline 3.101.13 8.790.91 41.18
Table 2: Test scores (Mean SD) for LibriSpeech. For MOS and identification accuracy, higher is better. For MCD – lower.)
Hub Spoke
Method MOS Similarity MOS Similarity
GT source 4.540.72 1.901.02 4.480.73 1.921.06
GT target 4.430.77 3.890.39 4.430.77 3.890.39
N10 [15] 3.950.87 3.150.86 3.940.89 2.920.90
N17 [16] 2.920.95 2.950.91 2.630.92 2.680.96
Ours 3.900.88 3.100.88 3.990.90 2.820.94
Table 3: MOS and similarity scores for the 2018 Voice Conversion Challenge (Mean SD; higher is better)

4 Related Work

A major factor in the applicability of specific voice conversion techniques to a given application relies, in addition to factors such as the expected quality and the computational complexity, on the type of supervision the method requires and the quantity of the data that fitting a target speaker requires. Unlike many recent methods, our method does not require parallel training data between speakers and the reference source voice need not be available during training. The amount of data required varies between the phases of training. Since a high capacity WaveNet decoder is used, considerable (unlabeled) data is required for the training voices used for the initial training phase, in which the network and the look up table are being trained. Then, when fitting additional voices, much less data is required.

The method of [15] obtains data efficiency, by performing the conversion on a shared intermediate vocoder representation. This intermediate representation is estimated from linguistic features and decoded with a wavenet. The WaveNet decoder is generic (from the vocoder features to waveform). However, during fitting, both the part that maps the linguistic features to the vocoder features and the WaveNet are fitted to each new speaker. In contrast to our system, which uses an off-the-shelf ASR network, the linguistic features in [15] are trained (on a proprietary dataset) together with the subsequent networks.

The topic of data efficiency was recently also studied in the supervised TTS case, where the training set contains both audio and text [4]. Their conclusion, similarly to ours, is that it is beneficial to both optimize the embedding vector of the new speaker, as well as adopt the weights of the network itself.

Many of the literature voice conversion methods that require parallel data between the speakers, which is a limiting factor by itself, also require the parallel samples to be aligned in time. This requirement can be overcome in an iterative fashion, by first performing a nearest neighbor search of parallels and then refining the conversion module. Erro et al. [17] have shown that repeating the matching process with the gradually improving conversion module leads to better and better results. A similar iterative process was used by Song et al. [18], in which the transformation takes the form of adapting GMM coefficients. ASR features were used by Xie et al [19] in order to align multiple speakers in the latent space of the ASR system. Such an alignment relies on the ability of such features to be (mostly) speaker-agnostic. Once the alignment was formed, it was used in order to convert between voices, by concatenating matching fragments, i.e., in a unit-based approach.

In this work, we rely on a trained ASR network in order to obtain features. The ASR features are supposedly mostly orthogonal to the identity. An alternative approach would be to use the speaker identification features, in order to represent the source and target voices and perform the transformation between the two. I-vectors are a GMM-based representation, which is often used in speaker identification or verification. Kinnunen et al. [20] have aligned the source and target GMMs, by comparing the i-vectors of the two speakers, without using transcription or parallel data. Unlike our method, the reference speaker is known at training time, and their method employs an MFCC vocoder, which limits the output’s quality, in comparison to WaveNets. Speaker verification features were also used to embed speakers by Jia et al. [21]. In this case, the speaker embedding is based on a neural classifier, and is used within a tacotron2 [22] TTS framework, which employs a WaveNet decoder.

While in our method, the encoder is pre-trained for the purpose of ASR, other voice conversion methods employ a neural autoencoder. Specifically, many methods are based on variational auto encoders [23]. Hsu et al. [24], employs, similarly to us, one encoder and a parameterized decoder, where the parameters represent the identity. In a follow-up work [25], a WGAN [26] loss term was added, in order to improve the naturalness of the resulting audio. In these variational autoencoder works, spectral frames are used, while our WaveNet decoder uses the waveform.

An audio conversion method based on a waveform Wavenet autoencoder was presented in [14]. Trained mostly to convert between musical domains, voice conversion results are also presented. Unlike our method, the encoder is learned and multiple decoders are used, one per speaker. A single decoder that, similarly to ours, is conditioned on the identity of the target speaker was used in [27]. While, we employ a look up table that stores continuous speaker embeddings, their work employs a one-hot encoding. More importantly, the approach for obtaining a speaker-agnostic representation, is different. In [27], the capacity of the latent space was reduced, by employing a discrete representation, while in [14], a domain confusion loss is used. The latter was shown in [14] to produce results of a better quality. In our work, we employ neither approach, and instead rely on the speaker-agnostic nature of the ASR features.

5 Discussion

This work enables customization of text-to-speech engines to an unlimited number of speakers. We show that combining a neural encoder obtained from an automatic speech recognition network with a neural audio decoder, achieves promising results. The alternative methods struggle to disentangle the speaker identity at the encoder level, while an ASR based encoder is trained to be speaker-agnostic.

In addition to enabling multiple voice persona, the new technology can be readily used to create voice effects, e.g., convert ones voice to a whimsical voice. Being able to generate a large variety of voices is also beneficial to the study of identifying manipulated audio.


  • [1] O. Kuchaiev, B. Ginsburg, I. Gitman, V. Lavrukhin, J. Li, H. Nguyen, C. Case, and P. Micikevicius, “Mixed-precision training for nlp and speech recognition with openseq2seq,” 2018.
  • [2] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,” CoRR, vol. abs/1609.03193, 2016. [Online]. Available:
  • [3] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi, “Neural audio synthesis of musical notes with wavenet autoencoders,” arXiv preprint, vol. arXiv: 1704.01279, 2017. [Online]. Available:
  • [4] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, C. Gulcehre, A. van den Oord, O. Vinyals, and N. de Freitas, “Sample efficient adaptive text-to-speech,” in International Conference on Learning Representations, 2019. [Online]. Available:
  • [5] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” arXiv preprint arXiv:1806.04558, 2018.
  • [6] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, 2015, pp. 5206–5210.
  • [7] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv preprint arXiv:1804.04262, 2018.
  • [8] F. P. Ribeiro, D. Florencio, C. Zhang, and M. Seltzer, “CROWDMOS: an approach for crowdsourcing mean opinion score studies,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 05 2011, pp. 2416–2419.
  • [9] S. Arik et al., “Deep voice 2: Multi-speaker neural text-to-speech,” in NIPS, 2017.
  • [10] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop,” in ICLR, 2018.
  • [11] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” ICML, 2018.
  • [12] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
  • [13] Google Cloud TTS robot, “en-US-Wavenet-E,”, 2018.
  • [14] N. Mor, L. Wolf, A. Polyak, and Y. Taigman, “A universal music translation network,” in International Conference on Learning Representations, 2019.
  • [15] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenet vocoder with limited training data for voice conversion,” Interspeech, 2018.
  • [16] Y.-C. Wu, P. L. Tobing, T. Hayashi, K. Kobayashi, and T. Toda, “The nu non-parallel voice conversion system for the voice conversion challenge 2018,” WORLD, vol. 2, no. m3, p. a1, 2018.
  • [17] D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm for training voice conversion systems from nonparallel corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, 2010.
  • [18] P. Song, Y. Jin, W. Zheng, and L. Zhao, “Text-independent voice conversion using speaker model alignment method from non-parallel speech,” INTERSPEECH, 2014.
  • [19] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sentences,” in INTERSPEECH, 2016.
  • [20] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non-parallel voice conversion using i-vector plda: towards unifying speaker verification and transformation,” in ICASSP, 2017.
  • [21] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, 2018, pp. 4485–4495.
  • [22] J. Shen et al., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP, 2017.
  • [23] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in ICLR, 2014.
  • [24] C. Hsu et al., “Voice Conversion from Non-parallel Corpora Using Variational Auto-Encoder,” in APSIPA, 2016.
  • [25] C. Hsu et al., “Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks,” in INTERSPEECH, 2017.
  • [26] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative Adversarial Networks,” in ICML, 2017.
  • [27] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” in NIPS, 2017.
Comments 1
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description