Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Hierarchical Sequence to Sequence Voice Conversion with Limited Data


We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). The problem consists of converting between voices in a parallel setting when source,target audio pairs are available. Our seq2seq architecture makes use of a hierarchical encoder to summarize input audio frames. On the decoder side, we use an attention based architecture used in recent TTS works. Since there is a dearth of large multispeaker voice conversion databases needed for training DNNs, we resort to training the network with a large single speaker dataset as an autoencoder. This is then adapted for the smaller multispeaker voice conversion datasets available for voice conversion. In contrast with other voice conversion works that use , duration and linguistic features, our system uses mel spectrograms as the audio representation. Output mel frames are converted back to audio using a wavenet vocoder.

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

Praveen Narayanan, Punarjay Chakravarty, Francois Charette, Gint Puskorius

Ford Greenfield Labs, Palo Alto, CA


Index Terms: voice conversion, seq2seq, TTS, ASR, DNNs, attention

1 Introduction

Recently, sequence to sequence models have been adapted with great success in producing realistic sounding speech in TTS systems [1, 2, 3, 4, 5]. Likewise, it has been demonstrated that ASR can be handled excellently by seq2seq architectures. In TTS, the system takes in a text or phoneme sequence and outputs a speech representation as output. On the other hand, in ASR, one feeds in an audio representation, and the system performs the task of classifying audio into text or phoneme. In voice conversion, both the input and output sequences are audio representations. The problem is related to both ASR and TTS in that like ASR, the DNN must learn to summarize input audio frames into a hidden context, and like in TTS, it must decode audio frames from the latent context in a temporal, attentive fashion.

In voice conversion, we seek to convert a speech utterance from a source speaker A to make it sound like an utterance from a target speaker B. There are two pertinent scenarios, the first of which is when both the source and target speakers are uttering the same text (the ’parallel’ case), and the second is when the utterances don’t match (the ’non-parallel’ case). We focus on parallel voice conversion in this work with DNNs. While the larger goal of this work is to address the more important problem of non-parallel voice conversion (producing parallel datasets for conversion is not easy), we start with the arguably simpler task of demonstrating how we can achieve this in the parallel scenario using seq2seq models.

While we can go about the voice conversion task by first performing ASR on the source voice, and then sending the text obtained to a TTS engine, our approach leads to an end-to-end solution wherein one doesn’t have to train an ASR and TTS engine separately. Our approach has a simpler processing pipeline as it only needs audio transcripts (with no accompanying text or need for segmentation), and can be converted directly to target representations. A notable aspect of this work is that it gets around the problem of limited parallel data for voice conversion by pretraining a much larger, single speaker TTS corpus as an autoencoder and then performing transfer learning on the available, diminutive voice conversion datasets. Without this adaptation technique it becomes difficult to effectively carry out voice conversion without having access to larger, expensive to obtain parallel datasets.

We train the system using Maximum Likelihood to minimize the L1 error between the generated and target mel spectrograms.

Figure 1: System Diagram: Our Attention based Encoder-Decoder architecture for Voice Conversion takes in a mel-spectrogram for the source speaker and outputs the mel-spectrogram for the target speaker.

2 Related Work

The traditional pipeline for parallel voice conversion is through use of Gaussian Mixture Models (GMMs) [6, 7, 8] or Deep Neural Networks (DNNs) [9, 10, 11, 12]. After first aligning source and target features using Dynamic Time Warping (DTW)[13], the model is trained so that it learns to produce the target given the source features for each frame. A disadvantage of these methods is that they need aligned, parallel data. Moreover, conversions performed on spectral features disregard dependencies on other controlling factors such as prosody and fundamental frequency, duration and rhythm. Furthermore, transforming features on a frame basis disregards temporal context dependencies. The dependence on fundamental frequency is often handled by performing a linear transformation in the logarithmic domain. A good review article of the topic is found in [14].

Non-negative Matrix Factorization (NMF) [15], traditionally used for sound source separation and speech enhancement, has also been used for VC [16]. NMF factorizes a matrix into two non-negative factors, the basis or dictionary matrix and the activation matrix. In the case of VC with parallel training data, dictionaries for speaker 1 and speaker 2 are first constructed separately. Subsequently, given test source data (speaker 1), the previously learnt dictionary for speaker 1 is used to factorize the source voice into a set of source activations, or contributions of speaker 1 dictionary to speaker 1 utterance. The same is done for speaker 2. The activations for the source speaker utterance are then combined with the dictionary atoms of the target speaker utterance to transform speaker 1 utterance into speaker 2. NMF based methods, like GMMs, also require alignment of parallel voice samples using Dynamic Programming, and other pre-processing steps like the Short-term-Fourier-Transform (STFT).

Recent sequence to sequence modeling approaches for voice conversion have largely been inspired by advances in seq2seq practice in NMT, TTS and ASR, in that they involve an encoder-decoder model as the underlying machinery. It is often advantageous to classify the input waveform into text or phoneme, and use that information to inform the decoder model of the content that the input audio representation embodies [17, 18]. Our work is most similar to [17], and we compare and contrast salient aspects of both models. In both models, the overall architecture is a seq2seq model inspired by the ASR work [19] (with a hierarchical encoder stack), and the TTS work Tacotron [1] and derivatives. However, in [17], the encoder outputs are augmented by features extracted with an ASR model, while our approach comprises end-to-end neural networks without need for labeled data. There are also several ancillary components such as the use of additional losses and postprocessing networks. Also, we use a Convolutional filter bank, Highway network to extract ’context’ as a prelude to processing them in the multilayer hierarchy of encoder RNNs. Nevertheless, we wish to emphasize that a substantive difference between the two works is the training philosophy, in how the data limitation problem is handled. We elaborate on this in the following paragraphs with additional examples from the literature.

Pertinent to our discussion are seq2seq modeling works [20, 21]. In these works, additional loss terms are introduced to encourage the model to learn alignment and to preserve linguistic context. Alignment is maintained by noting that the attention curve is predominantly diagonal (in the voice conversion problem) between source and target, and including in the loss function a diagonal penalty matrix - a term referred to as guided attention in the TTS work [22]. An additional consideration is to prevent the decoder from ’losing’ linguistic context, as would arise when it simply learns to reconstruct the output of the target. This was addressed by using additional neural networks that ensure that the hidden representation produced by the encoder (similar reasoning applies to the decoder) was capable of reconstructing the input, and thereby retained context information. These manifest as additional loss terms - we also glean a similarity to cycle consistency losses [23] - that they call ’context preservation losses’. Also noteworthy is that these approaches use non-recurrent architectures for their seq2seq modeling.

We suspect that the problems that motivated the design of these additional losses have their provenance in the diminutive size of the training corpus (CMU Arctic was used, with utterances per speaker) which is hardly sufficient to learn a diverse respresentation with good generalization capabilities. In our work, we arrive at a slightly different way to overcome the data limitation problem as compared with these works, which do so by augmenting data with ASR training [17, 18], and by introducing additional losses [20, 21]. Our solution makes use of transfer learning by first pretraining with a large, single speaker corpus, and then adapting to the smaller, pertinent corpus (CMU Arctic) in question.

Developments in the generative modeling (primarily, Variational Autoencoders [24] and Generative Adversarial Networks [25]) front have led to their use in voice conversion problems. In [26], a learned similarity metric obtained through a GAN discriminator is used to correct oversmoothed speech that results from maximum likelihood training, which imposes a particular form for the loss function (usually the MSE). A conditional VAEGAN [27] setup is used in [28] to implement voice conversion, with conditioning on speakers, together with a Wasserstein GAN discriminator [29] to fix the blurriness issue associated with VAEs. Moreover, an important apparatus that is of use in training non-parallel voice setups consists of Cycle Consistency Losses from the famous CycleGAN [23] work for images. This forms a building block in the papers [30] and [31].

A natural extension to our work is to explore a generative solution to Voice Conversion as in some of the works above, in order to apply our architectural components to non-parallel setups.

Our work is influenced by recent TTS works involving transfer learning and speaker adaptation. The recently published work [32] demonstrates a methodology to use adapt a trained network for new speakers with a wavenet. Likewise, in [33], a speaker embedding is extracted using a discriminative network for unseen, new speakers which is then used to condition a TTS pipeline similar to Tacotron. This philosophy is also used in [34] where schemes are used to learn speaker embeddings extracted separately or trained as part of the model during adaptation. In all these contexts, it is emphasized that the onus is on adapting to small, limited data corpuses, thereby circumventing the need to obtain large datasets to train these models from scratch. In our work, we use the same idea to get around the problem of not having enough data to train in the voice conversion dataset under consideration. However, in our work, instead of producing new speaker embeddings, we retrain the model for each new pair, a process that is rapid owing to the small size of the corpus.

An interesting alternative to using recurrent (or autoregressive) seq2seq modeling for TTS or VC is to use differential memory as a way to store speech related information. In the VoiceLoop architecture [35, 36, 37], the input is transformed with a shallow fully connected network into a context, with attention being used to compare with the memory buffer. The memory buffer itself is updated by replacing its first element with a newly computed representation vector. With this approach, which also uses speaker embeddings, the network is able to adapt to new speakers with only a few samples, in addition to having a much reduced network complexity (only shallow fully connected layers are used).

3 Architecture

We use an attention based encoder-decoder network for our voice conversion task. The network architecture borrows heavily from recent developments in TTS [1] and ASR [19]. The system takes in an audio representation (mel-spectrogram) as input, and encodes it into a hidden representation in recurrent fashion. This hidden representation is then processed by an attention based decoder into output mel-spectrograms. In order to convert the mel frames back to audio, we employ a widely used wavenet vocoder implementation available online [38]. In the Tacotron2 [2] work, it was demonstrated that using wavenet as a neural vocoder produced audio samples whose quality was superior to those from the Griffin-Lim procedure used in Tacotron [1].

A system diagram showing the various components of the model is shown in Figure 1. We describe its components in the following subsections.

3.1 Encoder

3.1.1 Prenet

The prenet is a bottleneck layer containing full connections with a ReLU nonlinearity and dropout [1, 39]. The purpose of this layer is to enable the model to generalize to unseen input with dropout. Mechanisms to achieve this effect in sequence models are teacher forcing, scheduled sampling and professor forcing [40, 41, 42]. Prenet processes vectors of 80 dimensions to yield output of the same size. A dropout ratio of is used.

3.1.2 CBH: Convolutional Banks and Highway layers

Figure 2: The Pre-net and the CBH layers that are used to process the input mel-spectrogram frames. Output tensor sizes at each step of processing are indicated by the side of the unit.

Originally proposed in the context of Neural Machine Translation [43] and later used in [1] where it was named CBHG (Convolutional Banks, Highway and Gated Recurrent Units), this layer served as a processing mechanism to accumulate ’word’ level context when the input is text. For our voice conversion task, the effect is similar, in that neighboring speech frames are filtered so as to abstract the equivalent phoneme level representation, mixed with speaker characteristics and prosodic content. Together with the hierarchical RNN encoder units described later, we could view this assemblage as an implementation of CBHG. The Pre-net and CBH layers, along with the tensor output sizes at each step of the processing are shown in Figure  2 and described below.

Convolutional Filter Banks A bank of 1-D convolutional filters of size are used to capture n-grams of varying width. The input sequence is convolved with each of these filters and the results from all filters are concatenated together. Each convolution filter preserves the original length of the sequence by padding it to extend original length by , where the filter is of width , followed by BatchNorm and ReLU operations. The filter maps obtained are then stacked in the channel dimension with output channels being produced for each convolution. This is followed by a max-pool operation with stride 1, which maintains the length of the sequence. A 1-D convolutional projection operation then reduces the sequence length to the original size, followed by a final linear layer that also maintains representation length.

Highway Layers The Highway layer is like a Resnet block with a skip connection that is a shortcut for information flow that skips the intermediate layers, but with learnable weights to determine the extent of the information skip. We use 4 highway layers.

3.1.3 Hierarchical Recurrent Encoder

We design our encoder as a stack of bidirectional layers, reducing the sequence length by a factor of as the data flows up the stack (Figure 3). This construction was first proposed in [19] in the context of speech recognition with DNNs. The encoder’s task is to summarize audio input to an intermediate hidden representation embodying linguistic content, akin to text. However (and this might be argued as a desirable attribute of DNN processing), we make no attempt to disentangle content (text) and voice characteristics (style) in this case. We assume that the DNN automatically learns to disentangle content and style as part of the training process, and that during the decoding process, the first speaker’s voice characteristics are discarded and the second speaker’s voice is injected into the content.

The reasoning behind using a hierarchical reduction of timesteps is that speech frames are highly inflated, redundant descriptors of linguistic content mixed with speaker and duration information. A single phoneme could thus span several (10) frames. It therefore makes sense to reduce or cluster the speech frames so as to contain more relevant information. By reducing the number of timesteps, we are implicitly performing this clustering operation to distill the pertinent linguistic content at the top of the stack. The reduction in timesteps is also favorable as regards learning attention, the rationale being that as the decoder examines all the frames of the encoder to extract attention parameters, it is useful to aggregate relevant information so that it has a smaller set to work with, which helps in speeding up the computation and in helping the model learn alignment.

In order to reduce the number of input timesteps, we accumulate two neighboring frames, and then pass the concatenated features along to the bidirectional RNN layer above. In our experiments, we use a stack of recurrent reduction layers, resulting in an overall reduction in the number of timesteps by a factor of .

The basic unit of the hierarchical recurrent encoder is the bi-directional GRU. This bi-directional Gated Recurrent Unit (shown separately in the diagram as and ) passes over the input sequence twice: left to right and from right to left, and concatenates the two passes. Each GRU has 150 hidden units, and outputs a dimensionality 150xT, where T is input sequence length. After concatenation of the and outputs, one gets a dimensionality 300xT. The sequence length itself is reduced to T/2 after GRU1 and to T/4 after GRU2 as a result of accumulating 2 neighbouring frames at each step. GRU0 is a pre-processing recurrent unit that does not have this accumulation and reduction of time steps. The details of the pyramidal encoding, with tensor sizes after each step are shown in Figure 3.

Figure 3: Hierarchical Bi-directional Recurrent Encoder with an indication of the tensor sizes at each step. The number of hidden units in each GRU is 150. Each pyramidal GRU unit (GRU 1 and 2) decreases the sequence length by 1/2. Left-right and right-left GRU units each output a 150xT matrix, that are concatenated to give a 300xT matrix, with T as input sequence length.

3.2 Attention Decoder

The decoder architecture is inspired by the Tacotron TTS setup [1]. As in the Tacotron work, the decoder has the following components:

  1. Prenet

  2. Attention RNN

  3. Decoder RNNs with residuality

We describe the components in more detail below. However, before doing so, it is useful to have in mind an overall picture of how the data flows through the decoder stack. To that end, we present a brief description of the calculations at a high level.

The decoder’s task is to transform linguistic content from the source speaker to that of the target speaker in a temporal way, conditioned on frames generated previously. The linguistic content is provided by the hierarchical encoder described previously, which condenses the source speaker’s utterances into an intermediate hidden representation embodying linguistic content. The decoder’s task is therefore to ingest this linguistic content, and imbue it with the target speaker’s voice characteristics. In the current setup, the DNN implicitly adds the target speaker’s voice characteristics (i.e. duration, pitch) to the encoder summary. It is designed as a complex stack of unidirectional RNNs trained to emit output spectrogram frames conditioned on all previous frames emitted, together with the encoder’s representation of the context.

The attention modeling ensures that the target’s spectrogram frames are aligned with the appropriate frames of the input. Attention computations are ubiquitous in sequence to sequence modeling. While decoding output sequence frames - this could be in any general sequence modeling task, such as NMT, ASR, or TTS - attention helps to focus on the appropriate frame of the input sequence so that the decoder is able to decide what it should emit in a more precise way. This aspect is especially important when the sequence length becomes large, for the decoder’s task becomes much more difficult in emitting sequential output based on a single, global context that the encoder provides. Moreover, it is seen in experiments that attention modeling is essential for the system to generalize to unseen input. Our experiments seem to be in line with the notion that for the speech model to perform well on unseen data, it is in fact necessary for the model to learn proper alignment.

We now proceed to describe in more detail the components of the decoder.

3.2.1 Prenet

As with the encoder, we transform target data through a set of bottleneck layers (two in total) using dropout. We use dropout in order to regularize the model and prevent overfitting, and hence it is a very essential component. We use a stack of prenet layers (full connections with ReLU non-linearity and a dropout ratio of ) yielding vectors of size and respectively.

3.2.2 The Hybrid Content-Location Attention Model

We use attention modeling [44, 45, 2] as a way of focusing the generator on the most relevant section of the input sequence. We have a state sequence output by the encoder (hidden) units at the top of the hierarchical stack: . The sequence output by the decoder units is . The input spectrogram sequence is and the output spectrogram sequence is . At the ith step of the generation process, the recurrent sequence generator, the RNN, generates state by using the ’s up to that point, the previous and the hidden encoder output . The attention model is used to inform the generator which encoder states are important for the generation of this , and this is done with an attentional neural network, which learns to produce the attention or alignment vector , which is a vector of normalized importance weights used to weight the hidden encoder state . This is then used to produce the context vector , which is a weighted sum of the encoder states :


The context vector , concatenated with the spectogram output prediction of the previous time step is used to condition the production of the decoder output for the current time step .

The attention vector is obtained by softmax normalization (to between 0 and 1) with a temperature parameter, over the scores .


where is the softmax temperature that sharpens the attention ([45]).

The scores or un-normalized attention energies is the central part of the attention modeling, and is done for each hidden encoder state separately. There are two ways of calculating these attention energies. Content based attention is dependent on the content or encoder hidden state: . Location based attention is dependent on the location of the previous generator state, or where the attention was previously focused: . This is normally implemented as a 1-D convolutional kernel (with learnt weights) centred around the previous position. We use a hybrid attention model, with both content and location based scoring. Location scoring is done by convolving the previous attention with . This is then combined with content scoring:


where vector and matrices , , and are trainable weights, implemented as a feed-forward neural network. We use a form inspired by Luong’s multiplicative attention mechanism [44] to determine the mapping between hidden units and attention energies in equation 5.

Figure 4: The decoder RNN. Att represents the Attention RNN, RNNa and b represent the first and second layers of the decoder RNNs. Red arrows indicate residual connections and purple arrows indicate the generated output being fed back to the attention RNN (along with input) to generate the attention output. Output of the second decoder is transformed to the dimensions of the output spectrogram using a fully connected layer (Project).

3.3 Decoder RNNs with residuality

The attention RNN’s output is now processed by two RNN layers with residuality, before transforming them back to audio frames.


This is depicted in the equations (6), (7). Here, the superscripts represent the first and second decoder layers. The second term in these equations contain the residual signal from the input. In this case, represents the output from the attention RNN and and denote the hidden units from the first and second decoder layers. We use the same number of dimensions () in all the decoder RNN layers.

Finally, the output of the last residual decoder layer is transformed back to the dimensions of the output ( bins) by sending it to a fully connected layer and adding a ReLU non-linearity to it.

4 Autoencoder pretraining and transfer learning

Voice conversion with DNNs for parallel data is a difficult undertaking owing to the lack of availability of large multispeaker voice conversion datasets. To get around this problem, we first pretrain our network as an autoencoder with a large single speaker TTS corpus [46], with the source and target voices being the same. After this network is trained - a guideline for this is to see if system learns alignment - we adapt the network for the smaller, multispeaker voice conversion data.

Transfer learning can be seen as a way to mitigate data insufficiency problems in the speech domain. This is particularly trenchant owing to the lack of availability of good quality speech datasets (large corpuses, and with sufficient diversity) that can be obtained inexpensively.

The system is trained using the L1 loss between source and target voices. The Adam optimizer is used with a learning rate of for the pretraining task, and for the voice adaptation task. The optimizer parameters were and respectively.

5 Experimental setup

Our experimental procedure consists of two steps, as mentioned in section 4. We first pretrain the network with a large single-speaker corpus in which the source and the target are the same. After this, we allow the network to adapt to the desired source and target data.

5.1 Datasets

For autoencoder pretraining, we use the LJSpeech dataset [46]. This dataset contains short utterances from a single female speaker reading passages from audio books, with a total audio amounting to about hours recorded on a Macbook Pro in a home setting with a sampling rate of . The main task is to perform voice conversion (by adapting the pretrained network trained above) on the much smaller CMU Arctic dataset [47] containing utterances from several speakers. We used the male speakers ”bdl”, ”rms” and the female speakers ”clb” and ”slt” for experiments. The training/test/validation split was , and respectively. Since this corpus has a sampling rate of , we upsample this dataset to , generate audio through the pipeline and then downsample it back to the original sampling rate. This measure was adopted instead of downsampling the large corpus to the target sampling rate because we found that the system was unable to learn at the lower rate of .

Figure 5: Feature extractor, depicted through attention alignment and mel spectrograms produced by training the network to produce ljspeech voices, with source and target being the same.

5.2 Example Conversions

In figure 5, we present visualizations of source and target spectrograms, conversion and alignment curve for the pretrained autoencoder feature extractor using the large LJSpeech corpus. The alignment curve in this case shows more decoder timesteps than the encoder (by a factor of ) because of the hierarchical encoding scheme which reduces the number of timesteps in the encoder.

Figure 6: Voice conversion from male (bdl) to female (slt) voice, depicted through attention alignment and mel spectrograms produced by adapting to small CMU Arctic voice corpus.

In figure 6, we present corresponding visualizations for the transfer learning experiment wherein we convert from male (bdl) to female (slt) voice. Starting with a network whose weights are pretrained with the large LJSpeech corpus as an autoencoder, we allow the network to adapt to the smaller CMU-Arctic dataset, using paired training examples. As can be seen, while the conversion is plausible, the transfer learning spectrogram is somewhat ‘blurry’ owing to the limited amount of data and the use of the L1 loss, which makes the spectrograms appear oversmoothed. While the alignment curve is more or less linear, it has a few ’kinks’ (unlike the ljspeech curve) in keeping with the slight differences that arise in the alignment path as compared with the case where both source and target are the same.

Encoder Prenet FC-80-ReLU-Dropout(0.5)
CBH -BN-480-ReLU-Dropout(0.1)
Maxpool (stride=1) and stack
Highway layer stack of
BiGRU0 300 cells (f+b); Dropout(0.1)
Hierarchical BiGRU 2 layers, 300 cells (f+b);
Decoder Prenet FC-256-ReLU-Dropout(0.5)
Attention GRU 600 cells; Dropout(0.1)
Residual GRU 1,2 600 cells; Dropout(0.1)
Table 1: Network architecture hyperparameters. -c-BN-ReLU-Dropout(f) denotes a convolution of width , output channels, BatchNorm, ReLU, with a dropout of (=). --Dropout(f)-ReLU-Linear denotes a convolution of width , with output channels, dropout of (=), followed by ReLU and a linear projection to the same size output. Prenet layers are full connections (e.g. FC- would be a linear connection to an output of size ) but with dropout of . All other network components use a dropout of

5.3 Wavenet Implementation

We use a popular open source wavenet implementation [38] available online to recover audio from mel spectrograms. Wavenet is an autoregressive architecture [48] especially designed for audio generation. Related architecutures have been used for generative modeling tasks in other domains: ByteNet [49] for text, PixelCNN [50] for images and Video Pixel Net [51] for videos. This type of architecture, at a high level works on a temporal (in the sense that there is a certain temporal ordering of data) basis by stacking dilated convolutions with exponentially growing receptive field sizes (e.g. , , , ). Masking is carried out so as to only allow information from the past. In wavenet, instead of masking, one simply uses all the inputs from the past for the operations as the data already has an implicit temporal order to it. The architecture also uses gating and skip connections to allow better information flow through the network stack.

A drawback of this type of architecture is that while training is fast, inference is slow owing to the sample level autoregressive nature of the setup, in that every sample generated is conditioned on all previous samples; the upshot being that with raw audio ( samples per second), the calculations become extremely expensive. To alleviate these issues, themes from flow based generative modeling techniques (with some of the ideas originally proposed in order to improve the expressiveness of VAE priors [52, 53] by successively transforming them) were adapted for fast inference during the sampling stage [54, 55].

To use as a vocoder backend, we present the wavenet with mel-spectrograms as conditioning features. These are upsampled (with transpose convolutions) to match the target rate with upsampling layers. This network has layers with channels for residuals, gating and skips respectively. The setup uses mixtures of logistics to model the bit ( bins) raw waveform. This architecture was also used to compare against the WaveGlow implementation in [55]. A more extensive list of hyperparameters is available online [38].

6 Conclusions

In this work, we demonstrated a way to overcome data limitations (an all too common malady in the speech world) with a trick to extract linguistic features by pretraining with a large corpus so that it learns to reconstruct the input voice. These features serve as a useful starting point for transfer learning in the limited data corpus. The architecture proposed is slightly elaborate, in that it resorts to hierarchically reducing the number of timesteps on the encoder side. The basis for this proposal was in keeping with the fact that the content embedded in the input waveforms - viewed as words or phoneme like entities - is much smaller than the size of the waveforms (5 words vs 100 audio frames, 10 phonemes vs 100 audio frames, etc.). With this intuition, the hierarchical reduction in timesteps is viewed as a mechanism to extract phoneme like entities by compressing the content in the input mel spectrogram. Our task is in a sense, to extract a style independent representation on the encoder side. The decoder then learns to inject the target speaker’s content using exactly the same type of architecture as in the Tacotron works [1, 2]. The output spectrograms are converted back to audio using a wavenet vocoder, yielding plausible conversions, demonstrating that our approach is indeed legitimate.

The system is sensitive to hyperparameters. We noticed the capacity of the CBHG network is particularly important, and adding dropout at various places helps in generalizing to the small dataset. However, dropout also leads to ’blurriness’. Cleaning up the output is probably necessary with a postnet, which we have not implemented.

We hope to release code and samples to allow for experimentation.


  • [1] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end to end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
  • [2] J. Shen, R. Pang, R. J. Weiss, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Sauros, Y. Agiomyrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” arXiv preprint arXiv:1712.05884, 2017.
  • [3] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, 2017.
  • [4] S. O. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, and Y. Zhou, “Deep voice 2: Multi-speaker text-to-speech,” arXiv preprint arXiv:1705.08947, 2017.
  • [5] W. Ping, K. Peng, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
  • [6] A. Kain and M. Macon, “Spectral voice conversion for text-to-speech synthesis,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 1.    IEEE, 1998, pp. 285–288.
  • [7] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” Trans. Audio, Speech and Lang. Proc., vol. 15, no. 8, pp. 2222–2235, Nov. 2007. [Online]. Available: https://doi.org/10.1109/TASL.2007.907344
  • [8]
  • [9] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “Voice conversion using artificial neural networks,” in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, ser. ICASSP ’09.    Washington, DC, USA: IEEE Computer Society, 2009, pp. 3893–3896. [Online]. Available: https://doi.org/10.1109/ICASSP.2009.4960478
  • [10] S. Desai, A. W. Black, and B. Yegnanarayana, “Voice conversion using artificial neural networks,” in IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 5, July 2010.
  • [11] L. Sun, S. Yang, K. Li, and H. Meng, “Voice conversion using deep bidirectional long short-term memory,” in Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ser. ICASSP ’15.    Washington, DC, USA: IEEE Computer Society, 2015, pp. 4869–4873.
  • [12] L. Sun, K. Li, S. Kang, and H. Meng, in IEEE International Conference on Multimedia and Expo, 2016.
  • [13] M. Müller, Information Retrieval for Music and Motion.    Springer, 2007.
  • [14] S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Commun., vol. 88, no. C, pp. 65–82, Apr. 2017. [Online]. Available: https://doi.org/10.1016/j.specom.2017.01.008
  • [15] Y. Li, M. Sun, H. Van Hamme, X. Zhang, and J. Yang, “Robust hierarchical learning for non-negative matrix factorization with outliers,” IEEE Access, vol. 7, pp. 10 546–10 558, 2019.
  • [16] R. Takashima, T. Takiguchi, and Y. Ariki, “Exemplar-based voice conversion using sparse representation in noisy environments,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. 96, no. 10, pp. 1946–1953, 2013.
  • [17] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y. Jiang, and L.-R. Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” arXiv preprint arXiv:1810.06865, 2018.
  • [18] J. Zhang, Z. Ling, Y. Jiang, L. Liu, C. Liang, and L. Dai, “Improving sequence-to-sequence acoustic modeling by adding text-supervision,” CoRR, vol. abs/1811.08111, 2018. [Online]. Available: http://arxiv.org/abs/1811.08111
  • [19] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
  • [20] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” arXiv preprint arXiv:1811.04076, 2018.
  • [21] H. Kameoka, K. Tanaka, T. Kaneko, and N. Hojo, “Convs2s-vc fully convolutional sequence-to-sequence voice conversion,” arXiv preprint arXiv:1811.01609, 2018.
  • [22] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” CoRR, vol. abs/1710.08969, 2017. [Online]. Available: http://arxiv.org/abs/1710.08969
  • [23] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017. [Online]. Available: http://arxiv.org/abs/1703.10593
  • [24] D. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [25] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Wade-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” arXiv preprint arXiv:1406.2661, 2014.
  • [26] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,” in INTERSPEECH, 2017.
  • [27] A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” CoRR, vol. abs/1512.09300, 2015. [Online]. Available: http://arxiv.org/abs/1512.09300
  • [28] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” CoRR, vol. abs/1704.00849, 2017. [Online]. Available: http://arxiv.org/abs/1704.00849
  • [29] M. Arjrovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  • [30] T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017.
  • [31] H. Kameoka and T. Kaneko, “Stargan-vc: Non-parallel many-to-many voice conversion with star generative adversarial networks,” arXiv preprint arXiv:1806.02169, 2018.
  • [32] Y. Chen, Y. M. Assael, B. Shillingford, D. Budden, S. E. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, Ç. Gülçehre, A. van den Oord, O. Vinyals, and N. de Freitas, “Sample efficient adaptive text-to-speech,” CoRR, vol. abs/1809.10460, 2018. [Online]. Available: http://arxiv.org/abs/1809.10460
  • [33] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. Lopez-Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” CoRR, vol. abs/1806.04558, 2018. [Online]. Available: http://arxiv.org/abs/1806.04558
  • [34] S. Ö. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” CoRR, vol. abs/1802.06006, 2018. [Online]. Available: http://arxiv.org/abs/1802.06006
  • [35] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voice synthesis for in-the-wild speakers via a phonological loop,” CoRR, vol. abs/1707.06588, 2017. [Online]. Available: http://arxiv.org/abs/1707.06588
  • [36] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” CoRR, vol. abs/1802.06984, 2018. [Online]. Available: http://arxiv.org/abs/1802.06984
  • [37] E. Nachmani and L. Wolf, “Unsupervised polyglot text to speech,” CoRR, vol. abs/1902.02263, 2019. [Online]. Available: http://arxiv.org/abs/1902.02263
  • [38] R. Yamamoto, “Wavenet vocoder,” 2018. [Online]. Available: https://github.com/r9y9/wavenet_vocoder
  • [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html
  • [40] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, no. 2, pp. 270–280, Jun. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.2.270
  • [41] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” arXiv preprint arXiv:1506.03099, 2015.
  • [42] A. Lamb, A. Goyal, Y. Zhang, S. Zhang, A. Courville, and Y. Bengio, “Professor forcing: A new algorithm for training recurrent networks,” arXiv preprint arXiv:1610.09038, 2016.
  • [43] J. Lee, K. Cho, and T. Hoffman, “Fully character-level neural machine translation without explicit segmentation,” arXiv prepring arXiv:1610.03017, 2016.
  • [44] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
  • [45] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015.
  • [46] K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/
  • [47] J. Kominek and A. W. Black, “Cmu arctic databases for speechsynthesis,” Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, 2003. [Online]. Available: http://festvox.org/cmuarctic/index.html
  • [48] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499
  • [49] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den Oord, A. Graves, and K. Kavukcuoglu, “Neural machine translation in linear time,” CoRR, vol. abs/1610.10099, 2016. [Online]. Available: http://arxiv.org/abs/1610.10099
  • [50] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, “Conditional image generation with pixelcnn decoders,” CoRR, vol. abs/1606.05328, 2016. [Online]. Available: http://arxiv.org/abs/1606.05328
  • [51] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” CoRR, vol. abs/1610.00527, 2016. [Online]. Available: http://arxiv.org/abs/1610.00527
  • [52] D. J. Rezende and S. Mohamed, “Variational normalizing flows,” arXiv preprint arXiv:1505.05770, 2015.
  • [53] D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with inverse autoregressive flow,” CoRR, vol. abs/1606.04934, 2016. [Online]. Available: http://arxiv.org/abs/1606.04934
  • [54] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel wavenet: Fast high-fidelity speech synthesis,” CoRR, vol. abs/1711.10433, 2017. [Online]. Available: http://arxiv.org/abs/1711.10433
  • [55] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” CoRR, vol. abs/1811.00002, 2018. [Online]. Available: http://arxiv.org/abs/1811.00002
Comments 4
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description