One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization


Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training. This is achieved by disentangling speaker and content representations with instance normalization (IN). Objective and subjective evaluation shows that our model is able to generate the voice similar to target speaker. In addition to the performance measurement, we also demonstrate that this model is able to learn meaningful speaker representations without any supervision.


Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee \address College of Electrical Engineering and Computer Science, National Taiwan University \email{r06922020,r06942067,hungyilee}

Index Terms: Voice conversion, disentangled representations, generative model.

1 Introduction

VC aims to convert the non-linguistic information of the speech signals while maintaining the linguistic content the same. The non-linguistic information may refer to speaker identity [1, 2, 3], accent or pronunciation [4, 5] to name a few. VC can be useful in some down-stream tasks like multi-speaker text-to-speech [6, 7] and expressive speech synthesis [8, 9], and also some applications like speech enhancement [10, 11, 12] or pronunciation correction [4], and so on. In this paper, we will focus on the problem of speaker identity conversion.

Prior works on VC can be roughly categorized into two types, a supervised one and an unsupervised one. Supervised VC has achieved great performance [13, 14, 15, 16]. However, it requires frame-level alignment between source and target utterance. If there is a huge gap between source and target domain, inaccurate alignment may hurt the performance of the conversion. More important, collecting parallel data is difficult and time-consuming, which make supervised VC not a desirable framework if we want to have the flexibility of adapting it to some new domains.

Unsupervised VC recently became an actively investigated problem due to its efficiency in data collection. It means that we do not have to collect parallel data, but to utilize non-parallel data to train the VC system. Some works try to incorporate ASR system to perform unsupervised VC [17, 18, 19]. By translating the speech to phoneme posterior sequences, and then synthesizing the speech with the target domain synthesizer, unsupervised VC can be achieved. However, The performances of this kind of approaches highly depend on the accuracy of the ASR system, and will corrupt if the ASR system is not well-functioned. Some other works try to utilize deep generative model like VAE [20] or GAN [21] to do unsupervised VC  [22, 23, 24, 25, 26]. These works formulate VC as a domain mapping problem, aiming to learn networks that can transfer utterances among different domains. These works are able to generate speech with good quality and can convert the speaker characteristics successfully. However, the major limitation of these works is that they can not synthesize the voice of the speakers who were never seen in training phase.

Speech signals inherently carry both static information and linguistic information. The static part such as speaker, acoustic condition is time-independent and merely changes during the whole utterance, while the linguistic part may change dramatically every several frames. Here we assume an utterance can be factorized into a speaker representation plus a content representation. To disentangle speaker and content representation, our model consists of three components: a speaker encoder, a content encoder and a decoder in Fig. 1. The speaker encoder is trained to encode the speaker information into the speaker representation. The content encoder is trained to encode only the linguistic information into the content representation. And then the task of the decoder is to synthesize the voice back by combining these two representations. We utilize instance normalization [27] without affine transformation in the content encoder to normalize the channel statistics, which control the global information. In this way, the global information such as speaker information is removed from the representation encoded by the content encoder. And also, adaptive instance normalization [28] is utilized in the decoder, the corresponding affine parameters are provided by the speaker encoder. By doing this, the global information needed in the decoder is controlled by the speaker encoder. With the designed architecture, our model is encouraged to learn factorized representations. This kind of factorization enable our model to perform one-shot voice conversion as follows: with one utterance from source speaker and another utterance from target speaker, we first extract the speaker representation from the target utterance, and then extract the content representation from the source utterance, and finally combine them with the decoder to generate the converted result as in Fig. 1. It is worth mentioning that our model does not require any speaker label of the utterances during the training process, which makes the data collection easier. Interestingly, the speaker encoder learns a meaningful speaker embeddings even if we do not provide any speaker label.

In terms of applying factorization techniques to speech, some prior works proposed using adversarial training to remove certain attributes from an utterance [29, 26]. However, with the cost of training an extra discriminator network, lots more computational resources are used. Also, adversarial training suffers from instability problem, which makes the training process difficult. In our proposed approach, we simply use the technique of instance normalization instead of adversarial training to remove the speaker information in an utterance, which substantially reduces the computation and makes the training process easier.

Figure 1: Model overview. is speaker encoder; is content encoder and is decoder. IN is instance normalization layer without affine transformation. AdaIN represents adaptive instance normalization layer.

Our contribution is three-fold:

  1. Our proposed model is able to do one-shot VC without any supervision.

  2. The efficacy of instance normalization on disentangling representations for VC is verified.

  3. We demonstrate that our model is able to learn meaningful speaker embedding as a side effect.

2 Proposed Approach

2.1 Variational autoencoder

Let be the acoustic feature segment, and be the collection of all the acoustic segments in the training data. Let be the speaker encoder, be the content encoder, and be the decoder. is trained to generate the speaker representation . And is trained to generate content representation . We assume that is a conditionally independent Gaussian distribution with unit variance as in [30], which means . The reconstruction loss is given as in Eq. 1.


We uniformly sample an acoustic segment from during training process (that is, in Eq. 1 is an uniform distribution over ). To match the posterior distribution to the prior , the KL divergence loss will be minimized. Since we assume unit variance, the KL divergence reduces to L2 regularization. The KL divergence term is given as in Eq. 2.


The objective function for VAE training is to minimize the combination of the two terms with weighted hyper-parameters and .


2.2 Instance Normalization for Feature Disentanglement

At the first glance, it is unclear how could the two encoders and encode speaker and content information respectively based on the description in Section 2.1. In this paper, we find that simply adding Instance normalization (IN) without affine transformation to can remove the speaker information while preserving the content information. Similar idea has been verified to be effective for style transfer in computer vision [28].

The formula of instance normalization (IN) without affine transformation is given as in Eq. 5. Here is the feature map of the output of the previous convolutional layer, and represents the -th channel, which is a -dimensional array. Here each channel is an array instead of a matrix because 1-D convolution is applied rather than 2-D. To apply IN, we have to compute the mean and standard variation of the -th channel first.


where is the -th element in . in Eq. 4 is simply a small value to avoid numerical instability. Then in IN, each element in the array is normalized into as below.


The normalized are processed by the following deep network layers. We utilize IN layer in content encoder to prevent the content encoder from learning domain information. So as to enforce the model to extract speaker information from speaker encoder and content information from content encoder respectively.

To further enforce the speaker encoder to generate speaker representation, we provide the speaker information to decoder by adaptive instance normalization (adaIN) layer [28]. In adaIN layer, the decoder first normalizes the global information by IN, and the speaker encoder provide the global information. The formula is given as followed.


and are computed as Eq. 4. and for each channel are the linear transformation of the output of speaker encoder .

3 Implementation Details

3.1 Architecture

We use Conv1d layers in encoders and decoder to process all the frequency information at a time as in Fig. 2. The ConvBank layer is used in both the speaker encoder and content encoder to better capture long-term information [31]. We apply average pooling over time to the speaker encoder so as to enforce the speaker encoder to learn global information only. Instance normalization layers are used in content encoder to normalize the global information. PixelShuffle1d [32] layer is used in the decoder for upsampling. adaIN layer is used to provide global information to the decoder. The speaker representation is first processed by a residual DNN, and then transformed by an affine layer before get into each adaIN layer.

Figure 2: The architecture of the encoders and decoder.

3.2 Acoustic feature

We use mel-scale spectrograms as the acoustic feature. We first trimmed out the silence and normalize the volume, and then convert the audio to 24kHz. After that, we perform STFT to the audio with a 50 milliseconds window length, a 12.5 milliseconds hop length, and a 2048 STFT window size. And then transformed magnitude of the spectrograms to 512-bin mel-scale spectrograms. The mel-scale spectrograms are normalized by subtract mean and divide standard deviation. To convert the mel-scale spectrograms back to waveform, we apply the approximate inverse linear transformation to recover the linear-scale spectrograms [35]. And phase is reconstructed by Griffin-Lim algorithm with iterations.

3.3 Training details

We trained the proposed model by ADAM optimizer with a learning rate, and , . We set the batch size to . To prevent the model from over-fitting, we apply dropout to each layers with a dropout rate and a weight decay. is set to and is set to . We trained the model for iterations (mini-batches). Further details may be found in our implementation code:

4 Experiments

We evaluated our model on CSTR VCTK Corpus [34]. The audio data were produced by 109 speakers in English with different accents. We randomly selected 20 speakers’ utterances as our testing set, and the rest utterances will be split to 90% training set and 10% validation set. While we set the segment length to be 128 during training, because of the fully-convolutional architecture, the model can process input with any length at inference stage. After removing all the utterances less than 128 frames, the training set contains about 16000 utterances.

4.1 Evaluation of disentanglement

To see the effect of IN layer, we performed an ablation study to verify it could help content encoder remove the information of speaker characteristics. We trained another network (5-layer DNN with 1024 neurons and ReLU activation) to classify speaker identity given the latent representation encoded by the content encoder. We compared the classification accuracy under three settings which were ”content encoder with IN”, ”content encoder without IN” and ”content encoder without IN while speaker encoder with IN”, respectively. The results are shown in Table 1. We can see that the classification accuracy is apparently lower when IN is applied to the content encoder. But we also found the accuracy was not as high as expected even if we did not apply IN to the content encoder. This was probably because by the fact that the speaker encoder is able to control the channel statistics of decoder by adaIN, the whole model tends to learn speaker information from the speaker encoder rather than from the content encoder. To further confirm this assumption, we tested the classification accuracy under the third settings mentioned above, which was not to apply IN to the content encoder but applied it to the speaker encoder. As we can see, due to the average pooling over time property combined with IN layer (output zero-vector), the speaker encoder could no longer possess the complete speaker information, thus the whole model tended to ”flow” more speaker information through content encoder, increasing the classification accuracy.

with IN w/o IN w/o IN + with IN
Table 1: The accuracy for speaker identity prediction on content representation. Smaller value means less speaker information in the content representation.

4.2 Speaker embedding visualization

We found that the speaker encoder learned meaningful embeddings related to speakers even if we did not explicitly add any objective or constraint to the encoder [36]. We inputted both seen and unseen (during training) speakers’ utterances through speaker encoder and plotted their embeddings in 2D space with t-SNE in Fig. 3. We found that utterances spoken by different speakers were well-separated. We also conducted experiments on classifying speaker id with these embeddings. The setting was the same as subsection 4.1. Seen speakers achieved accuracy and unseen speakers achieved accuracy, indicating that the speaker encoder learned reasonable representations in the embedding space.

Figure 3: The visualization of speaker embedding. ’x’ are female speakers and ’o’ are male speakers. Segments are randomly sampled from validation set and testing set.

4.3 Objective evaluation

4.3.1 Global variance

To show that our model is able to convert speaker characteristic, we used the global variance (GV) as the visualization of spectral distribution. Global variance has been used as a way to see whether voice conversion result match to the target speaker in terms of variance distribution [37]. We evaluated the global variance for each of the frequency index for 4 conversion examples: male to male, male to female, female to male, and female to female. The results are shown in Fig. 4, and we found that our generated samples did match to the target speaker in terms of variance distribution.

Figure 4: The variance distribution of converted result and target speaker utterances. 100 randomly chosen utterances and converted result are used to calculate the variance.

4.3.2 Spectrograms example

Some examples of spectrogram heatmaps are shown in Fig. 5. We can see that our model is able to transform the fundamental frequency (f0) and keep the original phonetic content in both male to female conversion and female to male conversion.

Figure 5: The heatmaps of the spectrogram: the upper left is an utterance spoken by a female speaker. The upper right is the converted result to a male speaker. The lower left is an utterance spoken by a male speaker. The lower right is the converted result to a female speaker.

4.4 Subjective evaluation

Subjective evaluation was performed on converted voice (including male to male, male to female, female to male and female to female, in total four pairs of speakers). The speakers of these four pairs were all unseen during training time, so the converted result of each pair was outputted from our proposed approach by using only one source utterance and one target utterance. We then asked the human participants to evaluate the similarity between two utterances with a 4-scale score indicating same absolutely sure, same not sure, different not sure, and different absolutely sure. The two utterances were one converted result with either one source speaker utterance or one target speaker utterance. The results are in Fig. 6. Our model is able to generate the voice similar to target speaker’s. The demo can be found at

Figure 6: Similarity test. The left one is the comparison to source speaker’s utterance. The right one is the comparison to target speaker’s utterance.

5 Conclusion

We proposed a novel approach to tackle one-shot unsupervised VC by applying instance normalization to enforce the model to learn factorized representations. In this way, we can perform VC to unseen speakers with only one utterance. Subjective and objective evaluations showed good result in terms of similarity to target speakers. And also, the disentanglement experiments and visualization showed that in our proposed approach, the speaker encoder learns a meaningful embedding space without any supervision.

6 Acknowledgement

This work is supported by Nvidia.


Comments 2
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description