PPG-based singing voice conversion with adversarial representation learning

PPG-based singing voice conversion with adversarial representation learning


Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody. On top of recent voice conversion works, we propose a novel model to steadily convert songs while keeping their naturalness and intonation. We build an end-to-end architecture, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms. Specifically, we implement two separate encoders: one encodes PPGs as content, and the other compresses mel spectrograms to supply acoustic and musical information. To improve the performance on timbre and melody, an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model. Objective and subjective experiments are conducted on our private Chinese singing corpus. Comparing with the baselines, our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity. Moreover, our PPG-based method is proved to be robust for noisy sources.


Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Ling Xu, Chen Shen, Zejun Ma \addressByteDance AI Lab
{lizhonghao.01,tangbenlai,yinxiang.stephen}@bytedance.com {keywords} Singing voice conversion, phonetic posteriorgrams, confusion module, representation learning

1 Introduction

Singing is one of the popular forms of entertainment and self-expression. The goal of singing voice conversion (SVC) is to convert the timbre of a source singer to that of a target singer without changing the content and melody. Compared with conventional speech voice conversion, singing voice conversion requires more considerations about acoustic features. For speech conversion, minor changes of certain features such as pitch and pause, are acceptable. However, for singing conversion, pitch and pause are related to musical characteristics like melody and rhythm, which means they are song-dependent and should be precisely preserved.

The early studies for singing voice conversion generally follow the statistical generation architectures [22, 10], which often use Gaussian mixture model (GMM) with parallel singing data. [19] updates the conversion framework with Generative Adversarial Network (GAN). However, it still requires the source and the target speakers to sing the same songs during the training phase.

As parallel singing corpus is rare, several works have been conducted to solve this problem. Referring to advanced achievements from voice conversion [18, 17], [15] builds an autoencoder framework to train the conversion model. The autoencoder model, consisting of a WaveNet [21] encoder to compress acoustic information and a WaveNet decoder to recover waveform with a speaker embedding table, maps the source waveform to itself. With the powerful network architecture, it achieves competitive results with non-parallel data. To enhance the timbre similarity of the converted audio, this work introduces a domain confusion module [5] to disentangle singer information from encoder output by an adversarial singer classifier. PitchNet [4] follows the confusion method and adds an extra pitch confusion module to remove pitch information from the encoder so that it can leverage F0 values to control pitch contour and melody. Moreover, some novel generation frameworks are introduced to the SVC task, such as Gaussian mixture variational autoencoders (GMVAEs) [13] and variational autoencoding Wasserstein GANs (VAW-GANs) [12]. Although the autoencoder-based models can obtain natural singing voices, redundant noise from input data may reduce the quality of the generated sounds.

Another way to address the issue of parallel data limitation is to use phonetic posteriorgrams (PPGs) as the model input [20]. PPGs represent frame-level linguistic information by probability distributions of phonemes. It deservedly removes acoustic information such as timbre and pitch, while maintaining speaker-independent content and tempo information. For singing voice conversion, [3] executes a multi-layer bidirectional LSTM (DBLSTM) network to map PPGs to Mel Cepstrals (MCEPs) in a specific timbre, building a many-to-one singing voice conversion system with WORLD vocoder [14]. Recently, [16] upgrades the work to support many-to-many conversion. It uses WaveNet conditioned on various linguistic and acoustic features and presents a non-autoregressive model optimized by several perceptual losses.

Figure 1: The overall architecture of the proposed model. The framed part is the Mel encoder, and the remaining modules constitute our baseline architecture.

In this work, we design a many-to-many SVC model based on the end-to-end framework which is widely used in audio generation tasks [23, 1, 24, 11]. Benefiting from previous SVC works, our model generally follows a PPG-to-Mel pipeline. An additional reference encoder, named Mel encoder, is implemented to elevate the quality, naturalness, and melody of the conversion outputs. To enhance the timbre similarity, an adversarial singer confusion module [5] is applied to disentangle the singer information from the Mel encoder. And then a singer lookup table compensates the singer identity information. Furthermore, we raise a mel-regressive module to capture to capture acoustic representations from the Mel encoder outputs and singer identity embeddings. In the experiments, integrating the proposed techniques, our model outperforms the baseline systems in naturalness and timbre-similarity significantly. And an objective evaluation reveals that our model is noise-robust.

2 method

As illustrated in Fig. 1, the overall structure of the proposed model is a PPG-based end-to-end framework, which takes the source audio as input and outputs the converted audio with the target timbre. The input singing audio is passed through the feature extractors for the input features. Then the conversion model maps the features to mel spectrograms. Finally, A WaveRNN [8] neural vocoder is used to synthesize waveform from the mel spectrograms in real-time and high fidelity.

2.1 Proposed Baseline Architecture

We employ a speaker-independent automatic speech recognition (ASR) model to extract PPGs as linguistic features. Unlike the previous work [3], We train this part with a large scale of singing data. Compared with models trained with speech data, it can improve the conversion results dramatically. To our knowledge, this is the first work using singing data to train the ASR model for the SVC task. The singing ASR (SASR) model is based on DFSMN [25] with CTC loss [6] and includes 30 layers.

For mapping PPGs to mel spectrograms, our fundamental conversion model consists of a linguistic encoder and an acoustic decoder bridged by an attention module. Specifically, we employ a CBHG encoder [23] to encode the frame-level PPG features to the linguistic representation. Besides, the singer identity embedding is selected from the singer lookup table. And the logarithmic F0 sequence is extracted from the source audio. Then the singer embeddings and the F0 sequence are concatenated to the encoder outputs. The acoustic decoder follows the design in [1]. GMM attention mechanism [7] is used for its capacity of generating very long utterances [2].

Mel spectrograms regression loss and stop token prediction loss are used to optimize the conversion model as [1]. The loss to be minimized can be described as


where represents mel spectrograms, and represents stop tokens. is the output of the acoustic decoder. Mean Square Error (MSE) measures the difference between generated and target mel spectrograms. And binary Cross Entropy (CE) is used for the stop token prediction.

2.2 Mel Encoder

PPG features ideally disentangle undesired timbre information from the source voice. However, it is difficult to reconstruct the style from the singing source (e.g. intonation, melody, emotion, etc.) that are not covered abundantly by the present features. To preserve these musical characteristics from the source audio, we use an additional encoder to extract information from the source mel spectrograms, called Mel encoder.

The structure of Mel encoder is shown in Fig. 2(a). First, the mel spectrogram extracted from the source audio is fed into a max-pooling layer, followed by 6 2-D convolution layers and a bidirectional GRU network. The outputs are concatenated with the encoder outputs as described in Section 2.1.

It is worth noting that the dimension of the Mel encoder output is set to be minimal, to suppress the effects of the timbre and sound noise of the source voice. As evaluated in [11], we found 4 units performed best in balancing timbre, sound quality, and musical characters of the converted audio.

Figure 2: Our proposed adversarial representation learning encoder. (a) Mel encoder architecture. (b) The singer confusion module. (c) The mel-regressive representation learning module. (b) and (c) are not used in the inference phase.

2.3 Singer Confusion Module

To strengthen the timbre similarity of the output, a singer confusion module is introduced to our model. During the training phase, the encoded embeddings from Mel encoder are fed into a singer-identity classifier. The classifier is illustrated in Fig 2(b). The input embeddings are passed through three 1-D convolutional layers followed by a dense projection and transferred to represent probability distributions in the size of singer identity classes.

With being the probability distribution of the singer identity, is a one-hot vector. For an embedding sequence with frames, is the classifier output, where . The cross entropy values between and are averaged for the classification loss


Training with the confusion module is a cycle of two steps. First, the classification network is trained to minimize . Second, the conversion pathway except the classifier part is trained with the loss


where is a weight factor. The entire model forms an adversarial framework. The baseline conversion model with Mel encoder is the generator. And the singer-identity classifier becomes a discriminator to distinguish singers with Mel encoder outputs. We also attempt to employ this module to the CBHG encoder, but it leads to instability and even collapse during the training phase.

2.4 Mel-Regressive Representation learning Module

Although the singer confusion technique improves timbre similarity and stability, the naturalness of the converted audio is evidently declined. It indicates the confusion module decreases articulatory and musical representation of the Mel encoder, besides disentangling singer identities. To handle the issue, we propose a novel representation learning decoder.

As shown in Fig. 2(c), the additional part is a mel-regressive network, which consists of 3 residual activated DNN layers and a projection layer. In the training phase, output embeddings of the Mel encoder are concatenated with the singer embedding to compensate identity information and then passed through the regressive network to generate the mel spectrograms . The regression loss is calculated by the MSE function.


By training Mel encoder with the regression loss, the Mel encoder outputs are expected to preserve acoustic and musical information for reconstructing mel spectrograms except for singer identity information.

Integrating all proposed modules, the overall conversion model is trained with an augmented generation loss


and is the weight factor of the mel-regressive module.

3 Experiments

3.1 Experimental Setup

Our experiments are implemented with an internal Chinese mandarin singing corpus. The data set consists of totally 32.7-hour audio data from 9 female singers and 7 male singers. And each singer has 1000 utterances for training on average and 10 for validation. For evaluation, we choose a female singer and a male singer as the target timbres. The test set consists of 40 segments from 20 singers out of the training set1. All songs are sampled to 24kHz.

To obtain input features, the SASR model is trained with about 20k hours of singing data, generating 1467-dimensional PPGs as outputs. And F0 values are extracted by REAPER2. Configurations for extracting mel spectrograms and training WaveRNN model configuration follow [11]. The conversion model is trained for 200k steps with a batch size of 32. We use Adam optimizer [9] and halve the learning rate per 25k steps. For weight factors in Section 2, we set .

System Naturalness Similarity NCC
Source GT 4.430.11 2.970.19 -
Female target GT 4.410.07 4.120.09 -
BASE1 2.790.18 2.470.19 0.927
BASE2 2.850.13 2.430.16 0.930
BASE3 2.580.08 3.550.10 0.850
Proposed 3.750.18 3.570.19 0.902
Male target GT 4.490.08 4.240.09 -
BASE1 3.100.12 2.690.11 0.924
BASE2 3.210.13 2.750.11 0.926
BASE3 2.350.09 2.960.14 0.870
Proposed 3.640.15 3.420.09 0.889
Table 1: MOS evaluation scores of the baseline models and our proposed model. GT means ground truth.

3.2 Evaluations

Subjective and Objective Evaluations We compare our model with three baseline systems: (1) BASE1, the DBLSTM model [3] which uses PPGs for the SVC task for the first time. In this work, we use our SASR model for PPG extraction instead. (2) BASE2, augmented from BASE1 to support multi-singer corpora by adding a singer lookup table to promote the transformation. (3) BASE3, our proposed baseline system introduced in Section 2.1.

For subjective evaluations, all of the converted samples from each system are scored individually by 18 music professionals. Two metrics are conducted to measure the models: (1) Mean Opinion Score (MOS) of the naturalness, used for judging an integrated assessment of intonation, rhythm, melody, clarity, and expression. (2) MOS of the timbre similarity between the converted samples and the target singing voices. Both metrics are scaled between 1-5. Moreover, we use the normalized cross-correlation (NCC) as an objective evaluation to measure the pitch match between the prediction and the ground truth.

Conducted scores are illustrated in Tab. 1. Comparing BASE2 with BASE1, the former performs better both in naturalness and similarity, explaining our proposal to use multi-singer training data. The DBLSTM models only convert MCEP features and use the pitch values from the source audio, resulting in high accuracy in intonation and melody. BASE3 gets the lowest NCC scores and naturalness MOS compared with BASE1 and BASE2, showing its shortage in maintaining pitch and voice quality. However, in terms of similarity scores, BASE3 outperforms the former methods substantially, as it employs a more complicated singer-dependent decoder than the DBLSTM network. Moreover, our entire framework gains the highest scores in the two subjective metrics, and also obtains competitive scores in NCC. The results show that our framework with additional modules supplying more musical information from mel spectrograms, is promoted in naturalness and pitch precision.

Target System Naturalness Similarity NCC
Female BASE3 2.580.08 3.550.10 0.850
+ME 2.810.33 2.610.27 0.865
+SC 2.520.10 3.520.13 0.879
+MS 3.750.18 3.570.19 0.901
Male BASE3 2.350.09 2.960.14 0.870
+ME 2.930.22 2.740.13 0.895
+SC 2.410.10 3.040.13 0.872
+MS 3.640.15 3.420.09 0.889
Table 2: Ablation Tests. ME means Mel encoder, SC denotes the singer confusion module and MS denotes the mel-regressive module. The additions are accumulations.

Ablation Ablations are executed to present contributions of the proposed modules. Tab. 2 summarises the evaluation results. As shown by the scores, Mel encoder improves the quality of the converted samples but decreases timbre similarity. By adding the adversarial singer confusion module, the timbre similarities of the converted samples acquire great progress. However, both the NCC score and the naturalness MOS fall because of the disentangling process on Mel encoder. The Mel-regressive module eliminates the adverse effect of the confusion module and further improves the performance extremely and comprehensively.

Source Female target Male target
25.35 30.41 33.56
15.30 24.41 27.38
8.18 22.42 19.64
Table 3: SNR values of source voices and converted voices.

Noise Robustness To show the robustness of our method in handling noise, our proposed model is used to convert source audio adding various levels of white noise. Signal-to-noise ratio (SNR) is calculated to measure the clarity of the source and converted samples. Results in Tab. 3 indicate SNR of the converted samples falls slightly when the added noise increases.

4 Conclusion and Future Work

We devote a novel end-to-end model for singing voice conversion in this work. PPGs extracted from a singer independent SASR model are used as input features thus the model can achieve many-to-many singing conversion. The proposed model contains an additional encoder to obtain acoustic and musical information from mel spectrograms, along with a singer confusion module and a mel-regressive representation learning module. Experiments show that our proposed model outperforms the baseline models significantly, generating natural and pitch-accurate singing voices in the target timbre. We also confirm that our proposed system can make conversion robustly. In future work, we will continue to improve the quality of converted audio and attempt to use less data for better performance.


  1. Samples can be found in https://lzh1.github.io/singVC
  2. https://github.com/google/REAPER


  1. Cited by: §1, §2.1, §2.1.
  2. E. Battenberg, R. J. Skerry-Ryan, S. Mariooryad, D. Stanton, D. Kao, M. Shannon and T. Bagby (2020) Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP, pp. 6194–6198. External Links: Link, Document Cited by: §2.1.
  3. X. Chen, W. Chu, J. Guo and N. Xu (2019) Singing voice conversion with non-parallel data. In MIPR, pp. 292–296. External Links: Link, Document Cited by: §1, §2.1, §3.2.
  4. C. Deng, C. Yu, H. Lu, C. Weng and D. Yu (2020) Pitchnet: unsupervised singing voice conversion with pitch adversarial network. In ICASSP, pp. 7749–7753. External Links: Link, Document Cited by: §1.
  5. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand and V. S. Lempitsky (2016) Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, pp. 59:1–59:35. External Links: Link Cited by: §1, §1.
  6. A. Graves, S. Fernández, F. J. Gomez and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, Vol. 148, pp. 369–376. External Links: Link, Document Cited by: §2.1.
  7. A. Graves (2013) Generating sequences with recurrent neural networks. CoRR abs/1308.0850. External Links: Link, 1308.0850 Cited by: §2.1.
  8. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman and K. Kavukcuoglu (2018) Efficient neural audio synthesis. In ICML, Vol. 80, pp. 2415–2424. External Links: Link Cited by: §2.
  9. D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, External Links: Link Cited by: §3.1.
  10. K. Kobayashi, T. Toda, G. Neubig, S. Sakti and S. Nakamura (2014) Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In INTERSPEECH, pp. 2514–2518. External Links: Link Cited by: §1.
  11. W. Li, B. Tang, X. Yin, Y. Zhao, W. Li, K. Wang, H. Huang, Y. Wang and Z. Ma (2020) Improving accent conversion with reference encoder and end-to-end text-to-speech. CoRR abs/2005.09271. External Links: Link, 2005.09271 Cited by: §1, §2.2, §3.1.
  12. J. Lu, K. Zhou, B. Sisman and H. Li (2020) VAW-GAN for singing voice conversion with non-parallel training data. CoRR abs/2008.03992. External Links: Link, 2008.03992 Cited by: §1.
  13. Y. Luo, C. Hsu, K. Agres and D. Herremans (2020) Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders. In ICASSP, pp. 3277–3281. External Links: Link, Document Cited by: §1.
  14. M. Morise, F. Yokomori and K. Ozawa (2016) WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99-D (7), pp. 1877–1884. External Links: Link, Document Cited by: §1.
  15. E. Nachmani and L. Wolf (2019) Unsupervised singing voice conversion. In INTERSPEECH, pp. 2583–2587. External Links: Link, Document Cited by: §1.
  16. A. Polyak, L. Wolf, Y. Adi and Y. Taigman (2020) Unsupervised cross-domain singing voice conversion. CoRR abs/2008.02830. External Links: Link, 2008.02830 Cited by: §1.
  17. A. Polyak and L. Wolf (2019) Attention-based wavenet autoencoder for universal voice conversion. In ICASSP, pp. 6800–6804. External Links: Link, Document Cited by: §1.
  18. K. Qian, Y. Zhang, S. Chang, X. Yang and M. Hasegawa-Johnson (2019) AutoVC: zero-shot voice style transfer with only autoencoder loss. In ICML, Vol. 97, pp. 5210–5219. External Links: Link Cited by: §1.
  19. B. Sisman, K. Vijayan, M. Dong and H. Li (2019) SINGAN: singing voice conversion with generative adversarial networks. In APSIPA ASC, pp. 112–118. External Links: Link, Document Cited by: §1.
  20. L. Sun, K. Li, H. Wang, S. Kang and H. M. Meng (2016) Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In ICME, pp. 1–6. External Links: Link, Document Cited by: §1.
  21. A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior and K. Kavukcuoglu (2016) WaveNet: A generative model for raw audio. In ISCA, pp. 125. External Links: Link Cited by: §1.
  22. F. Villavicencio and J. Bonada (2010) Applying voice conversion to concatenative singing-voice synthesis. In INTERSPEECH, pp. 2162–2165. External Links: Link Cited by: §1.
  23. Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark and R. A. Saurous (2017) Tacotron: towards end-to-end speech synthesis. In INTERSPEECH, pp. 4006–4010. External Links: Link Cited by: §1, §2.1.
  24. L. Zhang, C. Yu, H. Lu, C. Weng, C. Zhang, Y. Wu, X. Xie, Z. Li and D. Yu (2020) DurIAN-sc: duration informed attention network based singing voice conversion system. CoRR abs/2008.03009. External Links: Link, 2008.03009 Cited by: §1.
  25. S. Zhang, M. Lei, Z. Yan and L. Dai (2018) Deep-fsmn for large vocabulary continuous speech recognition. In ICASSP, pp. 5869–5873. External Links: Link, Document Cited by: §2.1.
Comments 1
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description