singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders

singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders


We propose a flexible framework that deals with both singer conversion and singers vocal technique conversion. The proposed model is trained on non-parallel corpora, accommodates many-to-many conversion, and leverages recent advances of variational autoencoders. It employs separate encoders to learn disentangled latent representations of singer identity and vocal technique separately, with a joint decoder for reconstruction. Conversion is carried out by simple vector arithmetic in the learned latent spaces. Both a quantitative analysis as well as a visualization of the converted spectrograms show that our model is able to disentangle singer identity and vocal technique and successfully perform conversion of these attributes. To the best of our knowledge, this is the first work to jointly tackle conversion of singer identity and vocal technique based on a deep learning approach.


Yin-Jyun Luo   Chin-Cheng Hsu   Kat Agres   Dorien Herremans1 \address Singapore University of Technology and Design
University of Southern California, Los Angeles, United States
Institute of High Performance Computing, A*STAR, Singapore
Yong Siew Toh Conservatory of Music, National University of Singapore {keywords} singing voice conversion, vocal technique, variational autoencoders, disentangled representations

1 Introduction

Singing voice conversion (SVC) comprehensively refers to tasks that convert an attribute of singing. Converting from one singer’s voice to that of another without affecting linguistic content has been the focus in SVC [20, 3, 12, 14, 1]. Converting between different vocal techniques, however, is a worthwhile line of research that has lacked attention. Such an approach would allow one to convert a singing voice into a different timbre or pitch that was originally infeasible due to physical constraints or lack of singing skills, thereby facilitating applications in entertainment and pedagogy.

Vocal techniques, such as ‘breathy’ and ‘vibrato’, enrich the sound and are an integral part of singing. Singers perform different techniques at different points in time so as to create emotional ebbs and flows. Modeling vocal techniques through data-driven models is challenging due to lack of labeled and balanced data, together with intrinsic ambiguity, just to name a few.

Figure 1: The proposed framework, fully detailed in Section 3.2. The blue, red and green blocks correspond to the singer encoder, vocal technique encoder, and the joint decoder, respectively.

We propose a framework that deals with the conversion of both singer identity and vocal technique. We augment the model based on the Deep Bi-directional Long Short-Term Memory (DBLSTM) from [18] with latent variables, such that it learns disentangled representations for both singer and vocal technique through Gaussian mixture variational autoencoders (GMVAEs) [2, 8]. Unlike typical SVC models that condition generation of singing voice on an utterance-level singer label [16, 1, 14], our model is conditioned on time-dependent singer/technique variables on a shorter temporal scale, accommodating cases in which the vocal technique varies across time. The proposed model can be trained on non-parallel (i.e., unpaired) corpora, and allows for many-to-many conversion of singer identities and vocal techniques.

We describe our modified implementation of the GMVAE model [7, 13] along with our singer/vocal technique conversion strategy in Section 2. Next, we elaborate on our experimental setup using VocalSet [21], a dataset featuring signing techniques, in Section 3. Finally, we report and discuss the experimental results in Section 4.

2 Method

2.1 Variational Autoencoders

Our proposed singing voice generation process () based on VAE includes the generation of a chunk of spectrogram which is generated from a latent variable . This simple dependency structure allows us to apply variational inference, which optimizes the evidence lower bound (ELBO) of :


where we assume , and , in which and are inferred from by an encoder, and is predicted by a decoder. Reconstructing from which is inferred from itself using variational methods thus concludes a VAE.

2.2 Increasing Expressivity using a GMM Prior

The above made assumption reflects the preference for a simple distribution of data, which in turn sacrifices model expressivity. Replacing with a Gaussian mixture model (GMM), known as GMVAEs, has been proven effective in increasing the expressivity and controllability [2, 8, 7, 13]. It has an additional layer of dependency: that enables us to utilize categorical attributes that may be available in data. The ELBO of a GMVAE then becomes:


where the prior is now multi-modal (GMM), more likely to model data with higher diversity. In addition, introducing endows the model with direct controllability over attributes and flexibility for generation, as will be elaborated next.

2.3 Controlling Singer Identities and Vocal Techniques

Using the proposed GMVAE, the generation process for singing voice is as follows: given a singer and a vocal technique (collectively referred to as attributes), the model first infers latent representations ( and ) of each of the attributes, and then combines these two to generate a spectrogram of the singing voice. Mathematically, the joint probability, given the attributes can be factorized as follows:


where both the conditional distributions and are assumed to be Gaussian with learnable means and diagonal covariances. This GMVAE model now takes into consideration singer and vocal technique and thus can directly control them during the conversion phase.

2.4 Learning an Attribute-discriminative Space

We incorporate two classifiers, one for vocal techniques and the other for singers, to encourage the learned spaces and to be discriminative. Each classifier learns to predict from the sequence of , where is the number of chunks of a recording and denotes singers or vocal techniques. The classifier receives a sequence level representation that can be summarized by a simplified attention mechanism [17, 15]:


where is a learnable function and denotes the summarization (and thus the representation) of the input sequence . is uniformly distributed without the attention module. The auxiliary objective of maximizing will thus be added to the overall objective.

2.5 Weighting KLDs

A singer might be unable to maintain the same level of expressiveness of a technique throughout a recording. Similarly, voice timbre and pitch also vary across a recording even though a singer is asked to perform the same vocal technique. Based on these observations, we may benefit from weighting the KLD terms of (2) with obtained from Section 2.4. Consummating our training objective, we have


where and are weights for the discriminative objectives.

2.6 Conversion Strategies

We accommodate the time-varying singing attributes in expressive singing voices by learning latent variables at a shorter temporal scale. Consequently, we adopt the model to infer the attributes, rather than assigning attributes directly during the conversion phase.2 Generally, conversion is done by adding a conversion vector to at the chunk level. We define , where represents the mean of a Gaussian mixture component. can be determined by either Gaussian likelihood or the auxiliary sequence-level classifier , referred to as C-chunk and C-sequence, respectively. Note that the latter computes a common (and hence ) over all chunks. We report the result from both methods below.

3 Experimental Settings

3.1 Dataset

We use the VocalSet [21] to evaluate our framework. A subset of audio files we selected has 20 singers, 6 vocal techniques that were most distinguishable, (straight, breathy, vibrato, belt, lip trill and vocal fry), and 5 vowels. Each recording was sung as either a scale or arpeggios. The subset was divided into a training set of 1,065 recordings, and a testing set of 118 (17 out of the combinations are missing). The length of recordings ranges from 3.5 to 23 seconds. This subset approximates a balanced number of instances over classes for singers, techniques, and vowels.

We re-sampled the recordings to 22,050 Hz, normalized the waveform w.r.t. the largest magnitude, computed the log magnitude Mel-spectrogram (MEL) with filter banks, and then rescaled it to [-1, 1]. We further segmented the MEL into chunks of frames (). A frame shift of 256 was used for computing the MELs, so that 43 frames amount to 0.5 seconds.

3.2 Architecture

Our SVC framework encompasses seven components: a feature extraction network (FEN), two encoders, a decoder, a post-processor, and two sequence-level classifiers. The overall architecture is shown Fig. 1

The FEN is composed of a two-layer one-dimensional convolutional neural network (CNN), each with 512 filters (), followed by two fully-connected (FC) layers with 512 and 256 units respectively. Batch normalization followed by ReLu is used for every layer. The FEN produces a 256-dimension bottleneck feature for a given input MEL chunk, and is shared and consumed by both encoders that follow.

Both of the encoders are parameterized as two-layer Recurrent Neural Networks (RNNs): a BLSTM with 256 hidden units, followed by two FC layers shared across time which predict and . is then sampled using the reparameterization trick from [11]. The joint decoder has a similar architecture as the encoders but in reverse order. At each time step of the output sequence, a CNN with an architecture that is symmetric to the FEN is employed to reconstruct the MEL chunk. Finally, the refinement network, a three-layer one-dimensional CNN with 512 filters (), is used to refine the reconstructed MELs.

3.3 Hyperparameters

The mean vectors of were all randomly initialized, whereas the variance vectors were kept fixed with value . The number of mixtures for singers is set to 20, and set to 6 for vocal techniques. Both are equal to the number of classes they correspond to. We set the batch size to 128, and initialized the model parameters with Xavier initialization [5]. The Adam optimizer [10] was used with a learning rate of .

3.4 Evaluation Metrics

We evaluate our model by how well a classifier correctly recognizes the attributes in the converted MELs. The main idea is that the converted MELs should be accurately classified as the target class, and attributes that are not intended to be converted should be predicted the same. The classification results thereby reveal the effects on the output MELs caused by conversion. Each recording in the test set is first converted to all possible target attributes and then evaluated by three classifiers that recognize singer, singing technique, or vowel. These classifiers have the same architecture as the combination of FEN, RNNs, and the attention module, and are trained independently from unconverted MELs.

Strategy Model Effect of Singer Conversion Effect of Technique Conversion
*Singer Technique Vowel Singer *Technique Vowel
Before After Before After Before After Before After Before After Before After
M0 89.83 NA 90.68 NA 77.97 NA 89.83 NA 90.68 NA 77.97 NA
C-chunk M1 80.51 63.35 83.05 75.51 77.97 69.24 80.51 75.99 83.05 54.38 77.97 72.74
M2 87.29 76.95 83.90 76.99 72.03 66.78 87.29 81.92 83.90 65.82 72.03 71.33
M3 83.05 75.68 88.98 79.32 73.73 72.88 83.05 84.18 88.98 67.66 73.73 71.47
C-sequence M2 87.29 76.65 83.90 76.64 72.03 66.69 87.29 82.06 83.90 65.64 72.03 71.33
M3 83.05 75.47 88.98 79.62 73.73 72.83 83.05 84.04 88.98 67.94 73.73 72.03
Table 1: The classification accuracy (%) derived by the three attribute classifiers, given different models. * denotes the converted attributes.
Figure 2: Examples of singer conversion (a) and vocal technique conversion (b), converted by the model M3. The first column refers to source, and the rest correspond to different targets. Targets that are the same as the sources are faded.

4 Results

4.1 Recognizing Attributes from Converted MELs

We compare three variants of the proposed models: M1 denotes the model trained with neither the attention module nor the the discriminative objectives (). M2 is similar to M1 but with . Finally, M3 is the model equipped with the attention module. The performance of M0 servers as the upper bound for classification results using unconverted MELs as input. The results are listed in Table 1. Higher numbers represent better performance for all cases.

We summarize our findings as follows: First, the attributes in the MELs by M2 can be recognized with higher accuracy than those from M1 in all cases; this supports our belief that including discriminative objectives helps the model to disentangle certain attributes from the input. Second, results from M2 and M3 are similar, but M2 is better at singer conversion, whereas M3 is slightly better at converting vocal techniques. Third, there is no noticeable difference in performance between the two conversion strategies C-chunk and C-sequence. Fourth, converting vocal techniques is much more challenging than converting singer identities as the accuracy drops from 90.68% to below 67.94% after conversion, even though the number of techniques is fewer.

4.2 Visualization

We visualise some examples of converted MELs in Fig. 2. In the upper panel (a), it is clear that the overall pitch level changes when doing cross-gender conversion. For technique conversion (b), we can see that converting the lip trill to, e.g., straight, makes the spectrogram less flattened. On the other hand, we can decorate a straight tone by converting it to a bright and energized vocalization as can be seen in straight-belt conversion, or to one with periodic frequency modulation as shown in straight-vibrato conversion. Noticeable effects are also observed when the targets are lip trill and vocal fry.

Despite the change of spectral distribution, the overall pitch contours are retained in all source-target pairs. This hints towards the model’s ability to perform many-to-many singer identity and vocal technique conversion.

Conversion at chunk-level enables us to morph from source to target by gradually increasing the conversion vector from 0 to . As such, we can, e.g., convert a straight tone to gradually express another technique over time. This has not yet been seen in other existing SVC frameworks, and we leave further investigation for future research.

5 Related Work

Recent advances in deep learning have brought great success to VC [19, 6, 9]; SVC, on the other hand, have not benefited from it until recently. SINGAN [16], an SVC framework based on deep generative models has been proposed to map acoustic features of a source singer to that of a target one; however, the model is restricted to temporally-aligned singing recordings. In contrast, [1] proposed to combine automatic speech recognition for SVC that is trainable from non-parallel corpora. The model, however, only allows for converting a handful of source singers to a single target. Recently, an encoder-decoder model is proposed which incorporates a domain confusion network [4] to learn singer-agnostic features [14].

We distinguish ourselves by jointly modeling singer and technique with a principled probabilistic generative model, and conditioning the generation of singing voice on time-dependent latent variables of the aforementioned attributes. To the best of our knowledge, this is the first study on jointly modeling/converting singer identity and vocal technique with a single deep learning model.

6 Conclusion and Future Work

We have proposed a flexible framework based on GMVAEs to tackle non-parallel, many-to-many SVC for singer identity and vocal technique. Audio samples are available from Analyzing the temporal dynamics of the latent variables, as well as accommodating the dependency between singer identity and vocal technique variables will be the focus of our future work.


  1. thanks: This work is supported by a SINGA provided by the A*STAR, under reference number SING-2018-01-1270.
  2. Empirically, we found training a model conditioned on an utterance-level technique label did not work well, and we resorted to the proposed method.


  1. X. Chen, W. Chu, G. J and X. N (2019) Singing voice conversion with non-parallel data. In IEEE Conf. on Multimedia Information Processing and Retrieval, pp. 292–296. Cited by: §1, §1, §5.
  2. N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C.-H. Lee, H. Salimbeni, K. Arulkumaran and M. Shanahan (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §1, §2.2.
  3. H. Doi, T. Toda, T. Nakano, M. Goto and S. Nakamura (2012) Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system. In APSIPA, Vol. , pp. 1–6. Cited by: §1.
  4. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Mario and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §5.
  5. X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pp. 249–256. Cited by: §3.3.
  6. C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao and H.-M. Wang (2016) Voice conversion from non-parallel corpora using variational auto-encoder. In APSIPA, Vol. , pp. 1–6. Cited by: §5.
  7. W.-N. Hsu, R. J. W. Y. Zhang, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen and R. Pang (2019) Hierarchical generative modeling for controllable speech synthesis. In ICLR, Cited by: §1, §2.2.
  8. Z. Jiang, Y. Zheng, H. Tan, B. Tang and H. Zhou (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In Int. Joint Conf. on Artificial Intelligence, Cited by: §1, §2.2.
  9. H. Kameoka, T. Kaneko, K. Tanaka and N. Hojo (2018) StarGAN-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 266–273. Cited by: §5.
  10. D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §3.3.
  11. D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §3.2.
  12. K. Kobayashi, T. Toda, G. Neubig, S. Sakti and S. Nakamura (2014) Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In INTERSPEECH, Cited by: §1.
  13. Y.-J. Luo, K. Agres and D. Herremans (2019) Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders. In ISMIR, Cited by: §1, §2.2.
  14. E. Nachmani and L. Wolf (2019) Unsupervised singing voice conversion. arXiv preprint. Cited by: §1, §1, §5.
  15. C. Raffel and D. P. Ellis (2016) Feed-forward networks with attention can solve some long-term memory problems. In ICLR, workshop track, Cited by: §2.4.
  16. B. Sisman, K. Vijayan, M. Dong and H. Li (2019) SINGAN: singing voice conversion with generative adversarial networks. In APSIPAA, pp. . Cited by: §1, §5.
  17. S. K. Sønderby, C. K. Sønderby, H. Nielsen and O. Winther (2015) Convolutional lstm networks for subcellular localization of proteins. In Int. Conf. on Algorithms for Computational Biology, pp. 68–80. Cited by: §2.4.
  18. L. Sun, S. Kang, K. Li and H. Meng (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In ICASSP, pp. 4869–4873. Cited by: §1.
  19. L. Sun, S. Kang, K. Li and H. Meng (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In ICASSP, pp. 4869–4873. Cited by: §5.
  20. F. Villavicencio and J. Bonada (2010) Applying voice conversion to concatenative singing-voice synthesis. In INTERSPEECH, Cited by: §1.
  21. J. Wilkins, P. Seetharaman, A. Wahl and B. Pardo (2018) VocalSet: A singing voice dataset. In ISMIR, pp. 468–474. Cited by: §1, §3.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description