StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

StarGAN-VC: Non-parallel many-to-many voice conversion
with star generative adversarial networks

Abstract

This paper proposes a method that allows for non-parallel many-to-many voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN. Our method, which we term StarGAN-VC, is remarkable in that it (1) requires neither parallel utterances, transcriptions, nor time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across different attribute domains using a single generator network, (3) is able to generate signals of converted speech quickly enough to allow for real-time implementations and (4) requires only several minutes of training examples to generate reasonably realistic-sounding speech. Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

StarGAN-VC: Non-parallel many-to-many voice conversion

with star generative adversarial networks

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo
NTT Communication Science Laboratories, NTT Corporation, Japan


Index Terms—  Voice conversion (VC), non-parallel VC, many-to-many VC, generative adversarial networks (GANs), CycleGAN-VC, StarGAN-VC

1 Introduction

Voice conversion (VC) is a technique to convert para/non-linguistic information of a given utterance while preserving linguistic information. This technique can be applied to various tasks such as speaker-identity modification for text-to-speech (TTS) systems [1], speaking assistance [2, 3], speech enhancement [4, 5, 6], and pronunciation conversion [7].

One successful VC framework involves statistical methods based on Gaussian mixture models (GMMs) [8, 9, 10]. Recently, a neural network (NN)-based framework based on feed-forward deep NNs [11, 12, 13], recurrent NNs [14], and generative adversarial nets (GANs) [7], and an exemplar-based framework based on non-negative matrix factorization (NMF) [15, 16] have also proved to be successful. Many conventional VC methods including the ones mentioned above require accurately aligned parallel data of source and target speech. However, in general scenarios, it is not always possible to collect parallel utterances. Even though we could collect such data, we typically need to perform time alignment procedures, which becomes relatively difficult when there is a large acoustic gap between source and target speech. Since many frameworks are weak against misalignment involved in parallel data, careful pre-screening and manual correction may be required to make these frameworks work reliably. To bypass these restrictions, this paper is concerned with developing a non-parallel VC method, which requires neither parallel utterances, transcriptions, nor time alignment procedures.

In general, the quality and conversion effect obtained with non-parallel methods are usually limited compared to the methods using parallel data due to the disadvantage in the training condition. Thus, developing non-parallel methods with as high audio quality and conversion effect as parallel methods can be very challenging. Recently, some attempts have been made to develop non-parallel methods [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]. For example, a method using automatic speech recognition (ASR) was proposed in [24]. The idea is to convert input speech under a restriction that the posterior state probability of the acoustic model of an ASR system is preserved so that the transcription of the converted speech becomes consistent with that of the input speech. Since the performance of this method depends heavily on the quality of the acoustic model of ASR, it can fail to work if ASR does not work reliably. A method using i-vectors [30], known as a feature for speaker verification, was recently proposed in [25]. Conceptually, the idea is to shift the acoustic features of input speech towards target speech in the i-vector space so that the converted speech is likely to be recognized as the target speaker by a speaker recognizer. While this method is also free from parallel data, one limitation is that it is specialized only for speaker identity conversion tasks.

Recently, a framework based on conditional variational autoencoders (CVAEs) [31, 32] was proposed in [22, 29]. As the name implies, variational autoencoders (VAEs) are a probabilistic counterpart of autoencoders (AEs), consisting of encoder and decoder networks. CVAEs [32] are an extended version of VAEs where the encoder and decoder networks can take an auxiliary variable as an additional input. By using acoustic features as the training examples and the associated attribute labels as , the networks learn how to convert the attribute of source speech to a target attribute according to the attribute label fed into the decoder. This CVAE-based VC approach is notable in that it is completely free from parallel data and works even with unaligned corpora. However, one well-known problem as regards VAEs is that outputs from the decoder tend to be oversmoothed. For VC applications, this can be problematic since it usually results in buzzy-sounding poor quality speech.

One powerful framework that can potentially overcome the weakness of VAEs involves GANs [33]. GANs offer a general framework for training a generator network in such a way that it can deceive a real/fake discriminator network. While they have been found to be effective for use with image generation, they have also been applied with notable success for various speech processing tasks in recent years [34, 35, 36, 37, 7, 38]. We previously proposed a non-parallel VC method using a GAN variant called the cycle-consistent GAN (CycleGAN) [26], which was originally proposed as a method for translating images using unpaired training examples [39, 40, 41]. This method, which we call CycleGAN-VC, aims to learn a mapping of acoustic features from one attribute to another attribute , its inverse mapping , and a discriminator , whose role is to distinguish acoustic features of converted speech from those of real speech, through a training loss combining an adversarial loss and a cycle consistency loss. Although this method was shown to work reasonably well, one major limitation is that it only learns one-to-one mappings. For many VC application scenarios, it is desirable to obtain many-to-many mappings. One naive way of applying CycleGAN to many-to-many VC tasks would be to train different and pairs for all pairs of attribute domains. However, this may be ineffective since all attribute domains are common in the sense that they represent speech and so there must be common latent features that can be shared across different domains. In practice, the number of parameters will increase quadratically with the number of attribute domains, making parameter training challenging particularly where the amount of training examples in each domain is limited.

A common limitation of CVAE-VC and CycleGAN-VC is that at test time the attribute of input speech must be known. As for CVAE-VC, the source attribute label must be fed into the encoder of the trained CVAE and as for CycleGAN-VC, the source attribute domains at training and test times must be the same.

To overcome the shortcomings and limitations of CVAE-VC [22] and CycleGAN-VC [26], this paper proposes a non-parallel many-to-many VC method using a recently proposed novel GAN variant called StarGAN [42], which offers the advantages of CVAE-VC and CycleGAN-VC concurrently. Unlike CycleGAN-VC and as with CVAE-VC, our method, which we call StarGAN-VC, is capable of simultaneously learning many-to-many mappings using a single encoder-decoder type generator network where the attributes of the generator outputs are controlled by an auxiliary input . Unlike CVAE-VC and as with CycleGAN-VC, StarGAN-VC uses an adversarial loss for generator training to encourage the generator outputs to become indistinguishable from real speech and ensure that the mappings between each pair of attribute domains will preserve linguistic information. It is also noteworthy that unlike CVAE-VC and CycleGAN-VC, StarGAN-VC does not require any information about the attribute of input speech at test time.

The VAE-GAN framework [43] is perhaps another natural way of overcoming the weakness of VAEs. A non-parallel VC method based on this framework was already proposed in [23]. With this approach, an adversarial loss derived using a GAN discriminator is incorporated into the training loss to encourage the decoder outputs of a CVAE to be indistinguishable from real speech features. Although the concept is similar to our StarGAN-VC approach, we will show in Section 4 that our approach outperforms this approach in terms of both the audio quality and conversion effect.

Another related approach worth noting is the vector quantized VAE (VQ-VAE) approach [27], which has showed impressive performances on non-parallel VC tasks. This approach is particularly notable in that it offers a novel way of overcoming the weakness of VAEs by using the WaveNet model [44], a sample-by-sample neural signal generator, to devise both the encoder and decoder of a discrete counterpart of CVAEs. The original WaveNet model is a recursive model, allowing to predict the distribution of a sample conditioned on the samples the generator has generated. While a faster version [45] has recently been proposed, it typically requires huge computational cost to generate a stream of samples, which can cause difficulty in implementing real-time systems. The model is also known to require a huge amount of training examples to be able to generate natural-sounding speech. By contrast, our method is noteworthy in that it is able to generate signals quickly enough to allow for a real-time implementation and requires only several minutes of training examples to generate reasonably realistic-sounding speech.

The remainder of this paper is organized as follows. We briefly review the formulation of CycleGAN-VC in Section 2, present the idea of StarGAN-VC in Section 3 and show experimental results in Section 4.

2 CycleGAN Voice Conversion

Since the present method is an extension of CycleGAN-VC, which we proposed previously [26], we start by briefly reviewing its formulation.

Let and be acoustic feature sequences of speech belonging to attribute domains and , respectively, where is the feature dimension and and are the lengths of the sequences. The aim of CycleGAN-VC is to learn a mapping that converts the attribute of into and a mapping that does the opposite. Now, we introduce discriminators and , whose roles are to predict whether or not their inputs are acoustic features of real speech belonging to and , and define

(1)
(2)
(3)
(4)

as the adversarial losses for , , and , respectively. and measure how indistinguishable and are from acoustic features of real speech belonging to and . Since the goal of and is to correctly distinguish the converted feature sequences obtained via and from real speech feature sequences, and attempt to minimize these losses in order not to be fooled by and . Conversely, since one of the goals of and is to generate realistic-sounding speech that are indistinguishable from real speech, and attempt to maximize these losses or minimize and to fool and . It can be shown that the output distributions of and trained in this way will match the empirical distributions and . Note that since and are minimized when and , we can also use and as the adversarial losses for and .

As mentioned in Section 1, training and using only the adversarial losses does not guarantee that or will preserve the linguistic information of input speech since there are infinitely many mappings that will induce the same output distributions. To further regularize these mappings, we introduce a cycle consistency loss

(5)

to encourage and . With the same motivation, we also consider an identity mapping loss

(6)

to ensure that inputs to and are kept unchanged when the inputs already belong to and . The full objectives of CycleGAN-VC to be minimized with respect to , , and are thus given as

(7)
(8)

where and are regularization parameters, which weigh the importance of the cycle consistency loss and the identity mapping loss relative to the adversarial losses.

Fig. 1: Concept of CycleGAN training.
Fig. 2: Concept of StarGAN training.

3 StarGAN Voice Conversion

While CycleGAN-VC allows generating natural-sounding speech when a sufficient amount of training examples is available, one limitation is that it only learns one-to-one-mappings. Here, we propose using StarGAN [42] to develop a method that allows for non-parallel many-to-many VC. We term the present method StarGAN-VC.

3.1 Training objectives

Let be a generator that takes an acoustic feature sequence with an arbitrary attribute and a target attribute label as the inputs and generates an acoustic feature sequence . We assume that a speech attribute consists of one or more categories, each consisting of multiple classes. We thus represent as a concatenation of one-hot vectors, each of which is filled with 1 at the index of a class in a certain category and with 0 everywhere else. For example, if we consider speaker identities as the only attribute category, will be represented as a single one-hot vector, where each element is associated with a different speaker. One of the goals of StarGAN-VC is to make as realistic as real speech features and belong to attribute . To realize this, we introduce a real/fake discriminator as with CycleGAN and a domain classifier , whose role is to predict to which classes an input belongs. is designed to produce a probability that an input is a real speech feature whereas is designed to produce class probabilities of .

Adversarial Loss: First, we define

(9)
(10)

as adversarial losses for discrminator and generator , respectively, where denotes a training example of an acoustic feature sequence of real speech with attribute and denotes that with an arbitrary attribute. takes a small value when correctly classifies and as fake and real speech features whereas takes a small value when successfully deceives so that is misclassified as real speech features by . Thus, we would like to minimize with respect to and minimize with respect to .

Domain Classification Loss: Next, we define

(11)
(12)

as domain classification losses for classifier and generator . and take small values when correctly classifies and as belonging to attribute . Thus, we would like to minimize with respect to and with respect to .

Cycle Consistency Loss: Training , and using only the losses presented above does not guarantee that will preserve the linguistic information of input speech. To encourage to be a bijection, we introduce a cycle consistency loss to be minimized

(13)

where denotes a training example of an acoustic feature sequence of real speech with attribute and is a positive constant. We also consider an identity mapping loss

(14)

to ensure that an input to will be kept unchanged when the input already belongs to the target attribute .

To summarize, the full objectives of StarGAN-VC to be minimized with respect to , and are given as

(15)
(16)
(17)

respectively, where , and are regularization parameters, which weigh the importance of the domain classification loss, the cycle consistency loss and the identity mapping loss relative to the adversarial losses.

3.2 Conversion process

As an acoustic feature vector, we use mel-cepstral coefficients computed from a spectral envelope obtained using WORLD [46]. After training , we can convert the acoustic feature sequence of an input utterance with

(18)

where denotes the target attribute label. A naïve way of obtaining a time-domain signal is to simply use to reconstruct a signal with a vocoder. Instead of directly using , we can also use the reconstructed feature sequence

(19)

to obtain a time-domain signal if the attribute of input speech is known. By using and , we can obtain a sequence of spectral gain functions. Once we obtain the spectral gain functions, we can reconstruct a time-domain signal by multiplying the spectral gain function to the spectral envelope of input speech frame-by-frame and resynthesizing a signal using a vocoder.

3.3 Network architectures

One of the key features of our approach including [7, 26] is that we consider a generator that takes an acoustic feature sequence instead of a single-frame acoustic feature as an input and outputs an acoustic feature sequence of the same length. This allows us to obtain conversion rules that capture time dependencies. While RNN-based architectures are a natural choice for modeling time series data, we use a convolutional neural network (CNN)-based architecture to design as detailed below. The generator consists of encoder and decoder networks where only the decoder network takes an auxiliary input . We also design and to take acoustic feature sequences as inputs and generate sequences of probabilities.

Generator: Here, we treat an acoustic feature sequence as an image of size with channel and use 2D CNNs to construct , as they are suitable for parallel computations. Specifically, we use the gated CNN [47], which was originally introduced to model word sequences for language modeling and was shown to outperform long short-term memory (LSTM) language models trained in a similar setting. We previously applied gated CNN architectures for voice conversion [7, 26] and audio source separation [48], and their effectiveness have already been confirmed. In the encoder part, the output of the -th hidden layer is described as a linear projection modulated by an output gate

(20)

where , , and are the generator network parameters to be trained, and denotes the elementwise sigmoid function. Similar to LSTMs, the output gate multiplies each element of and control what information should be propagated through the hierarchy of layers. This gating mechanism is called Gated Linear Units (GLU). In the decoder part, the output of the -th hidden layer is given by

(21)
(22)

where means concatenation of and along the channel dimension, is a 3D array consisting of a -by- tiling of copies of in the feature and time dimensions. The input to the 1st layer of is and the output of the final layer is given as a regular linear projection

(23)
(24)

It should be noted that the entire architecture is fully convolutional with no fully-connected layers, which allows us to take as an input an entire sequence with an arbitrary length and converts the entire sequence at once.

Real/Fake Discriminator: We leverage the idea of PatchGANs [49] to devise the real/fake discriminator , which classifies whether local segments of an input feature sequence are real or fake. More specifically, we devise using a gated CNN, which takes an acoustic feature sequence and an attribute label as inputs and produces a sequence of probabilities that measures how likely each segment of is to be real speech features of attribute . The output of the -th layer of is given in the same way as (21) and (22) and the final output is given by the product of all these probabilities. See Section 4 for more details.

Domain Classifier: We also devise the domain classifier using a gated CNN, which takes an acoustic feature sequence and produces a sequence of class probability distributions that measures how likely each segment of is to belong to attribute . The output of the -th layer of is given in the same way as (20) and the final output is given by the product of all these distributions. See Section 4 for more details.

Fig. 3: Network architectures of generator , real/fake discriminator and domain classifier . Here, the inputs and outputs of , and are interpreted as images, where “h”, “w” and “c” denote the height, width and channel number, respectively. “Conv”, “Batch norm”, “GLU”, “Deconv” “Sigmoid”, “Softmax” and “Product” denote convolution, batch normalization, gated linear unit, transposed convolution, sigmoid, softmax, and product pooling layers, respectively. “k”, “c” and “s” denote the kernel size, output channel number and stride size of a convolution layer, respectively. Note that all the networks are fully convolutional with no fully connected layers, thus allowing inputs to have arbitrary sizes.

4 Subjective evaluation

Fig. 4: Examples of acoustic feature sequences of (a) source speech, (c) converted speech obtained with the baseline method and (d) converted speech obtained with the proposed method, along with an acoustic feature sequence of (b) the target speaker uttering the same sentence.
Fig. 5: Results of the AB test for sound quality and the ABX test for speaker similarity.

To confirm the performance of the proposed method, we conducted subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task. We used the Voice Conversion Challenge (VCC) 2018 dataset [50], which consists of recordings of six female and six male US English speakers. We used a subset of speakers for training and evaluation. In particular, we selected two female speakers, ‘VCC2SF1’ and ‘VCC2SF2’, and two male speakers, ‘VCC2SM1’ and ‘VCC2SM2’. Thus, is represented as a four-dimensional one-hot vector and there were twelve different combinations of source and target speakers in total. The audio files for each speaker were manually segmented into 116 short sentences (about 7 minutes) where 81 and 35 sentences (about 5 and 2 minutes) were provided as training and evaluation sets, respectively. All the speech signals were sampled at 22050 Hz. For each utterance, a spectral envelope, a logarithmic fundamental frequency (log ), and aperiodicities (APs) were extracted every 5 ms using the WORLD analyzer [46]. 36 mel-cepstral coefficients (MCCs) were then exctracted from each spectral envelope. The contours were converted using the logarithm Gaussian normalized transformation described in [51]. Aperiodicities were directly used without modification. The detail of the network configuration is shown in Fig. 3. The signals of converted speech were obtained using the method described in 3.2.

We chose the VAEGAN-based approach [23] as a comparison for our experiments. Although we would have liked to exactly replicate the implementation of this method, we made our own design choices owing to missing details of network configuration and hyperparameters. We conducted an AB test to compare the sound quality of the converted speech samples and an ABX test to compare the similarity to target speaker of the converted speech samples, where “A” and “B” were converted speech samples obtained with the proposed and baseline methods and “X” was a real speech sample of a target speaker. With these listening tests, “A” and “B” were presented in random orders to eliminate bias in the order of stimuli. Eight listeners participated in our listening tests. For the AB test for sound quality, each listener was presented {“A”,“B”} 20 utterances, and for the ABX test for speaker similarity, each listener was presented {“A”,“B”,“X”} 24 utterances. Each listener was then asked to select “A”, “B” or “fair” for each utterance. The results are shown in Fig. 5. As the results show, the proposed method significantly outperformed the baseline method in terms of both the sound quality and speaker similarity. Fig. 4 shows an example of the MCC sequences of source, reconstructed, and converted speech. Audio samples are demonstrated in http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc/.

5 Conclusion

This paper proposed a method that allows for non-parallel many-to-many VC by using a novel GAN variant called StarGAN. Our method, which we term StarGAN-VC, is noteworthy in that it (1) requires neither parallel utterances, transcriptions, nor time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across different voice attribute domains using a single generator network, (3) is able to generate signals of converted speech quickly enough to allow for real-time implementations and (4) requires only several minutes of training examples to generate reasonably realistic-sounding speech. Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a baseline method based on a VAE-GAN approach.

References

  • [1] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech synthesis,” in Proc. ICASSP, 1998, pp. 285–288.
  • [2] A. B. Kain, J.-P. Hosom, X. Niu, J. P. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007.
  • [3] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech,” Speech Commun., vol. 54, no. 1, pp. 134–146, 2012.
  • [4] Z. Inanoglu and S. Young, “Data-driven emotion conversion in spoken English,” Speech Commun., vol. 51, no. 3, pp. 268–283, 2009.
  • [5] O. Türk and M. Schröder, “Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 965–973, 2010.
  • [6] T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice conversion techniques for body-conducted unvoiced speech enhancement,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 20, no. 9, pp. 2505–2517, 2012.
  • [7] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,” in Proc. Interspeech, 2017, pp. 1283–1287.
  • [8] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech and Audio Process., vol. 6, no. 2, pp. 131–142, 1998.
  • [9] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximumlikelihood estimation of spectral parameter trajectory,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222–2235, 2007.
  • [10] E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice conversion using partial least squares regression,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 912–921, 2010.
  • [11] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 954–964, 2010.
  • [12] S. H. Mohammadi and A. Kain, “Voice conversion using deep neural networks with speaker-independent pre-training,” in Proc. SLT, 2014, pp. 19–23.
  • [13] Y. Saito, S. Takamichi, and H. Saruwatari, “Voice conversion using input-to-output highway networks,” IEICE Transactions on Information and Systems, vol. E100-D, no. 8, pp. 1925–1928, 2017.
  • [14] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in Proc. ICASSP, 2015, pp. 4869–4873.
  • [15] R. Takashima, T. Takiguchi, and Y. Ariki, “Exampler-based voice conversion using sparse representation in noisy environments,” IEICE Trans. Inf. Syst., vol. E96-A, no. 10, pp. 1946–1953, 2013.
  • [16] Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse representation with residual compensation for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10, pp. 1506–1521, 2014.
  • [17] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 12, pp. 1859–1872, 2014.
  • [18] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion based on speaker-dependent restricted Boltzmann machines,” IEICE Trans. Inf. Syst., vol. 97, no. 6, pp. 1403–1410, 2014.
  • [19] T. Nakashika, T. Takiguchi, and Y. Ariki, “High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion,” in Proc. Interspeech, 2014, pp. 2278–2282.
  • [20] T. Nakashika, T. Takiguchi, and Y. Ariki, “Parallel-data-free, many-to-many voice conversion using an adaptive restricted boltzmann machine,” in Proc. MLSP, 2015.
  • [21] M. Blaauw and J. Bonada, “Modeling and transforming speech using variational autoencoders,” in Proc. Interspeech, 2016, pp. 1770–1774.
  • [22] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in Proc. APSIPA, 2016, pp. 1–6.
  • [23] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. Interspeech, 2017, pp. 3364–3368.
  • [24] F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sentences,” in Proc. Interspeech, 2016, pp. 287–291.
  • [25] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation,” in Proc. ICASSP, 2017, pp. 5535–5539.
  • [26] T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” eprint arXiv:1711.11293, Nov. 2017.
  • [27] A. van den Oord and O. Vinyals, “Neural discrete representation learning,” in Adv. Neural Information Processing Systems (NIPS), 2017, pp. 6309–6318.
  • [28] T. Hashimoto, H. Uchida, D. Saito, and N. Minematsu, “Parallel-data-free many-to-many voice conversion based on dnn integrated with eigenspace using a non-parallel speech corpus,” in Proc. Interspeech, 2017.
  • [29] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP, 2018, pp. 5274–5278.
  • [30] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE. Trans. Audio Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011.
  • [31] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. ICLR, 2014.
  • [32] D. P. Kingma and D. J. Rezendey, S. Mohamedy, and M. Welling, “Semi-supervised learning with deep generative models,” in Adv. Neural Information Processing Systems (NIPS), 2014, pp. 3581–3589.
  • [33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Adv. Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
  • [34] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino, “Generative adversarial network-based postfilter for statistical parametric speech synthesis,” in Proc. ICASSP, 2017, pp. 4910–4914.
  • [35] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 1, pp. 84–96, Jan. 2018.
  • [36] S. Pascual, A. Bonafonte, and J. Serrá, “SEGAN: Speech enhancement generative adversarial network,” eprint arXiv:1703.09452, Mar. 2017.
  • [37] T. Kaneko, S. Takaki, H. Kameoka, and J. Yamagishi, “Generative adversarial network-based postfilter for STFT spectrograms,” in Proc. Interspeech, 2017, pp. 3389–3393.
  • [38] K. Oyamada, H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and H. Ando, “Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms,” eprint arXiv:1804.02181, Apr. 2018.
  • [39] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223–2232.
  • [40] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proc. ICML, 2017, pp. 1857–1865.
  • [41] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. ICCV, 2017, pp. 2849–2857.
  • [42] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” eprint arXiv:1711.09020, Nov. 2017.
  • [43] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” eprint arXiv:1512.09300, Dec. 2015.
  • [44] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” eprint arXiv:1609.03499, Sept. 2016.
  • [45] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, N. Casagrande, D. G., S. Noury, S. Dieleman, E. Elsen, N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis, “Parallel WaveNet: Fast high-fidelity speech synthesis,” eprint arXiv:1711.10433, Nov. 2017.
  • [46] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE trans. Inf. Syst., vol. E99-D, no. 7, pp. 1877–1884, 2016.
  • [47] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
  • [48] L. Li and H. Kameoka, “Deep clustering with gated convolutional networks,” in Proc. ICASSP, 2018, pp. 16–20.
  • [49] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [50] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” eprint arXiv:1804.04262, Apr. 2018.
  • [51] K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin,” in Proc. FSKD, 2007, pp. 410–414.
Comments 1
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
202124
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
1

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description