# Crossmodal Voice Conversion

## Abstract

Humans are able to imagine a person’s voice from the person’s appearance and imagine the person’s appearance from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and generate a face image that matches the voice of the input speech by leveraging the correlation between faces and voices. We propose a model, consisting of a speech converter, a face encoder/decoder and a voice encoder. We use the latent code of an input face image encoded by the face encoder as the auxiliary input into the speech converter and train the speech converter so that the original latent code can be recovered from the generated speech by the voice encoder. We also train the face decoder along with the face encoder to ensure that the latent code will contain sufficient information to reconstruct the input face image. We confirmed experimentally that a speech converter trained in this way was able to convert input speech into a voice that matched an input face image and that the voice encoder and face decoder can be used to generate a face image that matches the voice of the input speech.

Hirokazu Kameoka, Kou Tanaka, Aarón Valero Puche, Yasunori Ohishi, Takuhiro Kaneko \address NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation \emailhirokazu.kameoka.uh@hco.ntt.co.jp

Index Terms: crossmodal audio/visual generation, voice conversion, face image generation, deep generative models

## 1 Introduction

Humans are able to imagine a person’s voice solely from that person’s appearance and imagine the person’s appearance solely from his/her voice. Although such predictions are not always accurate, the fact that we can sense if there is a mismatch between voice and appearance should indicate the possibility of being a certain correlation between voices and appearance. In fact, recent studies by Smith et al. [1] have revealed that the information provided by faces and voices is so similar that people can match novel faces and voices of the same sex, ethnicity, and age-group at a level significantly above chance. Here, an interesting question is whether it is technically possible to predict the voice of a person only from an image of his/her face and predict a person’s face only from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and that can generate a face image that matches the voice providing input speech by learning and leveraging the underlying correlation between faces and voices.

Several attempts have recently been made to tackle the tasks of crossmodal audio/image processing, including voice/face recognition [2] and audio/image generation [3, 4, 5]. The former task involves detecting which of two given face images is that of the speaker, given only an audio clip of someone speaking. Hence, this task differs from ours in that it does not involve audio/image generation. The latter task involves generating sounds from images/videos. The methods presented in [3, 4, 5] are designed to predict very short sound clips (e.g., 0.5 to 2 seconds long) such as the sounds made by musical instruments, dogs, and babies crying, and are unsuited to generating longer audio clips with richer variations in time such as speech utterances. By contrast, our task is crossmodal voice conversion (VC), namely converting given speech utterances where the target voice characteristics are determined by visual inputs.

VC is a technique for converting the voice characteristics of an input utterance such as the perceived identity of a speaker while preserving linguistic information. Potential applications of VC techniques include speaker-identity modification, speaking aids, speech enhancement, and pronunciation conversion. Typically, many conventional VC methods utilize accurately aligned parallel utterances of source and target speech to train acoustic models for feature mapping [6, 7, 8]. Recently, some attempts have also been made to develop non-parallel VC methods [9, 10, 11, 12, 13, 14, 15, 16], which require no parallel utterances, transcriptions, or time alignment procedures. One approach to non-parallel VC involves a framework based on conditional variational autoencoders (CVAEs) [11, 12, 13, 14]. As the name implies, variational autoencoders (VAEs) [17] are a probabilistic counterpart of autoencoders, consisting of encoder and decoder networks. CVAEs [18] are an extended version of VAEs where the encoder and decoder networks can additionally take an auxiliary input. By using acoustic features as the training examples and the associated attribute (e.g., speaker identity) labels as the auxiliary input, the networks are able to learn how to convert an attribute of source speech to a target attribute according to the attribute label fed into the decoder. As a different approach, in [15] we proposed a method using a variant of a generative adversarial network (GAN) [19] called a cycle-consistent GAN (CycleGAN) [20, 21, 22]. Although this method was shown to work reasonably well, one major limitation is that it is designed to learn only mappings between a pair of domains. To overcome this limitation, we subsequently proposed in [16] a method incorporating an extension of CycleGAN called StarGAN [23]. This method is capable of simultaneously learning mappings between multiple domains using a single generator network where the attributes of the generator outputs are controlled by an auxiliary input. StarGAN uses an auxiliary classifier to train the generator so that the attributes of the generator outputs are correctly predicted by the classifier. We further proposed a method based on a concept that combined StarGAN and CVAE, called an auxiliary classifier VAE (ACVAE) [14]. An ACVAE employs a generator with a CVAE structure and uses an auxiliary classifier to train the generator in the same way as StarGAN. Training the generator in this way can be interpreted as increasing the lower bound of the mutual information between the auxiliary input and the generator output.

In this paper, we propose extending the idea behind the ACVAE to build a model for crossmodal VC. Specifically, we use the latent code of an auxiliary face image input encoded by a face encoder as the auxiliary input into the speech generator and use a voice encoder to train the generator so that the original latent code can be recovered from the generated speech using the voice encoder. We also train a face decoder along with the face encoder to ensure that the latent code will contain sufficient information to reconstruct the input face image. In this way, the speech generator is expected to learn how to convert input speech into a voice characteristic that matches an auxiliary face image input and the voice encoder and the face decoder can be used to generate a face image that matches the voice characteristic of input speech.

## 2 Method

### 2.1 Variational Autoencoder (VAE)

Our model employs VAEs [17, 18] as building blocks. Here, we briefly introduce the principle behind VAEs.

VAEs are stochastic neural network models consisting of encoder and decoder networks. The encoder aims to encode given data into a (typically) lower dimensional latent representation whereas the decoder aims to recover the data from the latent representation . The decoder is modeled as a neural network (decoder network) that produces a set of parameters for a conditional distribution where denotes the network parameters. To obtain an encoder using , we must compute the posterior . However, computing the exact posterior is usually difficult since involves an intractable integral over . The idea of VAEs is to sidestep the direct computation of this posterior by introducing another neural network (encoder network) for approximating the exact posterior . As with the decoder network, the encoder network generates a set of parameters for the conditional distribution where denotes the network parameters. The goal of VAEs is to learn the parameters of the encoder and decoder networks so that the encoder distribution becomes consistent with the posterior . We can show that the Kullback-Leibler (KL) divergence between and is given as

(1) |

Here, it should be noted that since , is shown to be a lower bound for . Given training examples,

(2) |

can be used as the training criterion to be maximized with respect to and , where denotes the sample mean over the training examples. Obviously, is maximized when the exact posterior is obtained .

One typical way of modeling , and is to assume Gaussian distributions

(3) | ||||

(4) | ||||

(5) |

where and are the outputs of an encoder network with parameter , and and are the outputs of a decoder network with parameter . The first term of (2) can be interpreted as an autoencoder reconstruction error. Here, it should be noted that to compute this term, we must compute the expectation with respect to . Since this expectation cannot be expressed in an analytical form, one way of computing it involves using a Monte Carlo approximation. However, simply sampling from does not work, since once is sampled, is no longer a function of and so it becomes impossible to evaluate the gradient of with respect to . Fortunately, by using a reparameterization with , sampling from can be replaced by sampling from the distribution, which is independent of . This allows us to compute the gradient of the first term of with respect to by using a Monte Carlo approximation of the expectation . The second term is given as the negative KL divergence between and . This term can be interpreted as a regularization term that forces each element of the encoder output to be uncorrelated and normally distributed. It should be noted that when and are Gaussians, this term can be expressed as a function of .

Conditional VAEs (CVAEs) [18] are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary input . With CVAEs, (3) and (4) are replaced with

(6) | ||||

(7) |

and the training criterion to be maximized becomes

(8) |

where denotes the sample mean over the training examples.

### 2.2 Proposed model

We use and to denote the acoustic feature vector sequence of a speech utterance and the face image of the corresponding speaker. Now, we combine two VAEs to model the joint distribution of and . The encoder for speech (hereafter, the utterance encoder) aims to encode into a time-dependent latent variable sequence whereas the decoder (hereafter, the utterance decoder) aims to reconstruct from using an auxiliary input . Ideally, we would like to capture only the linguistic information contained in and to contain information about the target voice characteristics. Hence, we expect that the encoder and decoder work as acoustic models for speech recognition and speech synthesis so that they can be used to convert the voice of an input utterance according to the auxiliary input . We use the time-independent latent code of an image encoded by the encoder for face images (hereafter, the face encoder) as the auxiliary input into the utterance decoder. The decoder for face images (hereafter, the face decoder) is designed to reconstruct from . Fig. 1 shows the assumed graphical model for the joint distribution .

Our model can be formally described as follows. The utterance/face decoders and the utterance/face encoders are represented as the conditional distributions , , and , expressed using NNs with parameters , , and , respectively. Our aim is to approximate the exact posterior by . The KL divergence between these distributions is given as

(9) |

Hence, given the training examples of speech and face pairs , we can use

(10) |

as the training criterion to be maximized with respect to , , , and , where , and denote the sample means over the training examples. We assume the encoder/decoder distributions for and to be Gaussian distributions:

(11) | ||||

(12) | ||||

(13) | ||||

(14) |

where and are the outputs of the utterance encoder network, and are the outputs of the utterance decoder network, and are the outputs of the face encoder network, and and are the outputs of the face decoder network. We further assume and to be standard Gaussian distributions, namely and . It should be noted that we can use the same reparametrization trick as in 2.1 to compute the gradients of with respect to and .

Since there are no explicit restrictions on the manner in which the utterance decoder may use the auxiliary input , we introduce an information-theoretic regularization term to assist the utterance decoder output to be correlated with as far as possible. The mutual information for and conditioned on can be written as

(15) |

where represents the entropy of , which can be considered a constant term. In practice, is hard to optimize directly since it requires access to the posterior . Fortunately, we can obtain the lower bound of the first term of by introducing an auxiliary distribution

(16) |

This technique of lower bounding mutual information is called variational information maximization [24]. The equality holds in (16) when . Hence, maximizing the lower bound (16) with respect to corresponds to approximating by as well as approximating by this lower bound. We can therefore indirectly increase by increasing the lower bound alternately with respect to and . One way to do this involves expressing using an NN and training it along with all other networks. Let us use the notation to indicate expressed using an NN with parameter . The role of (hereafter, the voice encoder) is to recover time-independent information about the voice characteristics of . For example, we can assume to be a Gaussian distribution

(17) |

where and are the outputs of the voice encoder network. Under this assumption, (16) becomes a negative weighted squared error between and . Thus, maximizing (16) corresponds to forcing the outputs of the face and voice encoders to be as consistent as possible. Hence, the regularization term that we would like to maximize with respect to , , and becomes

(18) |

where and denote the sample means over the training examples. Here, it should be noted that to compute , we must sample from , from and from . Fortunately, we can use the same reparameterization trick as in 2.1 to compute the gradients of with respect to , , and .

Overall, the training criterion to be maximized becomes

(19) |

Fig. 2 shows the overview of the proposed model.

### 2.3 Generation processes

Given the acoustic feature sequence of input speech and a target face image , can be converted via

(20) |

A time-domain signal can then be generated using an appropriate vocoder. We can also generate a face image corresponding to the input speech via

(21) |

### 2.4 Network architectures

Utterance encoder/decoder: As detailed in Fig. 3, the utterance encoder/decoder networks are designed using fully convolutional architectures with gated linear units (GLUs) [25]. The output of the GLU block used in the present model is defined as where is the layer input, and denote convolution layers, and denote batch normalization layers, and denotes a sigmoid gate function. We used 2D convolutions to design the convolution layers in the encoder and decoder, where is treated as an image of size with 1 channel.

Face encoder/decoder: The face encoder/decoder networks are designed using architectures inspired by those introduced in [26] for conditional image generation.

Voice encoder: As with the utterance encoder/decoder, the voice encoder is designed using a fully convolutional architecture with GLUs. As shown in Fig. 3, the voice encoder is designed to produce a time sequence of the means (and variances) of latent vectors. Here, we expect each of these latent vectors to represent information about the voice characteristics of input speech within a different time region, which must be time-independent. One way of implementing (17) would be to add a pooling layer after the final layer so that the network produces the time average of the latent vectors. However, rather than the time average of these values, we would want each of these values to be as close to as possible. Hence, here we choose to implement (17) by treating as a broadcast version of the latent code generated from the face encoder so that the and arrays have compatible shapes.

## 3 Experiments

To evaluate the proposed method, we created a virtual dataset consisting of speech and face pairs by combining the Voice Conversion Challenge 2018 (VCC2018) [27] and Large-scale CelebFaces Attributes (CelebA) [28] datasets. First, we divided the speech data in the VCC2018 dataset and the face image data in the CelebA dataset into training and test sets. For each set, we segmented the speech and face image data according to gender (male/female) and age (young/aged) attributes. We then treated each pair, which consisted of a speech signal and a face image randomly selected from groups with the same attributes, as virtually paired data. This indicates that the correlation between each speech and face image data pair was artificial. However, despite this, we believe that testing with this dataset can still provide a useful insight into the ability of the present method to capture and leverage the underlying correlation to convert speech or to generate images in a crossmodal manner.

All the face images were downsampled to 3232 pixels and all the speech signals were sampled at 22,050 Hz. For each utterance, a spectral envelope, a logarithmic fundamental frequency (log ), and aperiodicities (APs) were extracted every 5 ms using the WORLD analyzer [29, 30]. 36 mel-cepstral coefficients (MCCs) were then extracted from each spectral envelope using the Speech Processing Toolkit (SPTK) [31]. The aperiodicities were used directly without modification. The signals of the converted speech were obtained from the converted acoustic feature sequences using the WORLD synthesizer.

We implemented two methods as baselines for comparison, which assume the availability of the gender and age attribute label assigned to each data. One is a naive method that simply adjusts the mean and variance of the feature vectors of the input speech for each feature dimension so that they match those of the training examples with the same attributes as the input speech. We refer to this method as “Baseline1”. The other is a two-stage method, which performs face attribute detection followed by attribute-conditioned VC. For the face attribute detector, we used the same architecture as the face encoder described in Fig. 3 with the only difference being that we added a softmax layer after the final layer so that the network produced the probabilities of the input face image being “male” and “young”. We trained this network using gender/age attribute labels. For the attribute-conditioned VC, we used the ACVAE-VC [14], also trained using gender/age attribute labels. We refer to this method as “Baseline2”.

We conducted ABX tests to compare how well the voice of speech generated by each of the methods matched the face image input, where “A” and “B” were converted speech samples obtained with the proposed and baseline methods and “X” was the face image used for the auxiliary input. With these listening tests, “A” and “B” were presented in random order to eliminate bias in the order of stimuli. Eleven listeners participated in our listening tests. Each listener was presented “A”,“B”,“X” 30 utterances. Each listener was then asked to select “A”, “B” or “fair” by evaluating which of the two matches “X” better. The results are shown in Fig. 4. As the results reveal, the proposed method significantly outperformed Baseline1 and performed comparably to Baseline2. It is particularly noteworthy that the performance of the proposed method was comparable to that of Baseline2 even though the baseline methods had the advantage of using the attribute labels. Audio examples are provided at [32].

Fig. 5 shows several examples of the face images predicted by the proposed method from female and male speech. As can be seen from these examples, the gender and age of the predicted face images are reasonably consistent with those of the input speech, demonstrating an interesting effect of the proposed method.

## 4 Conclusions

This paper described the first attempt to solve the crossmodal VC problem by introducing an extension of our previously proposed non-parallel VC method called ACVAE-VC. Through experiments using a virtual dataset combining the VCC2018 and CelebA datasets, we confirmed that our method could convert input speech into a voice that matches an auxiliary face image input and generate a face image that matches input speech reasonably well. We are also interested in developing a crossmodal text-to-speech system, where the task is to synthesize speech from text with voice characteristics determined by an auxiliary face image input.

Acknowledgements: We thank Mr. Ken Shirakawa (Kyoto University) for his help in annotating the virtual corpus during his summer internship at NTT. This work was supported by JSPS KAKENHI 17H01763.

### References

- H. M. J. Smith, A. K. Dunn, T. Baguley, and P. C. Stacey, “Concordant cues in faces and voices: Testing the backup signal hypothesis,” Evolutionary Psychology, vol. 14, no. 1, pp. 1–10, 2016.
- A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices and hearing faces: Cross-modal biometric matching,” arXiv:1804.00326 [cs.CV], 2018.
- L. Chen, S. Srivastava, Z. Duan, and C. Xu, “Deep cross-modal audio-visual generation,” arXiv:1704.08292 [cs.CV], 2017.
- Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg, “Visual to sound: Generating natural sound for videos in the wild,” arXiv:1712.01393 [cs.CV], 2018.
- W.-L. Hao, Z. Zhang, and H. Guan, “CMCGAN: A uniform framework for cross-modal visual-audio mutual generation,” in Proc. AAAI, 2018.
- Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. SAP, vol. 6, no. 2, pp. 131–142, 1998.
- T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222–2235, 2007.
- K. Kobayashi and T. Toda, “sprocket: Open-source voice conversion software,” in Proc. Odyssey, 2018, pp. 203–210.
- F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sentences,” in Proc. Interspeech, 2016, pp. 287–291.
- T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation,” in Proc. ICASSP, 2017, pp. 5535–5539.
- C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in Proc. APSIPA, 2016.
- ——, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. Interspeech, 2017, pp. 3364–3368.
- Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in Proc. ICASSP, 2018, pp. 5274–5278.
- H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder,” arXiv:1808.05092 [stat.ML], Aug. 2018.
- T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” arXiv:1711.11293 [stat.ML], Nov. 2017.
- H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks,” arXiv:1806.02169 [cs.SD], Jun. 2018.
- D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. ICLR, 2014.
- D. P. Kingma, D. J. Rezendey, S. Mohamedy, and M. Welling, “Semi-supervised learning with deep generative models,” in Adv. NIPS, 2014, pp. 3581–3589.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Adv. NIPS, 2014, pp. 2672–2680.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223–2232.
- T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proc. ICML, 2017, pp. 1857–1865.
- Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. ICCV, 2017, pp. 2849–2857.
- Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” arXiv:1711.09020 [cs.CV], Nov. 2017.
- D. Barber and F. V. Agakov, “The IM algorithm: A variational approach to information maximization,” in Proc. NIPS, 2003.
- Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
- X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional image generation from visual attributes,” in Proc. ECCV, 2016.
- J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv:1804.04262 [eess.AS], Apr. 2018.
- Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proc. ICCV, 2015.
- M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. E99-D, no. 7, pp. 1877–1884, 2016.
- https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder.
- https://github.com/r9y9/pysptk.
- http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/crossmodal-vc/.