Seen and Unseen emotional style transfer for voice conversion With a new emotional speech dataset
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.
Kun Zhou ,
Information Systems Technology and Design, Singapore University of Technology and Design \ninept
emotional voice conversion, speech emotion recognition (SER), emotional speech dataset
Speech conveys information with words and also through its prosody. Speech prosody can affect the syntactic and semantic interpretation of an utterance (linguistic prosody), and also displays one’s emotional state (emotional prosody) . Emotional prosody reflects the intent, mood and temperament of the speaker and plays an important role in daily communication . Emotional voice conversion is a voice conversion (VC) technique, which aims to transfer the emotional style of an utterance from one to another. Emotional voice conversion enables various applications such as expressive text-to-speech (TTS)  and conversational agents.
Emotional voice conversion and speech voice conversion  differs in many ways. Speech voice conversion aims to change the speaker identity, whereas emotional voice conversion focuses on the emotional state transfer. Traditional VC research includes modelling spectral mapping with statistical methods such as Gaussian mixture model (GMM) , partial least square regression  and sparse representation . Recent deep learning approaches such as deep neural network (DNN) , recurrent neural network (RNN)  and generative adversarial network (GAN)  have advanced the state-of-the-art. We note that these frameworks have motivated the studies in emotional voice conversion.
Early studies on emotional voice conversion handle both spectrum and prosody conversion with GMM  and sparse representation . Recent deep learning methods, such as deep belief network (DBN) , deep bi-directional long-short-term memory (DBLSTM) , sequence-to-sequence  and rule-based model  have shown the effectiveness on emotion conversion. We note that these frameworks are relied on the parallel training data. However, such data is not widely available in real-life applications.
There have been studies on deep learning approaches for emotional voice conversion that do not require parallel training data, such as cycle-consistent adversarial network (CycleGAN)-based [35, 25] and autoencoder-based frameworks [7, 36]. However, they are typically designed for a fixed set of conversion pairs. In this paper, we would like to propose a novel technique, that is referred to as emotional style transfer (EST), that learns to transfer an emotional style to any input utterance. The emotional style is given to the network as a control condition. Therefore, the network supports one-to-many emotion conversion and for unseen emotion at run-time.
Auto-encoder is a suitable computational model that allows for the control of output generation through the latent variables . Recent studies  on disentangling and recomposing the emotional elements in the speech with VAW-GAN represent one of the successful attempts in emotional voice conversion. However, emotional prosody is the result of the interplay of multiple signal attributes, hence it is not easy to define emotional prosody by a simple labeling scheme . In TTS studies, there are techniques to learn a latent prosody embedding, i.e. style token, from the training data, in order to predict or transfer the speech prosody [33, 29]. These studies motivate us to investigate the use of deep emotional features that reflect the global speech style and describe the emotional prosody in a continuous space.
In this paper, we propose an emotional style transfer framework based on VAW-GAN. We use deep emotional features as a condition to enable the encoder-decoder training for seen and unseen emotion generation. Furthermore, we release an EVC dataset that consists of multi-speaker and multi-lingual emotional speech. We aim to tackle the lack of open-source emotional speech data in voice conversion research community. This dataset can be easily applied to other speech synthesis tasks, such as cross-lingual voice conversion and emotional TTS.
The main contributions of this paper include: 1) we propose to build a one-to-many emotional style transfer framework that does not require parallel training data; 2) we propose to use SER that is pre-trained with publicly available large speech corpus to describe the emotional style; 3) we propose to disentangle and recompose the emotional elements through deep emotional features; and 4) we release a multi-lingual and multi-speaker emotional speech corpus, denoted as ESD, that can be used for various speech synthesis tasks. To our best knowledge, this paper is the first reported study on emotional style transfer for unseen emotion.
This paper is organized as follows: In Section 2, we motivate our study and analyze the deep emotional features. In Section 3, we introduce the proposed one-to-many EST framework. In Section 4, we report the experiments. Section 5 concludes the study.
2 Analysis of Deep Emotional Features
The computational analysis of emotion has been the focus of SER . Recent advances of deep learning have led to a shift from traditional human-crafted representations of acoustic features, to the deep features automatically learnt by neural networks [24, 13]. Deep features are data-driven, less dependent on human knowledge and more suitable for emotional style transfer .
Emotional prosody is prominently exhibited in emotional speech databases, which can be characterized by discrete categories, such as Ekmans’s six basic emotions , and continuous representation, such as Russell’s circumplex model . Recent studies seek to characterize emotions over a continuum rather than a finite set of discrete categorical labels with the feature representation learnt by deep neural networks in a continuous space . Following the findings in speech analysis, studies are conducted to use emotional prosody modelling to improve the expressiveness of speech synthesis systems [8, 32, 14]. In , an emotion recognizer is used to generate the style embedding for speech style transfer. Um et al.  apply the style embedding to a Tacotron system with the aim to control the intensity of emotional expressiveness. These successful attempts have revealed the fact that deep emotional features serve as the excellent prosody descriptor, which motivates this study.
We are interested in the use of deep emotional features for voice conversion, to describe emotional prosody in a continuous space. The idea is to use deep emotional features of a reference speech to transfer its emotional style to an output target speech. To motivate the idea, we visualize the deep emotional features of 4 speakers (2 male and 2 female) using the t-SNE algorithm  in a two-dimensional plane, as shown in Fig. 1. It is observed that deep emotional features form clear emotion groups in terms of feature distributions. Fig. 1 suggests that we may use deep emotional features as a style embedding to encode an emotion class. Encouraged by this observation, we propose a one-to-many emotional style transfer framework through deep emotional features.
3 One-to-many emotional style Transfer
We propose a novel one-to-many emotional style transfer framework, that is based on VAW-GAN  with its decoder conditioning on deep emotional features. The proposed one-to-many EST framework is referred as DeepEST in short. We next discuss DeepEST in three stages: 1) emotion descriptor training, 2) encoder-decoder training with VAW-GAN, and 3) run-time conversion interface. In Stage I, we train an auxiliary SER network to serve as the emotion descriptor for input utterances. In Stage II, the proposed encoder-decoder training is implemented to learn the disentanglement and recomposition of the emotional elements. In Stage III, DeepEST takes input utterance and target deep emotional features to generate the utterance with target emotional style.
3.1 Stage I: Emotion Descriptor Training
Emotional prosody is complex with multiple acoustic attributes which makes it difficult to model. There have been studies to label emotion manually into discrete categories, such as one-hot emotion label [3, 36]. As emotional prosody naturally spreads over a continuum that is hard to force-fit into a few categories, we propose to use deep emotional features that are learned from large animated and emotive speech data.
We propose to use a SER model as an emotion descriptor , which extracts deep emotional features from the input utterance , or . The SER architecture is as the same as that in , which includes: 1) a three-dimensional (3-D) CNN layer; 2) a BLSTM layer; 3) an attention layer; 4) a fully connected (FC) layer. The 3-D CNN first projects the input Mel-spectrum with its delta and delta-deltas features into a fixed size latent representation, that preserves the effective emotional information while reducing the influence of emotional irrelevant factors. Then the following BLSTM and attention layer summarize the temporal information from the previous layer and produce discriminative utterance-level feature for emotion prediction, as visualized in Fig. 1.
|Zero Effort||VAW-GAN-EVC||DeepEST||Zero Effort||VAW-GAN-EVC||DeepEST|
|neutral-to-angry||6.686||4.437||4.079 (unseen)||6.351||3.774||4.041 (unseen)|
3.2 Stage II: Encoder-Decoder Training with VAW-GAN
Encoder-decoder structure has been used to effectively learn disentangled representations in previous studies [36, 11]. We propose an encoder-decoder training procedure as shown in Fig. 2, where the encoder () learns to disentangle the emotional elements from the input features and generate a latent representation . The resulting representation is assumed to contain phonetic and speaker information, but emotion-independent. Then the decoder/generator () learns to reconstruct the input features with the emotion-independent representation and other controllable emotion-related attributes.
In practice, we use WORLD vocoder  to extract spectral features (SP) and fundamental frequency () from the waveform. The encoder () with parameter set is exposed to the input spectral frames with different emotion types and learns an emotion-independent representation : . Since the latent representation extracted from the source spectrum still contains the source information, and the conversion performance can suffer from this flaw . Therefore, the decoder/generator () with parameter set takes emotion-independent representation and the emotion-related features: the deep emotional features that reflect the global emotional variance of the input utterance from Stage I and the corresponding that contains source pitch information to recompose the emotional elements of the spectrum. The reconstructed feature can be formulated as:
We then train a generative model for spectrum through an adversarial training: The discriminator () with parameter set tries to maximize the loss between the real features and reconstructed features , while the generator () tries to minimize it. The parameter sets , and are optimized through this min-max game, which allows us to generate high-quality speech samples.
3.3 Stage III: Run-time Conversion
During the run-time conversion, we have a source utterance that is expressed in neutral emotion, we would like to convert it to a target emotion following the reference emotion style from the reference utterances. Suppose that we have a set of reference utterances belonging to an emotion category. We first use the pre-trained SER to generate the deep emotional features for all the reference utterances, i.e. all the utterances of our dataset with the same reference emotion. We then concatenate with the converted () and emotion-independent from the source utterance to compose a latent representation for the target utterance SP. The converted SP can be formulated as:
Finally, the converted speech with target emotion is synthesised by WORLD vocoder with converted spectral features and converted .
4.1 Emotional Speech Dataset (ESD)
With this paper, we introduce and publicly release a new multi-lingual and multi-speaker emotional speech dataset that can be used for various speech synthesis and voice conversion tasks
To our best knowledge, this is the first parallel voice conversion dataset that provides emotion labels in a multi-lingual and multi-speaker setup. As a future work, we will report an in-depth investigation of this dataset for cross-lingual and mono-lingual emotional voice conversion applications.
4.2 Experimental Setup
We conduct objective and subjective evaluation to assess the performance of our proposed DeepEST model for seen and unseen emotional style transfer. We use four English speakers (2 male and 2 female) from ESD dataset. For each speaker, we conduct seen emotion conversion from neutral to happy (N2H: neutral-to-happy) and neutral to sad (N2S: neutral-to-sad). We choose angry as the unseen emotion, and conduct experiments from neutral to angry (N2A: neutral-to-angry) to assess the performance of our proposed model for unseen emotion style transfer.
We split the 350 utterances in ESD dataset for a speaker into training set (330 utterances) and test set (20 utterances). During training, we propose to have one universal model that takes 330 utterances from neutral, happy and sad emotion states respectively. To obtain the deep emotional features, we train the SER [8, 14] on a subset of IEMOCAP  with four emotion types (happy, angry, sad and neutral). At run-time, we evaluate our framework with 20 utterances both from seen (happy, sad) and unseen (angry) emotion states. We obtain the reference emotion style by using all the utterances in ESD dataset, as formulated in Eq. (2). For each emotion conversion, we generate the deep emotional features from SER module by calculating the mean of the features that are in the same emotion category as the emotion reference.
As the baseline framework, we adapted a state-of-the-art VAW-GAN-EVC  that can perform conversion from one emotional state to another. We note that for each emotion conversion pair, we need to train a new VAW-GAN-EVC model as it is not capable of performing one-to-many conversion. In contrast, DeepEST provides more flexible manipulation and generation of the output emotion, as it can perform an emotional conversion from one emotional state to many and generate unseen emotions, which will be investigated in the following sections.
|Reference||4.80 0.14||4.69 0.23||4.78 0.14|
|VAW-GAN-EVC||2.93 0.27||3.10 0.27||2.93 0.27|
|DeepEST||3.01 0.26||3.16 0.26||3.04 0.29|
In DeepEST, 513-dimensional SP, and APs are extracted every 5 ms with the FFT length of 1024 using the WORLD vocoder. The frame length is 25 ms with a frame shift of 5 ms. We normalize every input frame of SP to unit sum and then re-scale it to logarithm. The encoder is a 5-layer 1D CNN with the kernel size of 7 and a stride of 3 followed by a FC layer. Its output channel is 16,32,64,128,256. The latent representation is 128-dimensional and assumed to have a Gaussian distribution. The deep emotional feature is 256-dimensional, and concatenated with the 128-dimensional latent representation and 1-dimensional contour to merge as the input to the decoder. The decoder is 4-layer 1D CNN with kernel size of 9,7,7,1025 and strides of 3,3,3,1. Its output channel is 32,16,8,1. The discriminator is a 3-layer 1D CNN with kernel size of 7,7,115 and strides of 3,3,3 followed by a FC layer. The networks are trained by using RMSProp with a learning rate of 1e-5. The batch size is set as 256 for 45 epochs.
4.3 Objective Evaluation
We conduct objective evaluation to assess the performance of our proposed model. We calculate Mel-cepstral distortion (MCD) [28, 26] to measure the spectral distortion between the converted and reference Mel-spectrum for two male and two female speakers for three emotion combinations.
As shown in Table 1, the proposed DeepEST outperforms the baseline framework VAW-GAN-EVC for all the seen emotions (N2H and N2S). We note that for the unseen emotion (angry), DeepEST still achieves comparable results to that of VAW-GAN-EVC baseline, which is trained with angry speech samples. These encouraging observations validate the effectiveness of our proposed DeepEST for seen and unseen emotion transfer. Last but not least, we require three VAW-GAN-EVC models to be trained for all conversion pairs, whereas the proposed DeepEST can perform all the emotion mapping pairs within one model, which we believe is remarkable.
4.4 Subjective Evaluation
We further conduct three listening experiments to assess the proposed DeepEST in terms of speech quality and emotion similarity. 15 subjects participated in all the experiments and each listened to 108 converted utterances in total.
We first report the mean opinion score (MOS) of the reference speech samples, baseline VAW-GAN-EVC and the proposed DeepEST. As shown in Table 2, DeepEST achieves better results than the baseline for both seen and unseen emotion combinations, which we believe is remarkable. Secondly, we conduct AB preference test to further evaluate the speech quality, where the subjects are asked to choose the speech samples with higher speech quality. As shown in Fig. 3, we observe that proposed framework DeepEST consistently outperforms the baseline framework VAW-GAN-EVC for all the emotion combinations, which is also consistent with the MOS results. These observations validates the effectiveness of our proposed model in terms of voice quality.
We further conduct XAB emotion similarity test to assess the emotion conversion performance, where subjects are asked to choose the speech samples which sound closer to the reference in terms of the emotional expression. Consistent with previous experiments, the baseline is one-to-one conversion and all emotions are seen emotions. As shown in Fig. 4, DeepEST outperforms the baseline for neutral-to-angry conversion. We observe that baseline achieves better performance for neutral-to-happy conversion due to the poor performance of SER on happy (29.95%) compared with 84.32% on sad, and 70.47% on angry. For unseen emotion (angry), DeepEST outperforms the baseline in terms of emotion similarity, which is very encouraging. This result validates the effectiveness of DeepEST for unseen emotion transfer. As a future work, we will improve the SER performance for all emotional states, and report an in-depth investigation.
In this paper, we propose a one-to-many emotional style transfer framework based on VAW-GAN without the need for parallel data. We propose to leverage deep emotional features from SER to describe emotional prosody in a continuous space. By conditioning the decoder with controllable attributes such as deep emotional features and F0 values, we achieve competitive results for both seen and unseen emotions over the baseline framework, which validates the effectiveness of our proposed framework. Moreover, we also introduce a new emotional speech dataset, ESD, that can be used in speech synthesis and voice conversion.
- thanks: Codes & speech samples: https://kunzhou9646.github.io/controllable-evc/
- Emotional Speech Dataset (ESD): https://github.com/HLTSingapore/Emotional-Speech-Data
- (2014) Exemplar-based emotional voice conversion using non-negative matrix factorization. In APSIPA ASC, Cited by: §1.
- (1960) Emotion and personality.. Cited by: §1.
- (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: §3.1, §4.2.
- (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (12), pp. 1859–1872. Cited by: §1.
- (2018) 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters 25 (10), pp. 1440–1444. Cited by: §3.1.
- (1992) An argument for basic emotions. Cognition & emotion. Cited by: §2.
- (2019) Nonparallel emotional speech conversion. Proc. Interspeech 2019, pp. 2858–2862. Cited by: §1.
- (2020) Interactive text-to-speech via semi-supervised style transfer learning. arXiv preprint arXiv:2002.06758. Cited by: §2, §4.2.
- (2010) Voice conversion using partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing. Cited by: §1.
- (2004) Pragmatics and intonation. The handbook of pragmatics, pp. 515–537. Cited by: §1.
- (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849. Cited by: §1, §3.2, §3.
- (2019) Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6720–6724. Cited by: §2.
- (2020) Deep representation learning in speech processing: challenges, recent advances, and future trends. arXiv preprint arXiv:2001.00378. Cited by: §2.
- (2020) Expressive tts training with frame and style reconstruction loss. arXiv preprint arXiv:2008.01490. Cited by: §2, §4.2.
- (2020) Teacher-student training for robust tacotron-based tts. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6274–6278. Cited by: §1.
- (2016) Emotional voice conversion using deep neural networks with mcc and f0 features. In 2016 IEEE/ACIS 15th ICIS, Cited by: §1.
- (2017) Adapting and controlling dnn-based speech synthesis using input codes. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4905–4909. Cited by: §1.
- (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §2.
- (2016) Deep bidirectional lstm modeling of timbre and prosody for emotional voice conversion. Interspeech 2016, pp. 2453–2457. Cited by: §1.
- (2016) WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems 99 (7), pp. 1877–1884. Cited by: §3.2.
- (2014) High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion. In Fifteenth annual conference of the international speech communication association, Cited by: §1.
- (2019) Sequence-to-sequence modelling of f0 for speech emotion conversion. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6830–6834. Cited by: §1.
- (1980) A circumplex model of affect.. Journal of personality and social psychology 39 (6), pp. 1161. Cited by: §2.
- (2020) A review on five recent and near-future developments in computational processing of emotion in the human voice. Emotion Review, pp. 1754073919898526. Cited by: §2.
- (2020) Non-parallel emotion conversion using a deep-generative hybrid network and an adversarial pair discriminator. arXiv preprint arXiv:2007.12932. Cited by: §1.
- (2020) An overview of voice conversion and its challenges: from statistical modeling to deep learning. arXiv preprint arXiv:2008.03648. Cited by: §1, §4.3.
- (2019) ON the study of generative adversarial networks for cross-lingual voice conversion. IEEE ASRU. Cited by: §1.
- (2019) Group sparse representation with wavenet vocoder adaptation for spectrum and prosody conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (6), pp. 1085–1097. Cited by: §1, §4.3.
- (2018) Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In International Conference on Machine Learning, pp. 4693–4702. Cited by: §1.
- (2006) Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing. Cited by: §1.
- (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15 (8), pp. 2222–2235. Cited by: §1.
- (2020) Emotional speech synthesis with rich and granularized control. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258. Cited by: §2.
- (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, pp. 5180–5189. Cited by: §1.
- (2018) Voice conversion for emotional speech: rule-based synthesis with degree of emotion controllable in dimensional space. Speech Communication. Cited by: §1.
- (2020) Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 230–237. External Links: Cited by: §1.
- (2020) Converting anyone’s emotion: towards speaker-independent emotional voice conversion. arXiv preprint arXiv:2005.07025. Cited by: §1, §1, §3.1, §3.2, §3.2, §4.2.