PROBING THE INFORMATION ENCODED IN X-VECTORS

Probing the Information Encoded in X-Vectors

Abstract

Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about the utterance (duration and augmentation type), and compare these with the information encoded by i-vectors across a varying number of dimensions. We also study the effect of data augmentation during extractor training on the information captured by x-vectors. Experiments on the RedDots data set show that x-vectors capture spoken content and channel-related information, while performing well on speaker verification tasks.

\name

Desh Raj, David Snyder, Daniel Povey, Sanjeev Khudanpurthanks: This work was partially supported by NSF CRI Grant No 1513128, and DARPA LORELEI Contract No HR0011-15-2-0024. \addressCenter for Language and Speech Processing & Human Language Technology Center of Excellence
The Johns Hopkins University, Baltimore, MD 21218, USA.
draj@cs.jhu.edu, {david.ryan.snyder, dpovey}@gmail.com, khudanpur@jhu.edu {keywords} x-vector, i-vector, speaker embedding, text-dependent speaker verification, RedDots

1 Introduction

In the last few years, speaker embeddings, especially those extracted from deep neural networks (DNNs) discriminatively trained to classify speakers, have become very popular for tasks like speaker identification and verification. In several studies, DNNs have been used to replace the universal background models (UBMs) [lei2014novel], to obtain bottleneck features in addition to traditional features [mclaren2015advances], and for text-dependent [variani2014deep] and text-independent [snyder2016deep] speaker verification.

In particular, x-vectors [snyder2017deep] have been shown to obtain state-of-the-art performance on text-independent speaker verification. In this paper, we investigate whether an x-vector embedding, which is trained solely to predict the speaker label, also contains incidental information about the transcription, channel, or meta-information about the utterance. Toward this objective, we design simple classification and regression based probing tasks which examine the embedding for these properties. It was shown in [snyder2018x] that x-vector systems effectively exploit data augmentation strategies for improved performance on speaker recognition. Our investigation using x-vector extractors trained with and without augmentation sheds some light on the possible reason behind this improvement. Furthermore, previous work has shown that i-vectors, though developed for speaker recognition, can improve automatic speech recognition (ASR), because they capture speaker and channel characteristics [saon2013speaker]. Our probing task results suggest that x-vectors also capture similar information and hence motivate their use for speaker adaptation in ASR.

Wang et al. [wang2017does] have previously conducted similar investigations for i-vectors [dehak2010front] and d-vectors [variani2014deep]. However, their probing tasks were designed on the RSR2015 data set [larcher2012rsr2015], while we use the RedDots data set111RedDots is freely available for academic research from https://sites.google.com/site/thereddotsproject/home. [lee2015reddots]. Similar studies on x-vectors have not been done before to the best of our knowledge. In natural language processing (NLP), probing tasks for embeddings have recently gained attention [conneau2018you] with sentence encoders such as BERT [devlin2018bert], which are pretrained on language modeling, but achieve state-of-the-art performance across several other tasks.

Since x-vectors are trained in a text-agnostic fashion by predicting the speaker label given the input utterance features, they perform well in text-independent speaker verification tasks. However, they may also be incorporated into a text-dependent system, if used in conjunction with a keyword spotting component. We demonstrate this by presenting results for text-dependent speaker verification using x-vectors trained in a text-independent manner and compare this method with an i-vector based system. These experiments are conducted on the RedDots data set.

The remainder of this paper is organized thus. We first describe the theory and our implementation of i-vector and x-vector speaker embeddings in Section 2, followed by a description of the RedDots data set, the probing tasks, and classifiers for the probing tasks in Section 3. In Section 4, we analyze the results of the probing tasks for i-vector and x-vector embeddings of different dimensions. Finaly, we present results for our x-vector based system on text-dependent speaker verification in Section 5, and conclude in Section 6.

2 Speaker embeddings

In this section, we describe the speaker embeddings, i.e., i-vectors and x-vectors. Both these systems were built using the Kaldi speech recognition toolkit [povey2011kaldi], and trained on the dev portion of VoxCeleb2 [Chung18voxceleb], which consists of 1,092,009 utterances from 5,994 distinct speakers.

2.1 i-vector

First proposed in [dehak2010front], the i-vector framework assumes that the speaker and session-dependent supervector of Gaussian mean vectors may be modeled as

(1)

where is the speaker and session-independent supervector obtained from a Gaussian mixture model (GMM) based universal background model (UBM), is a low-rank total variability matrix that captures both speaker and session variability, and the i-vector is the posterior mean of .

Our implementation of the traditional GMM-UBM based i-vector system is similar to that described in [snyder2015time]. The system is trained on 30 MFCC features with a frame-length of 25ms that are mean-normalized over a sliding window of up to 3 seconds. An energy-based speech activity detection (SAD) system selects features corresponding to speech frames. The UBM is a 2048 component full-covariance GMM.

2.2 x-vector

Our x-vector system is similar to that implemented in [snyder2017deep], and described in detail there222An example recipe is in the main branch of Kaldi at https: //github.com/kaldi-asr/kaldi/tree/master/egs/ voxceleb/v2. The features are 30 dimensional filterbanks with a frame-length of 25ms, mean-normalized over a sliding window of up to 3 seconds. The same energy SAD as used in the i-vector system filters out nonspeech frames.

Layer
Layer context
Total
context
Input Output
frame1 [, ] 5
frame2 {, } 9
frame3 {, } 15
frame4 15
frame5 15
stats_pooling [)
segment6 {0}
segment7 {0}
softmax {0}
Table 1: The DNN architecture for the x-vector system. Embeddings are extracted at layer segment6, before the nonlinearity. Here, , , and denote the number of utterance frames, the dimensionality of embedding required, and the number of speakers, respectively.

Table 1 describes the DNN architecture for the x-vector system. The first 5 layers, called frame layers, operate at the context-dependent frame level. For example, the input for frame2 is the spliced output of frame1 at frames , , and . Since the context for frame1 extended from to , this means that the total context for frame2 is 9.

At the statistical pooling layer, frame5 outputs from all frames are aggregated by computing the mean and standard deviation. The subsequent frames operate on this 1500-dimensional vector which represents the entire segment, and are named segment layers. Since we extract embeddings at segment6, the output dimension at this layer is set to the required dimension, say . The final softmax layer has output nodes, where is the number of speakers in the training data. All nonlinearities are rectified linear units (ReLUs).

3 Probing tasks

The underlying assumption for using probing tasks is that if information pertaining to a property is encoded by the embedding, we can train a classifier to predict this property, and the performance of the classifier is proportional to the amount of information that the embedding encodes [wang2017does]. In our experiments, we focus on properties related to the speaker, the utterance transcription, and meta information about the utterance.

3.1 Data set

We use the RedDots data set [lee2015reddots] with some modification for our probing tasks. Our primary reason for using this data is that it contains both common and unique utterances across a variety of speakers. We use the enrollment utterances corresponding to the common, unique, and free-choice pass-phrase portions of the data set. This gives us a total of 2484 utterances, comprising 460 unique transcriptions. We further apply augmentation on this data by adding music, speech, and noise using the MUSAN data set [snyder2015musan], which consists of over 900 noises, 42 hours of music from various genres, and 60 hours of speech from twelve languages. This increases the total number of utterances to 9936. The final distribution of the frequency of unique utterances is given in Table 2.

# unique utterances Frequency Total
8 456 3648
2 444 888
450 12 5400
Table 2: Distribution of frequency of unique utterances. The utterances with high frequency correspond to the common pass-phrases. For the utterances with lower frequency, each of these are spoken by 3 different speakers, and 4-fold augmentation gives a total of 12.

3.2 Probing Tasks

We investigate the speaker embeddings using 8 different tasks that are designed to probe the speaker, transcription, and utterance-level meta information encoded in the embedding.

  1. [leftmargin=*]

  2. Session identification: This task probes the ability of the embedding to predict the session, which is a more fine-grained recognition task than speaker identification. Since different sessions for the same speaker may have different acoustic characteristics owing to channel effects or varying background conditions, we conjecture that embeddings which predict the session well may encode some channel information. The data set comprises 828 different sessions in total.

  3. Speaker gender: This task investigates whether the embedding can distinguish between genders, i.e., a binary classification task. Although it is inherently an easier task than speaker identification, our data is imbalanced in the gender labels (49 male and 13 female speakers).

  4. Speaking rate: We augment all 9936 utterances by 3-way speed perturbation with rates 0.5, 1.0, and 1.5. A multi-class classifier with 3 classes is then trained on 80% of this data and evaluated on the remaining 20%. We report the overall accuracy to predict whether the speaker embeddings can encode information about the speaking rate.

  5. Transcription: This task explores the ability of the speaker embedding to predict the exact transcription given the utterance. Although there are 460 different utterances, we select only the 100 most common ones for the prediction task. Still, there remains some imbalance in this set as well, since the top 10 most frequent utterances are the common phrases and occur much more frequently than the other utterances.

  6. Word recognition: This task probes whether the embeddings capture information about words in the utterance. We select the top 50 most frequent words in the vocabulary and build a classifier for each word that predicts, given the embedding as input, whether the utterance contains the word. Finally, we compute, for each utterance, the fraction of words it labeled correctly, and then obtain the mean across all utterances.

  7. Phoneme recognition: This task investigates whether the embeddings encode phone-level information in an utterance. For this, we select only the 24 consonant phonemes in the English language, since vowel phonemes are present in most utterances (regardless of length). Furthermore, since most such phonemes would still be present in most utterances, we only say that an utterance contains a phoneme if it contains at least occurrences of the phoneme 333This choice of is not arbitrary; it was tuned so as to get a sufficiently balanced training set.. Similar to the word recognition task, we build a binary classifier for each such phoneme and finally compute the mean accuracy across all utterances. We use the g2p444https://github.com/Kyubyong/g2p library for converting graphemes to phonemes.

  8. Utterance length: In this task, we examine whether the x-vector embedding can predict the duration of the utterance. Since the duration is continuous, we model this as a regression task as opposed to the other probing tasks. The performance of the embedding is reported in terms of , where is the root mean squared error, and is the standard deviation of utterance lengths. This quantity can be interpreted as the percentage of variance explained by the embedding.

  9. Augmentation type: Finally, this task investigates whether the type of augmentation (music, speech, noise, or no augmentation) is captured in the speaker embedding. We train a four-class classifier and report the accuracy of recognition.

(a)
(b)
(c)
Figure 1: Performance of speaker embeddings in speaker related tasks: (a) session identification, (b) speaker gender, and (c) speaking rate recognition, across several dimensions.

3.3 Classifiers

Since our objective is to evaluate the embeddings, we use a very simple classifier—a Multilayer perceptron (MLP) with a single hidden layer and ReLU activations. The hidden layer size is fixed at 500 for all the probing tasks. This system is implemented in Python using the PyTorch deep learning framework [paszke2017automatic]. We use a cross-entropy loss for the classifiers and a mean squared error loss for the regression task. Adam [kingma2014adam] is used as the optimizer in all probing tasks, with a learning rate of 0.001. Additionally, for the speaker gender task, we weigh the labels in inverse proportion to their sample sizes in the loss function, to account for the class imbalance. Unless stated otherwise, we train our classifiers on 90% of the augmented RedDots data set and test on the remaining 10%.

4 Results and discussion

We now present the results of the probing tasks described in Section 3. For each task, we experiment with embeddings of dimensions 128, 256, 512, and 768. For each dimension, we have an i-vector and an x-vector extractor trained on unaugmented VoxCeleb2 data. Additionally, we investigate the effect of data augmentation on the information encoded by x-vectors. We apply an augmentation strategy based on [snyder2018x], which doubles the amount of training data. We extend the clean VoxCeleb data with a second copy that has been randomly augmented with either noises from MUSAN [snyder2015musan] or convolved with simulated room impulse responses (RIRs) [ko2017study]. These are labeled as x-vector_aug in the figures.

4.1 Session identification

In speaker recognition, it is desirable for embeddings to be invariant with channel or session characteristics. In contrast, capturing these sources of variability is beneficial for ASR, where acoustic models utilize embeddings (usually i-vectors) to adapt to both speaker and channel characteristics. Here we measure the amount of session information retained in the embeddings, by examining their performance on a session identification task (from 848 possible sessions). In Figure 0(a), we see that while x-vectors significantly outperform i-vectors at low dimensions, they achieve a similar accuracy (∼45%) at high dimensions. We observed that most of the errors were due to the embeddings attributing a session to another session from the same speaker, and analysis revealed that in these cases, the sessions had similar acoustic characteristics and transcriptions. We believe that this ability of x-vectors to capture speaker and channel characteristics makes them suitable for speaker adaptation in ASR systems.

4.2 Speaker gender

Both embeddings achieve high accuracy in recognizing gender. In Figure 0(b) we see that, without training data augmentation, i-vectors clearly outperform x-vectors in this task. On further analysis, we found that while the recall for female-labeled utterances was ∼ 1 for i-vectors, it was slightly lower for x-vectors, which leads to this difference in overall performance. However, once it is trained with augmentation, the ability of x-vectors to discriminate between speakers increases, and subsequently its gender identification performance reaches close to 99% accuracy.

(a)
(b)
Figure 2: Results for probing tasks related to text content of the utterance: (a) transcription, and (b) word recognition.
Dim i-vector x-vector
P R F P R F
128 0.54 0.61 0.57 0.43 0.56 0.48
256 0.38 0.53 0.44 0.54 0.67 0.60
512 0.64 0.74 0.69 0.77 0.85 0.80
768 0.66 0.76 0.70 0.77 0.86 0.81
Table 3: Performance of the embeddings in predicting the 10 common phrase utterances. P, R, and F denote precision, recall, and F-score, respectively.

4.3 Speaking rate

Since we vary speaking rates by 0.5 and 1.5, this changes speaker characteristics and is therefore a type of speaker recognition task in itself. As such, it is not surprising to find that, in Figure 0(c), the embeddings achieve high accuracy in predicting the speaking rate, with x-vectors (both augmented and unaugmented) reaching close to 100% accuracy across all dimensions.

4.4 Transcription

Figure 1(a) shows the accuracy for the embeddings in predicting the utterance transcription, given the speaker embedding as input. Since several of these utterances are spoken by multiple speakers, it seems very plausible that to achieve a high recognition performance, the embedding must incidentally capture some transcription information in addition to the speaker information. In Table 3, we additionally report performance metrics (precision, recall, and F-score) for the 10 common-phrase utterances555Interestingly, the best performance across the board was found to be for the utterance “Ok Google”.. Although the x-vector DNN is trained on a large amount of unconstrained speech to predict only the speaker label, we find the x-vectors (with or without augmentation) are still not wholly invariant to lexical content, and they do retain information about what was spoken.

Figure 3: Word recognition performance of speaker embeddings for top 50 most frequent words in the RedDots data set.
Figure 4: Phoneme recognition performance for the speaker embeddings () for all consonant phonemes. Marker sizes denote relative counts of the phonemes in the training set. The most recognizable phonemes for both systems are found to be Z, S, and N.
Figure 3: Word recognition performance of speaker embeddings for top 50 most frequent words in the RedDots data set.

4.5 Word recognition

(a)
(b)
Figure 5: Results for probing tasks related to utterance meta information: (a) utterance length and (b) augmentation type.

From Figure 1(b), we find that i-vectors outperform x-vectors in the word recognition task by % in terms of the average percentage of top 50 words labeled correctly across all utterances. To further investigate this result, we plot the precision and recall metrics for all 50 words in Figure 4. It may be observed that x-vectors are good at recognizing words such as google, lawyers, password, and so on, but fare poorly on more common words such as and, he, and in. While i-vectors show a similar trend, the difference is less pronounced. Augmentation (not shown) reduces word-level recognition rate on average, but it increases recognition of keywords (most of which occurred in the common-phrase utterances), which would explain the results in Figure 1(a) and Table 3.

4.6 Phoneme recognition

We found that the same metric as used for word recognition was not very informative in this case, since all the embeddings correctly classified around 33-34% of the 24 consonant phonemes. Instead, we plot the precision and recall metrics for each phoneme in Figure 4. As expected, phonemes with higher counts have better recognition rates. Nasals and plosives show similar trends across both i-vectors and x-vectors, but recognition of fricatives and approximants show high variability.

Figure 6: Performance of speaker embeddings for text-dependent speaker verification on male_part_01 subset of RedDots data set.

4.7 Utterance length

In this Section, we investigate whether or not information about the utterance length is retained in the speaker embeddings. From Figure 4(a), it can be seen that unaugmented x-vectors capture more variance in utterance length than i-vectors. We also found that for the majority of utterances, the predicted value of length is wrong by at most 0.5 seconds. Furthermore, most incorrect predictions are for very long utterances (10 or more seconds). This may be because the x-vector extractors were trained on chunks of 2 to 4 seconds, or because these utterances are outliers in the data set. Augmentation tends to decrease this quantity for x-vectors because it improves the embedding’s robustness to different sources of variability.

4.8 Augmentation type

Figure 4(b) shows the performance of the speaker embeddings in predicting the augmentation type (i.e., clean, babble, music, or noise), and it is evident that without training augmentation, x-vectors capture more information than i-vectors in this regard. Further investigation showed that the highest recognition was for utterances with ‘babble’, which can be attributed to the fact that adding babble creates overlapping speech which has very different acoustics than noisy speech.

Without augmentation, x-vectors appear to capture more non-speech information than i-vectors. However, after training with augmented data, x-vectors become less sensitive to speech perturbed by noises, music or reverberation. Training with augmented data trades this noise-related information for additional speaker information. As a result of the embeddings becoming more robust, their performance on this probing task decreases.

5 Text-dependent speaker verification

In this task, the objective is to verify a speaker based on their known utterances. In this section, we present the results for our i-vector and x-vector based systems described earlier on the RedDots data set. The data set has 62 speakers including 49 male speakers and 13 female speakers from 21 countries. The total number of sessions for the current release is 572 (473 male and 99 female sessions). The partitions I, II, and III correspond to common, unique, and free-choice pass-phrase portions of the data set. We report results only for the male_part_01 subset666All other subsets followed similar trends.. The scores are obtained individually for the target_wrong, imposter_correct, and imposter_wrong cases, following the approach in [zeinali2016vector].

5.1 Speaker verification model

We use the speaker embedding models of dimensions 128, 256, and 512 for our experiments. In Section 2, we described the methods for obtaining these embeddings. Once we have the embeddings, we use a probabilistic linear discriminant analysis (PLDA) model for the backend [prince2007probabilistic]. Similar to [snyder2017deep], we center the embeddings and reduce the dimensionality (to 100 for embeddings of dimension 128 and 256, and 200 for the 512-dimensional embeddings, respectively) using LDA. After dimensionality reduction, the embeddings are length normalized and pairs of embeddings are compared using PLDA, which is trained on a combination of Mixer6, NIST SRE 2008, and SRE 2010 data sets [chodroff2016new, martin2009nist, martin2010nist].

5.2 Results

Figure 6 shows the equal error rates (EERs) obtained on the three systems for the different trial types. Since all systems are trained on text-independent speaker recognition data sets, their performance on target_wrong trials is worse than that on the imposter_correct and imposter_wrong trials. In general, x-vectors achieve better scores than the i-vector system, and training augmentation further improves this performance by making the embeddings robust to non-speech information. Finally, increasing the dimension size improves the recognition performance across all systems.

6 Conclusion

We investigated the information encoded in x-vectors, compared to i-vectors, using several probing tasks. We found that in addition to speaker-related information such as session and gender, x-vectors trained on unaugmented data also capture information about the lexical content (such as keywords present in the utterance, in particular) and meta information such as utterance length and augmentation type, even in low dimensions. When augmentations are used for extractor training, some of this information is traded for better speaker recognition performance. This was confirmed by our experiments on text-dependent speaker verification on the RedDots data set, for augmented and unaugmented x-vector systems trained in a text-independent manner. Our results motivate a line of inquiry for using x-vectors for speaker adaptation in automatic speech recognition (ASR) systems since they capture speaker and channel characteristics well.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390057
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description