Speaker Recognition using SincNet and X-Vector Fusion

Speaker Recognition using SincNet and X-Vector Fusion

Abstract

In this paper we propose an innovative approach to perform speaker recognition by fusing two recently introduced deep neural networks (DNNs) namely - SincNet and X-Vector. The idea behind using SincNet filters on the raw speech waveform is to extract more distinguishing frequency-related features in the initial convolution layers of the CNN architecture. X-Vectors are used to take advantage of the fact that this embedding is an efficient method to churn out fixed dimension features from variable length speech utterances, something which is challenging in plain CNN techniques, making it efficient both in terms of speed and accuracy. Our approach uses the best of both worlds by combining X- vector in the later layers while using SincNet filters in the initial layers of our deep model. This approach allows the network to learn better embedding and converge quicker. Previous works use either X-Vector or SincNet Filters or some modifications, however we introduce a novel fusion architecture wherein we have combined both the techniques to gather more information about the speech signal hence, giving us better results. Our method focuses on the VoxCeleb1 dataset for speaker recognition, and we have used it for both training and testing purposes.

Keywords:
Speaker Recognition Deep Neural Networks SincNet X-vector VoxCeleb1 Fusion Model.

1 Introduction

Speaker Recognition and Automatic Speech Recognition(ASR) are two of the actively researched and interesting fields in the computer science domain. Speaker recognition has applications in various fields such as biometric authentication, forensics, security and speech recognition, which has contributed to steady interest in this discipline [1]. The conventional method of speaker identification involves classification of features extracted from speech such as the Mel Frequency Cepstral Coefficients (MFCC) [19].

With the advent of i-vectors [4], speaker verification has become faster and more efficient as compared to the preceding model based on the higher dimensional Gaussian Mixture Model (GMM) supervectors. The i-vectors are low dimensional vectors that are rich in information and represent distinguishing features of the speaker. i-vectors are used to generate a verification score between the embedding of two speakers. This score gives us information about whether both are the same speaker or different speakers. Previous experiments have shown that i-vectors perform better with Probabilistic Linear Discriminant Analysis (PLDA) [7]. Work has also been carried out in the field of training the i-vectors using different techniques in order to get better embedding, therefore, better results [5, 12].

Currently, researchers are moving towards Deep Neural Network (DNN) to obtain speaker embedding. DNN can be directly optimized to distinguish between speakers. DNN showed promising results in comparison to statistical measures such as i-vectors [6, 16]. X-vectors are seen as an improvement over the i-vector system (which is also a fixed-dimension vector system) because they are more robust and have yielded better results [18]. X-vector extraction methodology employs a Time-Delayed Neural Network (TDNN) to learn features from the variable-length audio samples and converts them to fixed dimension vectors. This architecture can broadly be broken down in their order of occurrence into three units, namely, frame-level layers, statistics pooling layer and the segment level layers. X- vectors can then be used with classifiers of any kind to carry out recognition tasks.

SincNet [14] is a deep neural network that has embedded band pass filters for extracting features from the audio sample. The features are then fed into DNN based classifiers. We have used SincNet filters and not fed the audio waveform directly into the DNN based classifiers as the latter technique poses problems like high convergence time and less appealing results. The SincNet filters are actually band pass filters which are derived from parameterized sinc functions. This gives us a compact and an efficient way to get a filter bank that can be customized and specifically tuned for a given application.

Our novel architecture combines the goodness of both the methodologies - SincNet and X-Vector, by extracting features using both the techniques and feeding them to fully connected dense layers which acts as a classifier. The organization of this paper is as follows: the related works are described in section 2, the proposed fusion architecture is presented in section 3, the experimental setup and the results are discussed in sections 4 and 5 respectively, and the final conclusions are drawn in section 5.

2 Related Works

The i-vectors [4] feature extraction method has proved to be state-of-the-art for quite some time now for the speaker recognition tasks. Techniques like the Probabilistic Linear Discriminant Analysis (PLDA) and Gauss-PLDA [3, 7] are fed with the extracted features and based upon which they carry out classification. Despite being a state-of-the-art technique, we can still improve the performance in terms of accuracy.

With the advent of various deep learning techniques in multiple domains for feature extraction, work has also been carried out to extract features from audio signals using deep learning architectures [9]. These architectures can range from using deep learning just for feature extraction to using neural networks as classifiers as well. The deep learning methods give better results when compared to older techniques based on feature engineering [20].

The most commonly used deep learning method for feature extraction is the one based on the Convolution Neural Network (CNN) architecture [16]. The CNN has been a preferred choice for researchers as it has given good quality results in tasks such as image recognition. Initially the CNN was fed with spectrogram in order to extract features from the audio signals [2, 13, 15, 16, 22]. Despite the fact that, spectrogram-based CNN methods were giving good results there were again many drawbacks of using this method [21]. Firstly, the spectrogram is the temporal representation of data unlike images which are spatio-temporal representations. Secondly, a frame in spectrogram can be a result of superposition of multiple waves and not just a single wave which means that a single result can be obtained using different waves super-positioned in different manner. Also, even though spectrograms retain more information than standard hand-crafted features, their design still requires careful tuning of some hyper-parameters, such as the duration, overlap, and typology of the frame window.

The above drawbacks also inspired researchers to directly input raw waveforms into CNN [8] so that no information is lost. This was a good methodology to follow but it results in slower convergence since it processes the complete frequency band of the audio signal.

SincNet [14] and X-vector [18] are amongst the most recent deep learning methods for speech signal classification. Both of them have proved to be more robust than methods which have preceded them.

3 Proposed Fusion Model

We propose a novel fusion of SincNet [14] and X-Vector [18] embedding which will enable us to take the temporal features into account which is quite important for any audio based recognition task since the signal at any point at time t in an audio signal is affected by points preceding and succeeding it.

3.1 X-Vector Embedding

We have used a pre-trained X-vector system which was trained on the VoxCeleb1 [11] dataset which we are using. The pre-trained x-vector system is available in the kaldi toolkit which is available for public use [17]. Table. 1 shows the architecture of the x-vector feature extractor system which has been trained on the VoxCeleb1 dataset. X-vector extraction methodology employs a Time-Delayed Neural Network (TDNN) to learn features from the variable-length audio samples and converts them to fixed dimension vectors. This architecture can broadly be broken down in their order of occurrence into three units, namely, frame-level layers, statistics pooling layer and the segment level layers. X- vectors can then be used with classifiers of any kind to carry out recognition tasks.

Layer Layer Context Table Context Input x Output
frame 1 [t-2, t+2] 5 120x512
frame 2 {t-2, t, t+2} 9 1536x512
frame 3 {t-3, t, t+3} 15 1536x512
frame 4 {t} 15 512x512
frame 5 {t} 15 512x1500
stats pooling [0, T) T 1500Tx3000
segment 6 {0} T 3000x512
segment 7 {0} T 512x512
softmax {0} T 512xN
Table 1: X-vector DNN architecture [18]

3.2 SincNet Architecture

The SincNet architecture implements various band pass filters in its initial layers, that learns from raw audio signals. The SincNet filters are implemented using a set of mathematical operations which are stated below [14].

The function is the Finite Impulse Response (FIR) filter used as a convolution filter.

(1)

is a predefined function that depends on a few learnable parameters .

(2)

is the rectangular function in the magnitude frequency domain and and are cut-off frequencies.

(3)

Here, the function is defined as

(4)

We have to ensure that and , therefore, the previous equation is actually fed with the following parameters:

(5)
(6)

Windowing is performed by multiplying the function with a window function .

(7)

The windowing function is a hamming window which is given by .

(8)

Once the filters are applied on the raw audio, we get features. These features can now be fed into any classifier. The vanilla SincNet architecture can be seen in Fig. 1. It takes raw audio signals as input, applies SincNet filters and then feeds it into a CNN model which is used as a classifier.

Figure 1: This is the SincNet architecture. The image has been taken from the original paper [14].

3.3 Fusion Model

Our proposed fusion model fuses the pre-trained X-Vector with features extracted from the trained SincNet model. The concatenated features are fed to a fully connected dense layer. This is followed by two more fully connected dense layers as shown in Fig. 2. The idea behind using SincNet filters on the raw waveform is to extract more distinguishing features in the initial convolution layers of the CNN architecture. X-Vectors are used to take advantage of the fact that this embedding is an efficient method to churn out fixed dimension features variable length speech utterances, something which is challenging in plain CNN techniques, making it efficient both in terms of speed and accuracy.

4 Experimental Setup

4.1 Dataset

We have carried out our experiments on the publicly available dataset VoxCeleb1 [11]. VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. We have used the raw audio files for our experiments. The VoxCeleb1 dataset consists of videos from 1,251 celebrity speakers. This means a size of 1,251 speakers and about 21k recordings.

Dev Test
Number of speakers 1,251 1,251
Number of videos 21,245 1,251
Number of utterances 145,265 8,251
Table 2: VoxCeleb1 dataset distribution [11].

4.2 Model Architecture

In order to carry out comparative results we have experimented with the original SincNet architecture and our proposed architecture. The original SincNet architecture uses SincNet filters for feature extraction whereas our architecture makes use of both SincNet and X-Vector. The classifier used in all the cases consists of several fully connected layers (DNN classifier) with softmax layer as the output layer. The models can be categorized as:

  1. SincNet based feature extractor and DNN classifier.

  2. X-Vector embedding and DNN classifier.

  3. X-Vector and SincNet based feature extractor and DNN classifier (Proposed).

Our proposed fusion model fuses the pre-trained X-Vector with features extracted from the trained SincNet model. The concatenated features are fed to a fully connected dense layers as shown in Fig. 2.The output obtained after convolution step is flattened and fed into the Dense Layer which is further concatenated with X-vector Embedding. The X-vector combined with the Dense layer constitutes fully connected layer 1 (FC1) which is further connected to FC2 and FC3. All the dense layers use Leaky ReLU as activation. The softmax layer is used as the output layer.

Figure 2: The proposed fusion model.

5 Experimental Results

The experiments were carried out in Python 3 on a Nvidia GeForce GTX 1080 Ti GPU equipped with 16GB of memory. The software implementation of our code is made available online on GitHub [10]. The tests were carried out on the VoxCeleb1 dataset and the results are summarised in Table. 3. We calculated the Equal Error Rate (EER) for all the experiments and lower the value the of EER, the better is our model. The results were in alignment with our expectations which means that our system performed the best out of all the architectures tested.

We have an EER score 8.2 for pure SincNet based architecture, an EER score of 5.8 for the X-vector based architecture, and the best EER score of 3.56 using our SincNet and X-vector embedding based architecture. Our architecture proposed a framework which resulted in the EER score of 3.56 which was an improvement over the previous best EER score of 4.16 using X-vector over SITW core [18]. Fig. 3 shows a comparison of EER score of various architectures as the number of epochs increases. The proposed fusion model exhibits a consistently low EER over all epochs.

Architecture Used Training Dataset EER (On Test Data)
SincNet VoxCeleb1 8.2
X-Vector VoxCeleb1 5.8
Proposed VoxCeleb1 3.56
Table 3: Experimental Results
Figure 3: Comparison of EER score of various architectures over epochs.

6 Conclusions

In this paper we propose a novel fusion model involving two successful deep architectures for speaker recognition:- SincNet and X-Vector.The features extracted from the two sources are fused by concatenation and learnt using fully connected dense layers. We achieved an increase in the EER score from current state-of-the-art by 14.5% on the VoxCeleb1 Dataset. It also showed quite a significant improvement over using vanilla SincNet which resulted in an EER score of 8.2. Further improvement over this architecture can be carried out by combining the VoxCeleb2 dataset along with VoxCeleb1 dataset and using noise removal techniques prior to feeding into the network.

Footnotes

  1. email: {mayank_bt2k16,divyanshu_bt2k16}@dtu.ac.in, seba_406@yahoo.in
  2. email: {mayank_bt2k16,divyanshu_bt2k16}@dtu.ac.in, seba_406@yahoo.in
  3. email: {mayank_bt2k16,divyanshu_bt2k16}@dtu.ac.in, seba_406@yahoo.in

References

  1. H. Beigi (2011) Speaker recognition. Springer. Cited by: §1.
  2. J. S. Chung, A. Nagrani and A. Zisserman (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §2.
  3. S. Cumani, O. Plchot and P. Laface (2013) Probabilistic linear discriminant analysis of i-vector posterior distributions. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7644–7648. Cited by: §2.
  4. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet (2010) Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19 (4), pp. 788–798. Cited by: §1, §2.
  5. S. H. Ghalehjegh and R. C. Rose (2015) Deep bottleneck features for i-vector based text-independent speaker verification. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 555–560. Cited by: §1.
  6. H. Huang and K. C. Sim (2015) An investigation of augmenting speaker representations to improve speaker normalisation for dnn-based speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4610–4613. Cited by: §1.
  7. P. Kenny (2010) Bayesian speaker verification with heavy-tailed priors.. In Odyssey, Vol. 14. Cited by: §1, §2.
  8. J. Lee, T. Kim, J. Park and J. Nam (2017) Raw waveform-based audio classification using sample-level cnn architectures. arXiv preprint arXiv:1712.00866. Cited by: §2.
  9. C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan and Z. Zhu (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. Cited by: §2.
  10. D. Mayank Tripathi ICAISC-speaker-identification-system. External Links: Link Cited by: §5.
  11. A. Nagrani, J. S. Chung and A. Zisserman (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §3.1, §4.1, Table 2.
  12. O. Novotnỳ, O. Plchot, O. Glembek, L. Burget and P. Matějka (2019) Discriminatively re-trained i-vector extractor for speaker recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6031–6035. Cited by: §1.
  13. D. Palaz and R. Collobert (2015) Analysis of cnn-based speech recognition system using raw speech as input. Technical report Idiap. Cited by: §2.
  14. M. Ravanelli and Y. Bengio (2018) Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. Cited by: §1, §2, Figure 1, §3.2, §3.
  15. T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson and O. Vinyals (2015) Learning the speech front-end with raw waveform cldnns. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  16. D. Snyder, D. Garcia-Romero, D. Povey and S. Khudanpur (2017) Deep neural network embeddings for text-independent speaker verification.. In Interspeech, pp. 999–1003. Cited by: §1, §2.
  17. D. Snyder, D. Garcia-Romero and D. Povey VoxCeleb models. Note: Accessed: 2019-11-30 External Links: Link Cited by: §3.1.
  18. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §2, Table 1, §3, §5.
  19. S. Susan and S. Sharma (2012) A fuzzy nearest neighbor classifier for speaker identification. In 2012 Fourth International Conference on Computational Intelligence and Communication Networks, pp. 842–845. Cited by: §1.
  20. E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez (2014) Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. Cited by: §2.
  21. L. Wyse (2017) Audio spectrogram representations for processing with convolutional neural networks. arXiv preprint arXiv:1706.09559. Cited by: §2.
  22. C. Zhang, K. Koishida and J. H. Hansen (2018) Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26 (9), pp. 1633–1644. Cited by: §2.
Comments 6
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
413376
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
6

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description