# Histogram Transform-based Speaker Identification

###### Abstract

A novel text-independent speaker identification (SI) method is proposed. This method uses the Mel-frequency Cepstral coefficients (MFCCs) and the dynamic information among adjacent frames as feature sets to capture speaker’s characteristics. In order to utilize dynamic information, we design super-MFCCs features by cascading three neighboring MFCCs frames together. The probability density function (PDF) of these super-MFCCs features is estimated by the recently proposed histogram transform (HT) method, which generates more training data by random transforms to realize the histogram PDF estimation and recedes the commonly occurred discontinuity problem in multivariate histograms computing. Compared to the conventional PDF estimation methods, such as Gaussian mixture models, the HT model shows promising improvement in the SI performance.

## I Introduction

Speaker identification is a biometric task that has been intensively studied in the past decades [1, 2, 3, 4, 5, 6]. Given an input speech, the task is to determine the unknown speaker’s identity by selecting one speaker from the whole set of speakers registered in the system [4, 7].

The first step is feature extraction.
In this part the original speech signals are transformed into feature vectors which can represent speaker-specific properties. To this end, a lot of features have been considered, *e.g.*, the Mel-frequency Cepstral coefficients (MFCCs) [8], and the line spectral frequencies (LSFs) [2].
Among them, MFCCs are widely used in speech processing tasks, *e.g.*, language identification [9], speech emotion classification [10], and speaker identification [11]. In general, these static features
are supplemented by their corresponding velocity and acceleration coefficients to get dynamic information. Recently, some researchers tend to use the static features to directly build the system. In [2, 3, 12, 13, 14], LSFs are directly used in super-Dirichlet mixture models and in [15, 16, 17, 18, 19], static MFCCs are used in the deep learning model. In this paper, we also adopt the static MFCCs feature and, moreover, group several neighboring frames together to create a super MFCCs feature to express the speaker’s characteristics [20, 21, 22, 23].

The second step is model training. As the extracted features can describe the unique characteristic of an individual speaker, this allows us to classify each speaker by their voices using probabilistic models [24]. Separate models should be trained for each speaker, in order to describe the statistical properties of the extracted features.

The third step is identification. In this part the feature vectors extracted from the unknown person’s speech are compared against the models trained in the second step to make the final decision by using the maximum likelihood method.

The effectiveness of a speaker identification system is mainly decided by the design of the statistical model in the second part. The mixture model based methods are widely employed, e.g., Dirichlet mixture model (DMM) [2, 25, 26], beta mixture model (BMM) [27], von-Mises Fisher mixture model [28, 29] and Gaussian mixture model (GMM) [30, 31, 32, 33]. All these models belong to parametric models, where the aim of training is to optimize the parameters of the models.

In addition to the mixture model based approaches, nonparametric approaches which can offer close adaptation to particular features of the training data are also widely used [34, 35, 36, 37, 38, 39]. One of the most popular non-parametric approaches is the histogram probability estimation. Partitioning the training feature space into discrete intervals (bins), we can get the probability estimation by counting the number of training data that fall into each bin. If we have sufficient training data and set an appropriate bin width, good performance can be obtained [40, 41]. However, probability estimated by the histogram method, especially the multivariate histograms-based method, has large discontinuities [42]. This is because the bin number will increase at a geometrical ratio with the growing of the feature’s dimensionality. When the dimensionality is high, we can’t get sufficient training data in order to cover all the bins in the space. Recently, a histogram transform (HT) model was proposed to overcome such problems [42]. The HT model can alleviate the discontinuity problem by averaging multiple multivariate histograms. This method has been successfully applied in several applications, such as image segmentation [42, 43], speaker identification [44]. In this paper, we will use this method to build speaker identification models.

In the experimental part, we compare the performance of the HT model with the GMM model [45, 46, 47]. The identification decision was made by choosing the maximal log-likelihood of a test speech against all the trained speaker models. Experimental results show that the HT model is able to reach higher accuracy than the GMM model. This paper is organized as follows: The way to generate the super MFCCs features is described in Section II. We describe the HT model in Section III. The experimental results and analysis are presented in Section IV. Conclusions and some further work are given in Section V.

## Ii Feature Extraction

In speech processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a speech signal, based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency [8, 48, 49]. MFCCs are coefficients that collectively compose an MFC. They are derived from a type of cepstral representation of the audio. For the speech segment (frame) at time , we can extract a dimensional MFCC vector as

(1) |

In order to exploit the dynamic information useful for speaker recognition, the traditional method is to construct a super feature vector containing the MFCCs, the velocity of the MFCCs (), and the acceleration of MFCCs () [50]. The super frame is then defined as

(2) |

Inspired by the idea proposed and used in [2, 15, 28, 29], we represent the dynamic information of the MFCCs in a new way. We build the super frame by utilizing two neighbors of the current frame, one from the past and the other from the following frames. Set the time interval between two adjacent frames as and the super MFCCs frame is created by grouping the current frame and two neighbors as:

(3) |

where is an integer (*e.g.*, ).
The conventionally used feature contains processed information from the neighbor frames. The super MFCCs frame mentioned above actually includes the raw information contained in the neighbor frames. In principle, the super MFCCs frame should carry at least the same information as that represented by . This motivates us to use the “raw” data here.

## Iii Training of the HT Models

Theoretically, the non-parametric probabilistic models, such as histogram based models, are driven by training data directly and can simulate any complicated probability density function (PDF). In practice, the original histogram methods, especially the multivariate histograms-based methods, are rarely used due to the fact that the learned PDF has large discontinuities over the boundaries of the bins.

Fig. (a) shows the negative logarithm of PDF estimated for two randomly selected dimensions of -dimensional features using the original histogram method. The 16-dimensional MFCCs vectors are extracted from wide-band speech in the TIMIT dataset. The feature space is segmented into bins and only zones have been filled and many zones yield zero (black color).

In order to get a smooth PDF, parametric probabilistic models, such as mixture models, are usually employed. In these models, the combination of some simple smooth functions are recommended to estimate the actual PDF. If the function form and number of mixture components are chosen appropriately, the mixture models can fit the real probability distribution well. However, when the actual PDF is overcomplex, the combination of several simple functions can not represent the true PDF properly.

Recently, an HT method was proposed in [42, 51]. The HT method applies a group of random affine transforms to the training data and computes the average histogram to estimate the PDF. As illustrated in Fig. (b), after one transform, some points fall in the zones where the original histogram has zero density and bins have been filled. The PDFs estimated by the average histogram of 10 and 50 transforms are shown in Figs. (c) and (d), respectively. It is observed that the PDFs have been smoothed, the filling rates increase to and , respectively, and the discontinuity has then been overcome.

The HT model is based on histogram methods, and it has advantage of strong adaptability. Meanwhile, the transformation can overcome the disadvantage of discontinuity. A parametric probability density function is adopted in this model as prior, so some merits of parametric models are also found in this method.

The affine function in the HT model is defined as

(4) |

where is a training sample vector with size , is a matrix, is a vector. After times randomizing transforms, one training dataset of samples is mapped to training datasets. Then using the average histogram of these datasets to learn the PDF can partly solve the discontinuous problem [42]. Based on the above affine function incurred transforms, the probability of an input feature vector in the HT method is defined as

(5) |

The first item of (5) is a priori probability of finding a test sample in a zone where the histograms yield zero density,

is defined as
and is defined as a multivariate Gaussian distribution,

(6) |

(7) |

The second item in (5) describes the average histogram probability and is the histogram probability of the input data in the -th transform. Following the method introduced in [42], through adjusting the scale factor of the matrix , the bin width on the transformed space can be chosen as . Set

(8) |

(9) |

where function means changing the components of the transformed vector to the nearest integer, then the histogram probability of input data in the -th transform is defined as

(10) |

In (10), is the D-dimensional volume of the histogram bins in the input space, as

(11) |

stands for the indicator function, defined as

(12) |

The selection of the transform parameters and should take the following rules. Since the bin width on the transformed space is , we draw from the uniform distribution over the hypercube .

The matrix can be expressed as the product of a unit rotation matrix and a diagonal scaling matrix . The random unit rotation matrix can be generated by making QR decomposition on a standard normal random matrix [52].

, the diagonal elements of , can be generated using Jeffrey s prior for the scale parameters [53]. should be drawn from the uniform distribution over certain interval of real numbers , where

(13) |

(14) |

In order to make the bin width on the transformed space equal to , according to the multivariate histograms theory [54], should be set as

(15) |

and are tunable parameters. In this paper we empirically choose and .

## Iv Experimental Results and Discussions

To verify the proposed HT model-based SI method, we evaluated the speaker identification performance on the TIMIT database [55]. The TIMIT database contains male and female speakers coming from different regions and each speaker spoke sentences. During each round of evaluation, we randomly selected speakers from the database.

The speech was segmented into frames with a ms duration and a ms step size. The silence frames were removed. For each frame, a Hanning window was used to reduce the high frequency components. Since the speech clips are wide band data, -dimensional MFCCs were extracted from each frame. In order to compare the differences between the traditional and the super frame proposed in this paper, the MFCCs and the corresponding velocity and acceleration features were calculated according to the methods described in Section II. Finally, we obtained two sets of super frames, each contains -dimensional features.

In the training phase, seven sentences were randomly selected from each speaker as the training data and the remaining three sentences were used for testing. In each test sentence we randomly intercepted segments, each including consecutive frames, as test sets, so there were test sets in total. We trained HT models using and frames, respectively. Put a test segment into each trained model and the log-likelihood was calculated as:

(16) |

where is the input segment set including feature frames, denotes the -th frame and stands for the training set of the -th person. The trained model that yielded the largest log-likelihood value was considered to have the same statistical property as the test feature set, and therefore, we assigned the test segment with the identity of this trained model. We set the number of transforms as {} and the frame interval . The frame number in each test set was chosen as {}, which means the durations of each test utterance is {} seconds, respectively. The identification score is calculated by the number of correctly identified test sets divided by the total number of test sets, we ran evaluation for rounds, and the average scores in different parameter and methods were shown in Fig. 2.

The performance of using and in HT model is shown in Fig. 2(a). It is observed that, the HT model with frames reaches a higher identification accuracy. This indicates that the proposed features are more suitable for the HT model than . As introduced in Section III, the data transform matrix is generated according to a single parameter , so the feature in which all components have similar attribute fits the HT model better.

The result also shows that the number of transforms affects the final score. Increasing improves the identification accuracy, but when is higher than , the accuracy decreases instead. This indicates that too many transformations will make the estimated PDF over-smooth and with reduced speaker specific information. For example, when the speech duration is longer, * e.g.*, more than , we have sufficient amount of feature frames to describe the speaker’s characteristics, and less error caused by one frame can be compensated by the average of other frames. Hence, we want to increase the “specificity” of each frame, which means we want a “cliffy” PDF curve. Therefore, smaller is required in this case. However, when less amount of feature frames are presented, the requirement of smoothness get higher. Thus, larger should be employed in order to obtain a smooth PDF curve.

We also trained and tested the data sets in GMM models with different numbers of components, *i.e.*, . The results are shown in Fig. 2(b). The features give better results in the GMM model. This means that the features are more suitable for the GMM model. This also verifies the well-known strategy utilized in SI tasks. Based on the above facts, performs better in the HT model and better in the GMM model. When the number of test segments is relatively larger (e.g., more than 50 frames) the +HT methods can get lower error rates than the +GMM method.

The boxplots in Fig. 3 compare the precision and stability between the +HT method (setting ) and GMM+ method (setting the number of components as 64). We can observe that, when , the HT model’s identification accuracy is a little lower than the traditional GMM model, when the durations of the test utterance data are longer (e.g., ), the +HT method can obtain more accurate and stable results.

In order to check the statistical significance of the improvement, we analyzed the statistical independence of these two models by student’s -test method. We assumed the identification results from these two models obey independent random normal distributions with equal means and equal but unknown variances. The -value in different is shown in Table I, we can observe that when , -value is much larger than 0.05 which means statistical independence assumption does not hold. It can be inferred that, when , GMM model and HT model have the similar identification effect, although the GMM model achieves a little higher average identification accuracy in round evaluations. When is larger than , the -values are much smaller than , which indicates the improvement obtained by the HT model over the GMM model is statistically significant.

T |
50 | 100 | 150 | 200 |
---|---|---|---|---|

0.1748 | 0.0030 | 0.0158 | 0.0193 |

Through the above experiments, we can conclude the HT model performs better than the conventionally used GMM model in precision and stability and the HT model can fit the complicated probability distribution better. It encourages us to use the HT model to improve the some other GMM based speech processing system, *e.g.*, speech recognition system based on the GMM+HMM model.

## V Conclusions and Further Work

A speaker identification (SI) method based on histogram transform (HT) model was proposed in this paper. The proposed method used the mel-frequency cepstral coefficients (MFCCs) and the dynamic information among adjacent frames as features. The identification accuracies were improved by using synthesized features generated through the random transform method. By selecting a reasonable number of transforms, more train features were generated to estimate the histogram. The experimental results show that comparing with the traditional GMM model, the HT model make promising improvement for SI tasks.

In the future we can try to use some other features, *e.g.*, the line spectral frequencies (LSFs) in the HT model. Some other distributions, *e.g.*, Dirichlet distribution or beta distribution can be used to replace the Gaussian distribution as the prior distribution to estimate the probability of the zero zones of the histogram. Recently, some researches showed that fusion of several different systems effectively improves SI performance[56]. Therefore, it is also worthwhile considering fusion of the HT model and the state-of-the-art i-vector based method.

## References

- [1] S. Nakagawa, L. Wang, and S. Ohtsuka, “Speaker identification and verification by combining MFCC and phase information,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1085–1095, 2012.
- [2] Z. Ma and A. Leijon, “Super-Dirichlet mixture models using differential line spectral frequencies for text-independent speaker identification,” in INTERSPEECH, pp. 2360–2363, Aug 2011.
- [3] Z. Ma, A. Leijon, and W. B. Kleijn, “Vector quantization of LSF parameters with a mixture of Dirichlet distributions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1777–1790, 2013.
- [4] Y. Hu, D. Wu, and A. Nucci, “Fuzzy-clustering-based decision tree approach for large population speaker identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, pp. 762–774, 2013.
- [5] X. Ma, J. Zhang, Y. Zhang, and Z. Ma, “Data scheme-based wireless channel modeling method: motivation, principle and performance,” Journal of Communications and Information Networks, vol. 2, pp. 41–51, Sep 2017.
- [6] Z. Ma, “Bayesian estimation of the dirichlet distribution with expectation propagation,” in Proceedings of European Signal Processing Conference, 2012.
- [7] Z. Ma and A. Leijon, “Expectation propagation for estimating the parameters of the beta distribution,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010.
- [8] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition,” Speech Communication, vol. 54, no. 4, pp. 543–565, 2012.
- [9] U. Bhattacharjee and K. Sarmah, “Language identification system using MFCC and prosodic features,” in International Conference on Intelligent Systems and Signal Processing (ISSP), pp. 194–197, IEEE, 2013.
- [10] Z. M. Dan and F. S. Monica, “A study about MFCC relevance in emotion classification for SRoL database,” in International Symposium on Electrical and Electronics Engineering (ISEEE), pp. 1–4, IEEE, 2013.
- [11] T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation of various MFCC implementations on the speaker verification task,” in Proceedings of the SPECOM, vol. 1, pp. 191–194, 2005.
- [12] Z. Ma and A. Leijon, “Modeling speech line spectral frequencies with dirichlet mixture models,” in Proceedings of INTERSPEECH, 2010.
- [13] Z. Ma and A. Leijon, “Pdf-optimized lsf vector quantization based on beta mixture models,” in Proceedings of INTERSPEECH, 2010.
- [14] Z. Ma and A. Leijon, “Human skin color detection in rgb space with bayesian estimation of beta mixture models,” in Proceedings of European Signal Processing Conference, 2010.
- [15] P. Zhou, L. Dai, Q. Liu, and H. Jiang, “Combining information from multi-stream features using deep neural network in speech recognition,” in Signal Processing (ICSP), 2012 IEEE 11th International Conference on, vol. 1, pp. 557–561, IEEE, 2012.
- [16] Z. Ma and A. Leijon, “Human audio-visual consonant recognition analyzed with three bimodal integration models,” in Proceedings of INTERSPEECH, 2009.
- [17] Z. Ma and A. Leijon, “A probabilistic principal component analysis based hidden markov model for audio-visual speech recognition,” in Proceedings of IEEE Asilomar Conference on Signals, Systems, and Computers, 2008.
- [18] Z. Ma, A. Leijon, Z.-H. Tan, and S. Gao, “Predictive distribution of the dirichlet mixture model by local variational inference,” Journal of Signal Processing Systems, vol. 74, pp. 359–374, Mar 2014.
- [19] Z. Ma, H. Li, Q. Sun, C. Wang, A. Yan, and F. Starfelt, “Statistical analysis of energy consumption patterns on the heat demand of buildings in district heating systems,” Energy and Buildings, vol. 85, pp. 464–472, Dec. 2014.
- [20] Q. Sun, H. Li, Z. Ma, C. Wang, J. Campillo, Q. Zhang, F. Wallin, and J. Guo, “A comprehensive review of smart energy meters in intelligent energy networks,” IEEE Internet of Things Journal, vol. 3, pp. 464–479, Aug 2016.
- [21] Z. Wang, Y. Qi, J. Liu, and Z. Ma, “User intention understanding from scratch,” in IEEE International Workshop on Sensing, Processing and Learning for Intelligent Machines, 2016.
- [22] P. Xu, K. Li, Z. Ma, Y.-Z. Song, L. Wang, and J. Guo, “Cross-modal subspace learning for sketch-based image retrieval: A comparative study,” in Proceedings of IEEE International Conference on Network Infrastructure and Digital Content, 2016.
- [23] P. Xu, Q. Yin, Y. Huang, Y.-Z. Song, Z. Ma, L. Wang, T. Xiang, W. B. Kleijn, and J. Guo, “Cross-modal subspace learning for fine-grained sketch-based image retrieval,” NEUROCOMPUTING, vol. 278, pp. 75–86, Feb. 2018.
- [24] M. A. Pathak and B. Raj, “Privacy-preserving speaker verification and identification using Gaussian mixture models,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 2, pp. 397–406, 2013.
- [25] Z. Ma, S. Chatterjee, W. B. Kleijn, and J. Guo, “Dirichlet mixture modeling to estimate an empirical lower bound for LSF quantization,” Signal Processing, vol. 104, pp. 291–295, 2014.
- [26] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian estimation of Dirichlet mixture model with variational inference,” Pattern Recognition, vol. 47, no. 9, pp. 3143–3157, 2014.
- [27] Z. Ma and A. Leijon, “Bayesian estimation of beta mixture models with variational inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2160–2173, 2011.
- [28] J. Taghia, Z. Ma, and A. Leijon, “On von-Mises Fisher mixture model in text-independent speaker identification,” in INTERSPEECH, pp. 2499–2503, 2013.
- [29] J. Taghia, Z. Ma, and A. Leijon, “Bayesian estimation of the von-Mises Fisher mixture model with variational inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1701–1715, 2014.
- [30] D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication, vol. 17, no. 1, pp. 91–108, 1995.
- [31] S. Nakagawa, W. Zhang, and M. Takahashi, “Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based hmm,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–81, IEEE, 2004.
- [32] Z. Ma, R. Martin, J. Guo, and H. Zhang, “Nonlinear estimation of missing lsf parameters by a mixture of dirichlet distributions,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2014.
- [33] Z. Ma, J. Taghia, W. B. Kleijn, A. Leijon, and J. Guo, “Line spectral frequencies modeling by a mixture of von mises cfisher distributions,” Signal Processing, vol. 114, pp. 219–224, Sept. 2015.
- [34] J.-N. Hwang, S.-R. Lay, and A. Lippman, “Nonparametric multivariate density estimation: a comparative study,” IEEE Transactions on Signal Processing, vol. 42, no. 10, pp. 2795–2810, 1994.
- [35] W. K. Härdle, M. Müller, S. Sperlich, and A. Werwatz, Nonparametric and semiparametric models. Springer Science & Business Media, 2012.
- [36] Z. Ma, Z.-T. Tan, and J. Guo, “Feature selection for neutral vector in EEG signal classification,” NEUROCOMPUTING, vol. 174, pp. 937–945, 2016.
- [37] Z. Ma and A. E. Teschendorff, “A variational Bayes beta mixture model for feature selection in DNA methylation studies,” Journal of Bioinformatics and Computational Biology, vol. 11, no. 4, 2013.
- [38] Z. Ma, A. E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian matrix factorization for bounded support data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 4, pp. 876–89, 2015.
- [39] Z. Ma, J. Xie, H. Li, Q. Sun, Z. Si, J. Zhang, and J. Guo, “The role of data analysis in the development of intelligent energy networks,” IEEE Network, vol. 31, no. 5, pp. 88–95, 2017.
- [40] W. N. Venables and B. D. Ripley, Modern applied statistics with S-PLUS. Springer Science & Business Media, 2013.
- [41] P. K. Rana, Z. Ma, J. Taghia, and M. Flierl, “Multiview depth map enhancement by variational bayes inference estimation of dirichlet mixture models,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013.
- [42] E. López-Rubio, “A histogram transform for probability density function estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 644–656, 2014.
- [43] Z. Ma, J. H. Xue, A. Leijon, Z. H. Tan, Z. Yang, and J. Guo, “Decorrelation of neutral vector variables: Theory and applications,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, pp. 129–143, Jan 2018.
- [44] H. Yu, Z. Ma, M. Li, and J. Guo, “Histogram transform model using MFCC features for text-independent speaker identification,” in 48th Asilomar Conference on Signals, Systems and Computers, pp. 500–504, Nov 2014.
- [45] H. Yu, A. Sarkar, D. A. L. Thomsen, Z.-H. Tan, Z. Ma, and J. Guo, “Effect of multi-condition training and speech enhancement methods on spoofing detection,” in IEEE International Workshop on Sensing, Processing and Learning for Intelligent Machines, 2016.
- [46] H. Yu, Z. H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–12, 2018.
- [47] H. Yu, Z. H. Tan, Y. Zhang, Z. Ma, and J. Guo, “DNN filter bank cepstral coefficients for spoofing detection,” IEEE Access, vol. 5, pp. 4779–4787, 2017.
- [48] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic multiview depth image enhancement using variational inference,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, pp. 435–448, April 2015.
- [49] H. Zhou, N. Zhang, D. Huang, Z. Ma, W. Hu, and J. Guo, “Activation force-based air pollution tracing,” in Proceedings of IEEE International Conference on Network Infrastructure and Digital Content, 2016.
- [50] J. Benesty, M. M. Sondhi, and Y. Huang, Springer handbook of speech processing. Springer Science & Business Media, 2007.
- [51] Z. Si, H. Yu, and Z. Ma, “Learning deep features for DNA methylation data analysis,” IEEE ACCESS, vol. 4, pp. 2732–2737, June 2016.
- [52] F. Mezzadri, “How to generate random matrices from the classical compact groups,” arXiv preprint math-ph/0609050, 2006.
- [53] H. Jeffreys, “An invariant form for the prior probability in estimation problems,” in Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 186, pp. 453–461, The Royal Society, 1946.
- [54] D. W. Scott, Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
- [55] “DARPA-TIMIT,” Acoustic-phonetic continuous speech corpus, NIST Speech Disc 1.1-1, 1990.
- [56] O. Plchot, S. Matsoukas, P. Matejka, N. Dehak, J. Z. Ma, S. Cumani, O. Glembek, H. Hermansky, S. H. R. Mallidi, N. Mesgarani, et al., “Developing a speaker identification system for the DARPA rats project.,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6768–6772, 2013.