Statistical Speech Model Description with VMF Mixture Model
Efficient quantization of the linear predictive coding (LPC) parameters plays a key role in parametric speech coding. The line spectral frequency (LSF) representation of the LPC parameters has found its applications in speech model quantization. In practical implementation of vector quantization (VQ), probability density function (PDF)-optimized VQ has been shown to be more efficient than the VQ based on training data. In this paper, we present the LSF parameters by a unit vector form, which has directional characteristics. The underlying distribution of this unit vector variable is modeled by a von Mises-Fisher mixture model (VMM). With the high rate theory, the optimal inter-component bit allocation strategy is proposed and the distortion-rate (D-R) relation is derived for the VMM based-VQ (VVQ). Experimental results show that the VVQ outperforms our recently introduced DVQ and the conventional GVQ.
Quantization of the line predictive coding (LPC) model is ubiquitously applied in speech coding [1, 2, 3, 4, 5, 6]. The line spectral frequency (LSF)  presentation of the LPC model is the commonly used one in quantization [8, 1] because of its relatively uniform spectral sensitivity . Efficient quantization methods for the LSF parameters have been studied intensively in the literature (see e.g., [8, 10, 11, 12]). Among these methods, the probability density function (PDF)-optimized vector quantization (VQ) scheme has been shown to be superior to those based on training data [10, 11, 13, 14, 15, 16]. In PDF-optimized VQ, the underlying distribution of the LSF parameters is described by a statistical parametric model, e.g., Gaussian mixture model (GMM) [10, 14, 17]. Once this model is obtained, the codebook can be either trained by using a sufficient amount of data (theoretically infinitely large) generated from the obtained model or calculated theoretically. Thus PDF-optimized VQ can prevent the codebook from overfitting to the training data, and hence the performance of VQ can be significantly improved [10, 11].
Statistical modeling plays an important role in PDF-optimized VQ, hence in the literature, several studies have been conducted to seek an effective model to explicitly capture the statistical properties of the LSF parameters or its corresponding transformations. A frequently used method is the GMM-based VQ (GVQ), which models the LSF parameters’ distribution with a GMM [10, 11]. By recognizing the bounded property (all the LSF parameters are placed in the interval ), Lindblom and Samuelsson [18, 19, 20, 21] proposed a bounded GVQ scheme by truncating and renormalizing the standard Gaussian distribution. In , the LSF parameters were linearly scaled into the interval . Authors introduced a beta mixture model (BMM)-based VQ scheme, which took into account the bounded support nature of the LSF parameters. As the LSF parameters are also strictly ordered, a Dirichlet mixture model (DMM)-based VQ (DVQ) scheme was recently presented to explicitly utilize both the bounded and the ordering properties [13, 12]. In the DVQ scheme, the LSF parameters were transformed linearly to the LSF parameters . Modeling the underlying distribution of the LSF parameters with a DMM yields better distortion-rate (D-R) relation than those obtained by modeling the LSF parameters with a GMM [11, 5, 22, 23] and a BMM . Hence, the practical quantization performance was also improved significantly . Previous studies suggest the fact that transforming the LSF parameters into some other form and applying a suitable statistical model to efficiently describe the distribution can potentially benefit the practical quantization [14, 13, 12].
In this letter, we study the high rate D-R performance of the LSF parameter by using the recently proposed square-root LSF (SRLSF) representation [?, 24]. This representation is obtained by taking the positive square-root of the LSF parameters. By concatenating a redundant element to the end of the SRLSF parameter, a unit vector that contains only positive elements is obtained. Geometrically, this unit vector has directional characteristics and is distributed on the hypersphere with center at the origin. For such unit vector, the von Mises-Fisher (vMF) distribution is an ideal and widely used statistical model to describe the underlying distribution [25, 26, 27, 28]. One application domain of vMF distribution is in information retrieval where the cosine similarity is an effective measure of similarity for analyzing text documents . Another application domain of this distribution is in bioinformatics (e.g., [29, ?, 24, 30, 31]) and collaborative filtering (e.g., ) in which the Pearson correlation coefficient serves as the similarity measure. More recently, Taghia et al. [?, 24] proposed a text-independent speaker identification system based on modeling the underlying distribution of SRLSF parameters by a mixture of vMF distributions. Here, we model the underlying distribution of the SRLSF parameters by a VMM and propose a VMM-based VQ (VVQ) scheme. According to the high rate quantization theory [33, 34], the D-R relation can be analytically derived for a single vMF distribution with constrained entropy. Based on the high rate theory, the optimal inter-component bit allocation strategy is proposed. Finally, the D-R performance for the overall VVQ is derived. Compared with the recently presented DVQ and the conventionally used GVQ, the VVQ shows convincing improvement. Hence, it potentially permits better practical quantization performance.
The remaining parts are organized as follows. In section II, different representations of the LSF parameters are introduced. We briefly review the vMF distribution and the corresponding parameter estimation methods in section III. A PDF-optimized VQ based on VMM is proposed in section IV and the experimental results are shown in section V. Finally, we draw some conclusions and discuss future work in section VI and VII.
Ii Lsf, LSF, and SRLsf
The LSF parameters are widely used in speech coding due to the advantage over some other forms of representations (such as LARs, ASRCs). The LSF parameters with dimensionality are defined as
By recognizing that the LSF parameters are in the interval and are strictly ordered, we proposed a particular representation of LSF parameters called LSF  for the purpose of LSF quantization . The LSF parameters in represented as 
In [?, 24], we modeled the underlying distribution of the SRLSF by a -variate VMM and proposed a text-independent speaker identification system based on the SRLSF representation, which showed competitive performance compared to the benchmark approach.
Ii-B Distortion Transformation
Denote the PDFs of and as and , respectively. Assuming that the -dimensional SRLSF space is divided into cells and with the optimal lattice quantizer, the overall quantization distortion (using the square error as the criterion) for can be written as [33, 40, 41, 42]
where denotes the quantization error and all the cells are of identical shape according to Gersho conjecture . The mapping from LSF space to SRLSF space changes the distortion per cell in the LSF domain at as , where is the Jacobian matrix
Then the overall quantization distortion transformation between and can be denoted as
where is the identity matrix, the quantization noise is white in the optimal lattices, is the marginal distribution of , and we assumed that the quantization noise is independent of (and, therefore, independent of as well) . According to the neutrality  of the Dirichlet variable , the marginal distribution is beta distributed. Therefore, the mean value of with respect to its marginal distribution can be calculated explicitly. In our previous work, the measurement transformation between the LSF space and the LSF space was presented in [13, 43]. Therefore, with these transformation methods, we can compare the high rate D-R performance in all the three different spaces fairly with consistent measurements.
Iii Statistical Model for SRLSF Parameters
The vMF distribution and its corresponding VMM are widely used in modeling the underlying distribution of the unit vector [29, ?, 24, 44, 45]. Therefore, we apply the VMM as the statistical model for SRLSF Parameters.
Iii-a Von Mises-Fisher Mixture Model
Let denote a -dimensional vector satisfying . Then, the -dimensional unit random vector on the -dimensional unit hypersphere is said to have -variate vMF distribution if its PDF is given by
where , , and . The normalizing constant is given by
where represents the modified Bessel function of the first kind of order . The density function is characterized by the mean direction and the concentration parameter .
With mixture components, the likelihood function of the VMM with i.i.d. observation is
where (, ) is the weights, is the mean directions, and is the concentration parameters.
Iii-B Parameter Estimation
Let be the corresponding set of hidden random variables, where means is sampled from the th vMF component. Given , , and the model parameters , the complete log-likelihood of writes
As obtaining the maximum-likelihood (ML) estimates from the complete log-likelihood is not tractable , an efficient expectation-maximization (EM) approach is developed which provides the ML estimates to the model parameters [29, 47, 48]. The E-step and the M-step are summarized as:
Iv PDF-optimized Vector Quantization
In designing practical quantizers, one challenging problem is that when the amount of the training data is not sufficiently large enough, the obtained coodbook may tend to be over-fitted to the training set and perform worse for the whole real data set. The PDF-optimized VQ can overcome such problem either by generating sufficiently large amount of training data from the obtained PDF or calculating the optimal code book explicitly with the obtained PDF [10, 11]. Thus, with the trained VMM, we can design a PDF-optimized VQ.
Iv-a Distortion-Rate Relation with Constrained Entropy
With the high rate assumption, the analysis of the quantization performance is analytically tractable . Since coding at a finite rate is the motivation of using quanizers, constraint must be imposed on VQ design. Generally speaking, there are two commonly used cases, namely the constrained resolution (CR) and the constrained entropy (CE). In the CR case, the number of index levels is fixed. It is widely applied in communication systems. The CE case, on the other hand, imposes the constraint on average bit rate. It is less restrictive than the CR case and yields lower average bit rates. As the computational capabilities of hardware increases, it becomes more attractive to exploit advantages inherent in CE case .
Assuming that the PDF of variable is , the D-R relation in CE case, on a per dimension basis, writes
where is the differential entropy of , is the average rate for quantization, and is a constant depends on the distortion type (e.g., means the Euclidean distortion) and the variable’s dimension (degrees of freedom) .
Iv-B Optimal Inter-component Bit Allocation
When applying a mixture model based quantizer, we model the PDF as a weighted addition of mixture components and design a quantizer for each component. The total rate will be divided into two parts, one for identifying the indices of the mixture components and the other for quantizing the mixture components. Given mixture components, the rate spent on identifying the indices is . The remaining rate will be used for quantizing the mixture components. Therefore, an optimal inter-component bit allocation strategy is required so that the designed quantizer can achieve the smallest mean distortion at a given .
In CE case, the objective is to minimize the mean distortion
where is the rate assigned to component and satisfies . To reach the optimal mean distortion, each component should have its best CE performance. This indicates that the distortion for each mixture component writes
The differential entropy for component in a VMM is
where we used the fact that .
The constrained optimization problem in (15) can be solved by the method of Lagrange multipliers. With some mathematics, the rate assigned to the th mixture component is
Iv-C Distortion-Rate Relation by VMM
In CE case and with optimal inter-component bit allocation, the distortions contributed by all the mixture components are identical to each other because is a constant which only depends on the trained model  . Then the D-R relation is
V Experimental Results and Discussion
The proposed inter-component bit allocation strategy optimizes the D-R relation of VVQ. To demonstrate the D-R performance, we compared it with our recently presented DVQ  and the widely used GVQ [11, 23]. The TIMIT  database with wideband speech (sampled at kHz) was used. We extracted -dimensional LPC parameters and transformed them to LSF parameters, LSF parameters, and SRLSF, respectively. With window length equal to milliseconds and step size equal to milliseconds, approximate LSF vectors (the same amount for LSF and SRLSF as well) were obtained from the training partition. GMM, DMM, and VMM were trained based on the relating vectors and the D-R relations were calculated, respectively. The mean values of rounds of simulations are reported. Figure 1 shows the D-R performance comparisons. It can be observed that VVQ leads to smaller distortion at different rates, compared to GVQ and DVQ. We believe this is due to the efficient modeling of the SRLSF parameters. Furthermore, better D-R performance can be obtained with more mixture components. Therefore, VVQ potentially permits superior practical VQ performance.
A novel PDF-optimized VQ for LSF parameters quantization was proposed. The LSF parameters were transformed to the square-root LSF domain and we modeled the underlying distribution by a von Mises-Fisher mixture model (VMM). According to the principle of high rate quantization theory and with the constrained entropy case, the optimal inter-component bit allocation strategy was proposed based on the VMM. The mean distortion of the VMM based vector quantizer (VVQ) was minimized at a given rate so that the D-R relation was obtained. Compared to our recently proposed Dirichlet mixture model based VQ and the conventionally used Gaussian mixture model based VQ, the proposed VVQ performs better at a wide range of bit rates.
Vii Future Work
For our future work, we need to implement a practical scheme to carry out the VQ. One possible solution is to propose an efficient quantizer for the von Mises-Fisher (vMF) source, e.g., similar as the method in . Another possible solution is to decorrelate the vMF vector variable into a set of scalar variables, each of which has an explicit PDF representation. Then we can replace the VQ with a set of independent scalar quantizers. This approach is similar to the Dirichlet source decorrelation and the Dirichlet mixture model based VQ introduced in .
Appendix A Discussion about the Inconsistency of Likelihood Comparison and D-R Comparison
This section is only for discussion and will not appear in the final submission.
As we observed before, the likelihood obtained by DMM is higher than the likelihood obtained by VMM. If we calculate the differential entropy of the trained PDF empirically as
a higher likelihood leads to a smaller differential entropy. According to (14), this indicates better D-R performance. However, in our manuscript, VMM performs better than GMM, when we applied the mixture quantizer strategy.
Why would this happen?
Thereafter, we have
This inequality indicates that, the D-R performance calculated in the CE case (with mixture quantizer, (18)) is, in general, not identical to the D-R performance calculated with the whole PDF ((14)). The inequality in (A) vanishes when all the components have the same weights. The equality (27) holds if there is no overlapping among the mixture components or we do not take the mixture modeling (I=1).
The inequality above introduces a systematic gap (a loss at the D-R performance). This gap depends on the training and the distribution assumption. Therefore, smaller differential entropy for the whole PDF () can only guarantee better D-R performance, if we do not take mixture quantizer strategy. In mixture quantizer, it can not guarantee a better D-R performance.
- K. K. Paliwal and W. B. Kleijn, Speech Coding and Synthesis. Amsterdam, The Netherlands: Elsevier, 1995, ch. Quantization of LPC parameters, pp. 433–466.
- W. B. Kleijn, T. Backstrom, and P. Alku, “On line spectral frequencies,” IEEE Signal Processing Letters, vol. 10, no. 3, pp. 75–77, 2003.
- X. Ma, J. Zhang, Y. Zhang, and Z. Ma, “Data scheme-based wireless channel modeling method: motivation, principle and performance,” Journal of Communications and Information Networks, vol. 2, no. 3, pp. 41–51, Sep 2017.
- Z. Ma, “Bayesian estimation of the dirichlet distribution with expectation propagation,” in Proceedings of European Signal Processing Conference, 2012.
- Z. Ma, S. Chatterjee, W. B. Kleijn, and G. J., “Dirichlet mixture modeling to estimate an empirical lower bound for LSF quantization,” SIgnal Processing, vol. 104, pp. 291–295, Nov. 2014.
- Z. Ma, Y. Lai, J. Taghia, and J. Guo, “Insights into the convergence of extended variational inference for non-Gaussian statistical models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, under review.
- F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signals,” Journal of the Acoustical Society of America, vol. 57, p. 535, 1975.
- K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parameters at 24 bits/frame,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pp. 3–14, Jan. 1993.
- J. Li, N. Chaddha, and R. M. Gray, “Asymptotic performance of vector quantizers with a perceptual distortion measure,” IEEE Transactions on Information Theory, vol. 45, pp. 1082 – 1091, May 1999.
- P. Hedelin and J. Skoglund, “Vector quantization based on Gaussian mixture models,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 4, pp. 385–401, Jul. 2000.
- A. D. Subramaniam and B. D. Rao, “PDF optimized parametric vector quantization of speech line spectral frequencies,” IEEE Transactions on Speech and Audio Processing, vol. 11, pp. 130–142, Mar 2003.
- Z. Ma, A. Leijon, and W. B. Kleijn, “Vector quantization of LSF parameters with a mixture of Dirichlet distributions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1777–1790, Sept 2013.
- Z. Ma and A. Leijon, “Expectation propagation for estimating the parameters of the beta distribution,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010.
- ——, “Modeling speech line spectral frequencies with dirichlet mixture models,” in Proceedings of INTERSPEECH, 2010.
- ——, “Bayesian estimation of beta mixture models with variational inference.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2160–73, 2011.
- ——, “Modeling speech line spectral frequencies with Dirichlet mixture models,” in Proceedings of INTERSPEECH, 2010, pp. 2370–2373.
- ——, “Pdf-optimized lsf vector quantization based on beta mixture models,” in Proceedings of INTERSPEECH, 2010.
- J. Lindblom and J. Samuelsson, “Bounded support Gaussian mixture modeling of speech spectra,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 1, pp. 88–99, Jan. 2003.
- Z. Ma and A. Leijon, “Human skin color detection in rgb space with bayesian estimation of beta mixture models,” in Proceedings of European Signal Processing Conference, 2010.
- ——, “Human audio-visual consonant recognition analyzed with three bimodal integration models,” in Proceedings of INTERSPEECH, 2009.
- ——, “A probabilistic principal component analysis based hidden markov model for audio-visual speech recognition,” in Proceedings of IEEE Asilomar Conference on Signals, Systems, and Computers, 2008.
- Z. Ma, H. Li, Q. Sun, C. Wang, A. Yan, and F. Starfelt, “Statistical analysis of energy consumption patterns on the heat demand of buildings in district heating systems,” Energy and Buildings, vol. 85, pp. 464–472, Dec. 2014.
- S. Chatterjee and T. V. Sreenivas, “Low complexity wideband LSF quantization using GMM of uncorrelated Gaussian mixtures,” in 16th European Signal Processing Conference (EUSIPCO), 2008.
- J. Taghia, Z. Ma, and A. Leijon, “On von-Mises Fisher mixture model in text-independent speaker identification,” in Proceedings of INTERSPEECH, 2013.
- K. V. Mardia and P. E. Jupp, Directional Statistics. John Wiley and Sons, 2000.
- Z. Ma, R. Martin, J. Guo, and H. Zhang, “Nonlinear estimation of missing lsf parameters by a mixture of dirichlet distributions,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2014.
- Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian estimation of Dirichlet mixture model with variational inference,” Pattern Recognition, vol. 47, no. 9, pp. 3143–3157, 2014.
- Z. Ma, Z.-T. Tan, and J. Guo, “Feature selection for neutral vector in EEG signal classification,” NEUROCOMPUTING, vol. 174, pp. 937–945, 2016.
- A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra, “Clustering on the unit hypersphere using von Mises-Fisher distributions,” Journal of Machine Learning Research, vol. 6, pp. 1345–1382, 2005.
- Z. Ma and A. E. Teschendorff, “A variational Bayes beta mixture model for feature selection in DNA methylation studies,” Journal of Bioinformatics and Computational Biology, vol. 11, no. 4, 2013.
- Z. Ma, A. E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian matrix factorization for bounded support data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 4, pp. 876–89, 2015.
- B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item based collaborative filtering recommendation algorithms,” in Proc. 10th International Conference on the World Wide Web, 2001, pp. 285–295.
- W. B. Kleijn, A basis for source coding, 2010, KTH lecture notes.
- Z. Ma, J. Xie, H. Li, Q. Sun, Z. Si, J. Zhang, and J. Guo, “The role of data analysis in the development of intelligent energy networks,” IEEE Network, vol. 31, no. 5, pp. 88–95, 2017.
- Z. Ma, J. H. Xue, A. Leijon, Z. H. Tan, Z. Yang, and J. Guo, “Decorrelation of neutral vector variables: Theory and applications,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 1, pp. 129–143, Jan 2018.
- P. K. Rana, Z. Ma, J. Taghia, and M. Flierl, “Multiview depth map enhancement by variational bayes inference estimation of dirichlet mixture models,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013.
- P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic multiview depth image enhancement using variational inference,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 3, pp. 435–448, April 2015.
- Q. Sun, H. Li, Z. Ma, C. Wang, J. Campillo, Q. Zhang, F. Wallin, and J. Guo, “A comprehensive review of smart energy meters in intelligent energy networks,” IEEE Internet of Things Journal, vol. 3, no. 4, pp. 464–479, Aug 2016.
- Z. Wang, Y. Qi, J. Liu, and Z. Ma, “User intention understanding from scratch,” in IEEE International Workshop on Sensing, Processing and Learning for Intelligent Machines, 2016.
- P. Xu, K. Li, Z. Ma, Y.-Z. Song, L. Wang, and J. Guo, “Cross-modal subspace learning for sketch-based image retrieval: A comparative study,” in Proceedings of IEEE International Conference on Network Infrastructure and Digital Content, 2016.
- P. Xu, Q. Yin, Y. Huang, Y.-Z. Song, Z. Ma, L. Wang, T. Xiang, W. B. Kleijn, and J. Guo, “Cross-modal subspace learning for fine-grained sketch-based image retrieval,” NEUROCOMPUTING, vol. 278, pp. 75–86, Feb. 2018.
- H. Yu, Z. Ma, M. Li, and J. Guo, “Histogram transform model uding mfcc features for text-independent speaker identification,” in Proceedings of IEEE Asilomar Conference on Signals, Systems, and Computers, 2014.
- H. Yu, A. Sarkar, D. A. L. Thomsen, Z.-H. Tan, Z. Ma, and J. Guo, “Effect of multi-condition training and speech enhancement methods on spoofing detection,” in IEEE International Workshop on Sensing, Processing and Learning for Intelligent Machines, 2016.
- H. Yu, Z. H. Tan, Y. Zhang, Z. Ma, and J. Guo, “DNN filter bank cepstral coefficients for spoofing detection,” IEEE Access, vol. 5, pp. 4779–4787, 2017.
- H. Yu, Z. H. Tan, Z. Ma, R. Martin, and J. Guo, “Spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–12, 2018.
- M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York: Dover Publications, 1965.
- S. Sra, “A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of ,” Computational Statistics, vol. 27, no. 1, pp. 177–190, 2012.
- H. Zhou, N. Zhang, D. Huang, Z. Ma, W. Hu, and J. Guo, “Activation force-based air pollution tracing,” in Proceedings of IEEE International Conference on Network Infrastructure and Digital Content, 2016.
- “DARPA-TIMIT,” ¡°Acoustic-phonetic continuous speech corpus,¡± NIST Speech Disc 1.1-1, 1990.
- J. Hamkins and K. Zeger, “Gaussian source coding with spherical codes,” IEEE Transactions on Information Theory, vol. 48, no. 11, pp. 2980–2989, 2002.