Classification of EEG Signal based on nonGaussian Neutral Vector
Abstract
In the design of braincomputer interface systems, classification of Electroencephalogram (EEG) signals is the essential part and a challenging task. Recently, as the marginalized discrete wavelet transform (mDWT) representations can reveal features related to the transient nature of the EEG signals, the mDWT coefficients have been frequently used in EEG signal classification. In our previous work, we have proposed a superDirichlet distributionbased classifier, which utilized the nonnegative and sumtoone properties of the mDWT coefficients. The proposed classifier performed better than the stateoftheart support vector machinebased classifier. In this paper, we further study the neutrality of the mDWT coefficients. Assuming the mDWT vector coefficients to be a neutral vector, we transform them nonlinearly into a set of independent scalar coefficients. Feature selection strategy is proposed on the transformed feature domain. Experimental results show that the feature selection strategy helps improving the classification accuracy.
keywords:
Neutral vector, neutrality, nonlinear decorrelation, Dirichlet variable, superDirichlet distribution, beta distribution, EEG classification1 Introduction
Braincomputer interface (BCI) connects persons suffering from neuromuscular diseases with computers by analyzing the recorded brain signals. With a welldesigned BCI system, persons with neuromuscular disease can communicate with computers enabling them to get assistances from machines. As noninvasively acquired signal, the Electroencephalogram (EEG) signal is the most studied and applied one in the design of a BCI system Lotte2007 (); Chiang2012 (). While a person is imagining a kind of action, the electrical activity along the scalp is recorded in the EEG signal. EEG signals show different patterns for different actions. Hence, the type of imagined action can be estimated by analyzing the EEG signals. Appropriate classification of EEG signals plays an essential role in a BCI system Prasad2011 ().
Various types of features have been extracted from EEG signals for the purpose of classification, such as the autoaggressive (AR) parameters Penny2000 (), the multivariate AR parameters Chiang2012 (), the Fourier transform based features Veluvolu2012 (); Wang2013 (), and the marginalized discrete wavelet transform (mDWT) coefficients Subasi2007 (); Farina2007 (); Ma2012 (). The DWT coefficients present the signal by projecting it onto a set of spaces. The wavelet transform applied to the EEG signal can reveal features related to the transient nature of the signal in which the timescale regions are defined Subasi2007 (). In order to make the DWT coefficients insensitive to time alignment, the marginalized DWT (mDWT) coefficients are usually used as the feature for the task of EEG signal classification Prasad2011 (); Subasi2007 (); Farina2007 (). In this paper, we focus studying the EEG classification performance only on the mDWT features. A widely applied method, among others, is to design a classifier based on the support vector machine (SVM) Subasi2007 (); Farina2007 (); Chang2011 (); 08 (); 09 (). Generally speaking, the SVMbased classifier is not sensitive to the curse of dimensionality. It is also not sensitive to overtraining when choosing proper parameters Prasad2011 (). Moreover, it can easily be implemented for binary classification and extended to a multiple classes case. By involving a kernel function (e.g., Gaussian kernel), the performance of the SVMbased classifier could be further improved.
In EEG signal classification, the SVMbased classifier has been demonstrated as a successful tool Subasi2010 (); Prasad2011 (). Nevertheless, the SVMbased method does not exploit the nonnegativity and the sumtoone nature of the mDWT coefficients Ma2012 (). In order to capture such properties, we applied the Dirichlet distribution to model the mDWT coefficients’ underlying distribution. For the mDWT coefficients from more mutually independent channels, it is natural to apply the socalled superDirichlet distribution Ma2011b (). In Ma2012 (), we have designed a superDirichlet distributionbased classifier to classify the EEG signals with mDWT representation
It is wellknown that the Dirichlet variable is a neutral vector Connor1969 (); James1980 (). For a vector , an element is neutral if is independent of . If all the elements in are neutral, then is defined as a completely neutral vector Connor1969 (); Hankin2010 (). The idea of neutrality was introduced by Connor et al. Connor1969 () to describe constrained variables with the property mentioned above. It was originally developed for biological applications. The neutral vector is highly negatively correlated. As all the elements in a neutral vector have bounded support and are nonnegative, the neutral vector cannot be described efficiently by Gaussian distribution Ma2013 (). Thus, the conventional principal component analysis (PCA) method Bishop2006 () cannot be applied for optimal decorrelation
The purpose of dimension reduction is to remove the redundant dimensions and thus improve the corresponding performance Bishop2006 (); Saeys2007 (); Kwak2002 (); He2011 (); Zhu2010 (). We apply the proposed feature selection method in EEG signal classification tasks. The mDWT coefficients from each recording channel are assumed to be Dirichlet distributed Ma2012 (); Ma2014 () and decorrelated into a set of mutually independent scalars that are beta distributed. By retaining the most relevant features, we design a multivariate beta distribution classifier for EEG signals. Experimental results demonstrate that the proposed method performs better than both the stateoftheart SVMbased classifier Prasad2011 () and our previously proposed superDirichlet disributionbased classifier Ma2012 ().
2 Electroencephalogram Signal Analysis
EEG signal represents the brain electrical activities over a short period of time and it is recorded from multiple electrodes placed on the scalp. Therefore, the EEG signals are obtained from multiple channels. When a classifier trained on the first day is used to classify the data from the following days, it is very difficult and challenging to achieve good performance. The EEG signal we use in this paper is obtained from the BCI competition III BCI (). The training data and the test data were recorded from the same subject and with the same task, but on two different days with about one week in between. This way of recording data is robust to time variant.
2.1 Data Description
During the EEG signal recording, a subject had to perform imagined movements of either the left small finger or the tongue BCI (). Thus we have two classes of EEG signals and the task is binary classification. The electrical brain activity was picked up during these trials using an ECoG platinum electrode grid which was placed on the contralateral (right) motor cortex. In total, channels of EEG signals were obtained. For each channel, several trials of the imaginary brain activity were recorded. In total, trials were recorded as the labeled training set and trials were recorded as the labeled test set. In both the training set and test set, the data are evenly recorded for each imaginary movement.
2.2 Feature Extraction
For each trial out of in the training set, channel data of length samples were provided. Each channel data was band pass filtered in the Hz range
(1) 
After the DWT, we obtained a set of coefficients , where is the index of decomposition level, is the index for the coefficient at each level, and is the length of the data from each channel. In order to make the DWT representation insensitive to time alignment, the DWT coefficients were marginalized to socalled mDWT coefficients defined as Farina2007 ()
(2) 
where and denote the highband and lowband coefficients in the last decomposition level, respectively. The normalized coefficients were cascaded into a mDWT vector as . In our case, the DWT was carried out at level with Daubechies wavelet. Comparative work of applying different wavelets can be found in, e.g., Gandhi2011 (). With such settings, the total dimensionality of the mDWT vector is five. For each trial out of in the training set, we have mDWT vectors. The same procedure was also applied to the trials in the test set.
2.3 Channel Selection
As mentioned above, the EEG signals were recorded independently from channels, which were located on different positions over the scalp. However, it is unclear that which channels (i.e., recording position) are more relevant to the imaginary task than the rest Lal2004 () and the signals recorded from irrelevant channels should be noisy for the classification task Prasad2011 (). Thus the selection of the most relevant channels would improve the classification accuracy. Since it is a binary classification task in our study, we use two criteria, the Fisher ratio (FR) Malina1981 (); Chae2012 () and the generalization error estimation (GEE) Lal2004 (), to select the best channels, respectively.
Fisher Ratio
In binary classification, the FR presents how strong a channel correlates with labels . For a channel , the Fisher ratio of this channel, with equal prior probability to each class, is defined as Malina1981 ()
(3) 
where are the mean and the covariance matrix of class in channel , respectively. is a vector with the same size as . It represents the feature space coordinate axes. The channels with larger FRs are preferable for classification. The FRs were calculated based on the training set. Table 1 lists the FRs corresponds to recording channels.
Generalization Error Estimation
Channel  

FR  
CR  
Channel  
FR  
CR  
Channel  
FR  
CR  
Channel  
FR  0.01  
CR  49.28  
Channel  
FR  
CR  
Channel  
FR  
CR  
Channel  
FR  
CR  
Channel  
FR  0.01  
CR 
To select channels, the performance of the channel can also be estimated by the generalization error with folds cross validation. In the BCI competition III database, the data has already been split into the training set and test set and there is no overlap between these two sets. The evaluation of the classification rate (CR) on the training set is sufficient for estimating the channel performance. For each channel, we train a SVMbased classifier with the labeled training set. With the obtained classifier, we test the performance by the labeled training set itself. The higher the CR is, the more preferable the channel is. The CRs are also listed in Table 1.
3 EEG Classification via Feature Selection
The channel selection methods mentioned in the above section motivate us to combine different channels to obtain better classification results. As described in Ma2012 (), for each imagined trial we cascade EEG signals from the top channels to create a supervector. The classification task is carried out based on such supervectors.
3.1 SuperDirichlet Modeling
According to (2), the mDWT vector extracted from each channel contains elements which are nonnegative and whose sum is one. Hence, it is natural to model the underlying distribution of the mDWT vector by Dirichlet distribution. For more than one channels, we apply the superDirichlet distribution Ma2011b () to describe the supervector’s distribution. For a supervector from the top channels (), the probability density function (PDF) of the superDirichlet distribution is defined as
(4) 
where is the gamma function, is the number of subvectors (i.e., the number of selected channels) in the supervector, and is the degrees of freedom of the th subvector (in our case, ). is the parameter corresponds to , where denotes the th element in the th subvector . The PDF of the superDirichlet distribution is actually a multiplication of several PDFs of the Dirichlet distribution. The parameter estimation methods for the superDirichlet distribution can be found in Ma2012 ().
3.2 Nonlinear Decorrelation of Neutral Vector
Neutral Vector
Assuming we have a random vector variable , where and . An element is neutral if is independent of . If all the elements in are neutral, then is defined as a completely neutral vector Connor1969 (); Hankin2010 (). A neutral vector with elements has degrees of freedom. According to the above definition, the neutral vector conveys a particular type of independence among its elements, even though the element variables themselves are mutually negatively correlated.
Decorrelation via Parallel Nonlinear Transformation
In most signal processing applications, the transformations we use are linear or nonlinear according to some nonlinear kernel functions. Even though we could apply PCA directly to the neutral random vector variable, this linear transformation could only decorrelate the data, but cannot guarantee the independence if the data is not Gaussian. Furthermore, the PCA does not exploit the neutrality Ma2011 (). Therefore, PCA is not optimal for decorrelating neutral vector. By considering the neutrality, we apply nonlinear invertible transformation in this paper, which decorrelates the vector variable into a set of mutually independent variables. In contrast to PCA, the transformations do not require any statistical information (e.g., the covariance matrix) of the observed vector set. Thus, it avoids the eigenvalue analysis for PCA and, therefore, the computational cost is saved.
As each element in is neutral, with the neutrality of , we know that is independent of the remaining normalized elements. The remaining normalized elements then build a new neutral vector. Based on this fact, the parallel nonlinear transformation (PNT) scheme described in Algorithm 1 can be applied to nonlinearly decorrelate to a vector with mutually independent variables. Discussion of the independence is presented in Ma2013 (). The nonlinear transformation scheme proposed above is invertible by iterative multiplications. It shows the PNT procedure for dimensional neutral vector.
Distribution of the Decorrelated Elements
The Dirichlet variable is a completely neutral vector Frigyik2010 (). Assuming is a Dirichlet variable whose PDF is , we apply the above proposed PNT algorithm to decorrelate to obtain . Moreover, all the elements in are not only decorrelated but also mutually independent. The parameters in the Dirichlet PDF are . With the permutable property, aggregation property and the neutral property Ma2013 (), each element in obtained vector is beta distributed. The algorithm of calculating the parameters for the resulted beta distributions are described in Algorithm 2. For the example, we have
(5) 
where
(6) 
To illustrate the decorrelation effect of the PNT schemes on the Dirichlet variable, we generated vectors from a Dirichlet distribution with . xThe sample correlation coefficient for the original element pair was also evaluated. Table shows the sample correlation coefficients before and after transformation with PNT. The coefficients are very small after transformation, hence the correlation between each element pair vanished.
3.3 Selection of Relevant Features
Feature selection is an important problem in EEG signal classification Peng2005 (); Prasad2011 (); Lawhern2013 (). In section 2.3, the FR and GEE were applied to select the most relevant channels. However, within each channel, it is unknown which dimensions are more relevant to the class labels than others. Another difficulty for feature selection within each channel is that the feature in different dimensions are highly negatively correlated. The above introduced decorrelation strategy can transform the negatively correlated Dirichlet vector variable into a set of mutually independent scalar variables. Thus, we can directly select the features without considering the correlations among them.
Typically, two criteria can be used for feature selection, the variance of the data Bishop2006 (); He2011 () and the differential entropy of the data Kwak2002 (); Zhu2010 (). The variance reflects how far a set of data are spread out. The differential entropy is a measure of average uncertainty of a random variable under continuous probability distributions. In general, the dimension with larger variance/differential entropy is preferred in classification, as they can better describe the divergence among the data. With the assumption that the source data is Dirichlet distributed, the transformed vector contains a set of scalar variables which are beta distributed. For beta distribution , the variance of is computed as
(7) 
and the differential entropy of is calculated as
(8) 
where is the digamma function defined as .
In the following paragraph, we use both of the above mentioned criteria to select dimensions that correlate with the largest variances or differential entropies.
3.4 Multivariate Beta Distributionbased MAP Classifier
According to the above procedure, a set of selected dimensions are obtained. As the data in each dimension is assumed to be beta distributed and the dimensions are mutually independent, we can model the underlying distribution of the selected dimensional vector variable , which are selected from one recording channel, by a multivariate beta distribution (mvBeta) as
(9) 
Similarly, for the recordings from top channels, there are dimension selected in total. Therefore, these dimensions are modeled as
(10) 
where .
The BCI competition III data contains two classes, with label index . Since the parameters in the beta distributions are known according to Algorithm 2, a class dependent mvBeta distribution can be obtained for each class. In the test procedure, we create a maximum a posterior (MAP) classifier with the above obtained models. In each recording channel, for the vector from a test trial, we firstly transform it into with Algorithm 1, and then select the dimensions via the dimension’s variance/entropy. Finally, a decision based on the selected features for recording channels is made as
(11) 
where .
4 Experimental Results and Discussions
We evaluated the performance of the proposed feature selection strategy with the mvBeta distributionbased classifier on the BCI competition III database and compared it with the SVMbased classifier, the recently proposed superDirichlet mixture model (sDMM)based method, and the PCAbased classifier. The DWT is calculated using Matlab wavedec function with declevel equal to , followed by marginalization described in (2). According to Tab. 1, the best channels were selected based on FRs or CRs, in terms of their ranks.
Classifier setting and implementations:

The mvBetabased classifier was implemented according to the description in section 3.4. Feature selection was carried out within each channel.

The LIBSVM Chang2011 () was used to implement the SVMbased classifier, which had Gaussian kernel function with and the soft margin parameter . No feature selection was applied for SVMbased classifier.

The sDMMbased classifier was implemented based on the method described in Ma2012 (). There was no decorrelation strategy for sDMM or no feature selection either.

The PCAbased classifier was implemented with the standard PCA method. Within each channel, PCA was applied to decorrelate the data and features were selected according to their variances. The Gaussian mixture model was applied to model the distribution of the selected features.
All the above mentioned classifiers were trained and evaluated based on mDWT coefficients collected from the best channels.
4.1 Classification Accuracy without Feature Selection
In order to demonstrate the nonlinear decorrelation strategy, we evaluated the mvBeta distributionbased classifier without feature selection, which means that we set . In such case, the proposed classifier should perform the same as the one used in Ma2012 (), as no information is added or lost during the nonlinear transformation. As expected, experimental results show identical performance as that reported in Ma2012 (), where the sDMMbased classifier was employed. The highest classification accuracy is for both cases.
4.2 Classification Accuracy with Feature Selection
The total dimension of the mDWT is for each recording channel, which has the degrees of freedom equal to . Hence, after decorrelation (both with PNT and PCA), the obtained vector are dimensional (). In order to evaluate the mvBeta distributionbased classifier with the proposed feature selection strategy in Sec. 3.3, we set and , respectively
It can be observed that for the FR case (Fig. 1(a), 1(c), and 1(e)), when setting , the best performance appears at for mvBeta distributionbased classifier. This classification rate is the same as that obtained by the sDMM/mvBeta distribution (without feature selection)based classifiers, the only difference is the best performance occurs at in the latter classifiers. For the PCAbased classifier, the best performance, which is , appears at with . When investigating the CR case (Fig. 1(b), 1(d), and 1(f)), it can be observed that the mvBeta distributionbased classifier performs better than the sDMM/mvBeta distribution (without feature selection)based classifiers. The classification rate reaches at and at . Meanwhile, has been reached at several s. This fact supports our motivation that removing redundant features can improve the classification performance. The choice of does not work well, which is because we have reduced too much dimensions and key information are lost. The best performance of the PCAbased classifier is again , which happens at . In this case, feature selection does not help in improving the classification accuracy.
Channel Selection  Classifier  Best performance  Mean Acc.  Std. Dev. 

Fisher ratio  mvBeta /sDMM  ()  
mvBeta  ()  
mvBeta  ()  
PCA  ()  
PCA  ()  
PCA  ()  
SVM  ()  
Classification rate  mvBeta /sDMM  ()  
mvBeta  ()  
mvBeta  ()  
PCA  ()  
PCA  ()  
PCA  ()  
SVM  () 
4.3 Discussion
In general, the nonlinear decorrelation strategy for the neutral vector works well in EEG signal classification, no matter with or without feature selection. This verifies the effectiveness of the nonlinear decorrelation strategy.
When comparing with the SVMbased classifier Prasad2011 (), the recently proposed sDMMbased classifier Ma2012 () and the PCAbased classifier, the feature selection strategy proposed in this paper indeed improves the classification results. A summary of comparisons is listed in Tab. 2.
For the FR case, the mvBeta distributionbased classifier (with ) and the sDMMbased classifier have the same highest accuracies. However, the latter one needs to involve more channels ( or ) while the former one obtains the same classification rate at . This indicates that the latter method has higher complexity. Comparing with the best PCAbased classifier ( and ), the mvBeta classificationbased classifier improves the classification rate by . The mean accuracy is improved as well. For the CR case, the mvBeta distributionbased classifier (with ) outperforms the sDMMbased classifier by and outperforms the PCAbased classifier ( and ) by . Similar to the FR case, the mvBeta distributionbased classifier requires less channels. Moreover, when comparing the mean classification rate and the standard deviation, the mvBeta distributionbased classifier (with ) is more reliable and stable than all the other methods.
To further test the statistical meaning of the classification accuracies, we also applied the Student’s ttest to analyze the results. The values of the null hypothesis that the two compared methods perform similar are listed in Tab. 3. All the values are further smaller than and, therefore, the null hypothesis are rejected. This means that the proposed mvBeta distributionbased method indeed improves the classification accuracy.
Fisher ratio  
Null hypothesis  mvBeta & SVM  mvBeta & PCA () 
value  
Classification rate  
Null hypothesis  mvBeta & SVM  mvBeta & PCA () 
value 
5 Conclusions and future work
In order to optimally remove the correlation among the feature dimensions and thus improve classification accuracy, a parallel nonlinear transformation strategy was applied to decorrelate the negatively correlated neutral vector. Specially, when the neutral vector is Dirichlet distributed, the obtained decorrelated scalar variables are mutually independent and each of them is beta distributed. After decorrelation, we applied the variance and the differential entropy as criteria in feature selection. The proposed feature selection strategy with nonlinear transformation has been employed in EEG signal classification. Experimental results demonstrate that classifier based on the selected features performs better and is more stable than the SVMbased classifier, the recently proposed sDMMbased classifier, and the PCAbased classifier.
There are many possible ways to improve the classification accuracy in the future work. In current work, the feature selection is conducted for each channel independently. If we apply proper feature selection strategy on the best channels, further improvement of the the classification accuracy can be expected. Moreover, there exists other features, e.g., Fourier features, that can be used for EEG classification. Although the Fourier features does not fit the definition of Dirichlet distribution naturally, we can apply proper normalization strategy to make the feature neutral. Since Fourier features are more intuitive, classification accuracy improvement with normalized neutral Fourier feature can also be expected.
6 Acknowledgements
The authors would like to thank the reviewers for their fruitful suggestions. Also, the authors would like to thank Dr. JingHao Xue for his kind discussions and suggestions.
This work was partly supported by the National Natural Science Foundation of China (NSFC) under grant No. and No. , the Scientific Research Foundation for Returned Scholars, Ministry of Education of China, Chinese program of Advanced Intelligence and Network Service under grant No. B, and EU FP IRSES MobileCloud Project (Grant No. ).
Footnotes
 journal: arXiv
 A superDirichlet variable is obtained by cascading several Dirichlet variables.
 Even though we could apply the PCA directly to the neutral random vector variable, this linear transformation could only decorrelate the data, but can not guarantee the independence if the data is not Gaussian distributed.
 It is also suggested in other literature that the frequency characteristic can be found in even higher frequency band Leuthardt2004 (). We use the band pass, as suggested in Prasad2011 () and Ma2012 (), purely for the purpose of making the feature extraction settings consistent with previous work.
 The definition in Farina2007 () was unclear about processing the lowband data obtained at the last decomposition level. We use a different expression here to make it clearer.
 We have tried both the variance and the differential entropy criteria. For the BCI competition III data set that used in this paper, these two criteria yield exactly the same order of features.
References
 F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEGbased braincomputer interfaces,” Journal of Neural Engineering, vol. 4, no. 2, p. R1, 2007.
 J. Chiang, Z. Wang, and M. McKeown, “A generalized multivariate autoregressive (gmar)based approach for eeg source connectivity analysis,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 453–465, Jan 2012.
 K. C. Veluvolu, Y. Wang and S. S. Kavuri,“Adaptive estimation of EEGrhythms for optimal band identification in BCI,” Journal of Neurosci Methods, vol. 203,pp. 163–173, 2012.
 Y. Wang,K. C. Veluvolu, and M. Lee, “Timefrequency analysis of bandlimited EEG with BMFLC and Kalman filter for BCI applications,” Journal of Neuroeng Rehabilitation, vol. 10, 2013.
 S. Prasad, Z.H. Tan, R. Prasad, A. F. Cabrera, Y. Gu, and K. Dremstrup, “Feature selection strategy for classification of singletrial EEG elicited by motor imagery,” in International Symposium on Wireless Personal Multimedia Communications (WPMC), 2011, oct. 2011, pp. 1 –4.
 W. D. Penny, S. J. Roberts, E. A. Curran, and M. J. Stokes, “EEGbased communication: A pattern recognition approach,” IEEE Transactions on Rehabilitation Engineering, vol. 8, no. 2, pp. 214 –215, Jun. 2000.
 A. Subasi, “Eeg signal classification using wavelet feature extraction and a mixture of expert model,” Expert Systems with Applications, vol. 32, no. 4, pp. 1084 – 1093, 2007.
 D. Farina, O. F. Nascimento, M. F. Lucas, and C. Doncarli, “Optimization of wavelets for classification of movementrelated cortical potentials generated by variation of forcerelated parameters,” Journal of Neuroscience Methods, vol. 162, pp. 357 – 363, 2007.
 Z. Ma, Z. H. Tan, and S. Prasad, “EEG signal classification with superdirichlet mixture model,” in Proceedings of IEEE Statistical Signal Processing Workshop, Aug. 2012, pp. 440 – 443.
 Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian estimation of Dirichlet mixture model with variational inference,” Pattern Recognition, vol. 47, no. 9, pp. 3143–3157, 2014.
 C.C. Chang and C.J. Lin, “LIBSVM: A library for support vector machines,” ACM Transaction on Intelligent System Technology, vol. 2, no. 3, pp. 27:1–27:27, May 2011.
 Z. Ma and A. Leijon, “Bayesian estimation of beta mixture models with variational inference.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2160–73, 2011.
 A. Subasi and M. I. Gursoy, “Eeg signal classification using pca, ica, lda and support vector machines,” Expert Systems with Applications, vol. 37, no. 12, pp. 8659 – 8666, 2010.
 J. Taghia, Z. Ma, and A. Leijon, “Bayesian estimation of the vonMises Fisher mixture model with variational inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1701–1715, Sept 2014.
 Z. Ma, A. Leijon, and W. B. Kleijn, “Vector quantization of LSF parameters with a mixture of Dirichlet distributions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1777–1790, Sept 2013.
 Z. Ma and A. Leijon, “SuperDirichlet mixture models using differential line spectral frequences for textindependent speaker identification,” in Proceedings of INTERSPEECH, 2011, pp. 2349–2352.
 R. J. Connor and J. E. Mosimann, “Concepts of independence for proportions with a generalization of the Dirichlet distribution,” Journal of the American Statistical Association, vol. 64, no. 325, pp. 194–206, 1969.
 Z. Ma and A. E. Teschendorff, “A variational Bayes beta mixture model for feature selection in DNA methylation studies,” Journal of Bioinformatics and Computational Biology, vol. 11, no. 4, 2013.
 P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic multiview depth image enhancement using variational inference,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 3, pp. 435–448, April 2015.
 I. R. James and J. E. Mosimann, “A new characterization of the Dirichlet distribution through neutrality,” The Annals of Statistics, vol. 8, no. 1, pp. 183–189, 1980.
 Z. Ma and A. Leijon, “Pdfoptimized lsf vector quantization based on beta mixture models,” in Proceedings of INTERSPEECH, 2010.
 R. K. S. Hankin, “A generalization of the Dirichlet distribution,” Journal of Statistical Software, vol. 33, no. 11, pp. 1–18, 2010.
 Z. Ma, “Bayesian estimation of the dirichlet distribution with expectation propagation,” in Proceedings of European Signal Processing Conference, 2012.
 Z. Ma, A. Leijon, and W. B. Kleijn, “Vector quantization of LSF parameters with a mixture of Dirichlet distributions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, pp. 1777 – 1790, Sep. 2013.
 C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 Z. Ma and A. Leijon, “Bayesian estimation of beta mixture models with variational inference,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2160–2173, 2011.
 Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, pp. 2507–2517, 2007.
 Z. Ma, A. Leijon, Z.H. Tan, and S. Gao, “Predictive distribution of the dirichlet mixture model by local variational inference,” Journal of Signal Processing Systems, vol. 74, no. 3, pp. 359–374, Mar 2014.
 Z. Ma, S. Chatterjee, W. Kleijn, and J. Guo, “Dirichlet mixture modeling to estimate an empirical lower bound for LSF quantization,” Signal Processing, vol. 104, no. 11, pp. 291–295, Nov. 2014.
 Z. Ma, H. Li, Q. Sun, C. Wang, A. Yan, and F. Starfelt, “Statistical analysis of energy consumption patterns on the heat demand of buildings in district heating systems,” Energy and Buildings, vol. 85, pp. 464–472, Dec. 2014.
 Z. Ma, J. Taghia, W. B. Kleijn, A. Leijon, and J. Guo, “Line spectral frequencies modeling by a mixture of von mises¨cfisher distributions,” Signal Processing, vol. 114, pp. 219–224, Sept. 2015.
 J. Taghia and A. Leijon, “Variational inference for Watson mixture model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1886–1900, 2015.
 Z. Ma and A. Leijon, “Human skin color detection in rgb space with bayesian estimation of beta mixture models,” in Proceedings of European Signal Processing Conference, 2010.
 Z. Ma and A. Leijon, “Human audiovisual consonant recognition analyzed with three bimodal integration models,” in Proceedings of INTERSPEECH, 2009.
 H. Yu, Z. Ma, M. Li, and J. Guo, “Histogram transform model uding mfcc features for textindependent speaker identification,” in Proceedings of IEEE Asilomar Conference on Signals, Systems, and Computers, 2014.
 P. K. Rana, Z. Ma, J. Taghia, and M. Flierl, “Multiview depth map enhancement by variational bayes inference estimation of dirichlet mixture models,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013.
 Z. Ma, R. Martin, J. Guo, and H. Zhang, “Nonlinear estimation of missing ¦¤lsf parameters by a mixture of dirichlet distributions,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2014.
 Z. Ma and A. Leijon, “A probabilistic principal component analysis based hidden markov model for audiovisual speech recognition,” in Proceedings of IEEE Asilomar Conference on Signals, Systems, and Computers, 2008.
 ——, “Expectation propagation for estimating the parameters of the beta distribution,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010.
 N. Kwak and C.H. Choi, “Input feature selection by mutual information based on parzen window,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667–1671, Dec 2002.
 X. He, M. Ji, C. Zhang, and H. Bao, “A variance minimization criterion to feature selection using laplacian regularization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 10, pp. 2013–2025, Oct 2011.
 S. Zhu, D. Wang, K. Yu, T. Li, and Y. Gong, “Feature selection for gene expression using modelbased entropy,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 1, pp. 25–36, Jan. 2010.
 Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference,” Pattern Recognition, vol. 47, no. 9, pp. 3143–3157, Sep 2014.
 “BCI competition III,” http://www.bbci.de/competition/iii.
 E. C. Leuthardt, G. Schalk, J. R. Wolpaw, J. G. Ojemann, and D. W. Moran, “A braincomputer interface using electrocorticographic signals in humans,” Journal of Neural Engineering, no. 1, pp. 63–71, 2004.
 Z. Ma, A.E. Teschendorff, H. Yu, J. Taghia, J. Guo, “Comparisons of NonGaussian Statistical Models in DNA Methylation Analysis”, International Journal of Molecular Science. No. 15, pp. 1083510854 ,2014.
 K. Laurila, B. Oster, C. Andersen, P. Lamy, T. Orntoft, O. YliHarja, and C. Wiuf, “A betamixture model for dimensionality reduction, sample classification and analysis”, BMC Bioinformatics, 2011.
 Z. Ma and A. Leijon, “Bayesian Estimation of Beta Mixture Models with Variational Inference”, IEEE Transactions on Pattern Analysis and Machine Intelligence. No. 33, pp. 21602173, 2011.
 Z. Ma and A.E. Teschendorff, “A Variational Bayes Beta Mixture Model for Feature Selection in DNA Methylation Studies”, Journal of Bioinformatics and Computational Biology. Vol.11 No.4, pp.19, 2013.
 T. Gandhi, B. K. Panigrahi, and S. Anand, “A comparative study of wavelet families for eeg signal classification,” Neurocomputing, vol. 74, no. 17, pp. 3051 – 3057, 2011.
 T. N. Lal, M. Schroder, T. Hinterberger, J. Weston, M. Bogdan, N. Birbaumer, and B. Scholkopf, “Support vector channel selection in BCI,” Biomedical Engineering, IEEE Transactions on, vol. 51, no. 6, pp. 1003 –1010, Jun. 2004.
 W. Malina, “On an extended fisher criterion for feature selection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI3, no. 5, pp. 611 –614, Sep. 1981.
 Z. Ma, A. E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Transaction on Pattern Analysis and Machine Intelligence.Vol. 37, No. 4, pp. 876889, 2015.
 Y. Chae, J. Jeong, and S. Jo, “Toward brainactuated humanoid robots: Asynchronous direct control using an eegbased bci,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1131–1144, Oct 2012.
 Z. Ma, “Nongaussian statistical models and their applications,” Ph.D. dissertation, KTH  Royal Institute of Technology, 2011.
 B. A. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichlet distribution and related processes,” Department of Electrical Engineering, University of Washington, Tech. Rep., 2010.
 H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, Aug 2005.
 V. Lawhern, W. Hairston, and K. Robbins, “Optimal feature selection for artifact classification in eeg time series,” in Foundations of Augmented Cognition, ser. Lecture Notes in Computer Science, D. Schmorrow and C. Fidopiastis, Eds. Springer Berlin Heidelberg, 2013, vol. 8027, pp. 326–334.