Linear classifier design under heteroscedasticity in Linear Discriminant Analysis
Under normality and homoscedasticity assumptions, Linear Discriminant Analysis (LDA) is known to be optimal in terms of minimising the Bayes error for binary classification. In the heteroscedastic case, LDA is not guaranteed to minimise this error. Assuming heteroscedasticity, we derive a linear classifier, the Gaussian Linear Discriminant (GLD), that directly minimises the Bayes error for binary classification. In addition, we also propose a local neighbourhood search (LNS) algorithm to obtain a more robust classifier if the data is known to have a non-normal distribution. We evaluate the proposed classifiers on two artificial and ten real-world datasets that cut across a wide range of application areas including handwriting recognition, medical diagnosis and remote sensing, and then compare our algorithm against existing LDA approaches and other linear classifiers. The GLD is shown to outperform the original LDA procedure in terms of the classification accuracy under heteroscedasticity. While it compares favourably with other existing heteroscedastic LDA approaches, the GLD requires as much as times lower training time on some datasets. Our comparison with the support vector machine (SVM) also shows that, the GLD, together with the LNS, requires as much as times lower training time to achieve an equivalent classification accuracy on some of the datasets. Thus, our algorithms can provide a cheap and reliable option for classification in a lot of expert systems.
In many applications one encounters the need to classify a given object under one of a number of distinct groups or classes based on a set of features known as the feature vector. A typical example is the task of classifying a machine part under one of a number of health states. Other applications that involve classification include face detection, object recognition, medical diagnosis, credit card fraud prediction and machine fault diagnosis.
A common treatment of such classification problems is to model the conditional density functions of the feature vector [?]. Then, the most likely class to which a feature vector belongs can be chosen as the class that maximises the a posteriori probability of the feature vector. This is known as the maximum a posteriori (MAP) decision rule.
Let be the number of classes, be the th class, be a feature vector and be training samples belonging to the th class . The MAP decision rule for the classification task is then to choose the most likely class of , given as:
We assume for the moment that there are only classes, i.e. binary classification (we consider multi-class classification in a later section). Then, using Bayes’ rule, the two posterior probabilities can be expressed as:
It is often the case that the prior probabilities and are known, or else they may be estimable from the relative frequencies of and in where . Let these priors be given by and respectively for class and . Then, the likelihood ratio defined as:
is compared against a threshold defined as so that one decides on class if and class otherwise.
Linear Discriminant Analysis (LDA) proceeds from here with two basic assumptions [?]:
The conditional probabilities and have multivariate normal distributions.
The two classes have equal covariance matrices, an assumption known as homoscedasticity.
Let , be the mean and covariance matrix of and , be the mean and covariance of respectively. Then, for ,
where is the dimensionality of , which is the feature space of . Given the above definitions of the conditional probabilities, one may obtain a log-likelihood ratio given as:
which is then compared against so that is chosen if , and otherwise. Thus, the decision rule for classifying a vector under class can be rewritten as:
In general, this result is a quadratic discriminant. However, a linear classifier is often desired for the following reasons:
A linear classifier is robust against noise since it tends not to overfit [?].
A linear classifier has relatively shorter training and testing times [?].
Many linear classifiers allow for a transformation of the original feature space into a higher dimensional feature space using the kernel trick for better classification in the case of a non-linear decision boundary [?].
By calling on the assumption of homoscedasticity, i.e. , the original quadratic discriminant given by (6) for classifying a given vector decomposes into the following linear decision rule:
Here, is a vector of weights denoted by and is a threshold denoted by . This linear classifier is also known as Fisher’s Linear Discriminant. If only the weight vector w is required for dimensionality reduction, w may be obtained by maximising Fisher’s criterion [?], given by:
where and , are the cardinalities of and respectively.
LDA is the optimal Bayes’ classifier for binary classification if the normality and homoscedasticity assumptions hold [?] [?]. It demands only the computation of the dot product between and , which is a relatively computationally inexpensive operation.
As a supervised learning algorithm, LDA is performed either for dimensionality reduction (usually followed by classification) [?], [?], or directly for the purpose of statistical classification [?][?]. LDA has been applied to several problems such as medical diagnosis e.g. [?], face and object recognition e.g. [?] and credit card fraud prediction e.g. [?]. The widespread use of LDA in these areas is not because the datasets necessarily satisfy the normality and homoscedasticity assumptions, but mainly due to the robustness of LDA against noise, being a linear model [?]. Since the linear Support Vector Machine (SVM) can be quite expensive to train, especially for large values of or (), LDA is often relied upon [?].
Yet, practical implementation of LDA is not without problems. Of note is the small sample size (SSS) problem that LDA faces with high-dimensional data and much smaller training data [?]. When , the scatter matrix is not invertible, as it is not full-rank. Since the decision rule as given by (7) requires the computation of the inverse of , the singularity of makes the solution infeasible. In works by, for example, [?], this problem is overcome by taking the Moore-Penrose pseudo-inverse of the scatter matrix, rather than the ordinary matrix inverse. [?] use a gradient descent approach where one starts from an initial solution of and moves in the negative direction of the gradient of Fisher’s criterion (8). This method avoids the computation of an inverse altogether. Another approach to solving the SSS problem involves adding a scalar multiple of the identity matrix to the scatter matrix to make the resulting matrix non-singular, a method known as regularised discriminant analysis [?].
However, for a given dataset that does not satisfy the homoscedasticity or normality assumption, one would expect that modifications to the original LDA procedure accounting for these violations would yield an improved performance. One such modification, in the case of a non-normal distribution, is the mixture discriminant analysis [?] in which a non-normal distribution is modelled as a mixture of Gaussians. However, the parameters of the mixture components or even the number of mixture components, are usually not known a priori. Other non-parametric approaches to LDA that remove the normality assumption involve using local neighbourhood structures [?] to construct a similarity matrix instead of the scatter matrix used in LDA. However, these approaches aim at linear dimensionality reduction, rather than linear classification. Another modification, in the case of a non-linear decision boundary between and , is the Kernel Fisher Discriminant (KFD) [?]. KFD maps the original feature space into some other space (usually higher dimensional) via the kernel trick [?]. While the main utility of the kernel is to guarantee linear separability in the transformed space, the kernel may also be employed to transform non-normal data into one that is near-normal.
Our proposed method differs from the above approaches in that we primarily consider violation of the homoscedasticity assumption, and do not address the SSS problem. We seek to provide a linear approximation to the quadratic boundary given by (6) under heteroscedasticity without any kernel transformation; we note that several heteroscedastic LDA approaches have been proposed to this effect. Nevertheless, for reasons which we highlight in the next section, our contributions in this paper are stated explicitly as follows:
We propose a novel linear classifier, which we term the Gaussian Linear Discriminant (GLD), that directly minimises the Bayes error under heteroscedasticity via an efficient optimisation procedure. This is presented in Section 3.
We propose a local neighbourhood search method to provide a more robust classifier if the data has a non-normal distribution (Section 4).
Under the heteroscedasticity assumption, many LDA approaches have been proposed among which we mention [?],[?]. As it is known that Fisher’s criterion (whose maximisation is equivalent to the LDA derivation described in the Introduction section) only takes into account the difference in the projected class means, existing heteroscedastic LDA approaches tend to obtain a generalisation on Fisher’s criterion. In the work of [?], for instance, a directed distance matrix (DDM) known as the Chernoff distance, which takes into account the difference in covariance matrices between the two classes as well as the projected class means, is maximised instead of Fisher’s criterion (8). The same idea employing the Chernoff criterion is used by [?]. A wider class of Bregman divergences including the Bhattacharya distance [?] and the Kullback-Leibler divergence [?] have also been used for heteroscedastic LDA, as Fisher’s criterion can be considered a special case of these measures when the covariance matrices of the classes are equal.
However, most of these approaches aim at linear dimensionality reduction, which involves finding a linear transformation that transforms the original data into one of reduced dimensionality, while at the same time maximising the discriminatory information between the classes. Our focus with this paper, however, is not on dimensionality reduction, but on obtaining a Bayes optimal linear classifier for binary classification assuming that the covariance matrices are not equal. As far as we know, the closest work to ours in this regard are the works by [?]
Obtaining the Bayes optimal linear classifier involves minimising the probability of misclassification as given by:
where . Unfortunately, there is no closed-form solution to the minimisation of (9) [?]. Thus, an iterative procedure is inevitable in order to obtain the Bayes optimal linear classifier.
In the work of [?], for example, the iterative procedure described is to solve for and as given by
by obtaining the optimal values of and via systematic trial and error. We denote this heteroscedastic LDA procedure by R-HLD-2, for the reason that the two parameters and are chosen at random.
[?] makes the observation that if the weight vector and the threshold are both multiplied by the same positive scalar, the decision boundary remains unchanged. Therefore, by multiplying (10) through by the scalar , and can be put in the form of:
Still, the optimal value of has to be chosen by systematic trial and error. We denote this heteroscedastic LDA approach by R-HLD-1, for the reason that only one parameter is chosen at random. As we show in the next section, is unbounded. Therefore, the difficulty faced by this approach is that has to be chosen from the interval , so that the probability of finding the optimal for a given dataset is low, without extensive trial and error to limit the choice of to some finite interval .
To avoid the unguided trial and error procedure in [?], [?] and [?] propose a theoretical approach described below:
Change from to with small step increments .
Evaluate as given by:
Evaluate as given by:
Compute the probability of misclassification .
Choose and that minimise .
We refer to this procedure as C-HLD, for the reason that the optimal is constrained in the interval .
However, we highlight two main problems with the above C-HLD procedure:
There is no obvious choice of the step rate . Too small a value of will demand too many matrix inversions in Step 2, as there will be too many values. On the other hand, if is too large, the optimal may not be refined enough, and the obtained may not be optimal. Specifically, the change in that results from a small change in is given as:
which can affect the classification accuracy.
The solution obtained this way is only locally optimal as is bounded in the interval . As we show in the next section, is actually unbounded. When there is a class imbalance [?], the optimal may be found outside the interval which can lead to poor classification accuracy.
Our proposed algorithm, which is described in the next section, unlike the trial and error approach by [?], has a principled optimisation procedure, and unlike [?] does not encounter the problem of choosing an inappropriate , nor restricts to the interval . Consequently, our proposed algorithm achieves a far lower training time than the C-HLD, R-HLD-1 and R-HLD-2, for roughly the same classification accuracy.
3Gaussian Linear Discriminant
Let be a vector of weights, and , a threshold such that:
Since is assumed to have a multivariate normal distribution in classes and , has a mean of and a variance of for class and a mean of and a variance of for class given as:
With reference to the Bayes error of (9), the individual misclassification probabilities can be expressed as:
where is the Q-function. Therefore, the Bayes error to be minimised may be rewritten as:
Our aim is to find a local minimum of . A necessary condition is for the gradient of to be zero, i.e.,
From (9), it can be shown that:
From (20), however, we obtain the following:
It can similarly be shown from (9) that,
Again, from (20),
Now, equating the gradient to zero, the following set of equations are obtained:
Substituting (29) into (28) yields:
Then the vector can be given by:
It will be noted however that (31) is still in terms of , so that an explicit representation of in terms of is needed from (29) to substitute in and in (31). This is where our approach most significantly differs from [?]. Solving for from (29) results in the following quadratic:
which can be simplified to:
where is given as before as . If is defined and not equal to zero, and (since for heteroscedastic LDA), (33) can be shown to have the following solutions:
Nevertheless, since there are two solutions to in (34), a choice has to be made as to which of them is substituted into (31). To eliminate one of the solutions, we consider the second-order partial derivative of with respect to evaluated at as given by (34), and determine under what condition it is greater than or equal to zero. This is a second-order necessary condition for to be a local minimum. From (27), it can be shown that:
We denote this second-order derivative by . We then consider all possibilities of and (which are the variables in (35) that depend on ) under three cases, and analyse the sign of in each.
and : then is trivially non-positive.
and : then is trivially non-negative.
and or and : then is non-negative if and only if
It will be noted that the right-hand side of the inequality (37) is identically zero, as can be seen from (32). Therefore, the condition under which is greater than or equal to zero is when:
Note also that Case 2 necessarily satisfies (38) so that we consider (38) as the general inequality for the non-negativity of for all cases, and thus for to be a local minimum.
Now, when one considers the two solutions of in (35), only the solution given by:
satisfies the inequality of (38), i.e., only this choice of corresponds to a local minimum. The proof of this is given in the appendix.
We may then substitute this expression of into (31) so that (31) is in terms of only. Even so, has to be solved for iteratively. This is because (31) has no closed-form solution since are themselves functions of . As the iterative procedure requires an initial choice of , we use Fisher’s choice of the weight vector as given by:
as our initial solution. Again, we mention that and are the cardinalities of and . After a number of such iterative updates, the optimal is then solved for from (39). This algorithm, known as the Gaussian Linear Discriminant (GLD), is described in detail in Algorithm 1.
Note that by multiplying both of (31) and proportionally by (due to (38), is non-negative and hence the discrimination criterion given by (15) is not changed), the GLD may be viewed in terms of the optimal solution of (11), where
which is unbounded given the inequality of (38). However, unlike [?], is not chosen by systematic trial and error, and unlike [?], is not varied between and at small step increments. Instead, since is a function of and , our algorithm may be interpreted as obtaining increasingly refined values of by improving upon and starting from Fisher’s solution, as is described in Algorithm 1.
The GLD algorithm may be terminated under any of the following conditions:
When the change in the objective function remains within a certain tolerance for a number of consecutive iterations.
When the change in the norm of remains within a certain tolerance for a number of consecutive iterations.
When the gradient of as given by (21) remains within a certain tolerance for a number of consecutive iterations.
After a fixed number of iterations , if convergence is slow.
At the end of the algorithm, the final solution may be chosen either as the solution to which the iterations converge, or the solution corresponding to the minimum found in the iterative updates.
Suppose now that there are classes in the dataset , then the classification problem may be reduced to a number of binary classification problems. The two main approaches usually taken for this reduction are the One-vs-All (OvA) and One-vs-One (OvO) strategies [?].
In OvA, one trains a classifier to discriminate between one class and all other classes. Thus, there are different classifiers. An unknown vector is then tested on all classifiers so that the class corresponding to the classifier with the highest discriminant score is chosen. However, with respect to the proposed GLD algorithm, this is an ill-suited approach. This is because the collection of all other classes on one side of the discriminant will not necessarily have a normal distribution, and could in fact be multimodal, if the means are well-separated. Since our algorithm is built on strong normality assumptions of the data on each side of the discriminant, the GLD, as has been formulated, is expected to perform poorly.
In OvO, a classifier is trained to discriminate between every pair of classes in the dataset, ignoring the other classes. Thus, there are unique classifiers that may be constructed. Again, an unknown vector is tested on all classifiers. The predicted classes for all the classifiers are then tallied so that the class that occurs most frequently is chosen. This is equivalent to a majority vote decision. In a lot of cases, however, there is no clear-cut winner, as more than one class may have the highest number of votes. In such a case, the most likely class is often chosen randomly between those most frequently occurring classes. The GLD provides a more appropriate means for breaking such ties, by making use of the minimised Bayes error for each classifier. Specifically, one may instead use a weighted voting system, where the count of every predicted class is weighted by , since provides an appropriate measure of uncertainty associated with each classifier output. Thus, the decision rule is reduced to choosing the maximum weighted vote among the classes.
Note that even though the GLD minimises the Bayes error for each classifier, the overall Bayes error for a multiclass problem may not be minimised by using multiple binary classifiers.
So far, the fundamental assumption that has been used to derive the GLD is that the data in each class has a normal distribution. Thus, for an unknown non-normal distribution, the linear classifier we have obtained does not minimise the Bayes error for that unknown distribution. We argue, however, that if this unknown distribution is nearly-normal [?], then a more robust linear classifier may be found in some neighbourhood of the GLD. For this reason, we use a local neighbourhood search algorithm to explore the region in around the GLD to obtain the classifier that minimises the number of misclassifications on the training dataset. We do this by perturbing each of the vector elements in the optimal obtained from the GLD procedure by a small amount . After every perturbation, the resulting classifier is evaluated on the test dataset. This procedure is repeated as described in Algorithm 2 until the stopping criterion is satisfied.
The algorithm is terminated after a certain maximum number of iterations is reached. Additionally, one may perform an early termination if after a predefined number of iterations , there is no improvement in the minimum number of misclassifications on the training dataset that has been found in the search.
We validate our proposed algorithm on two artificial datasets denoted by D1 and D2, as well as on ten real-world datasets taken from the University of California, Irvine (UCI) Machine Learning Repository. These datasets are shown in Table 1, and cut across a wide range of applications including handwriting recognition, medical diagnosis, remote sensing and spam filtering. D1 and D2 are normally distributed with different covariance matrices. For D1, we generate samples for class and samples for class using the following Gaussian parameters:
For D2, we generate samples for class and samples for class using the following Gaussian parameters:
The above Gaussian parameters are slightly modified from the two class data used by [?] and [?] in order to make the sample means less separated.
|Image Segmentation (Statlog)||(g)|
|Wine Quality (White)||(i)|
For each dataset in Table 1, we perform -fold cross validation. We run different trials. On each training dataset, we evaluate the minimum Bayes error achievable by our proposed algorithm averaged over all folds and trials. If there are more than two classes, we use OvO, and calculate the mean Bayes error over all discriminants. As we are interested only in linear classification, we compare the performance of the GLD with the original LDA as well as the heteroscedastic LDA procedures by [?],[?] and [?] as described in Section 2 in terms of the Bayes error (9). For the sake of brevity, we denote these three heteroscedastic LDA algorithms by the annotations earlier introduced: C-HLD, R-HLD-1 and R-HLD-2 respectively. These results are shown in Table 2.
Moreover, for each of the test datasets, we evaluate the average classification accuracy for each of LDA, C-HLD, R-HLD-1, R-HLD-2, GLD and GLD with local neighbourhood search (LNS). We also compare the performance of these LDA approaches to the SVM. These results are shown in Table 3, while the average training times of the algorithms are shown in Table 4.
We estimate the prior probabilities based on the relative frequencies of the data in each class in the dataset, and the stopping criterion for the GLD is thus: we stop if the gradient of change is less than or equal to , or else we terminate our algorithm after iterations and choose the solution corresponding to the minimum . Also, for the LNS procedure, we perturb each vector element by of its absolute value, i.e. , and we run for =1000 iterations, terminating prematurely if . We use a step size of for the C-HLD algorithm, and run trials for R-HLD-1 and R-HLD-2. All the parameters used in the experiments are optimised via cross-validation. Note that if the sample covariance matrix is singular, we use the Moore-Penrose pseudo-inverse.
6Results and Discussion
For real-world datasets, the covariance matrices of the classes are rarely equal, therefore the homoscedasticity assumption in LDA does not hold. Our results in Table 2 confirm that LDA does not minimise the Bayes error under heteroscedasticity, as none of the datasets used has equal covariance matrices. With the exception of datasets (d), (j) and (l), where LDA achieves an equal Bayes error as the other heteroscedastic LDA approaches, LDA is outperformed by the GLD on all remaining datasets in terms of minimising the Bayes error. It will be noted that the other three heteroscedastic LDA approaches algorithms achieve a performance comparable to the GLD on all the datasets in terms of the Bayes error. However, R-HLD-1 and R-HLD-2 require a lot of trials ( in our experiments) in order to obtain the optimal parameters and , respectively, while C-HLD requires a step size of which translates to trials. Consequently, the training time for these algorithms far exceed that of the GLD, as can be seen in Table 4. For example, the gain in training time of the GLD over C-HLD, R-HLD-1 and R-HLD-2 is over folds for dataset (g), and about folds for dataset (l). Moreover, since C-HLD, R-HLD-1 and R-HLD-2 all require matrix inversions, performing a matrix inversion for each of the trials can be a computationally demanding task especially for high-dimensional data, which have large covariance matrices. Instead, since the GLD follows a principled optimisation procedure, the number of matrix inversions required is far lower. For example, on dataset (f), which has a dimensionality of , the GLD requires over times less time to train than the other heteroscedastic LDA approaches.
It is conceivable that the minimisation of the Bayes error would translate into a good performance in terms of the classification accuracy, if the normality assumption of LDA holds. For this reason, it can be seen in Table 3 that the GLD achieves the best classification accuracy on datasets (a) and (b), which are generated from known normal distributions. Thus, the proposed GLD algorithm is particularly suited for applications with datasets that tend to be normally distributed in each class e.g. in machine fault diagnosis, or accelerometer-based human activity recognition [?], as it also requires far less training time than the existing heteroscedastic LDA approaches.
However, for datasets (c) through to (l), the classes do not have any known normal distribution. Therefore, minimising the Bayes error under the normality assumption would not necessarily result in a classifier that has the best classification accuracy, even if the difference in covariance matrices has been accounted for. For this reason, it is not surprising that LDA achieves a superior classification accuracy than C-HLD, R-HLD-1, R-HLD-2 and the GLD on datasets (c) and (h) as can be seen in Table 3. However, by searching around the neighbourhood of the GLD, the Local Neighbourhood Search (LNS) algorithm is able to account for the non-normality and obtain a more robust classifier. Thus, the GLD, together with the LNS procedure, achieves a higher classification accuracy than all the LDA approaches on all the real-world datasets (i.e. (c) to (l)) with the exception of dataset (f) which has the GLD showing superior classification accuracy.
While the SVM outperforms the LDA approaches on half of the datasets, its training time can be rather long for large datasets. For instance, for dataset (d) which has elements, the SVM takes about hours to train whereas the GLD with LNS, which achieves the best classification accuracy on this dataset, takes seconds to train, representing over fold savings in computational time over the SVM. Similar patterns can be seen in other datasets like (i), where the GLD with LNS achieves a superior classification accuracy with over times shorter training time than the SVM. This suggests that for such large datasets, the GLD with Local Neighbourhood Search is a low-complexity alternative to the SVM, as it requires far less computational time than the SVM.
We, however, make note of two weaknesses our proposed algorithms have. For the GLD, the procedure as described in Algorithm 1, may converge to a saddle point, instead of a local minimum. Even if it were to converge to a local minimum, there is no guarantee that is the global optimum solution due to the fact that the objective function is known to be non-convex [?]. Also, since the Local Neighbourhood Search involves evaluating the misclassification rate on the training set for every perturbation, the procedure does not scale well with large amounts of training data. Because of this, it is important to have a good initial solution like the GLD, so that an early termination may be performed if there is no improvement after some number of iterations.
In this paper, we have presented the Gaussian Linear Discriminant (GLD), a novel and computationally efficient method for obtaining a linear discriminant for heteroscedastic Linear Discriminant Analysis (LDA) for the purpose of binary classification. Our algorithm minimises the Bayes error via an iterative optimisation procedure that uses Fisher’s Linear Discriminant as the initial solution. Moreover, the GLD does not require any parameter adjustments. We have also proposed a local neighbourhood search method by which a more robust linear classifier may be obtained for non-normal distributions. Our experimental results on two artificial and ten real world applications show that when the covariance matrices of the classes are unequal, LDA is unable to minimise the Bayes error. Thus, under heteroscedasticity, our proposed algorithm achieves superior classification accuracy to the LDA for normally distributed classes. While the proposed GLD algorithm compares favourably with other heteroscedastic LDA approaches, the GLD requires a far less training time. Moreover, the GLD, together with the LNS, has been shown to be particularly robust, comparing favourably with the SVM, but requiring far less training time on our datasets. Thus, for expert systems like machine fault diagnosis or human activity monitoring that require linear classification, the proposed algorithms provide a low-complexity, high-accuracy solution.
While this work has focused on linear classification, on-going work is focused on modifying the GLD procedure for the purpose of linear dimensionality reduction. Moreover, it is of particular interest to us to be able to derive the Bayes error for some known non-normal distributions. An alternative to this is to be able to obtain a kernel that implicitly transforms some data of a known non-normal distribution into a feature space where the classes are normally distributed. Finally, like all local search algorithms, the performance and complexity of the LNS procedure depends on the choice of the initial solution. Therefore, further work that explores the use initial solutions (including the heteroscedastic LDA approaches discussed) other than the GLD for the LNS procedure is being done.
Suppose that satisfies (38), then
This implies that since is a positive scalar.
Consider now given as:
In order for (38) to be satisfied, it can be shown, similar to (A.5), that
which can be simplified to give . Since this conclusion is false, only satisfies (26).