Multigroup discrimination based on weighted local projections

Multigroup discrimination based on weighted local projections

Thomas Ortner 111thomas.ortner@tuwien.ac.at, Irene Hoffmann, Peter Filzmoser, Maia Zaharieva,
Christian Breiteneder, Sarka Brodinova
Institute of Statistics and Mathematical Methods in Economics
Institute of Software Technology and Interactive Systems
TU Wien
This work has been partly funded by the Vienna Science and Technology Fund (WWTF) through project ICT12-010 and by the K-project DEXHELPP through COMET - Competence Centers for Excellent Technologies, supported by BMVIT, BMWFW and the province Vienna. The COMET program is administrated by FFG.
Abstract

A novel approach for supervised classification analysis for high dimensional and flat data (more variables than observations) is proposed. We use the information of class-membership of observations to determine groups of observations locally describing the group structure. By projecting the data on the subspace spanned by those groups, local projections are defined based on the projection concepts from Ortner et al. (2017a) and Ortner et al. (2017b). For each local projection a local discriminant analysis (LDA) model is computed using the information within the projection space as well as the distance to the projection space. The models provide information about the quality of separation for each class combination. Based on this information, weights are defined for aggregating the LDA-based posterior probabilities of each subspace to a new overall probability. The same weights are used for classifying new observations.

In addition to the provided methodology, implemented in the R-package lop, a method of visualizing the connectivity of groups in high-dimensional spaces is proposed at the basis of the posterior probabilities. A thorough evaluation is performed using three different real-world datasets, underlining the strengths of local projection based classification and the provided visualization methodology.

1 Introduction

Supervised classification methods are widely used in research and industry, including tasks like tumor classification, speech recognition, or the classification of food quality. Observations are gathered from distinct groups and for each observation the group membership is known. Decision boundaries are then estimated in the sample space, such that a new observation can be assigned to one of the groups. The aim of discrimination methods is to find classification boundaries, which result in low misclassification rates for new observations, i.e. new observations are assigned to the correct class with high accuracy.

Linear discriminant analysis (LDA) is a popular tool for classification. It estimates linear decision boundaries by maximizing the between-group to within-group variance and assumes equal covariance structure of the groups. LDA often gives surprisingly good results in low-dimensional settings, however, it cannot be directly applied if the number of variables exceeds the number of observations since then the within-group covariance estimate becomes singular and its inverse cannot be calculated. With restrictions on the covariance estimation the problem of singularity can be mended but asymptotically (with increasing number of variables) the performance of LDA is not better than random guessing Bickel and Levina (2004); Shao et al. (2011).

In many classification tasks, it is commonly the case that the underlying data has a flat structure, i.e. there are more variables than observations. Therefore, a great variety of alternative classification methods and extensions of LDA have been developed to overcome its limitations. Several proposed approaches consider projection of the data onto a lower dimensional subspace Barker and Rayens (2003); Chen et al. (2013) or reducing the dimensionality by model-based variable selection Witten and Tibshirani (2011). Other methods are not based on covariance estimation and so they are not restricted to low-dimensional (non-flat) settings, e.g. k-nearest neighbour (KNN) classification, support vector machines (SVM) or random forests (RF). Nevertheless, the noise accumulation due to a large number of variables, which are not informative for the class separations, affects these methods as well. Also the concepts of combining projections with classification methods has been previously explored (e.g. Lee et al., 2005; Caragea et al., 2001) where the focus is on exploratory classification for finding suitable projections for visualization.

We propose a new approach for supervised classification based on a series of projections into low-dimensional subspaces, referred to as local projections. In each subspace, we calculate an LDA model. The posterior probabilities of each LDA model are aggregated (weighted by a class-specific quality measure of the projection space) to obtain a final classification. The idea of aggregating posterior probabilities in the context of random forests has been proposed by Bosch et al. (2007) taking the average over the posterior probabilities from all trees.

The remainder of the paper is structured as follows. Section 2 presents the proposed method. First, local projections based on the k-nearest class neighbors of an observation and distances within and to the projection space are introduced in Section 2.1, resulting in the local discrimination space where an LDA model is estimated. Next, in Section 2.2, we introduce weights used for aggregating the posterior probabilities from the individual LDA models leading to a final classification rule. In Section 2.3, the range of the tuning parameter , associated with the dimensionality of the local discrimination space, is discussed and a strategy to select the tuning parameter is presented. Section 3 introduces a way of visualizing the data structure and the degree of separation. In Section 4, three real-world datasets are used to evaluate the performance of our approach in comparison to other related and popular classification methods. The datasets cover settings with only 25 and up to almost 10.000 variables, including multigoup and binary classification problems, and a dataset where subgroups are known to exist. The effect of imbalanced group sizes in the training data is investigated and results are visualized by the techniques introduced in the previous section. Section 5 concludes the paper.

2 Methodology

Let denote a data matrix of observations in a -dimensional space, , . We further assume the presence of classes where the class memberships of the observations are stored in a categorical vector with iff comes from group , for . The number of observations in group is denoted by with . For all observations we assume that they have been drawn from different continuous probability distributions.

For our methodology, it is important that each space spanned by a subset of observations has a dimension of at least and that there are no ties present in our data. These requirements automatically imply high-dimensional dataspaces as the area of application. Both assumptions can be met by a preprocessing step, removing duplicate or linearly dependent observations. Note that these restrictions only apply for the training data but not for new observations.

Previous research (Ortner et al., 2017a, b) shows the effectiveness of using series of projections to overcome the limitations caused by a flat data structure. In this section, the local discrimination method is introduced, which allows for the number of variables to exceed the number of observations . The idea is as follows. For a fixed observation , its nearest neighbours are identified, called the core of , which are used to define a dimensional hyperplane, the core space. The Euclidean distance to this hyperplane, called orthogonal distance, is calculated for each observations. The hyperplane and the orthogonal distance together define a -dimensional subspace, the local discrimination space, where an LDA model is estimated. This approach is performed for each observation resulting in LDA models. To assign the class membership to an observation, its posterior probabilities of all models are aggregated.

2.1 Local discrimination space

Let denote the th-smallest distance from to any observation from class , for . According to Ortner et al. (2017a) and Ortner et al. (2017b), we define the core of as the k-nearest class neighbors of ,

(1)

where denotes the Euclidean distance between and , and are the indices of the core observations within . In contrast to Ortner et al. (2017b), we use all k-nearest class neighbours as we can use the group membership in order to guarantee a clean core, i.e. no observations from other groups within the core.

Any of the available cores can be used to unambiguously define an affine subspace spanned by the core observations. In order to determine the projection onto this subspace, we center and scale the data with respect to the core observations , .

(2)
(3)

where denotes the sample variance. For the ongoing work, we denote as the data matrix of centered and scaled observations based on the location and scale estimators and of the core of . A projection onto the subspace spanned by the core of is defined by from the singular value decomposition (SVD) of the centered and scaled core observations . Since the core of consists of exactly linearly independent observations, is a dimensional diagonal matrix with non-zero singular values in the diagonal.

Since the idea of no ties being present in the data and each core consisting of linearly independent observations may appear like a strong limitation, an adjustment of the definitions can help in order to avoid a preprocessing step. If we interpret the core of as a set of observations, where iteratively the observation from the same class, closest to is added until a dimensional subspace is spanned, we only need to guarantee the existence of such cores, which is a much weaker assumption.

Given the projection matrix from the decomposition, a representation of the data in the core space is defined by down-projecting the centered and scaled data matrix, . The core representation consists of orthogonal variables, while the dimensional complement of defines the orthogonal complement of the core space. In contrast to commonly used procedures of first reducing the dimensionality using PCA and then performing a discrimination method like LDA, we acknowledge the fact that the last principal components might contain an important part of the information like exploited by modern outlier detection algorithms (e.g. Hubert et al., 2005; Kriegel et al., 2012). Since the reduction of dimensionality remains vital, we aggregate the information from the orthogonal complement by considering the Euclidean distance to the core space,

(4)

where denotes the core representation of given .

The combination of the core representation and the orthogonal distance in a matrix, , provides a -dimensional representation for all observations of . This -dimensional space is the local discrimination space. The reduction of the sample space to the local discrimination space results in a good description of the neighbourhood of an observation and also includes grouping structure which is not described in the core space by the orthogonal distances.

An LDA model is estimated in the local discrimination space, excluding the observations from the core of . It is necesarry to exclude the core observations, because they have very specific properties in the local discrimination space and this would distort the within-group covariance estimation. In Ortner et al. (2017a) it is shown that

(5)
(6)

where SD represents the score distance defined as the Euclidean distance within the core space. These properties hold because for the full information is located in the core space, so the orthogonal distances are zero. The scaling applied to the data based on the covariance estimation of the core observations leads to constant score distances for . So the core observations must not be included in the computation of the LDA model. The model estimated on the remaining observations in the local discrimination space is denoted by .

For the model the posterior probability of group given an observation is defined by

(7)

where denotes the estimated density of a multivariate normal distribution with the group mean of class as center and the pooled within-group covariance matrix as covariance estimate.

2.2 Weighting/aggregating local projections

We now have a set of local discrimination spaces and their respective LDA models. In order to receive an overall classification rule for a new observation , we need to aggregate the available models from the core spaces. We accomplish such an aggregation by using the posterior probabilities defined in Equation (7). First we consider the mean over all posterior probabilities of belonging to group , for ,

(8)

and we define the aggregated posterior probability of belonging to group , for , as

(9)

These new aggregated posterior probabilities are based on a fixed number describing the number of core observations as indicated by the index of .

The posterior probabilities of the LDA models, , compared to the true class membership of reflect the quality of separation in the respective local projection. We distinguish between two quality measures. Let denote the mean posterior probability of belonging to class over all observations actually coming from class , with respect to the model , i.e.

(10)

and the mean posterior probability of non-class- observations being classified as class observations given the model , i.e.

(11)

Based on and , we define weights representing the quality of each local projection for each group ,

(12)

Based on these quality measures , we redefine the overall posterior probabilities from Equation (9) by weighting each projection for each class with the respective weight. Note that these weights are class-specific and, therefore, a class-individual standardization of weights is required. In our notation, we remove the subscript from Equation (8) and Equation (9), which represents constant weights of for each local projection, resulting in:

(13)
(14)

Equivalently to classical LDA, we use these posterior probabilities to assign an observation to a class . This decision rule defines the local discrimination model.

2.3 The choice of

The computation of LDA models in the full dimensional space, given more variables than observations are available, requires data preprocessing including dimension reduction (e.g. Barker and Rayens, 2003; Chen et al., 2013) or the parallel performance of model estimation and variable selection (e.g. Witten and Tibshirani, 2011; Hoffmann et al., 2016). The concept of local projections allows us to compute an LDA model for each local projection due to the low dimensional core space. The parameter determining the dimensionality is of the -nearest class neighbours. It is important to properly tune since it defines the degree of locality for each projection. Smaller ’s are able to better describe a lower dimensional manifold on which groups might be located but increase the risk of not being able to properly describe the local data structure.

The number of classes as well as the number of observations for provide a first limitation for the range of . In order to compute an LDA model with G classes, a dimensionality equal to at least is required. Therefore,

(15)

provides a lower boundary for .

To identify an upper boundary for , two properties of the core observations must be taken into account. Due to the specific properties of the core observations stated in Equations (5) and (6), they are not included in the computation of the LDA model. Therefore, an upper boundary for is given by

(16)

to guarantee a non-singular covariance estimation. It is useful to further reduce the upper boundary of in order to allow for a reasonable covariance estimation. Here we take three times more observations than variables leading to the limitation

(17)

With these restrictions on , LDA models in the core spaces can be computed but for the evaluation of the models further limitations are necessary. To be able to evaluate the LDA models, we depend on the posterior probabilities of observations for each class in order to determine the risks of misclassification. Since a core consists of observations from the same class only and the core observations are excluded from the LDA model, the size of the smallest class needs to exceed .

(18)

Due to the identified restrictions on , we optimize within the following interval:

(19)

For a given , the misclassification rate of the local discrimination model is calculated by summing up the number of misclassified observations (again excluding the core observations) divided by , the total number of observations. The tuning parameter is chosen from within the interval described in Equation (19) in such a way that the misclassification rate is minimized.

3 Visualization of the discrimination

In linear discriminant analysis, the projection space is used for the visualization of the discrimination (e.g. Hair et al., 1998). The Mahalanobis distances of observations to the class centers refer to the posterior probabilities of the observations for the respective classes. This approach is not feasible for local discrimination since each LDA model refers to a different subspace and the aggregated posterior probabilities do not refer to one specific low dimensional space, where the posterior probabilities could be visualized.

We therefore focus on visualizing the aggregated posterior probabilities and follow an approach for compositional data using ternary diagrams. We present the visualization technique on the four-group Olitos dataset which is used as a benchmark dataset for robust, high-dimensional data analysis. The dataset is publicly available in the R-package rrcovHD and was originally described by Armanino et al. (1989).

Hron and Filzmoser (2013) used ternary diagrams to visualize the outcome of (three-group) fuzzy clustering results, which can be interpreted the same way as posterior probabilities of discrimination models. The difficult aspect about ternary diagrams is the limitation to three variables. Therefore, we select two classes, use the respective posterior probabilities and as third composition the sum of posterior probabilities for all remaining classes. This new three-class composition is visualized in Figure 1.

Figure 1: The aggregated posterior probabilities of two classes (class 1 and class 2) compared to the remaining classes is visualized as a ternary diagram. The dashed lines represent classification rules. Observations located in the white areas can be assigned to the respective group, while the grey area represents an uncertain area, where no reliable statement can be made. The grey dashed lines refer to posterior probabilities of the selected classes.

Figure 1 shows the proposed representation for a sample of the Olitos dataset. The focus of this representation is the evaluation of the separation between the two selected classes and . The gray dashed lines and the numbers on the left side show the posterior probability for an observation to belong to group 1. The two white areas at the bottom separated by a vertical dashed line represent the classification rules for the separation between class 1 and 2. Observations in the left area are assigned to group 1 and in the right area to group 2. In the bottom right area we can identify one outlier from the blue class and one from class which are wrongly assigned to group 2. Besides these two false classifications, additional information can be gained from the diagram.

First, the grey area represents the region, where no statement about classification can be made with certainty. Two observations and are highlighted there. While is located in the uncertainty area, we can still tell that it will be misclassified since the posterior probability for class is larger than for class . This decision is indicated by the vertical dashed line within the uncertainty area. However, from this figure, it is not possible to say whether or not it will be assigned to class 2 or to one of the other classes. The same holds for observation . The posterior probability for class is close to 0.4, for class to 0.15. Therefore, the posterior probabilities for classes and sum up to approximately 0.45. Depending on the class-specific allocation, the maximal posterior probability for the classes and varies between 0.225 and 0.45 and the largest posterior probability for can originate from class , or .

Second, the white classification area at the top of the triangle visualizes those observations, which with certainty will not be assigned to class 1 or class 2. We note a minor risk of misclassifying observations in the direction of class . Note that the size of the uncertainty area and therefore the size of the third classification area highly depends on the number of groups to be aggregated. In a 3-group case, all posterior probabilities can be visualized and no area of uncertainty exists as shown in Figure (a)a. The remaining plots of Figure 2 show the impact of increasing numbers of groups on the area of uncertainty.

(a)
(b)
(c)
(d)
(e)
Figure 2: The effect of the number of aggregated groups is visualized. Figure (a) refers to a three-class case, (b) to a four-class case, (c) to a five-class case, (d) to a six-class case and (e) to a ten-class case.

Finally, the positioning of the observations in the ternary diagram provides some insight on the connection between the groups. The red observations from class in Figure 1 are mostly aligned along the axis from to . Observations from the aggregated group are also mostly aligned between and while the black observations split up between class and . Therefore, class is strongly connected to class but has no connection to the aggregated classes. Observations like strongly deviate from the typical class direction and should therefore be candidates for further investigation in the context of outlier analysis.

Figure 3: A set of ternary diagrams is used to visualize the classification performance for each possible combination of classes. While the color remains constant, we switch labels to emphasize the currently selected classes labeled on the bottom and left of the diagrams.

Since the representation in Figure 1 uses a three-components representation, which can not illustrate the overall discrimination result, we propose to use a combination of ternary diagrams in the form of a scatterplot matrix, as presented in Figure 3. Each combination of possible two-class plus one aggregated group classifications is presented as described for Figure 1. In order to align the groups and increase the readability, the diagrams have been rotated accordingly.

Besides providing information on the quality of discrimination and risks of misclassification, we can derive an overall picture about group connectivity. We already remarked on the location of class . The positioning of the observations of the green and blue observations in the second and third column of the matrix reveals that the green group has stronger ties to the black group, while the blue observations are equally drawn to its direct neighbors, the black and the blue group. Such insight on connectivity provides a feeling for the location of groups in high dimensional spaces which is in general a non-trivial task limited by the human spacial sense.

4 Evaluation

In order to evaluate the performance of our proposed local discrimination approach, abbreviated by LP for Local Projections, we use three real-world datasets which have previously been used as benchmark datasets for high dimensional data analysis. Based on those datasets, we compare with well-established classification methods from the fields of computer science and statistics. While the visualization introduced in Section 3 provides interesting insights into each dataset and will be provided as well, we focus on comparing the used methods based on the misclassification rate.

For each dataset, we split the available observations into training and test dataset. The same training dataset is used for each method to estimate the discrimination model and the same test dataset is used to evaluate the performance of all the models by reporting the misclassification rates.

The employed datasets consist of groups of different numbers of observations. Since the outcome of a method can be strongly affected by the specific choice of the training and test set, we resample the observations 50 times per dataset, creating a series of training and test datasets resulting in a series of misclassification rates. The overall performance is then measured based on the median misclassification rate as well as on the deviation from the median misclassification rate.

4.1 Compared methods

The selection of classification methods is based on the popularity of the methods, the importance for our setups, and the relevance for our proposed approach. The most important aspect is the applicability on the evaluated datasets. The crucial factor is the flat data structure (more variables than observations), especially in class-specific subsets of the overall dataset. In order to cover related classification methods, we include Linear Discrimination Analysis (LDA) as this is the classification method internally used for each local projection. We further include statistical advancements of LDA, which try to deal with disadvantageous properties of our datasets of interest, namely penalized LDA and partial least squares for discriminant analysis. The most related method from the field of computer science is KNN-classification as our local projections are based on a knn-estimation. The last methods included in the evaluation are support vector machines and random forests to cover the most commonly used classification approaches from the field of computer science.

For Linear Discriminant Analysis (LDA), it is assumed that the covariance structure is the same for each class and has elliptical shape. Under this assumption, the optimal decision boundaries to separate the groups are linear. The separation of the classes is achieved by taking orthogonal directions which maximize the within-group variance to the between-group variance. In this dimensional space, the Euclidean distance to the group centers is used to assign an observation to the group with the closest center.

For the calculations, the lda function from the R-package MASS is used. This implementation can be applied to data with by performing singular value decomposition and reducing the dimensionality to the rank of the data.

Penalized LDA (PLDA) introduced by Witten and Tibshirani (2011) is a regularized version of Fisher’s linear discriminant analysis. A penalty on the discriminant vectors favours zero entries, which leads to variable selection. The influence of the penalty is controlled by the sparsity parameter : larger values of lead to fewer variables in the model.

The sparsity parameter is selected from 10 values between and by 10-fold cross-validation on the training data using as selection criterion the minimum mean misclassification rate. The number of discriminating vectors is set to . The functions for cross validation and model estimation are provided in the R-package penalizedLDA (Witten, 2011).

Partial least squares for discriminant analysis (PLSDA) was theoretically established by Barker and Rayens (2003), where its relationship to LDA and the application to flat data was discussed. PLSDA performs in a first step a projection onto latent variables, which considers the grouping information of . Then LDA is performed in the reduced space.

For the evaluation the R-package DiscriMiner (Sanchez, 2013) is used, which provides code for the selection of the number of components by leave-one-out cross-validation.

Support Vector Machines (SVMs) are a popular machine learning method for classification. The margins between the groups of the training data are maximized in a data space induced by the selected kernel. While a variety of kernels is available (e.g. linear, polynomial, sigmoid, etc.), we limit the optimization procedure to the radial basis kernel, which is suggested as standard configuration.

We use an R-interface to libsvm (Chang and Lin, 2011) included in the R-package e1071. The internal optimization of SVM is based on a -fold cross-validation on the training dataset, providing a range of values for the cost parameter and for . For multi-class-classification, libsvm internally trains K(K-1)/2 binary ‘one-against-one’ classifiers based on a sparse data representation matrix.

Random Forest (RF) is an ensemble-based learning method commonly used for classification and regression tasks. It builds a forest of decision trees using bootstrap samples of the training data and random feature selection for each tree. The final prediction is made as an average or majority vote of the predictions of the ensemble of all trees.

The RF implementation in the R-package randomForest uses Breiman’s random forest algorithm (Breiman, 2001) for multigroup classification. In order to optimize the classification model, we use the internal optimization procedure starting with randomly sampled variables as candidates for splits and increase this number with a factor of 1.5 in each optimization step.

In KNN-Classification (KNN), the class-membership of the -nearest neighbors of an observation based on Euclidean distances is used for determining the class of the respective observation. For , the class of the nearest neighbor is used, for , the class with the highest frequency is used. In the case of ties, a random decision is performed. We use one-fold cross-validation in order to optimize individually for each sampled dataset.

4.2 Olive oil

The first dataset in our experiments consists of 120 samples of 25 chemical compositions (fatty acids, sterols, triterpenic alcohols) of olive oils from Tuscany, Italy, and was first introduced by Armanino et al. (1989). The dataset is publicly available in the R-package rrcovHD (Todorov, 2014) where it is used as a reference dataset for robust high-dimensional data analysis.

The olive oils are separated in four classes of 50, 25, 34 and 11 observations. In order to have enough training observations from each group available, we use of observations for the training dataset and the remaining as test observations. We repeatedly create such an evaluation setup 50 times. Hence, each training dataset consists of 96 observations, which yields the only setup where we have more observations than variables available. Therefore, classical LDA is expected to perform fairly well. Note that the smallest number of training observations per class is still much smaller than the number of overall variables. Therefore, class-specific covariance estimation as it is performed in quadratic discriminant analysis (Friedman, 1989) cannot be performed in this setup or on any other of our considered datasets.

LP and LDA perform exactly the same, which can be seen in Figure 4. PLDA slightly outperforms LP, while PLSDA, SVM, RF, and especially KNN get outperformed. In most cases all variables are included in the PLDA model but only a subset of variables contributes to each discriminant vector. This variable selection leads to a slight improvement over LDA and LP.

Figure 4: The performance in terms of false classification rates of all considered classification methods for 50 repetitions of the Olitos dataset is visualized by boxplots.

4.3 Arcene

The second real-world dataset is part of the NIPS (Neural Information Processing Systems) 2003 feature selection challenge (Guyon et al., 2007). The task is to distinguish between cancer and non-cancer patterns from mass-spectrometric data with variables. Therefore, we deal with a two-class separation with continuous variables. The data was obtained from two different data sources, the National Cancer Institute (NCI) and the Eastern Virginia Medical School (EVMS). The observations represent patients with ovarian or prostate cancer and health or control patients. Very small and large masses have been removed from the spectrometric data in order to compress the data. In addition, a preprocessing step including baseline removal, smoothing and scaling was performed. All these details are described in Guyon et al. (2007).

The initial setup contained of 100 training and 100 validation observations, consisting of a total of 112 non-cancer samples and 88 cancer samples. In order to have a non-equal ratio of observations to create again an imbalanced scenario, we merge both groups and resample 22 cancer training observations and 84 non-cancer trainining observations. The remaining observations are used as test observations. This procedure is repeated 50 times as for the other datasets.

Figure 5: The performance in terms of false classification rates of all considered classification methods for 50 repetitions of the Arcene dataset is visualized by boxplots.

The performance of Arcene is evaluated in terms of boxplots in Figure 5. The classification for this dataset and the designed setup is more challenging than for the other real-world datasets. LP performs well in comparison to the other evaluated approaches being outperformed only by PLSDA. The performance of 80% false classification rate by SVM might be misleading as it appears worse than random classification. All observations from the non-cancer samples are classified as cancer samples. This could be improved by strategies like oversampling or by changing the majority class assignment to a weighted class assignment. For LP it is not necessary to make adjustments for group imbalance.

4.4 Melon

Our final dataset consists of measurements of three types of melons based on spectra analyses of 256 frequencies. The fruits are pertain to three different melon cultivars with group sizes of 490, 106 and 499 but additional subgroups are known to be present due to changes in the illumination system during the cultivation. The dataset is regularly used as a benchmark dataset for high-dimensional and robust data analysis methods (e.g. Hubert and Van Driessen, 2004). Especially the subgroups usually affect non-robust analysis methods. Figure 6 provides some insight on the structure of the dataset.

Figure 6: Visualization of the first three principal components of one sample of training observations of the Melon dataset. We see one subgroup of the green class, represented in the first principal components and a strong overlapping structure for the remaining observations.

We repeatedly sample 25% of the observations from each group as training observations using the remaining 75% for testing the model performance. The smallest training class therefore consists of 26 observations leading to a complex classification problem. The performance of the compared methods is presented in Figure 7. LP can handle the challenges of the Melon dataset the best and significantly outperforms all compared methods. Especially PLDA results in a high false classification rate which is assumed to be related to the subgroups and outliers affecting the variable selection.

Figure 7: The performance in terms of false classification rates of all considered classification methods for 50 repetitions of the Melon dataset is visualized by boxplots.

One problem during the visualization of LDA models is the property that in high-dimensional spaces with more variables than observations, the training observations will almost always be well-separated. Therefore, in a situation where we do not have enough observations to validate the model based on additional observations, a visualization of the discrimination space does not provide a lot of insight on the risks for misclassification of this model. These challenges are visualized in Figure (a)a and Figure (b)b.

(a)
(b)
Figure 8: Plot (a) shows the training observations of the LDA-projection space for one repetition of the Melon evaluation. Plot (b) shows the same projection for the respective setup.

We see a perfect separation in Figure (a)a and have no indication of risks of misclassification. The risk of misclassification can be evaluated using the aggregated posterior probabilities of LP, defined in Equation (14), which provide the advantage that each of the classification models is located in a low dimensional space. Figure 9 provides the visualization of the same data setup as used in Figure 8. The risk of misclassifying observations from class as class and vice verse becomes evident in Figure (a)a and the realization of this risk becomes evident in Figure (b)b. Note that this visualization can be adapted and used for posterior probabilities computed through cross-validation by any arbitrary classification method.

(a)
(b)
Figure 9: The same data setup as in Figure 8 is used. Plot (a) shows the proposed visualization of aggregated posterior probabilities from local projections for the training observations. Plot (b) visualizes the same aggregations for the respective test observations.

A further experiment is carried out with the Melon dataset: While in the previous experiment, the training datasets contained 25% of the observations from the original groups (with sizes 490, 106 and 499), we now investigate the effect of modifying the group sizes to be very imbalanced. We investigate six different scenarios with varying group sizes but the same overall sample size of (see Table 1). Figure 10 shows the mean misclassification rate over 50 repetitions. Scenario 1 and scenario 6, with the most extreme difference in the group sizes, result in the worst results for several methods. The LDA models are very stable but most of the time they are outperformed by LP. LP is only slightly affected by scenario 1 but otherwise it leads to similar results for the different settings outperforming all other methods. Note that classification methods could be tuned in order to cope with imbalanced groups. For example, for random forests there are different strategies to adjust the group assignments if the group sizes are very different from each other (e.g. Khoshgoftaar et al., 2007; Khalilia et al., 2011). However, according to the results shown in Figure 10, the performance of LP is very stable even in case of imbalanced groups.

Scenario 1 2 3 4 5 6
Class 1 25 50 75 100 125 150
Class 2 75 75 75 75 75 75
Class 3 150 125 100 75 50 25
Table 1: Group sizes for simulation scenarios for the Melon dataset. We vary the numbers of observations per group in order to simulate highly imbalanced group sizes.
Figure 10: The performance of the evaluated classification methods for highly imbalanced datasets is evaluated. The six scenarios refer to the scenarios described in Table 1. Especially setup 1 and 6 cause a problem for most approaches while LP presents itself as mostly robust towards imbalanced group sizes.

5 Conclusions & Outlook

We proposed a methodology for supervised classification combining aspects from the field of computer science and from the field of statistics. We use the concept of local projections to compute a set of linear discriminant models taking the information within each projection space and the distance to the projection spaces into account. The LDA models are then aggregated based on the projection-based degree of separation. As shown in Ortner et al. (2017a), local projections can help to identify group structure in high-dimensional spaces. Therefore, this way of computing aggregated probabilities for class-membership allows the utilization of LDA for high-dimensional spaces while exploiting the advantages of identifying group structure by local projections.

Additionally, a novel visualization based on ternary diagrams has been proposed which reveals links between the groups in high-dimensional space. The visualization makes use of the posterior probabilities computed from the local projections and therefore it allows to draw conclusions about the uncertainty of the class assignment supported by gray areas in the plot for uncertain assignment.

The conducted evaluations on the performance of LP in comparison to related supervised classification methods (LDA, PLDA, PLSDA, SVM, RF and KNN) based on three different real-world datasets demonstrated the advantage of LP in various settings: two- and multi-group classification tasks, higher number of observations than variables and vice versa, inhomogeneous groups caused by outliers, and imbalanced group sizes. The only tuning parameter required for LP is the number of nearest neighbors, for which a lower and upper boundary has been proposed.

While we utilize linear discriminant analysis performed on the projection space of each local projection, there is no reason to limit ourselves to LDA. Depending on the data setup, other methods can be preferred over LDA and still benefit from the local projection based aggregation. A general combination of classification approaches with local projections is still to be evaluated in future work.

References

  • Armanino et al. (1989) Armanino, C., Leardi, R., Lanteri, S., and Modi, G. (1989). Chemometric analysis of tuscan olive oils. Chemometrics and Intelligent Laboratory Systems, 5(4):343–354.
  • Barker and Rayens (2003) Barker, M. and Rayens, W. (2003). Partial least squares for discrimination. Journal of chemometrics, 17(3):166–173.
  • Bickel and Levina (2004) Bickel, P. J. and Levina, E. (2004). Some theory for fisher’s linear discriminant function,’naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, pages 989–1010.
  • Bosch et al. (2007) Bosch, A., Zisserman, A., and Munoz, X. (2007). Image classification using random forests and ferns. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
  • Caragea et al. (2001) Caragea, D., Cook, D., and Honavar, V. G. (2001). Gaining insights into support vector machine pattern classifiers using projection-based tour methods. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 251–256. ACM.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27.
  • Chen et al. (2013) Chen, L., Wang, Y., Liu, N., Lin, D., Weng, C., Zhang, J., Zhu, L., Chen, W., Chen, R., and Feng, S. (2013). Near-infrared confocal micro-raman spectroscopy combined with pca–lda multivariate analysis for detection of esophageal cancer. Laser Physics, 23(6):065601.
  • Friedman (1989) Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405):165–175.
  • Guyon et al. (2007) Guyon, I., Li, J., Mader, T., Pletscher, P. A., Schneider, G., and Uhr, M. (2007). Competitive baseline methods set new standards for the nips 2003 feature selection benchmark. Pattern recognition letters, 28(12):1438–1444.
  • Hair et al. (1998) Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., Tatham, R. L., et al. (1998). Multivariate data analysis, volume 5. Prentice hall Upper Saddle River, NJ.
  • Hoffmann et al. (2016) Hoffmann, I., Filzmoser, P., Serneels, S., and Varmuza, K. (2016). Sparse and robust pls for binary classification. Journal of Chemometrics, 30(4):153–162.
  • Hron and Filzmoser (2013) Hron, K. and Filzmoser, P. (2013). Robust diagnostics of fuzzy clustering results using the compositional approach. Synergies of Soft Computing and Statistics for Intelligent Data Analysis, pages 245–253.
  • Hubert et al. (2005) Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005). Robpca: a new approach to robust principal component analysis. Technometrics, 47(1):64–79.
  • Hubert and Van Driessen (2004) Hubert, M. and Van Driessen, K. (2004). Fast and robust discriminant analysis. Computational Statistics & Data Analysis, 45(2):301–320.
  • Khalilia et al. (2011) Khalilia, M., Chakraborty, S., and Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC medical informatics and decision making, 11(1):51.
  • Khoshgoftaar et al. (2007) Khoshgoftaar, T. M., Golawala, M., and Van Hulse, J. (2007). An empirical study of learning from imbalanced data using random forest. In Tools with Artificial Intelligence, 2007. ICTAI 2007. 19th IEEE International Conference on, volume 2, pages 310–317. IEEE.
  • Kriegel et al. (2012) Kriegel, H.-P., Kroger, P., Schubert, E., and Zimek, A. (2012). Outlier detection in arbitrarily oriented subspaces. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 379–388. IEEE.
  • Lee et al. (2005) Lee, E.-K., Cook, D., Klinke, S., and Lumley, T. (2005). Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics, 14(4):831–846.
  • Ortner et al. (2017a) Ortner, T., Filzmoser, P., Zaharieva, M., Breiteneder, C., and Brodinova, S. (2017a). Guided projections for analysing the structure of high-dimensional data. arXiv preprint arXiv:1702.06790.
  • Ortner et al. (2017b) Ortner, T., Filzmoser, P., Zaharieva, M., Breiteneder, C., and Brodinova, S. (2017b). Guided projections for analysing the structure of high-dimensional data. arXiv preprint arXiv:1702.06790.
  • Sanchez (2013) Sanchez, G. (2013). DiscriMiner: Tools of the Trade for Discriminant Analysis. R package version 0.1-29.
  • Shao et al. (2011) Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241–1265.
  • Todorov (2014) Todorov, V. (2014). rrcovhd: Robust multivariate methods for high dimensional data. R package version 0.2-3. Available at https://CRAN. R-project. org/package= rrcovHD, 426:429.
  • Witten (2011) Witten, D. (2011). penalizedLDA: Penalized classification using Fisher’s linear discriminant. R package version 1.0.
  • Witten and Tibshirani (2011) Witten, D. M. and Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753–772.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
248167
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description