Classification with many classes: challenges and pluses
The objective of the paper is to study accuracy of multi-class classification in high-dimensional setting, where the number of classes is also large (“large , large , small ” model). While this problem arises in many practical applications and many techniques have been recently developed for its solution, to the best of our knowledge nobody provided a rigorous theoretical analysis of this important setup. The purpose of the present paper is to fill in this gap.
We consider one of the most common settings, classification of high-dimensional normal vectors where, unlike standard assumptions, the number of classes could be large. We derive non-asymptotic conditions on effects of significant features, and the low and the upper bounds for distances between classes required for successful feature selection and classification with a given accuracy. Furthermore, we study an asymptotic setup where the number of classes is growing with the dimension of feature space and while the number of samples per class is possibly limited. We discover an interesting and, at first glance, somewhat counter-intuitive phenomenon that a large number of classes may be a “blessing” rather than a “curse” since, in certain settings, the precision of classification can improve as the number of classes grows. This is due to more accurate feature selection since even weaker significant features, which are not sufficiently strong to be manifested in a coarse classification, can nevertheless have a strong impact when the number of classes is large. We supplement our theoretical investigation by a simulation study and a real data example where we again observe the above phenomenon.
Keywords: Feature selection; high-dimensionality; misclassification error; multi-class classification; sparsity.
Classification has been studied in many contexts. In the era of “big data” one is usually interested in classifying objects that are described by a large number of features and belong to many different groups. For example the large hand-labeled ImageNet dataset http://www.image-net.org/ contains 10,000,000 labeled images depicting more than 10,000 object categories where each image, on the average, is represented by pixels (see Russakovsky et al., 2015 for description and discussion of this data set). The challenge of handling large dimensional data got the name of “large small ” type of problems which means that dimensionality of parameter space by far exceeds the sample size . It is well known that solving problems of this type require rigorous model selection. In fact, the results of Bickel and Levina (2004), Fan and Fan (2008), Shao et al. (2011) demonstrate that even for the standard case of two classes, classification of high-dimensional normal vectors without feature selection is as bad as just pure random guessing. However, while analysis of high-dimensional data (“Big data”) became ubiquitous, to the best of our knowledge, there are no theoretical studies that examine the effect of large number of classes on classification accuracy. The objective of the present paper is to fill in this gap.
At first glance, the problem of successful classification when the number of classes is large seems close to impossible. On the other hand, humans have no difficulty in distinguishing between thousands of objects, and the accuracy of state-of-the-art computer vision techniques is approaching human accuracy. In fact, in some settings, the accuracy of classification improves when the number of classes grows. How is this possible? One of the reasons why multi-class classification succeeds is that selection of appropriate features from a large sparse -dimensional vector becomes easier when the number of classes is growing since even weaker significant features that are not sufficiently strong to be manifested in a coarse classification with a small number of classes may nevertheless have a strong impact as the number of classes grows. Simulation studies in Davis, Pensky and Crampton (2011) and Parrish and Gupta (2012) support such a claim. Arias-Castro, Candès and Plan (2011) reported on a similar occurrence for testing in the sparse ANOVA model. Our paper establishes a firm theoretical foundation under the above phenomenon and confirms it via simulation studies and a real data example.
Although there exists an enormous amount of literature on classification, most of the existing theoretical results have been obtained for the binary classification () (see Boucheron, Bousquet and Lugosi, 2005 and references therein for a comprehensive survey). In particular, binary classification of high-dimensional sparse Gaussian vectors was considered in Bickel and Levina (2004), Fan and Fan (2008), Donoho and Jin (2009 ab), Ingster, Pouet and Tsybakov (2009) and Shao et al. (2011) among others.
In the meantime, a significant amount of effort has been spent on designing methods for the multi-class classification in statistical and machine learning literature. We can mention here techniques designed to adjust pairwise classification to multi-class setting (Escalera et al., 2011; Hill and Doucet, 2007; Jain and Kapoor, 2009), adjustment of the support vector machine technique to the case of several classes (Crammer and Singer, 2001; Lee, Lin and Wahba, 2004) as well as a variety of approaches to expand the linear regression and the neural networks techniques to accommodate the multi-category setup (see, e.g., Gupta, Bengio and Weston, 2014). Tewari and Bartlett (2007) and Pan, Wang and Li (2016) generalized theoretical results for binary classification to the case of multi-class classification and established consistency of the proposed classification procedures. However, all above-mentioned investigations considered only the “small , large , small ” setup, where the number of classes was assumed to be fixed.
This paper is probably the first attempt to rigorously investigate “large , large , small ” classification and the impact of the number of classes on the accuracy of feature selection and classification. In particular, we explore the somewhat counter-intuitive phenomenon, where the large number of classes may become a “blessing” rather than a “curse” for successful classification as more significant features may be revealed. For this purpose, we consider a well-known problem of multi-class classification of high-dimensional normal vectors. We assume that only a subset of truly significant features really contribute to separation between classes (sparsity). For this reason, we carry out feature selection and, following a standard scheme, assign the new observed vector to the closest class w.r.t. the scaled Mahalanobis distance in the space of the selected significant features. Our paper considers a realistic scenario where the number of classes as well as the number of features is large while the number of observations per class is possibly limited (“large , large , small ” model). We do not fix the total number of observations since in the real world the experience of each new class comes with its own, usually finite, set of observations.
We start with a non-asymptotic setting and derive the conditions on effects of significant features, and the low and the upper bounds for the distances between classes required for successful feature selection and classification with a given accuracy. All the results are obtained with the explicit constants and remain valid for any combination of parameters. Our finite sample study is followed by an asymptotic analysis for a large number of features , where, unlike previous works, the number of classes may grow with while the number of samples per class may grow or stay fixed. Our findings indicate that having larger number of classes aids the feature selection and, hence, can improve classification accuracy. On the other hand, larger number of classes require having larger number of significant features for their separation which automatically leads to a “large ” setting. Nevertheless, due to increasing point isolation in high-dimensional spaces (see e.g. Giraud, 2015, Section 1.2.1), those separation conditions become attainable when is large.
We ought to point out that our paper does not propose a novel methodology for feature selection or classification. Rather than that, it studies one of the most popular Gaussian setting and adapts to the case of a large number of classes a standard general scheme, where feature selection is implemented by a thresholding technique with the properly chosen threshold and classification is carried out on the basis of the minimal Mahalanobis distance (we consider both the known and the unknown covariance matrix scenarios). The reason for this choice is that such a general scheme for classification and feature selection in this setting is widely used (see, e.g., Fan and Fan (2008), Shao et al. (2011) and Pan, Wang and Li (2016) for similar approaches that differ mostly by selections of thresholds and distances). Nevertheless, the setup is simple enough for derivations of conditions required for successful classification with a specified precision when the number of classes is large. Therefore, in our simulation study we do not compare these simple and well known techniques with the state of the art classification methodologies but instead investigate how these popular procedures perform when is large and both the number of classes and the number of significant features are growing. In particular, simulations support our finding that classification precision can improve when is increasing. The real data example confirms that the phenomenon above is not due to an artificial construction and is possible in a real life setting.
The rest of the paper is organized as follows. In Section 2 we present the feature selection and multi-class classification procedures and derive the non-asymptotic bounds for their accuracy. An asymptotic analysis is considered in Section 3. Section 4 discusses adaptation of the procedure in the case of the unknown covariance matrix. In Section 5 we illustrate the performance of the proposed approach on simulated and real-data examples. Some concluding remarks are summarized in Section 6. All the proofs are given in the Appendix.
2 Feature selection and classification procedure
2.1 Notation and preliminaries
Consider the problem of multi-class classification of -dimensional normal vectors with classes:
where is the vector of mean effects of features in the -th class and with the common non-singular covariance matrix . To clarify the proposed approach we assume meanwhile that is known and discuss the situation with the unknown in Section 4.
In what follows, we study a realistic scenario where the number of classes as well as the number of features is large while the number of observations per class is possibly limited (“large , large , small ” model). We do not fix the total number of observations since in the real world the experience of each new class comes with its own, usually finite, set of observations.
After averaging over repeated observations within each class, model (1) yields
The objective is to assign a new observed feature vector to one of the classes. Denote
where evidently .
Since , we assign to the class with the nearest centroid w.r.t to the scaled Mahalanobis distance:
It is well-known (see, e.g., Bickel and Levina, 2004, Fan and Fan, 2008 and Shao et al., 2011) that the performance of classification procedures is worsening as the number of features grows (curse of dimensionality). Hence, dimensionality reduction by feature selection prior to classification is crucial for large values of .
Re-write (2) in terms of the one-way multivariate analysis of variance (MANOVA) model as follows:
where , is the vector of mean main effects of features and is the mean interaction effect of -th feature with -th class, with the standard identifiability conditions for each .
The impact of -th feature on classification depends on its variability between the different classes characterized by the interactions in the model (5). The larger are the interactions, the stronger is the impact of the feature. A natural global measure of feature’s contribution to classification is then . Note that a feature may still have a strong main effect but its contribution to classification is nevertheless remains weak if it does not vary significantly between classes, that is, if is small. The main goal of feature selection is to identify a sparse subset of significant features for further use in classification.
2.2 Oracle classification
First, we consider an ideal situation where there is an oracle that provides the list of truly significant features with . In this case, we would obviously use only those features for classification, thus, reducing the dimensionality of the problem. Define indicator variables , and let and be, respectively, the numbers of significant and non-significant features. Without loss of generality, we can always order features in such a way that those significant features are the first ones. The classification procedure (4) then becomes
where are the truncated versions of and respectively: and , and is the corresponding upper left sub-matrix of .
for some .
Let a new observation from the class be assigned to the -th class according to classification rule (6). Then, the misclassification error is
Condition (7) verifies that classes should be sufficiently separated from each other (in terms of Mahalanobis distance) to achieve the required classification accuracy. In fact, the requirements in (7) are also essentially necessary. Theorem 2 below, which is a direct consequence of Fano’s lemma for the lower bound of misclassification error (see, e.g., Ibragimov and Hasminskii, 1981, Section 7.1), implies that the first term in the RHS of (7) is unavoidable for successful classification and cannot be significantly improved (in the minimax sense) even in the idealized case, where the class centers are known:
Consider the model (1). Let a new observation be from one of classes. If
for some , then
where is the probability evaluated under the assumption that belongs to the -th class, and the infimum is taken over all classification rules .
The second term in the RHS of (7) appears due to replacing the unknown -dimensional class centers ’s by the corresponding within-class sample means ’s in (6). Indeed, straightforward extension of the results of Theorem 1 of Fan and Fan (2008) for a general yields that, unless for all pairs , for some , the curse of dimensionality affects the accumulated error in estimating high-dimensional ’s and yields classification performance nearly the same as random guessing.
2.3 Feature selection procedure
Consider now classification setup in the MANOVA model (5) with a more realistic scenario, where a set of significant features is unknown and should be identified from the data.
Following our previous arguments, a -th feature is not significant (irrelevant) for classification if it has zero interaction effects with all classes, that is, if or, equivalently, . Then, for each we need to test the null hypothesis . An obvious test statistic is then
where and . Under the null, , while under the alternative , where is the non-central chi-square distribution with the non-centrality parameter . Note that unless is diagonal, ’s are correlated.
For a given , define a threshold
and select the -th feature as significant (reject ) if
The following theorem shows that under certain conditions on the minimal required effect for significant features, the proposed feature selection procedure correctly identifies the true (unknown) subset of significant features with probability at least :
The condition (15) on the total minimal effect for significant features can be re-formulated in terms on their average effect per class:
Thus, as the number of classes increases, even significant features with weaker effects within each class become manifested and contribute to classification. Effect of a certain feature that remains latent and unnoticed in coarse classification with a small number of classes may be expressed in a finer classification.
2.4 Classification rule and misclassification error
Consider now the classification rule (6), where the unknown true are replaced by following the proposed feature selection procedure. Let be the number of features declared significant and . Again, order the features in such a way that those features selected as significant are the first ones. Thus, the resulting classification rule can then be presented as follows:
where the truncated vectors are defined now as , and is the corresponding upper left sub-matrix of .
3 Asymptotic analysis
Conditions (7) and (15) (or (16)) of Theorems 1 and 2, respectively, provide the non-asymptotic lower bounds on the minimal distance between different classes and the minimal effect of significant features required for the perfect feature selection and classification error bounded above by . In order to gain better understanding of these conditions, we consider an asymptotic setup.
Standard asymptotics considered in classification literature assume that the number of features and the sample sizes increase whereas the number of classes is fixed (see, e.g., Fan and Fan, 2008; Shao et al., 2011 for and Pan, Wang and Li, 2016, for a general but fixed ). On the contrary, our study is motivated by the case where the number of classes may also be large (“large , large , small ”).
Recall that is the total sample size and let the number of features . Pan, Wang and Li, 2016, assume that all eigenvalues of the covariance matrix of significant features are finite and bounded away from zero, i.e., there exist absolute constants and such that
Assume that the samples sizes within classes also grow with and, for simplicity of exposition, are of the same asymptotic order, that is, , where and means . In such asymptotic setup, , while . Though the results in the previous section allow one to study various other settings with unequal class sizes, the asymptotic analysis of a vast variety of such possible scenarios is beyond the scope of this paper.
Consider now the condition (7) of Theorems 1 and 4 on the minimal separation Mahalanobis distance between any two class centers as tends to infinity, while , the number of significant features and the number of classes may increase with , and may depend on and . Thus, (7) yields:
Depending on , the condition (20) implies two possible asymptotic regimes for :
For sparse regime (), the required minimal between-class distance grows slowly as and from Theorem 2 it immediately follows that this is the lowest possible rate for successful classification:
Let and as . Let a new observation be from one of classes. If
where arbitrarily slow as , then
where is the probability evaluated under the assumption that belongs to the -th class, and the infimum is taken over all classification rules .
For dense regime, the number of significant features is large enough for the accumulated error of estimating -dimensional by ’s to become dominant (see Section 2.2) and the classes should be, therefore, much stronger separated to deal with the curse of dimensionality.
It is natural that for successful classification the between-class distances should grow with . Note, however, that unless the number of classes increases exponentially with , the growth rate of is and the corresponding average per-feature distances still tend to zero.
and the threshold in (12) for feature selection can be presented as
To gain some insight on the minimal required effect for a significant feature to contribute to classification as the number of classes increases, assume for simplicity that each significant feature has equal effects on each class, that is, in (5) vary only in signs: , . Since implies that is large, so that , condition (22) yields as :
Since is decreasing with for a given value of , the required minimal level for in the RHS of (23) decreases as grows and, therefore, more significant features become manifested in classification for larger number of classes. Thus, while it might be hard to perform coarse classification with a set of weak features, their impacts grow as one considers finer and finer separation between objects (see also the corresponding remarks at the end of Section 2.3).
Although in this section our goal was to explore the case when , calculations above remain valid for a fixed value of (commonly, ). In particular, if is fixed and , conditions (20) and (23) are of the form and and are similar to those of Fan & Fan (2008, Theorem 1 and Theorem 3). See also the results of Donoho and Jin (2009 a,b) and Ingster, Pouet & Tsybakov (2009) for closely related setups.
4 Unknown covariance matrix
So far the covariance matrix was assumed to be known. In practice, however, it should usually be estimated from the data. The standard MLE estimator
and the similar unbiased pooled estimator commonly used in MANOVA behave poorly for high-dimensional data. However, under the sparsity assumption, the proposed classification procedure requires only to estimate the variances in feature selection procedure (11) and the inverse of the upper left sub-matrix of in classification rule (17). Thus, when , a low-dimensional matrix may still be a good estimator of the true sub-matrix and (under some additional mild conditions) may be used instead of the latter in (17).
where is the threshold (12) used for the case of known variances and
The following theorem shows that under slightly stronger conditions on the minimal required effect for significant features, the above feature selection procedure with estimated still controls the probability of correct identification of the true subset of significant features.
Consider now the classification procedure (17). In what follows we assume that is non-singular. Let be the corresponding upper left sub-matrix of , i.e.
where are the corresponding -dimensional truncated versions of .
Assign the -th class by replacing the true (unknown) in (17) by :
Then the following version of Theorem 4 holds:
for some and is an absolute constant specified in the proof. Denote
Theorem 6 shows that for a sparse setup the proposed classification procedure can still be used when the covariance matrix is unknown and estimated from the data.
In this section we demonstrate the performance of the proposed feature selection and classification procedure on simulated and real-data examples. Its main goal is to illustrate the phenomenon of improving the accuracy as the number of classes grows discussed in the previous sections. Simulated examples follow the settings presented in Pan, Wang and Li (2016).
5.1 Simulation study
We generated the class means as i.i.d. normal vectors , where is a diagonal matrix with for indices and for others. Since the vectors generated in this manner do not necessarily satisfy our assumptions, in order to reduce an impact of a particular choice of vectors , we generated replications of the class means. Furthermore, following the model (2), for each replication of class means we generated sets of training samples , where are i.i.d. . Finally, for each of sets of training samples, we drew a test set of new vectors from randomly chosen classes as i.i.d. normal vectors .
We used the same three choices for covariance matrix as in Pan, Wang and Li (2016). In Example 1 features were independent, i.e. . In Example 2 we used the autoregressive covariance structure with , while in Example 3 we set implying equal variances and all covariances equal to (compound symmetric structure). We carried out simulations with both the true covariance matrix and its MLE given by (24). Since the performances of feature selection and classification procedures in both cases were similar, in what follows we present only the results obtained with .
For each training sample we first carried out the feature selection procedure described above with the threshold defined in (25) and . Subsequently, we used the selected features for classifying vectors from the corresponding test set according to the rule (30). In the case when it delivered a non-unique solution, we chose one of the suggested solutions at random.
In all simulations we used and . Note that classification precision depends on the variance ratio that may be viewed as a signal-to-noise ratio. For this reason, we studied performance of feature selection and classification for various combinations of , and . In particular, we used , and several values of depending on .
|Example 1||Example 2|
The results of simulations indicate that for such data generating model (somewhat different from that analyzed in the paper), the threshold in (25) (as well as in (12) for the known variances) might be too high, especially for small values of . The latter led to an over-conservative feature selection procedure. Thus, in all simulations the feature selection procedure did not detect false positive features. The information on the proportions of false negative features (over the total number of significant features) for several combinations of , and over training samples is summarized in Table 1 for Example 1 and Example 2 (the results for Example 3 were similar and we omit their presentation to save the space). In particular, Table 1 clearly shows that for small values of and small , due to the over-conservative feature selection procedure, almost not a single significant feature has been detected and the resulting classification is then essentially reduced to just a pure random guess. However, for any the detection rate improves as grows. The improvement rate is very fast for . Thus, for the vast majority of significant features were detected in spite of high level of noise. As we have mentioned, this improves the classification precision since weaker significant features that remained latent in coarse classification become active and may have a strong impact with increasing .
For each combination of , and we calculated the corresponding average misclassification errors: see Figures 1–3 for Examples 1–3, respectively. Figures 1-3 show similar behavior for all three examples. For any and misclassification error tends to zero as increases. The decay is faster for larger – the more significant features, the easier is classification. The figures demonstrate also another interesting phenomenon: for moderate and large , the larger , the faster is the decay. As we have argued, this is due to the fact that the impact of weaker significant features becomes stronger with increasing . For small (strong noise), misclassification errors are higher for larger number of classes . This is naturally explained by the failure of feature selection procedure to detect significant features in this case (see comments above), so that the resulting classification is similar to a random guess with a misclassification error (see Figures 1-3). However, as increases, even the first few detected significant features strongly improve classification precision.
5.2 Real-data example
We applied feature selection techniques discussed above to a dataset of communication signals recorded from South American knife fishes of the genus Gymnotus. These nocturnally active freshwater fishes generate pulsed electrostatic fields from electric organ discharges (EODs). The three-dimensional electrostatic EOD fields of Gymnotus can be summarized by two-dimensional head-to-tail waveforms recorded from underwater electrodes placed in front of and behind a fish. EOD waveforms vary among species and are used by genus Gymnotus in order to recognize its own kind for more productive mating and other purposes.
The data set consists of 512-dimensional vectors of the Symmlet-4 discrete wavelet transform coefficients of signals obtained from eight genetically distinct species of Gymnotus (G. arapaima (G1), G. coatesi (G2), G. coropinae (G3), G. curupira (G4), G. jonasi (G5), G. mamiraua (G6), G. obscurus (G7), G. varzea (G8)) at various stages of their development. In particular, species were divided into six ontogenetic categories: postlarval (J0), small juvenile (J1), large juvenile (J2), immature adult (IA), mature male (M) and mature female (F). The EODs were recorded from 42 of 48 possible combinations of eight species and six categories. There are 677 samples from 42 classes with sizes varying from 3 to 69. The complete description of the data can be found in Crampton et al. (2011).
As it is evident from Crampton, Lovejoy and Waddell (2011), there is no expectation that these groups should all be mutually separable: there is considerable overlaps between developmental stages of the same specie as well as among juveniles of different species. For this reason, we reduced the number of classes to include only those species/categories that might be potentially separated. In particular, we ran our feature selection and classification procedure with the data sets comprised of 10 to 16 classes listed in the order they appear: G2-M, G4-M, G5-M, G1-F, G2-F, G5-F, G7-F, G8-F, G2-J1, G4-J1, G2-F, G1-J1, G7-AI, G1-F, G6-M, G7-J1.
We split the respective data sets into training and test parts. For this purpose, in each class we chose at random at most 1/3 of the total number of observations for validation leaving the rest of the data as training samples. Using those training samples, we carried out feature selection and subsequent classification of vectors in the test part of the data set. We repeated the process 100 times for various splits and recorded the average misclassification errors and their standard errors for each of the cases (). Table 2 presents results of the study: the average sample sizes of train () and test () sets for each , the average number of selected significant features () and average misclassification error with the corresponding standard errors.
The table shows that when one starts with 10 well separated classes the misclassification error is initially grows when increases from 10 to 13. However, at there is a strong jump in the numbers of detected features and the misclassification errors again start to decrease when grows from 13 to 15 due to better feature selection. For the misclassification error grows again with due to poor separation of juvenile Gymnotus EOD waveforms shapes.
6 Concluding remarks
The paper considers multi-class classification of high-dimensional normal vectors, where the number of classes may also be large. This is a first attempt to rigorously study “large , large , small ” classification problem.
We propose a consistent feature selection procedure and derive the misclassification error of the resulting classification procedure. In particular, our results indicate an interesting phenomenon that the precision of classification can improve as a number of classes grows. This is, at first glance, a completely counter-intuitive conclusion and has not been observed so far due to shortage of literature on multi-class classification. It is explained by the fact that even weaker significant features, that might be undetected for smaller , strongly contribute to successful classification when the number of classes is large. We believe that the results of the paper motivate further investigation of “large , large , small ” classification in other, more complicated setups.
Felix Abramovich was supported by the Israel Science Foundation (ISF), grant ISF-820/13. Marianna Pensky was partially supported by National Science Foundation (NSF), grants DMS-1407475 and DMS-1712977. The authors would like to thank Vladimir Koltchinskii for valuable remarks and Will Crampton for providing the data set used for the real data example.
We start from recalling two lemmas of Birgé (2001) that will be used further in the proofs.
Lemma 1 (Lemma 8.1 of Birgé, 2001).
Let , . Then, for any
Lemma 2 (Lemma 8.2 of Birgé, 2001).
Let be a random variable such that
where and are positive constants. Then
Proof of Theorem 1 Note that
For a given define a -dimensional random vector , where the vectors and are defined in (6). A straightforward calculus yields
Consider a random variable . Since is a symmetric positive-definite matrix and is symmetric, they can be simultaneously diagonalized, that is, there exists a matrix , such that and , where is a diagonal matrix of the eigenvalues of . Then, from the known results on the distribution of quadratic forms of normal variables (e.g., Imhof, 1961), can be represented a weighted sum of independent (generally) non-central chi-square variables as
where is such that with given by (37). By a straightforward matrix calculus, obtain
and, therefore, all eigenvalues of matrix are of the forms