Random-projection ensemble classification
We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.
Keywords: Aggregation; Classification; High-dimensional; Random projection
Supervised classification concerns the task of assigning an object (or a number of objects) to one of two or more groups, based on a sample of labelled training data. The problem was first studied in generality in the famous work of Fisher (1936), where he introduced some of the ideas of Linear Discriminant Analysis (LDA), and applied them to his Iris data set. Nowadays, classification problems arise in a plethora of applications, including spam filtering, fraud detection, medical diagnoses, market research, natural language processing and many others.
In fact, LDA is still widely used today, and underpins many other modern classifiers; see, for example, Friedman (1989) and Tibshirani et al. (2002). Alternative techniques include support vector machines (Cortes and Vapnik, 1995), tree classifiers and random forests (Breiman et al., 1984; Breiman, 2001), kernel methods (Hall and Kang, 2005) and nearest neighbour classifiers (Fix and Hodges, 1951). More substantial overviews and in-depth discussion of these techniques, and others, can be found in Devroye, Györfi and Lugosi (1996) and Hastie et al. (2009).
An increasing number of modern classification problems are high-dimensional, in the sense that the dimension of the feature vectors may be comparable to or even greater than the number of training data points, . In such settings, classical methods such as those mentioned in the previous paragraph tend to perform poorly (Bickel and Levina, 2004), and may even be intractable; for example, this is the case for LDA, where the problems are caused by the fact that the sample covariance matrix is not invertible when .
Many methods proposed to overcome such problems assume that the optimal decision boundary between the classes is linear, e.g. Friedman (1989) and Hastie et al. (1995). Another common approach assumes that only a small subset of features are relevant for classification. Examples of works that impose such a sparsity condition include Fan and Fan (2008), where it is also assumed that the features are independent, as well as Tibshirani et al. (2003), where soft thresholding is used to obtain a sparse boundary. More recently, Witten and Tibshirani (2011) and Fan, Feng and Tong (2012) both solve an optimisation problem similar to Fisher’s linear discriminant, with the addition of an penalty term to encourage sparsity.
In this paper we attempt to avoid the curse of dimensionality by projecting the feature vectors at random into a lower-dimensional space. The use of random projections in high-dimensional statistical problems is motivated by the celebrated Johnson–Lindenstrauss Lemma (e.g. Dasgupta and Gupta, 2002). This lemma states that, given , and , there exists a linear map such that
for all . In fact, the function that nearly preserves the pairwise distances can be found in randomised polynomial time using random projections distributed according to Haar measure, as described in Section 3 below. It is interesting to note that the lower bound on in the Johnson–Lindenstrauss lemma does not depend on ; this lower bound is optimal up to constant factors (Larsen and Nelson, 2016). As a result, random projections have been used successfully as a computational time saver: when is large compared to , one may project the data at random into a lower-dimensional space and run the statistical procedure on the projected data, potentially making great computational savings, while achieving comparable or even improved statistical performance. As one example of the above strategy, Durrant and Kabán (2013) obtained Vapnik–Chervonenkis type bounds on the generalisation error of a linear classifier trained on a single random projection of the data. See also Dasgupta (1999), Ailon and Chazelle (2006) and McWilliams et al. (2014) for other instances.
Other works have sought to reap the benefits of aggregating over many random projections. For instance, Marzetta, Tucci and Simon (2011) considered estimating a population inverse covariance (precision) matrix using , where denotes the sample covariance matrix and are random projections from to . Lopes, Jacob and Wainwright (2011) used this estimate when testing for a difference between two Gaussian population means in high dimensions, while Durrant and Kabán (2015) applied the same technique in Fisher’s linear discriminant for a high-dimensional classification problem.
Our proposed methodology for high-dimensional classification has some similarities with the techniques described above, in the sense that we consider many random projections of the data, but is also closely related to bagging (Breiman, 1996), since the ultimate assignment of each test point is made by aggregation and a vote. Bagging has proved to be an effective tool for improving unstable classifiers. Indeed, a bagged version of the (generally inconsistent) -nearest neighbour classifier is universally consistent as long as the resample size is carefully chosen, see Hall and Samworth (2005); for a general theoretical analysis of majority voting approaches, see also Lopes (2016). Bagging has also been shown to be particularly effective in high-dimensional problems such as variable selection (Meinshuasen and Bühlmann, 2010; Shah and Samworth, 2013). Another related approach to ours is Blaser and Fryzlewicz (2015), who consider ensembles of random rotations, as opposed to projections.
One of the basic but fundamental observations that underpins our proposal is the fact that aggregating the classifications of all random projections is not always sensible, since many of these projections will typically destroy the class structure in the data; see the top row of Figure 1. For this reason, we advocate partitioning the projections into disjoint groups, and within each group we retain only the projection yielding the smallest estimate of the test error. The attraction of this strategy is illustrated in the bottom row of Figure 1, where we see a much clearer partition of the classes. Another key feature of our proposal is the realisation that a simple majority vote of the classifications based on the retained projections can be highly suboptimal; instead, we argue that the voting threshold should be chosen in a data-driven fashion in an attempt to minimise the test error of the infinite-simulation version of our random projection ensemble classifier. In fact, this estimate of the optimal threshold turns out to be remarkably effective in practice; see Section 5.2 for further details. We emphasise that our methodology can be used in conjunction with any base classifier, though we particularly have in mind classifiers designed for use in low-dimensional settings. The random projection ensemble classifier can therefore be regarded as a general technique for either extending the applicability of an existing classifier to high dimensions, or improving its performance. The methodology is implemented in an R package RPEnsemble (Cannings and Samworth, 2016).
Our theoretical results are divided into three parts. In the first, we consider a generic base classifier and a generic method for generating the random projections into and quantify the difference between the test error of the random projection ensemble classifier and its infinite-simulation counterpart as the number of projections increases. We then consider selecting random projections from non-overlapping groups by initially drawing them according to Haar measure, and then within each group retaining the projection that minimises an estimate of the test error. Under a condition implied by the widely-used sufficient dimension reduction assumption (Li, 1991; Cook, 1998; Lee et al., 2013), we can then control the difference between the test error of the random projection classifier and the Bayes risk as a function of terms that depend on the performance of the base classifier based on projected data and our method for estimating the test error, as well as a term that becomes negligible as the number of projections increases. The final part of our theory gives risk bounds for the first two of these terms for specific choices of base classifier, namely Fisher’s linear discriminant and the -nearest neighbour classifier. The key point here is that these bounds only depend on , the sample size and the number of projections, and not on the original data dimension .
The remainder of the paper is organised as follows. Our methodology and general theory are developed in Sections 2 and 3. Specific choices of base classifier as well as a general sample splitting strategy are discussed in Section 4, while Section 5 is devoted to a consideration of the practical issues of computational complexity, choice of voting threshold, projected dimension and the number of projections used. In Section 6 we present results from an extensive empirical analysis on both simulated and real data where we compare the performance of the random projection ensemble classifier with several popular techniques for high-dimensional classification. The outcomes are very encouraging, and suggest that the random projection ensemble classifier has excellent finite-sample performance in a variety of different high-dimensional classification settings. We conclude with a discussion of various extensions and open problems. Proofs are given in the Appendix and the supplementary material Cannings and Samworth (2017), which appears below the reference list.
Finally in this section, we introduce the following general notation used throughout the paper. For a sufficiently smooth real-valued function defined on a neighbourhood of , let and denote its first and second derivatives at , and let and denote the integer and fractional part of respectively.
2 A generic random projection ensemble classifier
We start by describing our setting and defining the relevant notation. Suppose that the pair takes values in , with joint distribution , characterised by , and , the conditional distribution of , for . For convenience, we let . In the alternative characterisation of , we let denote the marginal distribution of and write for the regression function. Recall that a classifier on is a Borel measurable function , with the interpretation that we assign a point to class . We let denote the set of all such classifiers.
The test error of a classifier isWe define through an integral rather than to make it clear that when is random (depending on training data or random projections), it should be conditioned on when computing .
and is minimised by the Bayes classifier
(e.g. Devroye, Györfi and Lugosi, 1996, p. 10). Its risk is .
Of course, we cannot use the Bayes classifier in practice, since is unknown. Nevertheless, we often have access to a sample of training data that we can use to construct an approximation to the Bayes classifier. Throughout this section and Section 3, it is convenient to consider the training sample to be fixed points in . Our methodology will be applied to a base classifier , which we assume can be constructed from an arbitrary training sample of size in ; thus is a measurable function from to .
Now assume that . We say a matrix is a projection if , the -dimensional identity matrix. Let be the set of all such matrices. Given a projection , define projected data and for , and let . The projected data base classifier corresponding to is , given by
Note that although is a classifier on , the value of only depends on through its -dimensional projection .
We now define a generic ensemble classifier based on random projections. For , let denote independent and identically distributed projections in , independent of . The distribution on is left unspecified at this stage, and in fact our proposed method ultimately involves choosing this distribution depending on .
For , the random projection ensemble classifier is defined to be
We emphasise again here the additional flexibility afforded by not pre-specifying the voting threshold to be . Our analysis of the random projection ensemble classifier will require some further definitions. LetIn order to distinguish between different sources of randomness, we will write and for the probability and expectation, respectively, taken over the randomness from the projections . If the training data is random, then we condition on when computing and .
For , define distribution functions by . Note that since is non-decreasing it is differentiable almost everywhere; in fact, however, the following assumption will be convenient:
Assumption 1. and are twice differentiable at .
The first derivatives of and , when they exist, are denoted as and respectively; under assumption 1, these derivatives are well-defined in a neighbourhood of . Our first main result below gives an asymptotic expansion for the expected test error of our generic random projection ensemble classifier as the number of projections increases. In particular, we show that this expected test error can be well approximated by the test error of the infinite-simulation random projection classifier
Note that provided and are continuous at , we have
Assume assumption 1. Then
as , where
The proof of Theorem 1 in the Appendix is lengthy, and involves a one-term Edgeworth approximation to the distribution function of a standardised Binomial random variable. One of the technical challenges is to show that the error in this approximation holds uniformly in the binomial proportion. Related techniques can also be used to show that under assumption 1; see Proposition 4 in the supplementary material. Very recently, Lopes (2016) has obtained similar results to this and to Theorem 1 in the context of majority vote classification, with stronger assumptions on the relevant distributions and on the form of the voting scheme. In Figure 2, we plot the average error (plus/minus two standard deviations) of the random projection ensemble classifier in one numerical example, as we vary ; this reveals that the Monte Carlo error stabilises rapidly, in agreement with what Lopes (2016) observed for a random forest classifier.
Our next result controls the test excess risk, i.e. the difference between the expected test error and the Bayes risk, of the random projection classifier in terms of the expected test excess risk of the classifier based on a single random projection. An attractive feature of this result is its generality: no assumptions are placed on the configuration of the training data , the distribution of the test point or on the distribution of the individual projections.
For each , we have
where we write to emphasise the dependence on the voting threshold . In this case, by definition of and then applying Theorem 2,
3 Choosing good random projections
In this section, we study a special case of the generic random projection ensemble classifier introduced in Section 2, where we propose a screening method for choosing the random projections. Let be an estimator of , based on , that takes values in the set . Examples of such estimators include the training error and leave-one-out estimator; we discuss these choices in greater detail in Section 4. For , let denote independent projections, independent of , distributed according to Haar measure on . One way to simulate from Haar measure on the set is to first generate a matrix , where each entry is drawn independently from a standard normal distribution, and then take to be the matrix of left singular vectors in the singular value decomposition of (see, for example, Chikuse, 2003, Theorem 1.5.4). For , let
where denotes the smallest index where the minimum is attained in the case of a tie. We now set , and consider the random projection ensemble classifier from Section 2 constructed using the independent projections .
denote the optimal test error estimate over all projections. The minimum is attained here, since takes only finitely many values. We assume the following:
Assumption 2. There exists such that
The quantity , which depends on because is selected from independent random projections, can be interpreted as a measure of overfitting. Assumption 2 asks that there is a positive probability that is within of its minimum value . The intuition here is that spending more computational time choosing a projection by increasing is potentially futile: one may find a projection with a lower error estimate, but the chosen projection will not necessarily result in a classifier with a lower test error. Under this condition, the following result controls the test excess risk of our random projection ensemble classifier in terms of the test excess risk of a classifier based on -dimensional data, as well as a term that reflects our ability to estimate the test error of classifiers based on projected data and a term that depends on the number of projections.
Assume assumption 2. Then, for each , and every ,
Regarding the bound in Theorem 3 as a sum of three terms, we see that the final one can be seen as the price we have to pay for the fact that we do not have access to an infinite sample of random projections. This term can be made negligible by choosing to be sufficiently large, though the value of required to ensure it is below a prescribed level may depend on the training data. It should also be noted that in the second term may increase with , which reflects the fact mentioned previously that this quantity is a measure of overfitting. The behaviour of the first two terms depends on the choice of base classifier, and our aim is to show that under certain conditions, these terms can be bounded (in expectation over the training data) by expressions that do not depend on .
To this end, define the regression function on induced by the projection to be . The corresponding induced Bayes classifier, which is the optimal classifier knowing only the distribution of , is given by
In order to give a condition under which there exists a projection for which is close to the Bayes risk, we will invoke an additional assumption on the form of the Bayes classifier:
Assumption 3. There exists a projection such that
where denotes the symmetric difference of two sets and .
Assumption 3 requires that the set of points assigned by the Bayes classifier to class 1 can be expressed as a function of a -dimensional projection of . Note that if the Bayes decision boundary is a hyperplane, then assumption 3 holds with . Moreover, Proposition 1 below shows that, in fact, assumption 3 holds under the sufficient dimension reduction condition, which states that is conditionally independent of given ; see Cook (1998) for many statistical settings where such an assumption is natural.
If is conditionally independent of given , then assumption 3 holds.
The following result confirms that under assumption 3, and for a sensible choice of base classifier, we can hope for to be close to the Bayes risk.
Assume assumption 3. Then .
We are therefore now in a position to study the first two terms in the bound in Theorem 3 in more detail for specific choices of base classifier.
4 Possible choices of the base classifier
In this section, we change our previous perspective and regard the training data as independent random pairs with distribution , so our earlier statements are interpreted conditionally on . For , we write our projected data as , where and . We also write and to refer to probabilities and expectations over all random quantities. We consider particular choices of base classifier, and study the first two terms in the bound in Theorem 3.
4.1 Linear Discriminant Analysis
Linear Discriminant Analysis (LDA), introduced by Fisher (1936), is arguably the simplest classification technique. Recall that in the special case where , we have
so assumption 3 holds with and , a matrix. In LDA, , and are estimated by their sample versions, using a pooled estimate of . Although LDA cannot be applied directly when since the sample covariance matrix is singular, we can still use it as the base classifier for a random projection ensemble, provided that . Indeed, noting that for any , we have , where and , we can define
Here, , where , ,
Write for the standard normal distribution function. Under the normal model specified above, the test error of the LDA classifier can be written as
where and .
Efron (1975) studied the excess risk of the LDA classifier in an asymptotic regime in which is fixed as diverges. Specialising his results for simplicity to the case where , he showed that using the LDA classifier ( ‣ 4.1) with yields
as , where .
It remains to control the errors and in Theorem 3. For the LDA classifier, we consider the training error estimator
Devroye and Wagner (1976) provided a Vapnik–Chervonenkis bound for under no assumptions on the underlying data generating mechanism: for every and ,
see also Devroye et al. (1996, Theorem 23.1). We can then conclude that
The more difficult term to deal with is
In this case, the bound ( ‣ 4.1) cannot be applied directly, because are no longer independent conditional on ; indeed is selected from so as to minimise an estimate of test error, which depends on the training data. Nevertheless, since are independent of , we still have that
We can therefore conclude by almost the same argument as that leading to (4.1) that
Note that none of the bounds ( ‣ 4.1), (4.1) and ( ‣ 4.1) depend on the original data dimension . Moreover, ( ‣ 4.1), together with Theorem 3, reveals a trade-off in the choice of when using LDA as the base classifier. Choosing to be large gives us a good chance of finding a projection with a small estimate of test error, but we may incur a small overfitting penalty as reflected by ( ‣ 4.1).
Finally, we remark that an alternative method of fitting linear classifiers is via empirical risk minimisation. In this context, Durrant and Kabán (2013, Theorem 3.1) give high probability bounds on the test error of a linear empirical risk minimisation classifier based on a single random projection, where the bounds depend on what those authors refer to as the ‘flipping probability’, namely the probability that the class assignment of a point based on the projected data differs from the assignment in the ambient space. In principle, these bounds could be combined with our Theorem 2, though the resulting expressions would depend on probabilistic bounds on flipping probabilities.
4.2 Quadratic Discriminant Analysis
Quadratic Discriminant Analysis (QDA) is designed to handle situations where the class-conditional covariance matrices are unequal. Recall that when , and , for , the Bayes decision boundary is given by , where
In QDA, , and are estimated by their sample versions. If , where we recall that , then at least one of the sample covariance matrix estimates is singular, and QDA cannot be used directly. Nevertheless, we can still choose and use QDA as the base classifier in a random projection ensemble. Specifically, we can set
where and were defined in Section 4.1, and where
for . Unfortunately, analogous theory to that presented in Section 4.1 does not appear to exist for the QDA classifier; unlike for LDA, the risk does not have a closed form (note that is non-definite in general). Nevertheless, we found in our simulations presented in Section 6 that the QDA random projection ensemble classifier can perform very well in practice. In this case, we estimate the test errors using the leave-one-out method given by
where denotes the classifier , trained without the th pair, i.e. based on . For a method like QDA that involves estimating more parameters than LDA, we found that the leave-one-out estimator was less susceptible to overfitting than the training error estimator.
4.3 The -nearest neighbour classifier
The -nearest neighbour classifier (nn), first proposed by Fix and Hodges (1951), is a nonparametric method that classifies the test point according to a majority vote over the classes of the nearest training data points to . The enormous popularity of the nn classifier can be attributed partly due to its simplicity and intuitive appeal; however, it also has the attractive property of being universally consistent: for every fixed distribution , as long as and , the risk of the nn classifier converges to the Bayes risk (Devroye et al., 1996, Theorem 6.4).
Hall, Park and Samworth (2008) studied the rate of convergence of the excess risk of the -nearest neighbour classifier under regularity conditions that require, inter alia, that is fixed and that the class-conditional densities have two continuous derivatives in a neighbourhood of the -dimensional manifold on which they cross. In such settings, the optimal choice of , in terms of minimising the excess risk, is , and the rate of convergence of the excess risk with this choice is . Thus, in common with other nonparametric methods, there is a ‘curse of dimensionality’ effect that means the classifier typically performs poorly in high dimensions. Samworth (2012) found the optimal way of assigning decreasing weights to increasingly distant neighbours, and quantified the asymptotic improvement in risk over the unweighted version, but the rate of convergence remains the same.
We can use the nn classifier as the base classifier for a random projection ensemble as follows: given , let be a re-ordering of the training data such that , with ties split at random. Now define
where . The theory described in the previous paragraph can be applied to show that, under regularity conditions, .
Once again, a natural estimate of the test error in this case is the leave-one-out estimator defined in ( ‣ 4.2), where we use the same value of on the leave-one-out datasets as for the base classifier, and where distance ties are split in the same way as for the base classifier. For this estimator, Devroye and Wagner (1979, Theorem 4) showed that for every ,
see also Devroye et al. (1996, Chapter 24). It follows that
Arguing very similarly to Section 4.1, we can deduce that under no conditions on the data generating mechanism, and choosing ,
We have therefore again bounded the expectations of the first two terms on the right-hand side of ( ‣ 3) by quantities that do not depend on .
4.4 A general strategy using sample splitting
In Sections 4.1, 4.2 and 4.3, we focused on specific choices of the base classifier to derive bounds on the expected values of the first two terms in the bound in Theorem 3. The aim of this section, on the other hand, is to provide similar guarantees that can be applied for any choice of base classifier in conjunction with a sample splitting strategy. The idea is to split the sample into and , say, where and . To estimate the test error of , the projected data base classifier trained on , we use
in other words, we construct the classifier based on the projected data from , and count the proportion of points in