Theoretical analysis of cross-validation for estimating the risk of the -Nearest Neighbor classifier
The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the -nearest neighbors (NN) rule in the context of binary classification. Here we focus on the leave--out cross-validation (LO) used to assess the performance of the NN classifier. Remarkably this LO estimator can be efficiently computed in this context using closed-form formulas derived by Celisse and Mary-Huard (2011).
We describe a general strategy to derive moment and exponential concentration inequalities for the LO estimator applied to the NN classifier. Such results are obtained first by exploiting the connection between the LO estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the LO estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the LO estimator and the classification error/risk of the NN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments.
Keywords: Classification, Cross-validation, Risk estimation
The -nearest neighbor (NN) algorithm (Fix and Hodges, 1951) in binary classification is a popular prediction algorithm based on the idea that the predicted value at a new point is based on a majority vote from the nearest labeled neighbors of this point. Although quite simple, the NN classifier has been successfully applied to many difficult classification tasks (Li et al., 2004; Simard et al., 1998; Scheirer and Slaney, 2003). Efficient implementations have been also developed to allow dealing with large datasets (Indyk and Motwani, 1998; Andoni and Indyk, 2006).
The theoretical performances of the NN classifier have been already extensively investigated. In the context of binary classification preliminary theoretical results date back to Cover and Hart (1967); Cover (1968); Györfi (1981). More recently, Psaltis et al. (1994); Kulkarni and Posner (1995) derived an asymptotic equivalent to the performance of the 1NN classification rule, further extended to NN by Snapp and Venkatesh (1998). Hall et al. (2008) also derived asymptotic expansions of the risk of the NN classifier assuming either a Poisson or a binomial model for the training points, which relates this risk to the parameter . By contrast to the aforementioned results, the work by Chaudhuri and Dasgupta (2014) focuses on the finite sample framework. They typically provide upper bounds with high probability on the risk of the NN classifier where the bounds are not distribution-free. Alternatively in the regression setting, Kulkarni and Posner (1995) provide a finite-sample bound on the performance of 1NN that has been further generalized to the NN rule () by Biau et al. (2010a), where a bagged version of the NN rule is also analyzed and then applied to functional data Biau et al. (2010b). We refer interested readers to Biau and Devroye (2016) for an almost thorough presentation of known results on the NN algorithm in various contexts.
In numerous (if not all) practical applications, computing the cross-validation (CV) estimator (Stone, 1974, 1982) has been among the most popular strategies to evaluate the performance of the NN classifier (Devroye et al., 1996, Section 24.3). All CV procedures share a common principle which consists in splitting a sample of points into two disjoint subsets called training and test sets with respective cardinalities and , for any . The training set data serve to compute a classifier, while its performance is evaluated from the left out data of the test set. For a complete and comprehensive review on cross-validation procedures, we refer the interested reader to Arlot and Celisse (2010).
In the present work, we focus on the leave--out (LO) cross-validation. Among CV procedures, it belongs to exhaustive strategies since it considers (and averages over) all the possible such splittings of into training and test sets. Usually the induced computation time of the LO is prohibitive, which gives rise to its surrogate called fold cross-validation (V-FCV) with (Geisser, 1975). However, Steele (2009); Celisse and Mary-Huard (2011) recently derived closed-form formulas respectively for the bootstrap and the LO procedures applied to the NN classification rule. Such formulas allow one to efficiently compute the LO estimator. Moreover since the V-FCV estimator suffers a larger variance than the LO one (Celisse and Robin, 2008; Arlot and Celisse, 2010), LO (with ) strictly improves upon V-FCV in the present context.
Although being favored in practice for assessing the risk of the NN classifier, the use of CV comes with very few theoretical guarantees regarding its performance. Moreover probably for technical reasons, most existing results apply to Hold-out and leave-one-out (L1O), that is LO with (Kearns and Ron, 1999). In this paper we rather consider the general LO procedure (for ) used to estimate the risk (alternatively the classification error rate) of the NN classifier. Our main purpose is then to provide distribution-free theoretical guarantees on the behavior of LO with respect to influential parameters such as , , and . For instance we aim at answering questions such as: “Does it exist any regime of (with some function of ) where the LO estimator is a consistent estimate of the risk of the NN classifier?”, or “Is it possible to describe the convergence rate of the LO estimator with respect to ?”
The main contribution of the present work is two-fold: we describe a new general strategy to derive moment and exponential concentration inequalities for the LO estimator applied to the NN binary classifier, and these inequalities serve to derive the convergence rate of the LO estimator towards the risk of the NN classifier.
This new strategy relies on several steps. First exploiting the connection between the LO estimator and U-statistics (Koroljuk and Borovskich, 1994) and the Rosenthal inequality (Ibragimov and Sharakhmetov, 2002), we prove that upper bounding the polynomial moments of the centered LO estimator reduces to deriving such bounds for the simpler L1O estimator. Second, we derive new upper bounds on the moments of the L1O estimator using the generalized Efron-Stein inequality (Boucheron et al., 2005, 2013, Theorem 15.5). Third, combining the two previous steps provides some insight on the interplay between and in the concentration rates measured in terms of moments. This finally results in new exponential concentration inequalities for the LO estimator applying whatever the value of the ratio . In particular while the upper bounds increase with , it is no longer the case if . We also provide several lower bounds suggesting our upper bounds cannot be improved in some sense in a distribution-free setting.
The remainder of the paper is organized as follows. The connection between the LO estimator and -statistics is clarified in Section 2, where we also recall the closed-form formula of the LO estimator (Celisse and Mary-Huard, 2011) applied to the NN classifier. Order- moments () of the LO estimator are then upper bounded in terms of those of the L1O estimator. This step can be applied to any classification algorithm. Section 3 then specifies the previous upper bounds in the case of the NN classifier, which leads to the main Theorem 3.2 characterizing the concentration behavior of the LO estimator with respect to , , and in terms of polynomial moments. Deriving exponential concentration inequalities for the LO estimator is the main concern of Section 4 where we highlight the strength of our strategy by comparing our main inequalities with concentration inequalities derived with less sophisticated tools. Finally Section 5 exploits the previous results to bound the gap between the LO estimator and the classification error of the NN classifier. The optimality of these upper bounds is first proved in our distribution-free framework by establishing several new lower bounds matching the upper ones in some specific settings. Second, empirical experiments are also reported which support the above conclusions.
2 -statistics and LO estimator
2.1 Statistical framework
We tackle the binary classification problem where the goal is to predict the unknown label of an observation . The random variable has an unknown joint distribution defined by for any Borelian set , where denotes a probability distribution. In what follows no particular distributional assumption is made regarding . To predict the label, one aims at building a classifier on the basis of a set of random variables called the training sample, where represent copies of drawn independently from . In settings where no confusion is possible, we will replace by .
Any strategy to build such a classifier is called a classification algorithm or classification rule, and can be formally defined as a function that maps a training sample onto the corresponding classifier , where is the set of all measurable functions from to . Numerous classification rules have been considered in the literature and it is out of the scope of the present paper to review all of them (see Devroye et al. (1996) for many instances). Here we focus on the -nearest neighbor rule (NN) initially proposed by Fix and Hodges (1951) and further studied for instance by Devroye and Wagner (1977); Rogers and Wagner (1978).
The NN algorithm
For , the NN rule, denoted by , consists in classifying any new observation using a majority vote decision rule based on the label of the points closest to among the training sample . In what follows these nearest neighbors are chosen according to the distance associated with the usual Euclidean norm in . Note that other adaptive metrics have been also considered in the literature (see for instance Hastie et al., 2001, Chap. 14 ). But such examples are out of the scope of the present work that is, our reference distance does not depend on the training sample at hand. Let us also emphasize that possible ties are broken by using the smallest index among ties, which is one possible choice for the Stone lemma to hold true (Biau and Devroye, 2016, Lemma 10.6, p.125).
Formally, given the set of indices of the nearest neighbors of among , the kNN classification rule is defined by
where is the label of the -th nearest neighbor of for , and denotes a Bernoulli random variable with parameter 1/2.
For a given sample , the performance of any classifier (respectively of any classification algorithm ) is assessed by the classification error (respectively the risk ) defined by
In this paper we focus on the estimation of (and its expectation ) by use of the Leave--Out (LO) cross-validation for (Zhang, 1993; Celisse and Robin, 2008). LO successively considers all possible splits of into a training set of cardinality and a test set of cardinality . Denoting by the set of all possible subsets of with cardinality , any defines a split of into a training sample and a test sample , where . For a given classification algorithm , the final LO estimator of the performance of is the average (over all possible splits) of the classification error estimated on each test set, that is
where is the classifier built from . We refer the reader to Arlot and Celisse (2010) for a detailed description of LO and other cross-validation procedures. In the sequel, the lengthy notation is replaced by in settings where no confusion can arise about the algorithm or the training sample , and by if the training sample has to be kept in mind.
Exact LO for the NN classification algorithm
Usually due to its seemingly prohibitive computational cost, LO is not applied except with where it reduces to the well known leave-one-out. However unlike this widespread idea Celisse and Robin (2008); Celisse (2008, 2014) proved that the LO estimator can be efficiently computed by deriving closed-form formulas in several statistical frameworks. The NN classification rule is another instance for which efficiently computing the LO estimator is possible with a time complexity linear in as previously established by Celisse and Mary-Huard (2011). Let us briefly recall the main steps leading to the closed-form formula.
From Eq. (2.2) the LO estimator can be expressed as a sum (over the observations of the complete sample) of probabilities:
Here means that the integration is computed with respect to the random variable , which follows the uniform distribution over the possible subsets with cardinality in . For instance since it is the proportion of subsamples with cardinality which do not contain a given prescribed index , which equals . (See also Lemma D.4 for further examples of such calculations.)
For any , let be the ordered sequence of neighbors of . This list depends on , i.e. should be noted , but this dependency is skipped here for the sake of readability.
The key in the derivation is to condition with respect to the random variable which denotes the rank (in the whole sample ) of the th neighbor of in the , that is means that is the -th neighbor of in . Then
where the sum involves terms since only are candidates for being the -th neighbor of in at least one training subset .
Observe that the resulting probabilities can be easily computed (see Lemma D.4):
with , , and , where denotes the hypergeometric distribution and is the number of 1’s among the nearest neighbors of in .
The computational cost of LO for the NN classifier is the same as that of LO for the NN classifier whatever , that is . This contrasts with the usual prohibitive computational complexity seemingly suffered by LO.
2.2 -statistics: General bounds on LO moments
The purpose of the present section is to describe a general strategy allowing to derive new upper bounds on the polynomial moments of the LO estimator. As a first step of this strategy, we establish the connection between the LO risk estimator and U-statistics. Second, we exploit this connection to derive new upper bounds on the order- moments of the LO estimator for . Note that these upper bounds, which relate moments of the LO estimator to those of the L1O estimator, hold true with any classifier.
Let us start by introducing -statistics and recalling some of their basic properties that will serve our purposes. For a thorough presentation, we refer to the books by Serfling (1980); Koroljuk and Borovskich (1994). The first step is the definition of a -statistic of order as an average over all -tuples of distinct indices in .
Definition 2.1 (Koroljuk and Borovskich (1994)).
Let (or ) denote any Borelian function where is an integer. Let us further assume is a symmetric function of its arguments. Then any function such that
where , is a -statistic of order and kernel .
Before clarifying the connection between LO and -statistics, let us introduce the main property of -statistics our strategy relies on. It consists in representing any U-statistic as an average, over all permutations, of sums of independent variables.
Proposition 2.1 (Eq. (5.5) in Hoeffding (1963)).
With the notation of Definition 2.1, let us define by
where denotes the integer part of . Then
where denotes the summation over all permutations of .
We are now in position to state the key remark of the paper. All the developments further exposed in the following result from this connection between the LO estimator defined by Eq. (2.2) and -statistics.
For any classification rule and any such that the following quantities are well defined, the LO estimator is a U-statistic of order with kernel defined by
where denotes the sample with withdrawn.
Proof of Theorem 2.1.
From Eq. (2.2), the LO estimator of the performance of any classification algorithm computed from satisfies
since there is a unique set of indices with cardinality such that . Then
Furthermore for and fixed, since there is a unique set of indices such that . One gets
The kernel is a deterministic and symmetric function of its arguments that does only depend on . Let us also notice that reduces to the L1O estimator of the risk of the classification rule computed from , that is
In the context of testing whether two binary classifiers have different error rates, this fact has already been pointed out by Fuchs et al. (2013).
We now derive a general upper bound on the -th moment () of the LO estimator that holds true for any classification rule as long as the following quantities remain meaningful.
For any classification rule , let and be the corresponding classifiers built from respectively and , where . Then for every such that the following quantities are well defined, and any ,
Furthermore as long as , one also gets
where is a numeric constant and denotes the optimal constant defined in the Rosenthal inequality (Proposition D.2).
The proof is given in Appendix A.1. Eq. (2.5) and Eq. (2.6) straightforwardly result from the Jensen inequality applied to the average over all permutations provided in Proposition 2.1. If , the integer part becomes larger than 1 and Eq. (2.6) becomes better than Eq. (2.5) for . As a consequence of our strategy of proof, the right-hand side of Eq. (2.6) is equal to the classical upper bound on the variance of U-statistics which suggests it cannot be improved without adding further assumptions.
Unlike the above ones, Eq. (2.2) is derived from the Rosenthal inequality, which enables us to upper bound a sum of independent and identically centered random variables in terms of and . Let us remark that, for , both terms of the right-hand side of Eq. (2.2) are of the same order as Eq. (2.6) up to constants. Furthermore using the Rosenthal inequality allows us to take advantage of the integer part when , unlike what we get by using Eq.(2.5) for . In particular it provides a new understanding of the behavior of the LO estimator when as highlighted later by Proposition 4.2.
3 New bounds on LO moments for the NN classifier
Since Theorem 2.2 expresses the moments of the LO estimator in terms of those of the L1O estimator computed from (with ), the next step consists in focusing on the L1O moments. Deriving upper bounds on the moments of the L1O is achieved using a generalization of the well-known Efron-Stein inequality (see Theorem D.1 for Efron-Stein’s inequality and Theorem 15.5 in Boucheron et al. (2013) for its generalization). For the sake of completeness, we first recall a corollary of this generalization that is proved in Section D.1.4 (see Corollary D.1).
Let denote independent random variables and , where is any Borelian function. With independent copies of the s, there exists a universal constant such that for any ,
Then applying Proposition 3.1 with (LO estimator computed from with ) leads to the following Theorem 3.1, which finally allows us to control the order- moments of the L1O estimator applied to the NN classifier.
For every , let () denote the NN classifier learnt from and be the corresponding LO estimator given by Eq. (2.2). Then
for every ,
Its proof (detailed in Section A.2) involves the use of Stone’s lemma (Lemma D.5), which upper bounds, for a given , the number of points in having among their nearest neighbors by . The dependence of our upper bounds with respect to (see explicit constants and ) induces their strong deterioration as the dimension grows since . Therefore the larger the dimension , the larger the required sample size for the upper bound to be small (at least smaller than 1). Note also that the tie breaking strategy (based on the smallest index) is chosen so that it ensures Stone’s lemma to hold true.
In Eq. (3.1), the easier case enables to exploit exact calculations (rather than upper bounds) of the variance of the L1O. Further noticing (risk of the NN classifier learnt from ), the resulting rate is a strict improvement upon the usual upper bound in which is derived from using the sub-Gaussian exponential concentration inequality provided by Theorem 24.4 in Devroye et al. (1996).
By contrast the larger in Eq. (3.2) comes from the difficulty to derive a tight upper bound with for the expectation of , where (resp. ) denotes the sample where has been (resp. and have been) removed.
We are now in position to state the main result of this section. It follows from the combination of Theorem 2.2 (connecting moments of the LO estimator to those of the L1O) and Theorem 3.1 (providing an upper bound on the order- moments of the L1O).
for every ,
with and , where denotes the constant arising from Stone’s lemma (Lemma D.5). Furthermore in the particular setting where , then
for every ,
The straightforward proof is detailed in Section A.3. Let us start by noticing that both upper bounds in Eq. (3.3) and (3.4) deteriorate as grows. This is no longer the case for Eq. (3.5) and (3.6), which are specifically designed to cover the setup where , that is where is no longer equal to 1. Therefore unlike Eq. (3.3) and (3.4), these last two inequalities are particularly relevant in the setup where , as , which has been investigated in different frameworks by Shao (1993); Yang (2006, 2007); Celisse (2014). Eq. (3.5) and (3.6) lead to respective convergence rates at worse (for ) and (for ). In particular this last rate becomes approximately equal to as gets large.
One can also emphasize that, as a U-statistic of fixed order , the LO estimator has a known Gaussian limiting distribution, that is (see Theorem A, Section 5.5.1 Serfling, 1980)
where , with . Therefore the upper bound given by Eq. (3.5) is non-improvable in some sense with respect to the interplay between and since one recovers the right magnitude for the variance term as long as is assumed to be constant.
Finally Eq. (3.6) has been derived using a specific version of the Rosenthal inequality (Ibragimov and Sharakhmetov, 2002) stated with the optimal constant and involving a “balancing factor”. In particular this balancing factor has allowed us to optimize the relative weight of the two terms between brackets in Eq. (3.6). This leads us to claim that the dependence of the upper bound with respect to cannot be improved with this line of proof. However we cannot conclude that the term in cannot be improved using other technical arguments.
4 Exponential concentration inequalities
This section provides exponential concentration inequalities for the LO estimator applied to the NN classifier. Our main results heavily rely on the moments inequalities previously derived in Section 3, that is Theorem 3.2. In order to emphasize the gain allowed by this strategy of proof, we start this section by successively proving two exponential inequalities obtained with less sophisticated tools. We then discuss the strength and weakness of each of them to justify the additional refinements we introduce step by step along the section.
A first exponential concentration inequality for can be derived by use of the bounded difference inequality following the line of proof of Devroye et al. (1996, Theorem 24.4) originally developed for the L1O estimator.
The proof is given in Appendix B.1.
The upper bound of Eq. (4.1) strongly exploits the facts that: (i) for to be one of the nearest neighbors of in at least one subsample , it requires to be one of the nearest neighbors of in the complete sample, and (ii) the number of points for which may be one of the nearest neighbors cannot be larger than by Stone’s Lemma (see Lemma D.5).
This reasoning results in a rough upper bound since the denominator in the exponent exhibits a factor where and play the same role. The reason is that we do not distinguish between points for which is among or above the nearest neighbors of in the whole sample, although these two setups lead to strongly different probabilities of being among the nearest neighbors in the training resample. Consequently the dependence of the convergence rate on and in Proposition 4.1 can be improved, as confirmed by forthcoming Theorems 4.1 and 4.2.
Based on the previous comments, a sharper quantification of the influence of each neighbor among the ones leads to the next result.
The proof is given in Section B.2.
Let us remark that unlike Proposition 4.1, taking into account the rank of each neighbor in the whole sample enables to considerably reduce the weight of (compared to that of ) in the denominator of the exponent. In particular, one observes that letting as (with assumed to be fixed for instance) makes the influence of the factor asymptotically negligible. This would allow to recover (up to numeric constants) a similar upper bound to that of Devroye et al. (1996, Theorem 24.4), achieved with .
However the upper bound of Theorem 4.1 does not reflect the right dependencies with respect to and compared with what has been proved for polynomial moments in Theorem 3.2. The upper bound seems to strictly deteriorate as increases, which contrasts with the upper bounds derived for in Theorem 3.2. This drawback is overcome by the following result, which is our main contribution in the present section.
For every such that , let denote the LO estimator of the classification error of the NN classifier defined by (2.1). Then for every ,
where with defined in Theorem 3.1.
The proof has been postponed to Appendix B.3. It involves different arguments for the two inequalities (4.2) and (4.3) depending on the range of values of . Firstly for , a simple argument is applied to derive Ineq. (4.2) from the two corresponding moment inequalities of Theorem 3.2 characterizing the sub-Gaussian behavior of the LO estimator in terms of its even moments (see Lemma D.2). Secondly for , we rather exploit: the appropriate upper bounds on the moments of the LO estimator given by Theorem 3.2, and a dedicated Proposition D.1 which provides exponential concentration inequalities from general moment upper bounds.
In accordance with the conclusions drawn about Theorem 3.2, the upper bound of Eq. (4.2) increases as grows, unlike that of Eq. (4.3) which improves as increases. In particular the best concentration rate in Eq. (4.3) is achieved as , whereas Eq. (4.2) turns out to be useless in that setting. Let us also notice that Eq. (4.2) remains strictly better than Theorem 4.1 as long as , as . Note also that the constants and are the same as in Theorem 3.1. Therefore the same comments regarding their dependence with respect to the dimension apply here.
In order to facilitate the interpretation of the last Ineq. (4.3), we also derive the following proposition (proved in Appendix B.3) which focuses on the description of each deviation term in the particular case where .
The present inequality is very similar to the well-known Bernstein inequality (Boucheron et al., 2013, Theorem 2.10) except the second deviation term of order instead of (for the Bernstein inequality).
With respect to , the first deviation term is of order , which is the same as with the Bernstein inequality. The second deviation term is of a somewhat different order, that is , as compared with the usual in the Bernstein inequality. Note that we almost recover this rate by choosing for instance , which leads to . Therefore varying allows to interpolate between the and the rates.
Note also that the dependence of the first (sub-Gaussian) deviation term with respect to is only , which improves upon the usual resulting from Ineq. (4.2) in Theorem 4.2 for instance. However this remains certainly too large for being optimal even if this question remains widely open at this stage in the literature.
More generally one strength of our approach is its versatility. Indeed the two above deviation terms directly result from the two upper bounds on the moments of the L1O stated in Theorem 3.1. Therefore any improvement of the latter upper bounds would immediately lead to enhance the present concentration inequality (without changing the proof).
5 Assessing the gap between LO and classification error
5.1 Upper bounds
First, we derive new upper bounds on different measures of the discrepancy between and the classification error or the risk . These bounds on the LO estimator are completely new for , some of them being extensions of former ones specifically derived for the L1O estimator applied to the NN classifier.
By contrast with the results in the previous sections, a new restriction on arises in Theorem 5.1, that is . It is the consequence of using Lemma D.6 in the above proof to quantify how different two classifiers respectively computed from the same and points can be. Indeed this lemma, which provides an upper bound on the stability of the NN classifier previously proved by Devroye and Wagner (1979b), only remains meaningful as long as .