An adaptive nearest neighbor rule for classification

# An adaptive nearest neighbor rule for classification

Akshay Balsubramani
abalsubr@stanford.edu
&Sanjoy Dasgupta
dasgupta@eng.ucsd.edu
&Yoav Freund
yfreund@eng.ucsd.edu
&Shay Moran
shaym@princeton.edu
###### Abstract

We introduce a variant of the -nearest neighbor classifier in which is chosen adaptively for each query, rather than supplied as a parameter. The choice of depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger for predicting the labels of points in noisy regions.)

We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, -NN with an optimal choice of . In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the “advantage” which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest.

An adaptive nearest neighbor rule for classification

Akshay Balsubramani abalsubr@stanford.edu Sanjoy Dasgupta dasgupta@eng.ucsd.edu Yoav Freund yfreund@eng.ucsd.edu Shay Moran shaym@princeton.edu

\@float

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

We introduce an adaptive nearest neighbor classification rule. Given a training set with labels , its prediction at a query point is based on the training points closest to , rather like the -nearest neighbor rule. However, the value of that it uses can vary from query to query. Specifically, if there are training points, then for any query , the smallest is sought for which the points closest to have labels whose average is either greater than , in which case the prediction is , or less than , in which case the prediction is ; and if no such exists, then “?” (“don’t know”) is returned. Here, corresponds to a confidence interval for the average label in the region around the query.

We study this rule in the standard statistical framework in which all data are i.i.d. draws from some unknown underlying distribution on , where is the data space and is the label space. We take to be a separable metric space, with distance function , and we take . We can decompose into the marginal distribution on and the conditional expectation of the label at each point : if represents a random draw from , define . In this terminology, the Bayes-optimal classifier is the rule given by

 g∗(x)={sign(η(x))if η(x)≠0either −1 or +1if η(x)=0 (1)

and its error rate is the Bayes risk, . A variety of nonparametric classification schemes are known to have error rates that converge asymptotically to . These include -nearest neighbor (henceforth, -NN) rules [FH51] in which grows with the number of training points according to a suitable schedule , under certain technical conditions on the metric measure space .

In this paper, we are interested in consistency as well as rates of convergence. In particular, we find that the adaptive nearest neighbor rule is also asymptotically consistent (under the same technical conditions) while converging at a rate that is about as good as, and sometimes significantly better than, that of -NN under any schedule .

Intuitively, one of the advantages of -NN over nonparametric classifiers that use a fixed bandwidth or radius, such as Parzen window or kernel density estimators, is that -NN automatically adapts to variation in the marginal distribution : in regions with large , the nearest neighbors lie close to the query point, while in regions with small , the nearest neighbors can be further afield. The adaptive NN rule that we propose goes further: it also adapts to variation in . In certain regions of the input space, where is close to , an accurate prediction would need large . In other regions, where is near or , a small would suffice, and in fact, a larger might be detrimental because neighboring regions might be labeled differently. See Figure 1 for one such example. A -NN classifier is forced to pick a single value of that trades off between these two contingencies. Our adaptive NN rule, however, can pick the right in each neighborhood separately.

Our estimator allows us to give rates of convergence that are tighter and more transparent than those customarily obtained in nonparametric statistics. Specifically, for any point in the instance space , we define a notion of the advantage at , denoted , which is rather like a local margin. We show that the prediction at is very likely to be correct once the number of training points exceeds . Universal consistency follows by establishing that almost all points have positive advantage.

### 1.1 Relation to other work in nonparametric estimation

For linear separators and many other parametric families of classifiers, it is possible to give rates of convergence that hold without any assumptions on the input distribution or the conditional expectation function . This is not true of nonparametric estimation: although any target function can in principle be captured, the number of samples needed to achieve a specific level of accuracy will inevitably depend upon aspects of this function such as how fast it changes [DGL96, chapter 7]. As a result, nonparametric statistical theory has focused on (1) asymptotic consistency, ideally without assumptions, and (2) rates of convergence under a variety of smoothness assumptions.

Asymptotic consistency has been studied in great detail for the -NN classifier, when is allowed to grow with the number of data points . The risk of the classifier, denoted , is its error rate on the underlying distribution ; this is a random variable that depends upon the set of training points seen. Cover and Hart [CH67] showed that in general metric spaces, under the assumption that every in the support of is either a continuity point of or has , the expected risk converges to the Bayes-optimal risk , as long as and . For points in finite-dimensional Euclidean space, a series of results starting with Stone [Sto77] established consistency without any assumptions on or , and showed that almost surely [DGKL94]. More recent work has extended these universal consistency results—that is, consistency without assumptions on —to arbitrary metric measure spaces that satisfy a certain differentiation condition [CG06, CD14].

Rates of convergence have been obtained for -nearest neighbor classification under various smoothness conditions including Holder conditions on  [KP95, Gyö81] and “Tsybakov margin” conditions [MT99, AT07, CD14]. Such assumptions have become customary in nonparametric statistics, but they leave a lot to be desired. First, they are uncheckable: it is not possible to empirically determine the smoothness given samples. Second, they view the underlying distribution through the tiny window of two or three parameters, obscuring almost all the remaining structure of the distribution that also influences the rate of convergence. Finally, because nonparametric estimation is local, there is the intriguing possibility of getting different rates of convergence in different regions of the input space: a possibility that is immediately defeated by reducing the entire space to two smoothness constants.

The first two of these issues are partially addressed by recent work of [CD14], who analyze the finite sample risk of -NN classification without any assumptions on . Their bounds involve terms that measure the probability mass of the input space in a carefully defined region around the decision boundary, and are shown to be “instance-optimal”: that is, optimal for the specific distribution , rather than minimax-optimal for some very large class containing . However, the expressions for the risk are somewhat hard to parse, in large part because of the interaction between and .

In the present paper, we obtain finite-sample rates of convergence that are “instance-optimal” not just for the specific distribution but also for the specific query point. This is achieved by defining a margin, or advantage, at every point in the input space, and giving bounds (Theorem 1) entirely in terms of this quantity. For parametric classification, it has become common to define a notion of margin that controls generalization. In the nonparametric setting, it makes sense that the margin would in fact be a function , and would yield different generalization error bounds in different regions of space. Our adaptive nearest neighbor classifier allows us to realize this vision in a fairly elementary manner.

#### Organization.

Proofs are relegated to the appendices.

We begin by formally defining the setup and notation in Section 2. Then, a formal description of the adaptive -NN algorithm is given in Section 3. In Appendices A, 5 and 4, we state and prove consistency and generalization bounds for this classifier, and compare them with previous bounds in the -NN literature. These bounds exploit a general VC-based uniform convergence statement which is presented and proved in a self-contained manner in Appendix B.

## 2 Setup

Take the instance space to be a separable metric space and the label space to be . All data are assumed to be drawn i.i.d. from a fixed unknown distribution over .

Let denote the marginal distribution on : if is a random draw from , then

 μ(S)=Pr(X∈S)

for any measurable set . For any , the conditional expectation, or bias, of given , is

 η(x)=E(Y|X=x)∈[−1,1].

Similarly, for any measurable set with , the conditional expectation of given is

 η(S)=E(Y|X∈S)=1μ(S)∫Sη(x) dμ(x).

The risk of a classifier is the probability that it is incorrect on pairs ,

 R(g)=P({(x,y):g(x)≠y}). (2)

The Bayes-optimal classifier , as given in (1), depends only on , but its risk depends on . For a classifier based on training points from , we will be interested in whether converges to , and the rate at which this convergence occurs.

The algorithm and analysis in this paper depend heavily on the probability masses and biases of balls in . For and , let denote the closed ball of radius centered at ,

 B(x,r)={z∈X:d(x,z)≤r}.

For , let be the smallest radius such that has probability mass at least , that is,

 rp(x)=inf{r≥0:μ(B(x,r))≥p}. (3)

It follows that .

The support of the marginal distribution plays an important role in convergence proofs and is formally defined as

 supp(μ)={x∈X:μ(B(x,r))>0 for all r>0}.

It is a well-known consequence of the separability of that  [CH67].

## 3 The adaptive k-nearest neighbor algorithm

The algorithm is given a labeled training set . Based on these points, it is able to compute empirical estimates of the probabilities and biases of different balls.

For any set , we define its empirical count and probability mass as

 #n(S) =|{i:xi∈S}| μn(S) =#n(S)n. (4)

If this is non-zero, we take the empirical bias to be

 ηn(S)=∑i:xi∈Syi#n(S). (5)

The adaptive -NN algorithm (AKNN) is shown in Figure 2. It makes a prediction at by growing a ball around until the ball has significant bias, and then choosing the corresponding label. In some cases, a ball of sufficient bias may never be obtained, in which event “?” is returned. In what follows, let denote the AKNN classifier.

Later, we will also discuss a variant of this algorithm in which a modified confidence interval,

 Δ(n,k,δ)=c1√d0logn+log(1/δ)k (7)

is used, where is the VC dimension of the family of balls in .

## 4 Pointwise advantage and rates of convergence

We now provide finite-sample rates of convergence for the adaptive nearest neighbor rule. For simplicity, we give convergence rates that are specific to any query point and that depend on a suitable notion of the “margin” of distribution around .

Pick any . Recalling definition (3), we say a point is -salient if the following holds for either or :

• , and for all , and .

In words, this means that (recall that is the Bayes classifier), that the biases of all balls of radius around have the same sign as , and that the bias of the ball of radius has margin at least . A point can satisfy this definition for a variety of pairs . The advantage of  is taken to be the largest value of over all such pairs:

 adv(x)={sup{pγ2:x is (p,γ)-salient}if η(x)≠00if η(x)=0 (8)

We will see (Lemma 3) that under a mild condition on the underlying metric measure space, almost all with have a positive advantage.

The following theorem shows that for every point , if the sample size satisfies , then the label of is likely to be , where is the Bayes optimal classifier. This provides a pointwise convergence of to with a rate which is sensitive to the “local geometry” of .

###### Theorem 1 (Pointwise convergence rate).

There is an absolute constant for which the following holds. Let denote the confidence parameter in the AKNN algorithm (Figure 2), and suppose the algorithm is used to define a classifier based on training points chosen i.i.d. from . Then, for every point , if

then with probability at least we have that .

If we further assume that the family of all balls in the space has finite VC dimension then we can strengthen Theorem 1 so that the guarantee holds with high probability simultaneously for all . This is achieved by a modified version of the algorithm that uses confidence interval (7) instead of (6).

###### Theorem 2 (Uniform convergence rate).

Suppose that the set of balls in has finite VC dimension , and that the algorithm of Figure 2 is used with confidence interval (7) instead of (6). Then, with probability at least , the resulting classifier satisfies the following: for every point , if

then .

A key step towards proving Theorems 1 and 2 is to identify the subset of that is likely to be correctly classified for a given number of training points . This follows the rough outline of [CD14], which gave rates of convergence for -nearest neighbor, but there are two notable differences. First, we will see that the likely-correct sets obtained in that earlier work (for -NN) are subsets of those we obtain for the new adaptive nearest neighbor procedure. Second, the proof for our setting is considerably more streamlined; for instance, there is no need to devise tie-breaking strategies for deciding the identities of the nearest neighbors.

### 4.2 A comparison with k-nearest neighbor

For , let denote all points with advantage greater than :

In particular, consists of all points with positive advantage.

By Theorem 1, points in are likely to be correctly classified when the number of training points is , where the notation ignores logarithmic terms. In contrast, the work of [CD14] showed that with training points, the -NN classifier is likely to correctly classify the following set of points:

 X′n,k ={x∈supp(μ):η(x)>0,η(B(x,r))≥k−1/2% \ for all 0≤r≤rk/n(x)} ∪{x∈supp(μ):η(x)<0,η(B(x,r))≤−k−1/2\ for all 0≤r≤rk/n(x)}.

Such points are -salient and thus have advantage at least . In fact,

 ⋃1≤k≤nX′n,k⊆X1/n.

In this sense, the adaptive nearest neighbor procedure is able to perform roughly as well as all choices of simultaneously (logarithmic factors prevent this from being a precise statement).

## 5 Universal consistency

In this section we study the convergence of to the Bayes risk as the number of points grows. An estimator is described as universally consistent in a metric measure space if it has this desired limiting behavior for all conditional expectation functions .

Earlier work [CD14] has established the universal consistency of -nearest neighbor (for and ) in any metric measure space that satisfies the Lebesgue differentiation condition: that is, for any bounded measurable and for almost all (-a.e.) ,

 limr↓01μ(B(x,r))∫B(x,r)f dμ=f(x). (10)

This is known to hold, for instance, in any finite-dimensional normed space or any doubling metric space [Hei01, Chapter 1].

We will now see that this same condition implies the universal consistency of the adaptive nearest neighbor rule. To begin with, it implies that almost every point has a positive advantage.

###### Lemma 3.

Suppose metric measure space satisfies condition (10). Then, for any conditional expectation , the set of points

has zero -measure.

###### Proof.

Let consist of all points for which condition (10) holds true with , that is, . Since , it follows that .

Pick any with ; without loss of generality, . By (10), there exists such that

 η(B(x,r))≥η(x)/2\ for all \ 0≤r≤ro.

Thus is -salient for and , and has positive advantage. ∎

Universal consistency follows as a consequence; the proof details are deferred to  Appendix A.

###### Theorem 4 (Universal consistency).

Suppose the metric measure space satisfies condition (10). Let be a sequence in with (1) and (2) . Let the classifier be the result of applying the AKNN procedure (Figure 2) with points chosen i.i.d. from and with confidence parameter . Letting denote the risk of , we have almost surely.

## 6 Experiments

We performed a few experiments using real-world datasets from computer vision and genomics (see Section C). These were conducted with some practical alterations to the algorithm of Fig. 2.

Multiclass extension: Suppose the set of possible labels is . We replace the binary rule “find the smallest such that " with the rule: “find the smallest such that for some , where ."
Parametrization: We replace Equation (6) with , where is a confidence parameter corresponding to the theory’s (given ).
Resolving multilabel predictions: Our algorithm can output answers that are not a single label. The output can be “?”, which indicates that no label has sufficient evidence. It can also be a subset of that contains more than one element, indicating that more than one label has significant evidence. In some situations, using subsets of the labels is more informative. However, when we want to compare head-to-head with -NN, we need to output a single label. We use a heuristic to predict with a single label on any : the label for which is largest.

We briefly discuss our main conclusions from the experiments, with further details deferred to Appendix C.

AKNN is comparable to the best -NN rule. In Section 4.2 we prove that AKNN compares favorably to -NN with any fixed . We demonstrate this in practice in different situations. With simulated independent label noise on the MNIST dataset (Fig. 3), a small value of is optimal for noiseless data, but performs very poorly when the noise level is high. On the other hand, AKNN adapts to the local noise level automatically, as demonstrated without adding noise on the more challenging notMNIST and single-cell genomics data (Fig. 4, 5, 6).

Varying the confidence parameter controls abstaining. The parameter controls how conservative the algorithm is in deciding to abstain, instead of incurring error by predicting. represents the most aggressive setting, in which the algorithm never abstains, essentially predicting according to a -NN rule. Higher settings of cause the algorithm to abstain on some of these predicted points, for which there is no sufficiently small neighborhood with a sufficiently significant label bias (Fig. 7).

Adaptively chosen neighborhood sizes reflect local confidence. The number of neighbors chosen by AKNN is a local quantity that gives a practical pointwise measure of the confidence associated with label predictions. Small neighborhoods are chosen when one label is measured as significant nearly as soon as statistically possible; by definition of the AKNN stopping rule, this is not true where large neighborhoods are necessary. In our experiments, performance on points with significantly higher neighborhood sizes dropped monotonically, with the majority of the dataset having performance significantly exceeding the best -NN rule over a range of settings of (Fig. 4, 6; Appendix C).

## References

• [AT07] J.-Y. Audibert and A.B. Tsybakov. Fast learning rates for plug-in classifiers. Annals of Statistics, 35(2):608–633, 2007.
• [BBL05] Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
• [C18] Tabula Muris Consortium et al. Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature, 562(7727):367, 2018.
• [CD10] K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, pages 343–351, 2010.
• [CD14] K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437–3445. 2014.
• [CG06] F. Cerou and A. Guyader. Nearest neighbor classification in infinite dimension. ESAIM: Probability and Statistics, 10:340–355, 2006.
• [CH67] T. Cover and P.E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.
• [DCL11] Wei Dong, Moses Charikar, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586. ACM, 2011.
• [DGKL94] L. Devroye, L. Györfi, A. Krzyzak, and G. Lugosi. On the strong universal consistency of nearest neighbor regression function estimates. Annals of Statistics, 22:1371–1385, 1994.
• [DGL96] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.
• [Dud79] R.M. Dudley. Balls in do not cut all subsets of points. Advances in Mathematics, 31(3):306–308, 1979.
• [FH51] E. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination. USAF School of Aviation Medicine, Randolph Field, Texas, Project 21-49-004, Report 4, Contract AD41(128)-31, 1951.
• [Gyö81] L. Györfi. The rate of convergence of -nn regression estimates and classification rules. IEEE Transactions on Information Theory, 27(3):362–364, 1981.
• [Hei01] J. Heinonen. Lectures on Analysis on Metric Spaces. Springer, 2001.
• [KP95] S. Kulkarni and S. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028–1039, 1995.
• [MNI]
• [Mou18] Mouse cell atlas dataset. Accessed: 2019-05-02.
• [MT99] E. Mammen and A.B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
• [not11] notmnist dataset. Accessed: 2019-05-02.
• [RS98] Martin Raab and Angelika Steger. "balls into bins" - A simple and tight analysis. In Randomization and Approximation Techniques in Computer Science, Second International Workshop, RANDOM’98, Barcelona, Spain, October 8-10, 1998, Proceedings, pages 159–170, 1998.
• [Sto77] C. Stone. Consistent nonparametric regression. Annals of Statistics, 5:595–645, 1977.
• [VC71] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.

## Appendix A Analysis and proofs

The first step in establishing advantage-dependent rates of convergence is to bound the accuracy of empirical estimates of probability mass and bias. This is achieved by a careful choice of large deviation bounds.

### a.1 Large deviation bounds

Suppose we draw points from . If is reasonably large, we would expect the empirical mass of any set , as defined in (4), to be close to its probability mass under . The following lemma, from [CD10], quantifies one particular aspect of this.

###### Lemma 5 ([Cd10], Lemma 7).

There is a universal constant such that the following holds. Let  be any class of measurable subsets of of VC dimension . Pick any . Then with probability at least over the choice of , for all and for any integer , we have

 μ(B)≥kn+conmax(k,d0lognδ)  ⟹  μn(B)≥kn.

Likewise, we would expect the empirical bias of a set , as defined in (5), to be close to its true bias . The latter is defined whenever .

###### Lemma 6.

There is a universal constant for which the following holds. Let be a class of subsets of with VC dimension . Pick any . Then with probability at least over the choice of , for all ,

 |ηn(C)−η(C)| ≤ Δ(n,#n(C),δ)

where is the number of points in and

 Δ(n,k,δ)=c1√d0logn+log(1/δ)k. (11)

Lemma 6 is a special case222Indeed, Lemma 6 follows from Theorem 8 by plugging in it . of a uniform convergence bound for conditional probabilities (Theorem 8) that we present and prove in Appendix B.

### a.2 Proof of Theorem 1

###### Theorem (Theorem 1 restatement).

There is an absolute constant for which the following holds. Let denote the confidence parameter in the AKNN algorithm (Figure 2), and suppose the algorithm is used to define a classifier based on training points chosen i.i.d. from . Then, for every point , if

then with probability at least we have that .

###### Proof.

Define , where and are the constants from Lemmas 5 and 6, and take .

Suppose ; the negative case is symmetric. The set of all balls centered at is easily seen to have VC dimension . By Lemmas 5 and 6, we have that with probability at least , the following two properties hold for all :

1. For any integer , we have whenever .

2. .

Assume henceforth that these hold.

By the definition of advantage, point is -salient for some with . The lower bound on in the theorem statement implies that

 γ≥2c2√logn+log(1/δ)np, (12)

or equivalently that .

Set . By (12) we have and thus . As a result, , and by property 1, the ball has . This means, in turn, that by property 2,

 ηn(B) ≥ η(B)−Δ(n,k,δ)=γ−c1√log(n/δ)k ≥2c2√log(n/δ)np−c1√log(n/δ)k≥2c1√log(n/δ)k−c1√log(n/δ)k =c1√log(n/δ)k≥Δ(n,#n(B),δ).

Thus ball would trigger a prediction of .

At the same time, for any ball with ,

 ηn(B′)≥η(B′)−Δ(n,#n(B′),δ)>−Δ(n,#n(B′),δ)

and thus no such ball will trigger a prediction of . Therefore, the prediction at must be . ∎

### a.3 Proof of Theorem 2

This proof follows much the same outline as that of Theorem 1. A crucial difference is that uniform large deviation bounds (Lemmas 5 and 6) are applied to the class of all balls in , which is assumed333This is motivated by finite-dimensional Euclidean space , where it holds with ([Dud79]). to have finite VC dimension . In contrast, the proof of Theorem 1 only applies these bounds to the class of balls centered at a specific point, which has VC dimension at most 1 in any metric space.

### a.4 Proof of Theorem 4

Recall from (9) that denotes the set of points with advantage .

###### Lemma 7.

Pick any as a confidence parameter for the AKNN estimator of Figure 2. Fix any . If the number of training points satisfies

 n≥c3amax(logc3a, log1δ),

then with probability at least , the resulting classifier has risk

 R(gn)−R∗≤δ+μ(X0∖Xa).
###### Proof.

From Theorem 1 , we have that for any ,

 Prn(gn(x)≠g∗(x))≤δ2,

where denotes probability over the choice of training points. Thus, for ,

 EnEX1(gn(X)≠g∗(X)|X∈Xa)≤δ2,

and by Markov’s inequality,

 Prn[PrX(gn(X)≠g∗(X)|X∈Xa)≥δ]≤δ.

Thus, with probability at least over the training set,

 PrX(gn(X)≠g∗(X)|X∈Xa)≤δ.

On points with , both and the Bayes-optimal incur the same risk. Thus

where we invoke Lemma 3 for the last step. ∎

We now complete the proof of Theorem 4. Given the sequence of confidence parameters , define a sequence of advantage values by

 an=c3nmax(2logn, log1δn).

The conditions on imply .

Pick any . By the conditions on , we can pick so that . Let denote a realization of an infinite training sequence from . By Lemma 7, for any positive integer ,

 Pr(ω:∃n≥N\ s.t.\ R(gn(ω))−R∗>δn+μ(X0∖Xan))≤∑n≥Nδn≤ϵ.

Thus, with probability at least over the training sequence , we have that for all ,

 R(gn(ω))−R∗≤δn+μ(X0∖Xan),

whereupon (since and ). Since this holds for any , the theorem follows.

## Appendix B Uniform Convergence of Empirical Conditional Measures

### b.1 Formal Statement

Let be a distribution over , and let be two collections of events. Consider independent samples from , denoted by . We would like to estimate simultaneously for all . It is natural to consider the empirical estimates:

 Pn(A|B)=∑i1[xi∈A∩B]∑i1[xi∈B].

We study when (and to what extent) these estimates provide a good approximation. Note that the case where (i.e., in which one estimates using simultaneously for all ) is handled by the classical VC theory. Throughout this section we assume that both have a finite VC-dimension, and we let denote an upper bound on both and .

To demonstrate the kinds of statements we would like to derive, consider the case where each of  contains only one event: , and , and set . A Chernoff bound implies that conditioned on the event that , the following holds with probability at least :

 |P(A|B)−Pn(A|B)|≤√2log(1/δ)#n(B). (13)

To derive it, use that conditioned on , the event has probability , and therefore the random variable “” has a binomial distribution with parameters and .

Note that the bound on the error in Equation 13 depends on and therefore is data-dependent. We stress that this is the type of statement we want: the more samples belong to , the more certain we are with the empirical estimate. Thus, we would want to prove a statement as follows:

With probability at least ,

 (∀A∈A)(∀B∈B):|P(A|B)−Pn(A|B)|≤O(√d0log(1/δ)#n(B)),

where .

The above statement is, unfortunately, false. As an example, consider the probability space defined by drawing uniformly, and then coloring by uniformly. For each let denote the event that  was drawn, and let denote the event that the drawn color was . (formally, , and ). One can verify that the VC dimension of and of is at most . The above statement fails in this setting: indeed, one can verify that if we draw samples from this space then with a constant probability there will be some such that:

• always gets the same color (say ), and

• is sampled at least times444This follows from analyzing the maximal bin in a uniform assignment of balls into bins [RS98].

Therefore, with constant probability we get that

 Pn(A|Bi)=1,P(A|Bi)=1/2,

and so the difference between the error is clearly , which is clearly not upper bounded by .

We prove the following (slightly weaker) variant:

###### Theorem 8 (Ucecm).

Let be a probability distribution over , and let be two families of measurable subsets of such that . Let , and let be i.i.d samples from . The, the following event occurs with probability at least :

 (∀A∈A)(∀B∈B):|P(A|B)−Pn(A|B)|≤√ko#n(B),

where , and555Note that the above inequality makes sense also when , by identifying as , and using the convention that and that . .

#### Discussion.

Theorem 8 can be combined with Lemma 5 to yield a bound on the minimal for which is a non-trivial approximation of . Indeed, Lemma 5 implies that if is large enough so that , then the empirical estimate is a decent approximation. In the context of the adaptive nearest neighbor classifier, this means that the empirical biases provide meaningful estimates of the true biases for balls whose measure is . This resembles the learning rate in realizable settings.

We remark that a weaker statement than Theorem 8 can be derived as a corollary of the classical uniform convergence result [VC71]. Indeed, since the VC dimension of is at most , it follows that

 Pn(A|B)≈P(A∩B)±√d0/nP(B)±√d0/n.

However, this bound guarantees non-trivial estimates only once is roughly . This is similar to the learning rate in agnostic (i.e., non-realizable) settings.

Another major advantage of the uniform convergence bound in Theorem 8 is that it is data-dependent: if many points from the sample belong to (i.e.  is large), then we get better guarantees on the approximation of by for all .

### b.2 Proof of Theorem 8

As noted above, the standard uniform convergence bound for VC classes can not yield the bound in Theorem 8. Instead, we use a variant of it due to [BBL05] which concerns relative deviations (see [BBL05]: Theorem 5.1 and the discussion before Corollary 5.2). In order to state the theorem, we need the following notation: Let be a family of subsets of . We denote by the growth function of , which is defined by:

 SC(n)=max{|C|R|:R⊆X,|R|=n},

where is the projection of to .

###### Theorem 9 ([Bbl05]).

Let be a family of subsets of and let be a distribution over . Then, the following holds with probability :

 (∀C∈C):|P(C)−Pn(C)|≤2√Pn(C)logSC(2n)+log(4/δ)n+4logSC(2n)+log(4/δ)n.

Set . We prove Theorem 8 by applying Theorem 9 on ; to this end we first upper bound . Let , so that . Then:

 SC(n)≤SB(n)+SD(n)≤SB(n)+SA(n)SB(n)≤2SA(n)SB(n)≤2(n≤d0)2≤2(2n)2d0,

where the second inequality follows since , the second to last inequality follows from the Sauer-Shelah-Perles Lemma, and the last inequality follows since . Therefore, applying Theorem 9 on yields that with probability the following event holds:

 (∀C∈C):|P(C)−Pn(C)|≤4√Pn(C)d0log8n+log(4/δ)n+8d0log8n+log(4/δ)n. (14)

For the remainder of the proof we assume that the event in Equation 14 holds and argue that it implies the conclusion in Theorem 8. Let , and let denote the number of data points in . We want to show that

 |P(A|B)−Pn(A|B)|≤√kok, (15)

where . Let denote the number of data points in . We establish Equation 15 by showing that

 P(A|B)≤Pn(A|B)+√kok    and    P(A|B)≥Pn(A|B)−√kok.

In the following calculation it will be convenient to denote . By Equation 14 we get:

 P(A|B) =P(A∩B)P(B) ≤Pn(A∩B)+4√Pn(A∩B)Dn+8DnPn(B)−4√Pn(B)Dn−8Dn =Pn(A∩B)Pn(B)+4√Pn(A∩B)Pn(B)DnPn(B)+8DnPn(B)1−4√DnPn(B)−8DnPn(B)s=Pn(A|B)1+4√Dj+8Dj1−4√Dk−