Nearest-Neighbor Sample Compression:Efficiency, Consistency, Infinite Dimensions

# Nearest-Neighbor Sample Compression: Efficiency, Consistency, Infinite Dimensions

Aryeh Kontorovich Ben-Gurion University of the Negev    Sivan Sabato    Roi Weiss Weizmann Institute of Science
###### Abstract

We examine the Bayes-consistency of a recently proposed 1-nearest-neighbor-based multiclass learning algorithm. This algorithm is derived from sample compression bounds and enjoys the statistical advantages of tight, fully empirical generalization bounds, as well as the algorithmic advantages of a faster runtime and memory savings. We prove that this algorithm is strongly Bayes-consistent in metric spaces with finite doubling dimension — the first consistency result for an efficient nearest-neighbor sample compression scheme. Rather surprisingly, we discover that this algorithm continues to be Bayes-consistent even in a certain infinite-dimensional setting, in which the basic measure-theoretic conditions on which classic consistency proofs hinge are violated. This is all the more surprising, since it is known that -NN is not Bayes-consistent in this setting. We pose several challenging open problems for future research.

## 1 Introduction

This paper deals with Nearest-Neighbor (NN) learning algorithms in metric spaces. Initiated by Fix and Hodges in 1951 [16], this seemingly naive learning paradigm remains competitive against more sophisticated methods [8, 45] and, in its celebrated -NN version, has been placed on a solid theoretical foundation [11, 43, 13, 46].

Although the classic 1-NN is well known to be inconsistent in general, in recent years a series of papers has presented variations on the theme of a regularized -NN classifier, as an alternative to the Bayes-consistent -NN. Gottlieb et al. [18] showed that approximate nearest neighbor search can act as a regularizer, actually improving generalization performance rather than just injecting noise. In a follow-up work, [26] showed that applying Structural Risk Minimization to (essentially) the margin-regularized data-dependent bound in [18] yields a strongly Bayes-consistent 1-NN classifier. A further development has seen margin-based regularization analyzed through the lens of sample compression: a near-optimal nearest neighbor condensing algorithm was presented [20] and later extended to cover semimetric spaces [21]; an activized version also appeared [25]. As detailed in [26], margin-regularized 1-NN methods enjoy a number of statistical and computational advantages over the traditional -NN classifier. Salient among these are explicit data-dependent generalization bounds, and considerable runtime and memory savings. Sample compression affords additional advantages, in the form of tighter generalization bounds and increased efficiency in time and space.

In this work we study the Bayes-consistency of a compression-based -NN multiclass learning algorithm, in both finite-dimensional and infinite-dimensional metric spaces. The algorithm is essentially the passive component of the active learner proposed by Kontorovich, Sabato, and Urner in [25], and we refer to it in the sequel as KSU; for completeness, we present it here in full (Alg. 1). We show that in finite-dimensional metric spaces, KSU is both computationally efficient and Bayes-consistent. This is the first compression-based multiclass 1-NN algorithm proven to possess both of these properties. We further exhibit a surprising phenomenon in infinite-dimensional spaces, where we construct a distribution for which KSU is Bayes-consistent while -NN is not.

#### Main results.

Our main contributions consist of analyzing the performance of KSU in finite and infinite dimensional settings, and comparing it to the classical -NN learner. Our key findings are summarized below.

• In Theorem 2, we show that KSU is computationally efficient and strongly Bayes-consistent in metric spaces with a finite doubling dimension. This is the first (strong or otherwise) Bayes-consistency result for an efficient sample compression scheme for a multiclass (or even binary)111 An efficient sample compression algorithm was given in [20] for the binary case, but no Bayes-consistency guarantee is known for it. -NN algorithm. This result should be contrasted with the one in [26], where margin-based regularization was employed, but not compression; the proof techniques from [26] do not carry over to the compression-based scheme. Instead, novel arguments are required, as we discuss below. The new sample compression technique provides a Bayes-consistency proof for multiple (even countably many) labels; this is contrasted with the multiclass 1-NN algorithm in [27], which is not compression-based, and requires solving a minimum vertex cover problem, thereby imposing a -approximation factor whenever there are more than two labels.

• In Theorem 4, we make the surprising discovery that KSU continues to be Bayes-consistent in a certain infinite-dimensional setting, even though this setting violates the basic measure-theoretic conditions on which classic consistency proofs hinge, including Theorem 2. This is all the more surprising, since it is known that -NN is not Bayes-consistent for this construction [9]. We are currently unaware of any separable222Cérou and Guyader [9] gave a simple example of a nonseparable metric on which all known nearest-neighbor methods, including -NN and KSU, obviously fail. metric probability space on which KSU fails to be Bayes-consistent; this is posed as an intriguing open problem.

Our results indicate that in finite dimensions, an efficient, compression-based, Bayes-consistent multiclass 1-NN algorithm exists, and hence can be offered as an alternative to -NN, which is well known to be Bayes-consistent in finite dimensions [12, 40]. In contrast, in infinite dimensions, our results show that the condition characterizing the Bayes-consistency of -NN does not extend to all NN algorithms. It is an open problem to characterize the necessary and sufficient conditions for the existence of a Bayes-consistent NN-based algorithm in infinite dimensions.

#### Related work.

Following the pioneering work of [11] on nearest-neighbor classification, it was shown by [13, 46, 14] that the -NN classifier is strongly Bayes consistent in . These results made extensive use of the Euclidean structure of , but in [40] a weak Bayes-consistency result was shown for metric spaces with a bounded diameter and a bounded doubling dimension, and additional distributional smoothness assumptions. More recently, some of the classic results on -NN risk decay rates were refined by [10] in an analysis that captures the interplay between the metric and the sampling distribution. The worst-case rates have an exponential dependence on the dimension (i.e., the so-called curse of dimensionality), and Pestov [32, 33] examines this phenomenon closely under various distributional and structural assumptions.

Consistency of NN-type algorithms in more general (and in particular infinite-dimensional) metric spaces was discussed in [1, 5, 6, 9, 29]. In [1, 9], characterizations of Bayes-consistency were given in terms of Besicovitch-type conditions (see Eq. (3)). In [1], a generalized “moving window” classification rule is used and additional regularity conditions on the regression function are imposed. The filtering technique (i.e., taking the first coordinates in some basis representation) was shown to be universally consistent in [5]. However, that algorithm suffers from the cost of cross-validating over both the dimension and number of neighbors . Also, the technique is only applicable in Hilbert spaces (as opposed to more general metric spaces) and provides only asymptotic consistency, without finite-sample bounds such as those provided by KSU. The insight of [5] is extended to the more general Banach spaces in [6] under various regularity assumptions.

None of the aforementioned generalization results for NN-based techniques are in the form of fully empirical, explicitly computable sample-dependent error bounds. Rather, they are stated in terms of the unknown Bayes-optimal rate, and some involve additional parameters quantifying the well-behavedness of the unknown distribution (see [26] for a detailed discussion). As such, these guarantees do not enable a practitioner to compute a numerical generalization error estimate for a given training sample, much less allow for a data-dependent selection of , which must be tuned via cross-validation. The asymptotic expansions in [42, 36, 23, 39] likewise do not provide a computable finite-sample bound. The quest for such bounds was a key motivation behind the series of works [18, 27, 20], of which KSU [25] is the latest development.

The work of Devroye et al. [14, Theorem 21.2] has implications for -NN classifiers in that are defined based on data-dependent majority-vote partitions of the space. It is shown that under some conditions, a fixed mapping from each sample size to a data-dependent partition rule induces a strongly Bayes-consistent algorithm. This result requires the partition rule to have a bounded VC dimension, and since this rule must be fixed in advance, the algorithm is not fully adaptive. Theorem 19.3 ibid. proves weak consistency for an inefficient compression-based algorithm, which selects among all the possible compression sets of a certain size, and maintains a certain rate of compression relative to the sample size. The generalizing power of sample compression was independently discovered by [30], and later elaborated upon by [22]. In the context of NN classification, [14] lists various condensing heuristics (which have no known performance guarantees) and leaves open the algorithmic question of how to minimize the empirical risk over all subsets of a given size.

The first compression-based 1-NN algorithm with provable optimality guarantees was given in [20]; it was based on constructing -nets in spaces with a finite doubling dimension. The compression size of this construction was shown to be nearly unimprovable by an efficient algorithm unless P=NP. With -nets as its algorithmic engine, KSU inherits this near-optimality. The compression-based -NN paradigm was later extended to semimetrics in [21], where it was shown to survive violations of the triangle inequality, while the hierarchy-based search methods that have become standard for metric spaces (such as [4, 18] and related approaches) all break down.

It was shown in [26] that a margin-regularized -NN learner (essentially, the one proposed in [18], which, unlike [20], did not involve sample compression) becomes strongly Bayes-consistent when the margin is chosen optimally in an explicitly prescribed sample-dependent fashion. The margin-based technique developed in [18] for the binary case was extended to multiclass in [27]. Since the algorithm relied on computing a minimum vertex cover, it was not possible to make it both computationally efficient and Bayes-consistent when the number of lables exceeds two. An additional improvement over [27] is that the generalization bounds presented there had an explicit (logarithmic) dependence on the number of labels, while our compression scheme extends seamlessly to countable label spaces.

#### Paper outline.

After fixing the notation and setup in Sec. 2, in Sec. 3 we present KSU, the compression-based 1-NN algorithm we analyze in this work. Sec. 4 discusses our main contributions regarding KSU, together with some open problems. High-level proof sketches are given in Sec. 5 for the finite-dimensional case, and Sec. 6 for the infinite-dimensional case. Full detailed proofs are found in the appendices.

## 2 Setting and Notation

Our instance space is the metric space , where is the instance domain and is the metric. (See Appendix A for relevant background on metric measure spaces.) We consider a countable label space . The unknown sampling distribution is a probability measure over , with marginal over . Denote by a pair drawn according to . The generalization error of a classifier is given by , and its empirical error with respect to a labeled set is given by The optimal Bayes risk of is where the infimum is taken over all measurable classifiers . We say that is realizable when . We omit the overline in in the sequel when there is no ambiguity.

For a finite labeled set and any , let be the nearest neighbor of with respect to and let be the nearest neighbor label of with respect to :

 (Xnn(x,S),Ynn(x,S)):=argmin(x′,y′)∈Sρ(x,x′),

where ties are broken arbitrarily. The 1-NN classifier induced by is denoted by . The set of points in , denoted by , induces a Voronoi partition of , , where each Voronoi cell is . By definition, , .

A 1-NN algorithm is a mapping from an i.i.d. labeled sample to a labeled set , yielding the 1-NN classifier . While the classic 1-NN algorithm sets , in this work we study a compression-based algorithm which sets adaptively, as discussed further below.

A 1-NN algorithm is strongly Bayes-consistent on if converges to almost surely, that is . An algorithm is weakly Bayes-consistent on if converges to in expectation, . Obviously, the former implies the latter. We say that an algorithm is Bayes-consistent on a metric space if it is Bayes-consistent on all distributions in the metric space.

A convenient property that is used when studying the Bayes-consistency of algorithms in metric spaces is the doubling dimension. Denote the open ball of radius around by and let denote the corresponding closed ball. The doubling dimension of a metric space is defined as follows. Let be the smallest number such that every ball in can be covered by balls of half its radius, where all balls are centered at points of . Formally,

 n:=min{n∈N:∀x∈X,r>0,∃x1,…,xn∈X s.t. Br(x)⊆∪ni=1Br/2(xi)}.

Then the doubling dimension of is defined by .

For an integer , let . Denote the set of all index vectors of length by Given a labeled set and any , denote the sub-sample of indexed by by . Similarly, for a vector , denote by , namely the sub-sample of as determined by where the labels are replaced with . Lastly, for , we denote

## 3 1-NN majority-based compression

In this work we consider the 1-NN majority-based compression algorithm proposed in [25], which we refer to as KSU. This algorithm is based on constructing -nets at different scales; for and , a set is said to be a -net of if and for all , .333 For technical reasons, having to do with the construction in Sec. 6, we depart slightly from the standard definition of a -net . The classic definition requires that (i) and (ii) . In our definition, the relations and in (i) and (ii) are replaced by and .

The algorithm (see Alg. 1) operates as follows. Given an input sample , whose set of points is denoted , KSU considers all possible scales . For each such scale it constructs a -net of . Denote this -net by , where denotes its size and denotes the indices selected from for this -net. For every such -net, the algorithm attaches the labels , which are the empirical majority-vote labels in the respective Voronoi cells in the partition . Formally, for ,

 Y′i∈argmaxy∈Y|{j∈[n]∣Xj∈Vi,Yj=y}|, (1)

where ties are broken arbitrarily. This procedure creates a labeled set for every relevant The algorithm then selects a single , denoted , and outputs . The scale is selected so as to minimize a generalization error bound, which upper bounds with high probability. This error bound, denoted in the algorithm, can be derived using a compression-based analysis, as described below.

We say that a mapping is a compression scheme if there is a function , from sub-samples to subsets of , such that for every there exists an and a sequence such that . Given a compression scheme and a matching function , we say that a specific is an -compression of a given if for some and . The generalization power of compression was recognized by [17] and [22]. Specifically, it was shown in [21, Theorem 8] that if the mapping is a compression scheme, then with probability at least , for any which is an -compression of , we have (omitting the constants, explicitly provided therein, which do not affect our analysis)

 err(hS′n)≤nn−mα+O(mlog(n)+log(1/δ)n−m)+O(√nmn−mαlog(n)+log(1/δ)n−m). (2)

Defining as the RHS of Eq. (2) provides KSU with a compression bound. The following proposition shows that KSU is a compression scheme, which enables us to use Eq. (2) with the appropriate substitution.444 In [25] the analysis was based on compression with side information, and does not extend to infinite .

###### Proposition 1.

The mapping defined by Alg. 1 is a compression scheme whose output is a -compression of .

###### Proof.

Define the function by , and observe that for all , we have , where is the -net index set as defined above, and is some index vector such that for every . Since is an empirical majority vote, clearly such a exists. Under this scheme, the output of this algorithm is a -compression. ∎

KSU is efficient, for any countable . Indeed, Alg. 1 has a naive runtime complexity of , since values of are considered and a -net is constructed for each one in time (see [20, Algorithm 1]). Improved runtimes can be obtained, e.g., using the methods in [28, 18]. In this work we focus on the Bayes-consistency of KSU, rather than optimize its computational complexity. Our Bayes-consistency results below hold for KSU, whenever the generalization bound satisfies the following properties:

1. For any integer and , with probability over the i.i.d. random sample , for all and : If is an -compression of , then

2. is monotonically increasing in and in .

3. There is a sequence , such that and for all ,

 limn→∞supα∈[0,1](Q(n,α,m,δn)−α)=0.

The compression bound in Eq. (2) clearly satisfies these properties. Note that Property 3 is satisfied by Eq. (2) using any convergent series such that ; in particular, the decay of cannot be too rapid.

## 4 Main results

In this section we describe our main results. The proofs appear in subsequent sections. First, we show that KSU is Bayes-consistent if the instance space has a finite doubling dimension. This contrasts with classical 1-NN, which is only Bayes-consistent if the distribution is realizable.

###### Theorem 2.

Let be a metric space with a finite doubling-dimension. Let be a generalization bound that satisfies Properties , and let be as stipulated by Property for . If the input confidence for input size is set to , then the 1-NN classifier calculated by KSU is strongly Bayes consistent on :

The proof, provided in Sec. 5, closely follows the line of reasoning in [26], where the strong Bayes-consistency of an adaptive margin-regularized -NN algorithm was proved, but with several crucial differences. In particular, the generalization bounds used by KSU are purely compression-based, as opposed to the Rademacher-based generalization bounds used in [26]. The former can be much tighter in practice and guarantee Bayes-consistency of KSU even for countably many labels. This however requires novel technical arguments, which are discussed in detail in Appendix B.1. Moreover, since the compression-based bounds do not explicitly depend on , they can be used even when is infinite, as we do in Theorem 4 below. To underscore the subtle nature of Bayes-consistency, we note that the proof technique given here does not carry to an earlier algorithm, suggested in [20, Theorem 4], which also uses -nets. It is an open question whether the latter is Bayes-consistent.

Next, we study Bayes-consistency of KSU in infinite dimensions (i.e., with ) — in particular, in a setting where -NN was shown by [9] not to be Bayes-consistent. Indeed, a straightforward application of [9, Lemma A.1] yields the following result.

###### Theorem 3 (Cérou and Guyader [9]).

There exists an infinite dimensional separable metric space and a realizable distribution over such that no -NN learner satisfying when is Bayes-consistent under . In particular, this holds for any space and realizable distribution that satisfy the following condition: The set of points labeled by satisfies

 μ(C)>0and∀x∈C,limr→0μ(C∩¯Br(x))μ(¯Br(x))=0. (3)

Since , Eq. (3) constitutes a violation of the Besicovitch covering property. In doubling spaces, the Besicovitch covering theorem precludes such a violation [15]. In contrast, as [34, 35] show, in infinite-dimensional spaces this violation can in fact occur. Moreover, this is not an isolated pathology, as this property is shared by Gaussian Hilbert spaces [44].

At first sight, Eq. (3) might appear to thwart any 1-NN algorithm applied to such a distribution. However, the following result shows that this is not the case: KSU is Bayes-consistent on a distribution with this property.

###### Theorem 4.

There is a metric space equipped with a realizable distribution for which KSU is weakly Bayes-consistent, while any -NN classifier necessarily is not.

The proof relies on a classic construction of Preiss [34] which satisfies Eq. (3). We show that the structure of the construction, combined with the packing and covering properties of -nets, imply that the majority-vote classifier induced by any -net with a sufficienlty small approaches the Bayes error. To contrast with Theorem 4, we next show that on the same construction, not all majority-vote Voronoi partitions succeed. Indeed, if the packing property of -nets is relaxed, partition sequences obstructing Bayes-consistency exist.

###### Theorem 5.

For the example constructed in Theorem 4, there exists a sequence of Voronoi partitions with a vanishing diameter such that the induced true majority-vote classifiers are not Bayes consistent.

The above result also stands in contrast to [14, Theorem 21.2], showing that, unlike in finite dimensions, the partitions’ vanishing diameter is insufficient to establish consistency when . We conclude the main results by posing intriguing open problems.

#### Open problem 1.

Does there exist a metric probability space on which some -NN algorithm is consistent while KSU is not? Does there exist any separable metric space on which KSU fails?

#### Open problem 2.

Cérou and Guyader [9] distill a certain Besicovitch condition which is necessary and sufficient for -NN to be Bayes-consistent in a metric space. Our Theorem 4 shows that the Besicovitch condition is not necessary for KSU to be Bayes-consistent. Is it sufficient? What is a necessary condition?

## 5 Bayes-consistency of Ksu in finite dimensions

In this section we give a high-level proof of Theorem 2, showing that KSU is strongly Bayes-consistent in finite-dimensional metric spaces. A fully detailed proof is given in Appendix B.

Recall the optimal empirical error and the optimal compression size as computed by KSU. As shown in Proposition 1, the sub-sample is an -compression of . Abbreviate the compression-based generalization bound used in KSU by

 Qn(α,m):=Q(n,α,2m,δn).

To show Bayes-consistency, we start by a standard decomposition of the excess error over the optimal Bayes into two terms:

 err(hS′n(γ∗n))−R∗=(err(hS′n(γ∗n))−Qn(α∗n,m∗n))+(Qn(α∗n,m∗n)−R∗)=:TI(n)+TII(n),

and show that each term decays to zero with probability one. For the first term, Property 1 for , together with the Borel-Cantelli lemma, readily imply with probability one. The main challenge is showing that with probability one. We do so in several stages:

1. Loosely speaking, we first show (Lemma 10) that the Bayes error can be well approximated using 1-NN classifiers defined by the true (as opposed to empirical) majority-vote labels over fine partitions of . In particular, this holds for any partition induced by a -net of with a sufficiently small . This approximation guarantee relies on the fact that in finite-dimensional spaces, the class of continuous functions with compact support is dense in (Lemma 9).

2. Fix sufficiently small such that any true majority-vote classifier induced by a -net has a true error close to , as guaranteed by stage 1. Since for bounded subsets of finite-dimensional spaces the size of any -net is finite, the empirical error of any majority-vote -net almost surely converges to its true majority-vote error as the sample size . Let sufficiently large such that as computed by KSU for a sample of size is a reliable estimate for the true error of .

3. Let and be as in stage 2. Given a sample of size , recall that KSU selects an optimal such that is minimized over all . For margins , which are prone to over-fitting, is not a reliable estimate for since compression may not yet taken place for samples of size . Nevertheless, these margins are discarded by KSU due to the penalty term in . On the other hand, for -nets with margin , which are prone to under-fitting, the true error is well estimated by . It follows that KSU selects and , implying with probability one.

As one can see, the assumption that is finite-dimensional plays a major role in the proof. A simple argument shows that the family of continuous functions with compact support is no longer dense in in infinite-dimensional spaces. In addition, -nets of bounded subsets in infinite dimensional spaces need no longer be finite.

## 6 On Bayes-consistency of NN algorithms in infinite dimensions

In this section we study the Bayes-consistency properties of 1-NN algorithms on a classic infinite-dimensional construction of Preiss [34], which we describe below in detail. This construction was first introduced as a concrete example showing that in infinite-dimensional spaces the Besicovich covering theorem [15] can be strongly violated, as manifested in Eq. (3).

###### Example 1 (Preiss’s construction).

The construction (see Figure 1) defines an infinite-dimensional metric space and a realizable measure over with the binary label set . It relies on two sequences: a sequence of natural numbers and a sequence of positive numbers . The two sequences should satisfy the following:

 ∑∞k=1akN1…Nk=1;limk→∞akN1…Nk+1=∞;andlimk→∞Nk=∞. (4)

These properties are satisfied, for instance, by setting and . Let be the set of all finite sequences of natural numbers such that , and let be the set of all infinite sequences of natural numbers such that .

Define the example space and denote , where . The metric over is defined as follows: for , denote by their longest common prefix. Then,

 ρ(x,y)=(γ|x∧y|−γ|x|)+(γ|x∧y|−γ|y|).

It can be shown (see [34]) that is a metric; in fact, it embeds isometrically into the square norm metric of a Hilbert space.

To define , the marginal measure over , let be the uniform product distribution measure over , that is: for all , each in the sequence is independently drawn from a uniform distribution over . Let be an atomic measure on such that for all , . Clearly, the first condition in Eq. (4) implies . Define the marginal probability measure over by

 ∀A⊆Z0∪Z∞,μ(A):=αν∞(A)+(1−α)ν0(A).

In words, an infinite sequence is drawn with probability (and all such sequences are equally likely), or else a finite sequence is drawn (and all finite sequences of the same length are equally likely). Define the realizable distribution over by setting the marginal over to , and by setting the label of to be with probability and the label of to be with probability .

As shown in [34], this construction satisfies Eq. (3) with and . It follows from Theorem 3 that no -NN algorithm is Bayes-consistent on it. In contrast, the following theorem shows that KSU is weakly Bayes-consistent on this distribution. Theorem 4 immediately follows from the this result.

###### Theorem 6.

Assume and as in Example 1. KSU is weakly Bayes-consistent on .

The proof, provided in Appendix C, first characterizes the Voronoi cells for which the true majority-vote yields a significant error for the cell (Lemma 15). In finite-dimensional spaces, the total measure of all such “bad” cells can be made arbitrarily close to zero by taking to be sufficiently small, as shown in Lemma 10 of Theorem 2. However, it is not immediately clear whether this can be achieved for the infinite dimensional construction above.

Indeed, we expect such bad cells, due to the unintuitive property that for any , we have when , and yet . Thus, if for example a significant portion of the set (whose label is 1) is covered by Voronoi cells of the form with , then for all sufficiently small , each one of these cells will have a true majority-vote . Thus a significant portion of would be misclassified. However, we show that by the structure of the construction, combined with the packing and covering properties of -nets, we have that in any -net, the total measure of all these “bad” cells goes to 0 when , thus yielding a consistent classifier.

Lastly, the following theorem shows that on the same construction above, when the Voronoi partitions are allowed to violate the packing property of -nets, Bayes-consistency does not necessarily hold. Theorem 5 immediately follows from the following result.

###### Theorem 7.

Assume , and as in Example 1. There exists a sequence of Voronoi partitions of with such that the sequence of true majority-vote classifiers induced by these partitions is not Bayes consistent: .

The proof, provided in Appendix D, constructs a sequence of Voronoi partitions, where each partition has all of its impure Voronoi cells (those with both and labels) being bad. In this case, is incorrectly classified by , yielding a significant error. Thus, in infinite-dimensional metric spaces, the shape of the Voronoi cells plays a fundamental role in the consistency of the partition.

#### Acknowledgments.

We thank Frédéric Cérou for the numerous fruitful discussions and helpful feedback on an earlier draft. Aryeh Kontorovich was supported in part by the Israel Science Foundation (grant No. 755/15), Paypal and IBM. Sivan Sabato was supported in part by the Israel Science Foundation (grant No. 555/15).

## References

• [1] Christophe Abraham, Gérard Biau, and Benoît Cadre. On the kernel rule for function classification. Ann. Inst. Statist. Math., 58(3):619–633, 2006.
• [2] Daniel Berend and Aryeh Kontorovich. The missing mass problem. Statistics & Probability Letters, 82(6):1102–1110, 2012.
• [3] Daniel Berend and Aryeh Kontorovich. On the concentration of the missing mass. Electronic Communications in Probability, 18(3):1–7, 2013.
• [4] Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 97–104, New York, NY, USA, 2006. ACM.
• [5] Gérard Biau, Florentina Bunea, and Marten H. Wegkamp. Functional classification in Hilbert spaces. IEEE Trans. Inform. Theory, 51(6):2163–2172, 2005.
• [6] Gérard Biau, Frédéric Cérou, and Arnaud Guyader. Rates of convergence of the functional -nearest neighbor estimate. IEEE Trans. Inform. Theory, 56(4):2034–2040, 2010.
• [7] V. I. Bogachev. Measure theory. Vol. I, II. Springer-Verlag, Berlin, 2007.
• [8] Oren Boiman, Eli Shechtman, and Michal Irani. In defense of nearest-neighbor based image classification. In CVPR, 2008.
• [9] Frédéric Cérou and Arnaud Guyader. Nearest neighbor classification in infinite dimension. ESAIM: Probability and Statistics, 10:340–355, 2006.
• [10] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In NIPS, 2014.
• [11] Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.
• [12] Luc Devroye. On the inequality of Cover and Hart in nearest neighbor discrimination. IEEE Trans. Pattern Anal. Mach. Intell., 3(1):75–78, 1981.
• [13] Luc Devroye and László Györfi. Nonparametric density estimation: the view. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York, 1985.
• [14] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.
• [15] Herbert Federer. Geometric measure theory. Die Grundlehren der mathematischen Wissenschaften, Band 153. Springer-Verlag New York Inc., New York, 1969.
• [16] Evelyn Fix and Jr. Hodges, J. L. Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57(3):pp. 238–247, 1989.
• [17] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine learning, 21(3):269–304, 1995.
• [18] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract COLT 2010). IEEE Transactions on Information Theory, 60(9):5750–5759, 2014.
• [19] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105–118, 2016.
• [20] Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Near-optimal sample compression for nearest neighbors. In Neural Information Processing Systems (NIPS), 2014.
• [21] Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearly optimal classification for semimetrics (extended abstract AISTATS 2016). Journal of Machine Learning Research, 2017.
• [22] Thore Graepel, Ralf Herbrich, and John Shawe-Taylor. PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1):55–76, 2005.
• [23] Peter Hall and Kee-Hoon Kang. Bandwidth choice for nonparametric classification. Ann. Statist., 33(1):284–306, 02 2005.
• [24] Olav Kallenberg. Foundations of modern probability. Second edition. Probability and its Applications. Springer-Verlag, 2002.
• [25] Aryeh Kontorovich, Sivan Sabato, and Ruth Urner. Active nearest-neighbor learning in metric spaces. In Advances in Neural Information Processing Systems, pages 856–864, 2016.
• [26] Aryeh Kontorovich and Roi Weiss. A Bayes consistent 1-NN classifier. In Artificial Intelligence and Statistics (AISTATS 2015), 2014.
• [27] Aryeh Kontorovich and Roi Weiss. Maximum margin multiclass nearest neighbors. In International Conference on Machine Learning (ICML 2014), 2014.
• [28] Robert Krauthgamer and James R. Lee. Navigating nets: Simple algorithms for proximity search. In 15th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 791–801, January 2004.
• [29] Sanjeev R. Kulkarni and Steven E. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Trans. Inform. Theory, 41(4):1028–1039, 1995.
• [30] Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnability. unpublished, 1986.
• [31] James R. Munkres. Topology: a first course. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1975.
• [32] Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Inform. Process. Lett., 73(1-2):47–51, 2000.
• [33] Vladimir Pestov. Is the -NN classifier in high dimensions affected by the curse of dimensionality? Comput. Math. Appl., 65(10):1427–1437, 2013.
• [34] David Preiss. Invalid Vitali theorems. Abstracta. 7th Winter School on Abstract Analysis, pages 58–60, 1979.
• [35] David Preiss. Gaussian measures and the density theorem. Comment. Math. Univ. Carolin., 22(1):181–193, 1981.
• [36] Demetri Psaltis, Robert R. Snapp, and Santosh S. Venkatesh. On the finite sample performance of the nearest neighbor classifier. IEEE Transactions on Information Theory, 40(3):820–837, 1994.
• [37] Walter Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third edition, 1976. International Series in Pure and Applied Mathematics.
• [38] Walter Rudin. Real and Complex Analysis. McGraw-Hill, 1987.
• [39] Richard J. Samworth. Optimal weighted nearest neighbour classifiers. Ann. Statist., 40(5):2733–2763, 10 2012.
• [40] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
• [41] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998.
• [42] Robert R. Snapp and Santosh S. Venkatesh. Asymptotic expansions of the nearest neighbor risk. Ann. Statist., 26(3):850–878, 1998.
• [43] Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595–620, 1977.
• [44] Jaroslav Tišer. Vitali covering theorem in Hilbert space. Trans. Amer. Math. Soc., 355(8):3277–3289, 2003.
• [45] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
• [46] Lin Cheng Zhao. Exponential bounds of mean error for the nearest neighbor estimates of regression functions. J. Multivariate Anal., 21(1):168–178, 1987.

## Appendix A Background on metric measure spaces

Here we provide some general relevant background on metric measure spaces. Our metric space is doubling, but in this section finite diameter is not assumed. We recall some standard definitions. A topological space is Hausdorff if every two distinct points have disjoint neighborhoods. It is a standard (and obvious) fact that all metric spaces are Hausdorff.

A metric space is complete if every Cauchy sequence converges to a point in . Every metric space may be completed by (essentially) adjoining to it the limits of all of its Cauchy sequences [37, Exercise 3.24]; moreover, the completion is unique up to isometry [31, Section 43, Exercise 10]. We implicitly assume throughout the paper that is complete. Closed subsets of complete metric spaces are also complete metric spaces under the inherited metric.

A topological space is locally compact if every point has a compact neighborhood. It is a standard and easy fact that complete doubling spaces are locally compact. Indeed, consider any and the open -ball about , . We must show that — the closure of — is compact. To this end, it suffices to show that is totally bounded (that is, has a finite -covering number for each ), since in complete metric spaces, a set is compact iff it is closed and totally bounded [31, Theorem 45.1]. Total boundedness follows immediately from the doubling property. The latter posits a constant and some such that . Then certainly We now apply the doubling property recursively to each of the , until the radius of the covering balls becomes smaller than .

We now recall some standard facts from measure theory. Any topology on (and in particular, the one induced by the metric ), induces the Borel -algebra . A Borel probability measure is a function that is countably additive and normalized by . The latter is complete if for all for which , we also have . Any Borel -algebra may be completed by defining the measure of any subset of a measure-zero set to be zero [38, Theorem 1.36]. We implicitly assume throughout the paper that is a complete measure space, where contains all of the Borel sets.

The measure is said to be outer regular if it can be approximated from above by open sets: For every , we have

 μ(E)=inf{μ(V):E⊆V,V open}.

A corresponding inner regularity corresponds to approximability from below by compact sets: For every ,

 μ(E)=sup{μ(K):K⊆E,K compact}.

The measure is regular if it is both inner and outer regular. Any probability measure defined on the Borel -algebra of a metric space is regular [24, Lemma 1.19]. (Dropping the “metric” or “probability” assumptions opens the door to various exotic pathologies [7, Chapter 7], [38, Exercise 2.17].)

Finally, we have the following technical result, adapted from [38, Theorem 3.14] to our setting:

###### Theorem 8.

Let be a complete doubling metric space equipped with a complete probability measure , such that all Borel sets are -measurable. Then (the collection of continuous functions with compact support) is dense in .

## Appendix B Bayes-consistency proof of Ksu in finite dimensions

In this section we prove Theorem 2 in full detail. Let be a metric space with doubling-dimension . Given a sample , we abbreviate the optimal empirical error and the optimal compression size as computed by KSU. As shown in Sec. 3, the labeled set computed by KSU is an -compression of the sample . For brevity we denote

 Qn(α,m):=Q(n,α,2m,δn).

To prove Theorem 2 we decompose the excess error over the Bayes into two terms:

 err(hS′n(γ∗n))−R∗ = (err(hS′n(γ∗n))−Qn(α∗n,m∗n))+(Qn(α∗n,m∗n)−R∗) =: TI(n)+TII(n),

and show that each term decays to zero with probability one.

For the first term, , from Property 1 of generalization bound , we have that for any ,

 PSn[err(hS′n(γ∗n))−Qn(α∗n,m∗n)>0]≤δn. (5)

Since , the Borel-Cantelli lemma implies with probability .

The main challenge is to prove that with probability one. We begin by showing that the Bayes error can be approached using classifiers defined by the true majority-vote labeling over fine partitions of . Formally, let be a finite partition of , and define the function such that is the unique for which . For any measurable set define the true majority-vote label by

 y∗(E)=argmaxy∈YP¯μ(Y=y|X∈E), (6)

where ties are broken lexicographically. To ensure that is always well-defined, when we arbitrarily define it to be the lexicographically first . Given and a measurable set , consider the true majority-vote classifier given by

 h∗V,M(x)=y∗(IV(x)∩M). (7)

Note that if , this classifier attaches a label to based on the true majority-vote in a set that does not contain . To bound the error of for any conditional distribution of labels, we use the fact that on doubling metric spaces, continuous functions are dense in .

###### Lemma 9.

For every probability measure on a doubling metric space , the set of continuous functions with compact support is dense in . Namely, for any and there is a continuous function with compact support such that

###### Proof.

This is stated as Theorem 8 in Appendix A. ∎

We have the following uniform approximation bound for the error of classifiers in the form of (7), essentially extending the approximation analysis done in the proof of [14, Theorem 21.2] for the special case and to the more general multi-class problem in doubling metric spaces.

###### Lemma 10.

Let be a probability measure on where is a doubling metric space. For any , there exists a diameter such that for any finite measurable partition of and any measurable set satisfying

• ,

the true majority-vote classifier defined in (7) satisfies .

###### Proof.

Let be the conditional probability function for label ,

 ηy(x)=P¯μ(Y=y|X=x).

Define as ’s conditional expectation function with respect to ,

 ~ηy(x)=P¯μ(Y=y|X∈IV(x)∩M)=∫IV(x)∩Mηy(z)dμ(z)μ(IV(x)∩M).

(And, say, for .) Note that are piecewise constant on the cells of the restricted partition . By definition, the Bayes classifier and the true majority-vote classifier satisfy

 h∗(x) = argmaxy∈Yηy(x), h∗V,M(x) = argmaxy∈Y~ηy(x).

It follows that

 P¯μ(h∗V,M(X)≠Y|X=x)−P¯μ(h∗(X)≠Y|X=x) = ηh∗(x)(x)−ηh∗V,M(x)(x) = maxy∈Yηy(x)−maxy∈Y~ηy(x) ≤ maxy∈Y|ηy(x)−~ηy(x)|.

By condition (i) in the theorem statement, . Thus,

 err(h∗V,M)−R∗ = P¯μ(h∗V,M(X)≠Y)−P¯μ(h∗(X)≠Y) ≤ μ(X∖M)+∫Mmaxy∈Y|ηy(x)−~ηy(x)|dμ(x) ≤ ν+∑y∈Y∫M|ηy(x)−~ηy(x)|dμ(x).

Let be a finite set of labels such that . Then

 err(h∗V,M)−R∗≤2ν+∑y∈Y′∫M|ηy(x)−~ηy(x)|dμ(x).