NearestNeighbor Sample Compression:
Efficiency, Consistency, Infinite Dimensions
Abstract
We examine the Bayesconsistency of a recently proposed 1nearestneighborbased multiclass learning algorithm. This algorithm is derived from sample compression bounds and enjoys the statistical advantages of tight, fully empirical generalization bounds, as well as the algorithmic advantages of a faster runtime and memory savings. We prove that this algorithm is strongly Bayesconsistent in metric spaces with finite doubling dimension — the first consistency result for an efficient nearestneighbor sample compression scheme. Rather surprisingly, we discover that this algorithm continues to be Bayesconsistent even in a certain infinitedimensional setting, in which the basic measuretheoretic conditions on which classic consistency proofs hinge are violated. This is all the more surprising, since it is known that NN is not Bayesconsistent in this setting. We pose several challenging open problems for future research.
1 Introduction
This paper deals with NearestNeighbor (NN) learning algorithms in metric spaces. Initiated by Fix and Hodges in 1951 [16], this seemingly naive learning paradigm remains competitive against more sophisticated methods [8, 45] and, in its celebrated NN version, has been placed on a solid theoretical foundation [11, 43, 13, 46].
Although the classic 1NN is well known to be inconsistent in general, in recent years a series of papers has presented variations on the theme of a regularized NN classifier, as an alternative to the Bayesconsistent NN. Gottlieb et al. [18] showed that approximate nearest neighbor search can act as a regularizer, actually improving generalization performance rather than just injecting noise. In a followup work, [26] showed that applying Structural Risk Minimization to (essentially) the marginregularized datadependent bound in [18] yields a strongly Bayesconsistent 1NN classifier. A further development has seen marginbased regularization analyzed through the lens of sample compression: a nearoptimal nearest neighbor condensing algorithm was presented [20] and later extended to cover semimetric spaces [21]; an activized version also appeared [25]. As detailed in [26], marginregularized 1NN methods enjoy a number of statistical and computational advantages over the traditional NN classifier. Salient among these are explicit datadependent generalization bounds, and considerable runtime and memory savings. Sample compression affords additional advantages, in the form of tighter generalization bounds and increased efficiency in time and space.
In this work we study the Bayesconsistency of a compressionbased NN multiclass learning algorithm, in both finitedimensional and infinitedimensional metric spaces. The algorithm is essentially the passive component of the active learner proposed by Kontorovich, Sabato, and Urner in [25], and we refer to it in the sequel as KSU; for completeness, we present it here in full (Alg. 1). We show that in finitedimensional metric spaces, KSU is both computationally efficient and Bayesconsistent. This is the first compressionbased multiclass 1NN algorithm proven to possess both of these properties. We further exhibit a surprising phenomenon in infinitedimensional spaces, where we construct a distribution for which KSU is Bayesconsistent while NN is not.
Main results.
Our main contributions consist of analyzing the performance of KSU in finite and infinite dimensional settings, and comparing it to the classical NN learner. Our key findings are summarized below.

In Theorem 2, we show that KSU is computationally efficient and strongly Bayesconsistent in metric spaces with a finite doubling dimension. This is the first (strong or otherwise) Bayesconsistency result for an efficient sample compression scheme for a multiclass (or even binary)^{1}^{1}1 An efficient sample compression algorithm was given in [20] for the binary case, but no Bayesconsistency guarantee is known for it. NN algorithm. This result should be contrasted with the one in [26], where marginbased regularization was employed, but not compression; the proof techniques from [26] do not carry over to the compressionbased scheme. Instead, novel arguments are required, as we discuss below. The new sample compression technique provides a Bayesconsistency proof for multiple (even countably many) labels; this is contrasted with the multiclass 1NN algorithm in [27], which is not compressionbased, and requires solving a minimum vertex cover problem, thereby imposing a approximation factor whenever there are more than two labels.

In Theorem 4, we make the surprising discovery that KSU continues to be Bayesconsistent in a certain infinitedimensional setting, even though this setting violates the basic measuretheoretic conditions on which classic consistency proofs hinge, including Theorem 2. This is all the more surprising, since it is known that NN is not Bayesconsistent for this construction [9]. We are currently unaware of any separable^{2}^{2}2Cérou and Guyader [9] gave a simple example of a nonseparable metric on which all known nearestneighbor methods, including NN and KSU, obviously fail. metric probability space on which KSU fails to be Bayesconsistent; this is posed as an intriguing open problem.
Our results indicate that in finite dimensions, an efficient, compressionbased, Bayesconsistent multiclass 1NN algorithm exists, and hence can be offered as an alternative to NN, which is well known to be Bayesconsistent in finite dimensions [12, 40]. In contrast, in infinite dimensions, our results show that the condition characterizing the Bayesconsistency of NN does not extend to all NN algorithms. It is an open problem to characterize the necessary and sufficient conditions for the existence of a Bayesconsistent NNbased algorithm in infinite dimensions.
Related work.
Following the pioneering work of [11] on nearestneighbor classification, it was shown by [13, 46, 14] that the NN classifier is strongly Bayes consistent in . These results made extensive use of the Euclidean structure of , but in [40] a weak Bayesconsistency result was shown for metric spaces with a bounded diameter and a bounded doubling dimension, and additional distributional smoothness assumptions. More recently, some of the classic results on NN risk decay rates were refined by [10] in an analysis that captures the interplay between the metric and the sampling distribution. The worstcase rates have an exponential dependence on the dimension (i.e., the socalled curse of dimensionality), and Pestov [32, 33] examines this phenomenon closely under various distributional and structural assumptions.
Consistency of NNtype algorithms in more general (and in particular infinitedimensional) metric spaces was discussed in [1, 5, 6, 9, 29]. In [1, 9], characterizations of Bayesconsistency were given in terms of Besicovitchtype conditions (see Eq. (3)). In [1], a generalized “moving window” classification rule is used and additional regularity conditions on the regression function are imposed. The filtering technique (i.e., taking the first coordinates in some basis representation) was shown to be universally consistent in [5]. However, that algorithm suffers from the cost of crossvalidating over both the dimension and number of neighbors . Also, the technique is only applicable in Hilbert spaces (as opposed to more general metric spaces) and provides only asymptotic consistency, without finitesample bounds such as those provided by KSU. The insight of [5] is extended to the more general Banach spaces in [6] under various regularity assumptions.
None of the aforementioned generalization results for NNbased techniques are in the form of fully empirical, explicitly computable sampledependent error bounds. Rather, they are stated in terms of the unknown Bayesoptimal rate, and some involve additional parameters quantifying the wellbehavedness of the unknown distribution (see [26] for a detailed discussion). As such, these guarantees do not enable a practitioner to compute a numerical generalization error estimate for a given training sample, much less allow for a datadependent selection of , which must be tuned via crossvalidation. The asymptotic expansions in [42, 36, 23, 39] likewise do not provide a computable finitesample bound. The quest for such bounds was a key motivation behind the series of works [18, 27, 20], of which KSU [25] is the latest development.
The work of Devroye et al. [14, Theorem 21.2] has implications for NN classifiers in that are defined based on datadependent majorityvote partitions of the space. It is shown that under some conditions, a fixed mapping from each sample size to a datadependent partition rule induces a strongly Bayesconsistent algorithm. This result requires the partition rule to have a bounded VC dimension, and since this rule must be fixed in advance, the algorithm is not fully adaptive. Theorem 19.3 ibid. proves weak consistency for an inefficient compressionbased algorithm, which selects among all the possible compression sets of a certain size, and maintains a certain rate of compression relative to the sample size. The generalizing power of sample compression was independently discovered by [30], and later elaborated upon by [22]. In the context of NN classification, [14] lists various condensing heuristics (which have no known performance guarantees) and leaves open the algorithmic question of how to minimize the empirical risk over all subsets of a given size.
The first compressionbased 1NN algorithm with provable optimality guarantees was given in [20]; it was based on constructing nets in spaces with a finite doubling dimension. The compression size of this construction was shown to be nearly unimprovable by an efficient algorithm unless P=NP. With nets as its algorithmic engine, KSU inherits this nearoptimality. The compressionbased NN paradigm was later extended to semimetrics in [21], where it was shown to survive violations of the triangle inequality, while the hierarchybased search methods that have become standard for metric spaces (such as [4, 18] and related approaches) all break down.
It was shown in [26] that a marginregularized NN learner (essentially, the one proposed in [18], which, unlike [20], did not involve sample compression) becomes strongly Bayesconsistent when the margin is chosen optimally in an explicitly prescribed sampledependent fashion. The marginbased technique developed in [18] for the binary case was extended to multiclass in [27]. Since the algorithm relied on computing a minimum vertex cover, it was not possible to make it both computationally efficient and Bayesconsistent when the number of lables exceeds two. An additional improvement over [27] is that the generalization bounds presented there had an explicit (logarithmic) dependence on the number of labels, while our compression scheme extends seamlessly to countable label spaces.
Paper outline.
After fixing the notation and setup in Sec. 2, in Sec. 3 we present KSU, the compressionbased 1NN algorithm we analyze in this work. Sec. 4 discusses our main contributions regarding KSU, together with some open problems. Highlevel proof sketches are given in Sec. 5 for the finitedimensional case, and Sec. 6 for the infinitedimensional case. Full detailed proofs are found in the appendices.
2 Setting and Notation
Our instance space is the metric space , where is the instance domain and is the metric. (See Appendix A for relevant background on metric measure spaces.) We consider a countable label space . The unknown sampling distribution is a probability measure over , with marginal over . Denote by a pair drawn according to . The generalization error of a classifier is given by , and its empirical error with respect to a labeled set is given by The optimal Bayes risk of is where the infimum is taken over all measurable classifiers . We say that is realizable when . We omit the overline in in the sequel when there is no ambiguity.
For a finite labeled set and any , let be the nearest neighbor of with respect to and let be the nearest neighbor label of with respect to :
where ties are broken arbitrarily. The 1NN classifier induced by is denoted by . The set of points in , denoted by , induces a Voronoi partition of , , where each Voronoi cell is . By definition, , .
A 1NN algorithm is a mapping from an i.i.d. labeled sample to a labeled set , yielding the 1NN classifier . While the classic 1NN algorithm sets , in this work we study a compressionbased algorithm which sets adaptively, as discussed further below.
A 1NN algorithm is strongly Bayesconsistent on if converges to almost surely, that is . An algorithm is weakly Bayesconsistent on if converges to in expectation, . Obviously, the former implies the latter. We say that an algorithm is Bayesconsistent on a metric space if it is Bayesconsistent on all distributions in the metric space.
A convenient property that is used when studying the Bayesconsistency of algorithms in metric spaces is the doubling dimension. Denote the open ball of radius around by and let denote the corresponding closed ball. The doubling dimension of a metric space is defined as follows. Let be the smallest number such that every ball in can be covered by balls of half its radius, where all balls are centered at points of . Formally,
Then the doubling dimension of is defined by .
For an integer , let . Denote the set of all index vectors of length by Given a labeled set and any , denote the subsample of indexed by by . Similarly, for a vector , denote by , namely the subsample of as determined by where the labels are replaced with . Lastly, for , we denote
3 1NN majoritybased compression
In this work we consider the 1NN majoritybased compression algorithm proposed in [25], which we refer to as KSU. This algorithm is based on constructing nets at different scales; for and , a set is said to be a net of if and for all , .^{3}^{3}3 For technical reasons, having to do with the construction in Sec. 6, we depart slightly from the standard definition of a net . The classic definition requires that (i) and (ii) . In our definition, the relations and in (i) and (ii) are replaced by and .
The algorithm (see Alg. 1) operates as follows. Given an input sample , whose set of points is denoted , KSU considers all possible scales . For each such scale it constructs a net of . Denote this net by , where denotes its size and denotes the indices selected from for this net. For every such net, the algorithm attaches the labels , which are the empirical majorityvote labels in the respective Voronoi cells in the partition . Formally, for ,
(1) 
where ties are broken arbitrarily. This procedure creates a labeled set for every relevant The algorithm then selects a single , denoted , and outputs . The scale is selected so as to minimize a generalization error bound, which upper bounds with high probability. This error bound, denoted in the algorithm, can be derived using a compressionbased analysis, as described below.
We say that a mapping is a compression scheme if there is a function , from subsamples to subsets of , such that for every there exists an and a sequence such that . Given a compression scheme and a matching function , we say that a specific is an compression of a given if for some and . The generalization power of compression was recognized by [17] and [22]. Specifically, it was shown in [21, Theorem 8] that if the mapping is a compression scheme, then with probability at least , for any which is an compression of , we have (omitting the constants, explicitly provided therein, which do not affect our analysis)
(2) 
Defining as the RHS of Eq. (2) provides KSU with a compression bound. The following proposition shows that KSU is a compression scheme, which enables us to use Eq. (2) with the appropriate substitution.^{4}^{4}4 In [25] the analysis was based on compression with side information, and does not extend to infinite .
Proposition 1.
The mapping defined by Alg. 1 is a compression scheme whose output is a compression of .
Proof.
Define the function by , and observe that for all , we have , where is the net index set as defined above, and is some index vector such that for every . Since is an empirical majority vote, clearly such a exists. Under this scheme, the output of this algorithm is a compression. ∎
KSU is efficient, for any countable . Indeed, Alg. 1 has a naive runtime complexity of , since values of are considered and a net is constructed for each one in time (see [20, Algorithm 1]). Improved runtimes can be obtained, e.g., using the methods in [28, 18]. In this work we focus on the Bayesconsistency of KSU, rather than optimize its computational complexity. Our Bayesconsistency results below hold for KSU, whenever the generalization bound satisfies the following properties:

For any integer and , with probability over the i.i.d. random sample , for all and : If is an compression of , then

is monotonically increasing in and in .

There is a sequence , such that and for all ,
The compression bound in Eq. (2) clearly satisfies these properties. Note that Property 3 is satisfied by Eq. (2) using any convergent series such that ; in particular, the decay of cannot be too rapid.
4 Main results
In this section we describe our main results. The proofs appear in subsequent sections. First, we show that KSU is Bayesconsistent if the instance space has a finite doubling dimension. This contrasts with classical 1NN, which is only Bayesconsistent if the distribution is realizable.
Theorem 2.
Let be a metric space with a finite doublingdimension. Let be a generalization bound that satisfies Properties , and let be as stipulated by Property for . If the input confidence for input size is set to , then the 1NN classifier calculated by KSU is strongly Bayes consistent on :
The proof, provided in Sec. 5, closely follows the line of reasoning in [26], where the strong Bayesconsistency of an adaptive marginregularized NN algorithm was proved, but with several crucial differences. In particular, the generalization bounds used by KSU are purely compressionbased, as opposed to the Rademacherbased generalization bounds used in [26]. The former can be much tighter in practice and guarantee Bayesconsistency of KSU even for countably many labels. This however requires novel technical arguments, which are discussed in detail in Appendix B.1. Moreover, since the compressionbased bounds do not explicitly depend on , they can be used even when is infinite, as we do in Theorem 4 below. To underscore the subtle nature of Bayesconsistency, we note that the proof technique given here does not carry to an earlier algorithm, suggested in [20, Theorem 4], which also uses nets. It is an open question whether the latter is Bayesconsistent.
Next, we study Bayesconsistency of KSU in infinite dimensions (i.e., with ) — in particular, in a setting where NN was shown by [9] not to be Bayesconsistent. Indeed, a straightforward application of [9, Lemma A.1] yields the following result.
Theorem 3 (Cérou and Guyader [9]).
There exists an infinite dimensional separable metric space and a realizable distribution over such that no NN learner satisfying when is Bayesconsistent under . In particular, this holds for any space and realizable distribution that satisfy the following condition: The set of points labeled by satisfies
(3) 
Since , Eq. (3) constitutes a violation of the Besicovitch covering property. In doubling spaces, the Besicovitch covering theorem precludes such a violation [15]. In contrast, as [34, 35] show, in infinitedimensional spaces this violation can in fact occur. Moreover, this is not an isolated pathology, as this property is shared by Gaussian Hilbert spaces [44].
At first sight, Eq. (3) might appear to thwart any 1NN algorithm applied to such a distribution. However, the following result shows that this is not the case: KSU is Bayesconsistent on a distribution with this property.
Theorem 4.
There is a metric space equipped with a realizable distribution for which KSU is weakly Bayesconsistent, while any NN classifier necessarily is not.
The proof relies on a classic construction of Preiss [34] which satisfies Eq. (3). We show that the structure of the construction, combined with the packing and covering properties of nets, imply that the majorityvote classifier induced by any net with a sufficienlty small approaches the Bayes error. To contrast with Theorem 4, we next show that on the same construction, not all majorityvote Voronoi partitions succeed. Indeed, if the packing property of nets is relaxed, partition sequences obstructing Bayesconsistency exist.
Theorem 5.
For the example constructed in Theorem 4, there exists a sequence of Voronoi partitions with a vanishing diameter such that the induced true majorityvote classifiers are not Bayes consistent.
The above result also stands in contrast to [14, Theorem 21.2], showing that, unlike in finite dimensions, the partitions’ vanishing diameter is insufficient to establish consistency when . We conclude the main results by posing intriguing open problems.
Open problem 1.
Does there exist a metric probability space on which some NN algorithm is consistent while KSU is not? Does there exist any separable metric space on which KSU fails?
Open problem 2.
5 Bayesconsistency of Ksu in finite dimensions
In this section we give a highlevel proof of Theorem 2, showing that KSU is strongly Bayesconsistent in finitedimensional metric spaces. A fully detailed proof is given in Appendix B.
Recall the optimal empirical error and the optimal compression size as computed by KSU. As shown in Proposition 1, the subsample is an compression of . Abbreviate the compressionbased generalization bound used in KSU by
To show Bayesconsistency, we start by a standard decomposition of the excess error over the optimal Bayes into two terms:
and show that each term decays to zero with probability one. For the first term, Property 1 for , together with the BorelCantelli lemma, readily imply with probability one. The main challenge is showing that with probability one. We do so in several stages:

Loosely speaking, we first show (Lemma 10) that the Bayes error can be well approximated using 1NN classifiers defined by the true (as opposed to empirical) majorityvote labels over fine partitions of . In particular, this holds for any partition induced by a net of with a sufficiently small . This approximation guarantee relies on the fact that in finitedimensional spaces, the class of continuous functions with compact support is dense in (Lemma 9).

Fix sufficiently small such that any true majorityvote classifier induced by a net has a true error close to , as guaranteed by stage 1. Since for bounded subsets of finitedimensional spaces the size of any net is finite, the empirical error of any majorityvote net almost surely converges to its true majorityvote error as the sample size . Let sufficiently large such that as computed by KSU for a sample of size is a reliable estimate for the true error of .

Let and be as in stage 2. Given a sample of size , recall that KSU selects an optimal such that is minimized over all . For margins , which are prone to overfitting, is not a reliable estimate for since compression may not yet taken place for samples of size . Nevertheless, these margins are discarded by KSU due to the penalty term in . On the other hand, for nets with margin , which are prone to underfitting, the true error is well estimated by . It follows that KSU selects and , implying with probability one.
As one can see, the assumption that is finitedimensional plays a major role in the proof. A simple argument shows that the family of continuous functions with compact support is no longer dense in in infinitedimensional spaces. In addition, nets of bounded subsets in infinite dimensional spaces need no longer be finite.
6 On Bayesconsistency of NN algorithms in infinite dimensions
In this section we study the Bayesconsistency properties of 1NN algorithms on a classic infinitedimensional construction of Preiss [34], which we describe below in detail. This construction was first introduced as a concrete example showing that in infinitedimensional spaces the Besicovich covering theorem [15] can be strongly violated, as manifested in Eq. (3).
Example 1 (Preiss’s construction).
The construction (see Figure 1) defines an infinitedimensional metric space and a realizable measure over with the binary label set . It relies on two sequences: a sequence of natural numbers and a sequence of positive numbers . The two sequences should satisfy the following:
(4) 
These properties are satisfied, for instance, by setting and . Let be the set of all finite sequences of natural numbers such that , and let be the set of all infinite sequences of natural numbers such that .
Define the example space and denote , where . The metric over is defined as follows: for , denote by their longest common prefix. Then,
It can be shown (see [34]) that is a metric; in fact, it embeds isometrically into the square norm metric of a Hilbert space.
To define , the marginal measure over , let be the uniform product distribution measure over , that is: for all , each in the sequence is independently drawn from a uniform distribution over . Let be an atomic measure on such that for all , . Clearly, the first condition in Eq. (4) implies . Define the marginal probability measure over by
In words, an infinite sequence is drawn with probability (and all such sequences are equally likely), or else a finite sequence is drawn (and all finite sequences of the same length are equally likely). Define the realizable distribution over by setting the marginal over to , and by setting the label of to be with probability and the label of to be with probability .
As shown in [34], this construction satisfies Eq. (3) with and . It follows from Theorem 3 that no NN algorithm is Bayesconsistent on it. In contrast, the following theorem shows that KSU is weakly Bayesconsistent on this distribution. Theorem 4 immediately follows from the this result.
Theorem 6.
Assume and as in Example 1. KSU is weakly Bayesconsistent on .
The proof, provided in Appendix C, first characterizes the Voronoi cells for which the true majorityvote yields a significant error for the cell (Lemma 15). In finitedimensional spaces, the total measure of all such “bad” cells can be made arbitrarily close to zero by taking to be sufficiently small, as shown in Lemma 10 of Theorem 2. However, it is not immediately clear whether this can be achieved for the infinite dimensional construction above.
Indeed, we expect such bad cells, due to the unintuitive property that for any , we have when , and yet . Thus, if for example a significant portion of the set (whose label is 1) is covered by Voronoi cells of the form with , then for all sufficiently small , each one of these cells will have a true majorityvote . Thus a significant portion of would be misclassified. However, we show that by the structure of the construction, combined with the packing and covering properties of nets, we have that in any net, the total measure of all these “bad” cells goes to 0 when , thus yielding a consistent classifier.
Lastly, the following theorem shows that on the same construction above, when the Voronoi partitions are allowed to violate the packing property of nets, Bayesconsistency does not necessarily hold. Theorem 5 immediately follows from the following result.
Theorem 7.
Assume , and as in Example 1. There exists a sequence of Voronoi partitions of with such that the sequence of true majorityvote classifiers induced by these partitions is not Bayes consistent: .
The proof, provided in Appendix D, constructs a sequence of Voronoi partitions, where each partition has all of its impure Voronoi cells (those with both and labels) being bad. In this case, is incorrectly classified by , yielding a significant error. Thus, in infinitedimensional metric spaces, the shape of the Voronoi cells plays a fundamental role in the consistency of the partition.
Acknowledgments.
We thank Frédéric Cérou for the numerous fruitful discussions and helpful feedback on an earlier draft. Aryeh Kontorovich was supported in part by the Israel Science Foundation (grant No. 755/15), Paypal and IBM. Sivan Sabato was supported in part by the Israel Science Foundation (grant No. 555/15).
References
 [1] Christophe Abraham, Gérard Biau, and Benoît Cadre. On the kernel rule for function classification. Ann. Inst. Statist. Math., 58(3):619–633, 2006.
 [2] Daniel Berend and Aryeh Kontorovich. The missing mass problem. Statistics & Probability Letters, 82(6):1102–1110, 2012.
 [3] Daniel Berend and Aryeh Kontorovich. On the concentration of the missing mass. Electronic Communications in Probability, 18(3):1–7, 2013.
 [4] Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 97–104, New York, NY, USA, 2006. ACM.
 [5] Gérard Biau, Florentina Bunea, and Marten H. Wegkamp. Functional classification in Hilbert spaces. IEEE Trans. Inform. Theory, 51(6):2163–2172, 2005.
 [6] Gérard Biau, Frédéric Cérou, and Arnaud Guyader. Rates of convergence of the functional nearest neighbor estimate. IEEE Trans. Inform. Theory, 56(4):2034–2040, 2010.
 [7] V. I. Bogachev. Measure theory. Vol. I, II. SpringerVerlag, Berlin, 2007.
 [8] Oren Boiman, Eli Shechtman, and Michal Irani. In defense of nearestneighbor based image classification. In CVPR, 2008.
 [9] Frédéric Cérou and Arnaud Guyader. Nearest neighbor classification in infinite dimension. ESAIM: Probability and Statistics, 10:340–355, 2006.
 [10] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In NIPS, 2014.
 [11] Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.
 [12] Luc Devroye. On the inequality of Cover and Hart in nearest neighbor discrimination. IEEE Trans. Pattern Anal. Mach. Intell., 3(1):75–78, 1981.
 [13] Luc Devroye and László Györfi. Nonparametric density estimation: the view. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley & Sons, Inc., New York, 1985.
 [14] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.
 [15] Herbert Federer. Geometric measure theory. Die Grundlehren der mathematischen Wissenschaften, Band 153. SpringerVerlag New York Inc., New York, 1969.
 [16] Evelyn Fix and Jr. Hodges, J. L. Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique, 57(3):pp. 238–247, 1989.
 [17] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the VapnikChervonenkis dimension. Machine learning, 21(3):269–304, 1995.
 [18] LeeAd Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract COLT 2010). IEEE Transactions on Information Theory, 60(9):5750–5759, 2014.
 [19] LeeAd Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105–118, 2016.
 [20] LeeAd Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearoptimal sample compression for nearest neighbors. In Neural Information Processing Systems (NIPS), 2014.
 [21] LeeAd Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearly optimal classification for semimetrics (extended abstract AISTATS 2016). Journal of Machine Learning Research, 2017.
 [22] Thore Graepel, Ralf Herbrich, and John ShaweTaylor. PACBayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning, 59(1):55–76, 2005.
 [23] Peter Hall and KeeHoon Kang. Bandwidth choice for nonparametric classification. Ann. Statist., 33(1):284–306, 02 2005.
 [24] Olav Kallenberg. Foundations of modern probability. Second edition. Probability and its Applications. SpringerVerlag, 2002.
 [25] Aryeh Kontorovich, Sivan Sabato, and Ruth Urner. Active nearestneighbor learning in metric spaces. In Advances in Neural Information Processing Systems, pages 856–864, 2016.
 [26] Aryeh Kontorovich and Roi Weiss. A Bayes consistent 1NN classifier. In Artificial Intelligence and Statistics (AISTATS 2015), 2014.
 [27] Aryeh Kontorovich and Roi Weiss. Maximum margin multiclass nearest neighbors. In International Conference on Machine Learning (ICML 2014), 2014.
 [28] Robert Krauthgamer and James R. Lee. Navigating nets: Simple algorithms for proximity search. In 15th Annual ACMSIAM Symposium on Discrete Algorithms, pages 791–801, January 2004.
 [29] Sanjeev R. Kulkarni and Steven E. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Trans. Inform. Theory, 41(4):1028–1039, 1995.
 [30] Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnability. unpublished, 1986.
 [31] James R. Munkres. Topology: a first course. PrenticeHall, Inc., Englewood Cliffs, N.J., 1975.
 [32] Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Inform. Process. Lett., 73(12):47–51, 2000.
 [33] Vladimir Pestov. Is the NN classifier in high dimensions affected by the curse of dimensionality? Comput. Math. Appl., 65(10):1427–1437, 2013.
 [34] David Preiss. Invalid Vitali theorems. Abstracta. 7th Winter School on Abstract Analysis, pages 58–60, 1979.
 [35] David Preiss. Gaussian measures and the density theorem. Comment. Math. Univ. Carolin., 22(1):181–193, 1981.
 [36] Demetri Psaltis, Robert R. Snapp, and Santosh S. Venkatesh. On the finite sample performance of the nearest neighbor classifier. IEEE Transactions on Information Theory, 40(3):820–837, 1994.
 [37] Walter Rudin. Principles of mathematical analysis. McGrawHill Book Co., New York, third edition, 1976. International Series in Pure and Applied Mathematics.
 [38] Walter Rudin. Real and Complex Analysis. McGrawHill, 1987.
 [39] Richard J. Samworth. Optimal weighted nearest neighbour classifiers. Ann. Statist., 40(5):2733–2763, 10 2012.
 [40] Shai ShalevShwartz and Shai BenDavid. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
 [41] John ShaweTaylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998.
 [42] Robert R. Snapp and Santosh S. Venkatesh. Asymptotic expansions of the nearest neighbor risk. Ann. Statist., 26(3):850–878, 1998.
 [43] Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595–620, 1977.
 [44] Jaroslav Tišer. Vitali covering theorem in Hilbert space. Trans. Amer. Math. Soc., 355(8):3277–3289, 2003.
 [45] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
 [46] Lin Cheng Zhao. Exponential bounds of mean error for the nearest neighbor estimates of regression functions. J. Multivariate Anal., 21(1):168–178, 1987.
Appendix A Background on metric measure spaces
Here we provide some general relevant background on metric measure spaces. Our metric space is doubling, but in this section finite diameter is not assumed. We recall some standard definitions. A topological space is Hausdorff if every two distinct points have disjoint neighborhoods. It is a standard (and obvious) fact that all metric spaces are Hausdorff.
A metric space is complete if every Cauchy sequence converges to a point in . Every metric space may be completed by (essentially) adjoining to it the limits of all of its Cauchy sequences [37, Exercise 3.24]; moreover, the completion is unique up to isometry [31, Section 43, Exercise 10]. We implicitly assume throughout the paper that is complete. Closed subsets of complete metric spaces are also complete metric spaces under the inherited metric.
A topological space is locally compact if every point has a compact neighborhood. It is a standard and easy fact that complete doubling spaces are locally compact. Indeed, consider any and the open ball about , . We must show that — the closure of — is compact. To this end, it suffices to show that is totally bounded (that is, has a finite covering number for each ), since in complete metric spaces, a set is compact iff it is closed and totally bounded [31, Theorem 45.1]. Total boundedness follows immediately from the doubling property. The latter posits a constant and some such that . Then certainly We now apply the doubling property recursively to each of the , until the radius of the covering balls becomes smaller than .
We now recall some standard facts from measure theory. Any topology on (and in particular, the one induced by the metric ), induces the Borel algebra . A Borel probability measure is a function that is countably additive and normalized by . The latter is complete if for all for which , we also have . Any Borel algebra may be completed by defining the measure of any subset of a measurezero set to be zero [38, Theorem 1.36]. We implicitly assume throughout the paper that is a complete measure space, where contains all of the Borel sets.
The measure is said to be outer regular if it can be approximated from above by open sets: For every , we have
A corresponding inner regularity corresponds to approximability from below by compact sets: For every ,
The measure is regular if it is both inner and outer regular. Any probability measure defined on the Borel algebra of a metric space is regular [24, Lemma 1.19]. (Dropping the “metric” or “probability” assumptions opens the door to various exotic pathologies [7, Chapter 7], [38, Exercise 2.17].)
Finally, we have the following technical result, adapted from [38, Theorem 3.14] to our setting:
Theorem 8.
Let be a complete doubling metric space equipped with a complete probability measure , such that all Borel sets are measurable. Then (the collection of continuous functions with compact support) is dense in .
Appendix B Bayesconsistency proof of Ksu in finite dimensions
In this section we prove Theorem 2 in full detail. Let be a metric space with doublingdimension . Given a sample , we abbreviate the optimal empirical error and the optimal compression size as computed by KSU. As shown in Sec. 3, the labeled set computed by KSU is an compression of the sample . For brevity we denote
To prove Theorem 2 we decompose the excess error over the Bayes into two terms:
and show that each term decays to zero with probability one.
For the first term, , from Property 1 of generalization bound , we have that for any ,
(5) 
Since , the BorelCantelli lemma implies with probability .
The main challenge is to prove that with probability one. We begin by showing that the Bayes error can be approached using classifiers defined by the true majorityvote labeling over fine partitions of . Formally, let be a finite partition of , and define the function such that is the unique for which . For any measurable set define the true majorityvote label by
(6) 
where ties are broken lexicographically. To ensure that is always welldefined, when we arbitrarily define it to be the lexicographically first . Given and a measurable set , consider the true majorityvote classifier given by
(7) 
Note that if , this classifier attaches a label to based on the true majorityvote in a set that does not contain . To bound the error of for any conditional distribution of labels, we use the fact that on doubling metric spaces, continuous functions are dense in .
Lemma 9.
For every probability measure on a doubling metric space , the set of continuous functions with compact support is dense in . Namely, for any and there is a continuous function with compact support such that
We have the following uniform approximation bound for the error of classifiers in the form of (7), essentially extending the approximation analysis done in the proof of [14, Theorem 21.2] for the special case and to the more general multiclass problem in doubling metric spaces.
Lemma 10.
Let be a probability measure on where is a doubling metric space. For any , there exists a diameter such that for any finite measurable partition of and any measurable set satisfying


,
the true majorityvote classifier defined in (7) satisfies .
Proof.
Let be the conditional probability function for label ,
Define as ’s conditional expectation function with respect to ,
(And, say, for .) Note that are piecewise constant on the cells of the restricted partition . By definition, the Bayes classifier and the true majorityvote classifier satisfy
It follows that
By condition (i) in the theorem statement, . Thus,
Let be a finite set of labels such that . Then