Sample Compression for RealValued Learners
Abstract
We give an algorithmically efficient version of the learnertocompression scheme conversion in Moran and Yehudayoff (2016). In extending this technique to realvalued hypotheses, we also obtain an efficient regressiontobounded sample compression converter. To our knowledge, this is the first general compressed regression result (regardless of efficiency or boundedness) guaranteeing uniform approximate reconstruction. Along the way, we develop a generic procedure for constructing weak realvalued learners out of abstract regressors; this may be of independent interest. In particular, this result sheds new light on an open question of H. Simon (1997). We show applications to two regression problems: learning Lipschitz and boundedvariation functions.
Sample Compression for RealValued Learners
Steve Hanneke Princeton, NJ steve.hanneke@gmail.com Aryeh Kontorovich BenGurion University karyeh@bgu.sc.il Menachem Sadigurschi BenGurion University sadigurs@post.bgu.ac.il
noticebox[b]\end@float
1 Introduction
Sample compression is a natural learning strategy, whereby the learner seeks to retain a small subset of the training examples, which (if successful) may then be decoded as a hypothesis with low empirical error. Overfitting is controlled by the size of this learnerselected “compression set”. Part of a more general Occam learning paradigm, such results are commonly summarized by “compression implies learning”. A fundamental question, posed by Littlestone and Warmuth (1986), concerns the reverse implication: Can every learner be converted into a sample compression scheme? Or, in a more quantitative formulation: Does every VC class admit a constantsize sample compression scheme? A series of partial results (Floyd, 1989; Helmbold et al., 1992; Floyd and Warmuth, 1995; BenDavid and Litman, 1998; Kuzmin and Warmuth, 2007; Rubinstein et al., 2009; Rubinstein and Rubinstein, 2012; Chernikov and Simon, 2013; Livni and Simon, 2013; Moran et al., 2017) culminated in Moran and Yehudayoff (2016) which resolved the latter question^{1}^{1}1 The refined conjecture of Littlestone and Warmuth (1986), that any concept class with VCdimension admits a compression scheme of size , remains open..
Moran and Yehudayoff’s solution involved a clever use of von Neumann’s minimax theorem, which allows one to make the leap from the existence of a weak learner uniformly over all distributions on examples to the existence of a distribution on weak hypotheses under which they achieve a certain performance simultaneously over all of the examples. Although their paper can be understood without any knowledge of boosting, Moran and Yehudayoff note the wellknown connection between boosting and compression. Indeed, boosting may be used to obtain a constructive proof of the minimax theorem (Freund and Schapire, 1996, 1999) — and this connection was what motivated us to seek an efficient algorithm implementing Moran and Yehudayoff’s existence proof. Having obtained an efficient conversion procedure from consistent PAC learners to boundedsize sample compression schemes, we turned our attention to the case of realvalued hypotheses. It turned out that a virtually identical boosting framework could be made to work for this case as well, although a novel analysis was required.
Our contribution.
Our point of departure is the simple but powerful observation (Schapire and Freund, 2012) that many boosting algorithms (e.g., AdaBoost, Boost) are capable of outputting a family of hypotheses such that not only does their (weighted) majority vote yield a sampleconsistent classifier, but in fact a supermajority does as well. This fact implies that after boosting, we can subsample a constant (i.e., independent of sample size ) number of classifiers and thereby efficiently recover the sample compression bounds of Moran and Yehudayoff (2016).
Our chief technical contribution, however, is in the realvalued case. As we discuss below, extending the boosting framework from classification to regression presents a host of technical challenges, and there is currently no offtheshelf generalpurpose analogue of AdaBoost for realvalued hypotheses. One of our insights is to impose distinct error metrics on the weak and strong learners: a “stronger” one on the latter and a “weaker” one on the former. This allows us to achieve two goals simultaneously:

We give apparently the first generic construction for our weak learner, demonstrating that the object is natural and abundantly available. This is in contrast with many previous proposed weak regressors, whose stringent or exotic definitions made them unwieldy to construct or verify as such. The construction is novel and may be of independent interest.

We show that the output of a certain realvalued boosting algorithm may be sparsified so as to yield a constant size sample compression analogue of the Moran and Yehudayoff result for classification. This gives the first general constantsize sample compression scheme having uniform approximation guarantees on the data.
2 Definitions and notation
We will write . An instance space is an abstract set . For a concept class , if say that shatters a set if
The VCdimension of is the size of the largest shattered set (or if shatters sets of arbitrary size) (Vapnik and Červonenkis, 1971). When the roles of and are exchanged — that is, an acts on via , — we refer to as the dual class of . Its VCdimension is then , and referred to as the dual VC dimension. Assouad (1983) showed that .
For and , we say that shatters a set if
contains the translated cube for some . The fatshattering dimension is the size of the largest shattered set (possibly ) (Alon et al., 1997). Again, the roles of and may be switched, in which case becomes the dual class of . Its fatshattering dimension is then , and Assouad’s argument shows that .
A sample compression scheme for a hypothesis class is defined as follows. A compression function maps sequences to elements in , where . A reconstruction is a function . We say that is a size sample compression scheme for if is a compression and for all and all , we have satisfies for all .
For realvalued functions, we say it is a uniformly approximate compression scheme if
3 Main results
Throughout the paper, we implicitly assume that all hypothesis classes are admissible in the sense of satisfying mild measuretheoretic conditions, such as those specified in Dudley (1984, Section 10.3.1) or Pollard (1984, Appendix C). We begin with an algorithmically efficient version of the learnertocompression scheme conversion in Moran and Yehudayoff (2016):
Theorem 1 (Efficient compression for classification).
Let be a concept class over some instance space with VCdimension , dual VCdimension , and suppose that is a (proper, consistent) PAClearner for : For all , all , and all distributions over , if receives points drawn iid from and labeled with , then outputs an such that
For every such , there is a randomized sample compression scheme for of size , where . Furthermore, on a sample of any size , the compression set may be computed in expected time
where is the runtime of to compute on a sample of size , is the runtime required to evaluate on a single , and is a universal constant.
Although for our purposes the existence of a distributionfree sample complexity is more important than its concrete form, we may take (Vapnik and Chervonenkis, 1974; Blumer et al., 1989), known to bound the sample complexity of empirical risk minimization; indeed, this loses no generality, as there is a wellknown efficient reduction from empirical risk minimization to any proper learner having a polynomial sample complexity (Pitt and Valiant, 1988; Haussler et al., 1991). We allow the evaluation time of to depend on the size of the training sample in order to account for nonparametric learners, such as nearestneighbor classifiers. A naive implementation of the Moran and Yehudayoff (2016) existence proof yields a runtime of order (for some universal constants ), which can be doubly exponential when ; this is without taking into account the cost of computing the minimax distribution on the game matrix.
Next, we extend the result in Theorem 1 from classification to regression:
Theorem 2 (Efficient compression for regression).
Let be a function class with fatshattering dimension , dual fatshattering dimension , and suppose that is an ERM (i.e., proper, consistent) learner for : For all , and all distributions over , if receives points drawn iid from and labeled with , then outputs an such that . For every such , there is a randomized uniformly approximate sample compression scheme for of size , where and . Furthermore, on a sample of any size , the compression set may be computed in expected time
where is the runtime of to compute on a sample of size , is the runtime required to evaluate on a single , and is a universal constant.
A key component in the above result is our construction of a generic weak learner.
Definition 3.
For and , we say that is an an weak hypothesis (with respect to distribution and target ) if
Theorem 4 (Generic weak learner).
Let be a function class with fatshattering dimension . For some universal numerical constants , for any and , any , and any distribution , letting be drawn iid from , where
with probability at least , every with is an weak hypothesis with respect to and .
In fact, our results would also allow us to use any hypothesis with bounded below : for instance, bounded by . This can then also be plugged into the construction of the compression scheme and this criterion can be used in place of consistency in Theorem 2.
4 Related work
It appears that generalization bounds based on sample compression were independently discovered by Littlestone and Warmuth (1986) and Devroye et al. (1996) and further elaborated upon by Graepel et al. (2005); see Floyd and Warmuth (1995) for background and discussion. A more general kind of Occam learning was discussed in Blumer et al. (1989). Computational lower bounds on sample compression were obtained in Gottlieb et al. (2014), and some communicationbased lower bounds were given in Kane et al. (2017).
Beginning with Freund and Schapire (1997)’s AdaBoost.R algorithm, there have been numerous attempts to extend AdaBoost to the realvalued case (Bertoni et al., 1997; Drucker, 1997; Avnimelech and Intrator, 1999; Karakoulas and ShaweTaylor, 2000; Duffy and Helmbold, 2002; Kégl, 2003; Nock and Nielsen, 2007) along with various theoretical and heuristic constructions of particular weak regressors (Mason et al., 1999; Friedman, 2001; Mannor and Meir, 2002); see also the survey MendesMoreira et al. (2012).
Duffy and Helmbold (2002, Remark 2.1) spell out a central technical challenge: no boosting algorithm can “always force the base regressor to output a useful function by simply modifying the distribution over the sample”. This is because unlike a binary classifier, which localizes errors on specific examples, a realvalued hypothesis can spread its error evenly over the entire sample, and it will not be affected by reweighting. The weak learner, which has appeared, among other works, in Anthony et al. (1996); Simon (1997); Avnimelech and Intrator (1999); Kégl (2003), gets around this difficulty — but provable general constructions of such learners have been lacking. Likewise, the heart of our sample compression engine, MedBoost, has been widely in use since Freund and Schapire (1997) in various guises. Our Theorem 4 supplies the remaining piece of the puzzle: any sampleconsistent regressor applied to some random sample of bounded size yields an weak hypothesis. The closest analogue we were able to find was Anthony et al. (1996, Theorem 3), which is nontrivial only for function classes with finite pseudodimension, and is inapplicable, e.g., to classes of Lipschitz or bounded variation functions.
The literature on general sample compression schemes for realvalued functions is quite sparse. There are wellknown narrowly tailored results on specifying functions or approximate versions of functions using a finite number of points, such as the classical fact that a polynomial of degree can be perfectly recovered from points. To our knowledge, the only general results on sample compression for realvalued functions (applicable to all learnable function classes) is Theorem 4.3 of David, Moran, and Yehudayoff (2016). They propose a general technique to convert any learning algorithm achieving an arbitrary sample complexity into a compression scheme of size , where may approach . However, their notion of compression scheme is significantly weaker than ours: namely, they allow to satisfy merely , rather than our uniform approximation requirement . In particular, in the special case of a family of binaryvalued functions, their notion of sample compression does not recover the usual notion of sample compression schemes for classification, whereas our uniform approximate compression notion does recover it as a special case. We therefore consider our notion to be a more fitting generalization of the definition of sample compression to the realvalued case.
5 Boosting RealValued Functions
As mentioned above, the notion of a weak learner for learning realvalued functions must be formulated carefully. The naïve thought that we could take any learner guaranteeing, say, absolute loss at most is known to not be strong enough to enable boosting to loss. However, if we make the requirement too strong, such as in Freund and Schapire (1997) for AdaBoost.R, then the sample complexity of weak learning will be so high that weak learners cannot be expected to exist for large classes of functions. However, our Definition 3, which has been proposed independently by Simon (1997) and Kégl (2003), appears to yield the appropriate notion of weak learner for boosting realvalued functions.
In the context of boosting for realvalued functions, the notion of an weak hypothesis plays a role analogous to the usual notion of a weak hypothesis in boosting for classification. Specifically, the following boosting algorithm was proposed by Kégl (2003). As it will be convenient for our later results, we express its output as a sequence of functions and weights; the boosting guarantee from Kégl (2003) applies to the weighted quantiles (and in particular, the weighted median) of these function values.
Here we define the weighted median as
Also define the weighted quantiles, for , as
and abbreviate and for and the values returned by MedBoost.
Then Kégl (2003) proves the following result.
Lemma 5.
(Kégl (2003)) For a training set of size , the return values of MedBoost satisfy
We note that, in the special case of binary classification, MedBoost is closely related to the wellknown AdaBoost algorithm (Freund and Schapire, 1997), and the above results correspond to a standard marginbased analysis of Schapire et al. (1998). For our purposes, we will need the following immediate corollary of this, which follows from plugging in the values of and using the weak learning assumption, which implies for all .
Corollary 6.
For , every has
6 The Sample Complexity of Learning RealValued Functions
This section reveals our intention in choosing this notion of weak hypothesis, rather than using, say, an good strong learner under absolute loss. In addition to being a strong enough notion for boosting to work, we show here that it is also a weak enough notion for the sample complexity of weak learning to be of reasonable size: namely, a size quantified by the fatshattering dimension. This result is also relevant to an open question posed by Simon (1997), who proved a lower bound for the sample complexity of finding an weak hypothesis, expressed in terms of a related complexity measure, and asked whether a related upper bound might also hold. We establish a general upper bound here, witnessing the same dependence on the parameters and as observed in Simon’s lower bound (up to a log factor) aside from a difference in the key complexity measure appearing in the bounds.
Define , where is the empirical measure induced by iid distributed random variables (the data points and ghost points). Define as the covering numbers of under the pseudometric.
Theorem 7.
Fix any , , and . For iid distributed, with probability at least , every with satisfies .
Proof.
This proof roughly follows the usual symmetrization argument for uniform convergence Vapnik and Červonenkis (1971); Haussler (1992), with a few important modifications to account for this based criterion. If is infinite, then the result is trivial, so let us suppose it is finite for the remainder of the proof. Similarly, if , then and hence the claim trivially holds, so let us suppose for the remainder of the proof. Without loss of generality, suppose everywhere and every is nonnegative (otherwise subtract from every and redefine as the absolute values of the differences; note that this transformation does not increase the value of since applying this transformation to the original functions remains a cover).
Let be iid distributed. Denote by the empirical measure induced by , and by the empirical measure induced by . We have
Denote by the event that there exists satisfying and , and on this event let denote such an (chosen solely based on ); when fails to hold, take to be some arbitrary fixed element of . Then the expression on the right hand side above is at least as large as
and noting that the event is independent of , this equals
(1) 
Then note that for any with , a Chernoff bound implies
where we have used the assumption that here. In particular, this implies that the expression in (1) is no smaller than . Altogether, we have established that
(2) 
Now let be independent random variables (also independent of the data), with , and denote as the sole element of for each . Also denote by the empirical measure induced by , and by the empirical measure induced by . By exchangeability of , the right hand side of (2) is equal
Now let be a minimal subset of such that . The size of is at most , which is finite almost surely (since we have assumed above that its expectation is finite). Then note that (denoting by ) the above expression is at most
(3) 
Then note that for any , we have almost surely
where the last inequality is by a Chernoff bound, which (as noted by Hoeffding (1963)) remains valid even when sampling without replacement. Together with (2) and (3), we have that
∎
Lemma 8.
There exist universal numerical constants such that ,
where is the fatshattering dimension.
Proof.
Mendelson and Vershynin (2003, Theorem 1) establishes that the covering number of under the pseudometric is at most
(4) 
for some universal numerical constants . Then note that for any , Markov’s and Jensen’s inequalities imply . Thus, any cover of under is also a cover of under , and therefore (4) is also a bound on . ∎
Combining the above two results yields the following theorem.
Theorem 9.
For some universal numerical constants , for any and , letting be iid distributed, where
with probability at least , every with satisfies .
In particular, Theorem 4 follows immediately from this result by taking and .
To discuss tightness of Theorem 9, we note that Simon (1997) proved a sample complexity lower bound for the same criterion of
where is a quantity somewhat smaller than the fatshattering dimension, essentially representing a fat Natarajan dimension. Thus, aside from the differences in the complexity measure (and a logarithmic factor), we establish an upper bound of a similar form to Simon’s lower bound.
7 From Boosting to Compression
Generally, our strategy for converting the boosting algorithm MedBoost into a sample compression scheme of smaller size follows a strategy of Moran and Yehudayoff for binary classification, based on arguing that because the ensemble makes its predictions with a margin (corresponding to the results on quantiles in Corollary 6), it is possible to recover the same proximity guarantees for the predictions while using only a smaller subset of the functions from the original ensemble. Specifically, we use the following general sparsification strategy.
For with , denote by the categorical distribution: i.e., the discrete probability distribution on with probability mass on .
For any values , denote the (unweighted) median
Our intention in dicussing the above algorithm is to argue that, for a sufficiently large choice of , the above procedure returns a set such that
We analyze this strategy separately for binary classification and realvalued functions, since the argument in the binary case is much simpler (and demonstrates more directly the connection to the original argument of Moran and Yehudayoff), and also because we arrive at a tighter result for binary functions than for realvalued functions.
7.1 Binary Classification
We begin with the simple observation about binary classification (i.e., where the functions in all map into ). The technique here is quite simple, and follows a similar line of reasoning to the original argument of Moran and Yehudayoff. The argument for realvalued functions below will diverge from this argument in several important ways, but the high level ideas remain the same.
The compression function is essentially the one introduced by Moran and Yehudayoff, except applied to the classifiers produced by the above Sparsify procedure, rather than a set of functions selected by a minimax distribution over all classifiers produced by samples each. The weak hypotheses in MedBoost for binary classification can be obtained using samples of size . Thus, if the Sparsify procedure is successful in finding such classifiers whose median predictions are within of the target values for all , then we may encode these classifiers as a compression set, consisting of the set of samples used to train these classifiers, together with extra bits to encode the order of the samples.^{2}^{2}2In fact, bits would suffice if the weak learner is permutationinvariant in its data set. To obtain Theorem 1, it then suffices to argue that is a sufficient value. The proof follows.
Proof of Theorem 1.
Recall that bounds the VC dimension of the class of sets . Thus for the iid samples obtained in Sparsify, for , by the VC uniform convergence inequality of Vapnik and Červonenkis (1971), with probability at least we get that
In particular, if we choose , , and appropriately, then Corollary 6 implies that every and so that the above event would imply every . Note that the Sparsify algorithm need only try this sampling times to find such a set of functions. Combined with the description above (from Moran and Yehudayoff, 2016) of how to encode this collection of functions as a sample compression set plus side information, this completes the construction of the sample compression scheme. ∎
7.2 RealValued Functions
Next we turn to the general case of realvalued functions (where the functions in may generally map into ). We have the following result, which says that the Sparsify procedure can reduce the ensemble of functions from one with functions in it, down to one with a number of functions independent of .
Theorem 10.
Choosing
suffices for the Sparsify procedure to return with
Proof.
Recall from Corollary 6 that MedBoost returns functions and such that ,
where is the training data set.
We use this property to sparsify from down to elements, where will depend on , and the dual fatshattering dimension of (actually, just of ) — but not sample size .
Letting for each , we will sample hypotheses with each , where as in Sparsify. Define a function . We claim that for any fixed , with high probability
(5) 
Indeed, partition the indices into the disjoint sets
Then the only way (5) can fail is if half or more indices sampled fall into — or if half or more fall into . Since the sampling distribution puts mass less than on each of and , Chernoff’s bound puts an upper estimate of on either event. Hence,