A New Lower Bound for Agnostic Learning with Sample Compression Schemes

# A New Lower Bound for Agnostic Learning with Sample Compression Schemes

Steve Hanneke
steve.hanneke@gmail.com &Aryeh Kontorovich
Ben-Gurion University
karyeh@bgu.ac.il
###### Abstract

We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size- agnostic sample compression schemes are of the form , which contrasts with agnostic learning with classes of VC dimension , where the optimal rates are of the form .

## 1 Introduction

Compression-based arguments provide some of the simplest and tightest generalization bounds in the literature. These are known as Occam learning in the most general setting (Blumer et al., 1989), and the special case of sample compression Littlestone and Warmuth (1986); Devroye et al. (1996); Graepel et al. (2005); Floyd and Warmuth (1995) has been receiving a fair amount of recent attention (Moran and Yehudayoff, 2016; David et al., 2016; Zhivotovskiy, 2017; Hanneke et al., 2018).

As the present paper deals with lower bounds, we stress up-front that these are statistical lower bounds (rather than, say, computational (Gottlieb et al., 2014) or communication-based (Kane et al., 2017)). In the realizable case, Littlestone and Warmuth (1986); Floyd and Warmuth (1995) showed that a -compression scheme on a sample of size achieves an expected generalization error bound of order \beqn k log( n / k ) n . \eeqnAs the compression size is a rough analogue of the VC-dimension, one is immediately led to inquire into the necessity of the factor. While known to be removable from the realizable VC bound (Haussler et al., 1994; Hanneke, 2016), the factor in (1) turns out to be tight (Floyd and Warmuth, 1995). On the other hand, turning to the agnostic case, the corresponding compression result from Graepel et al. (2005) implies an upper bound on the expected excess generalization error of a certain -compression scheme on a sample of size by a bound of order \beqn k log(n/k)n. \eeqnHere again, the agnostic VC analogue of (1) (Anthony and Bartlett, 1999, Theorem 4.10) might suggest that the factor might be superfluous. Though it is a simpler matter to give an lower bound, it proves significantly more challenging to determine whether the factor of is required for this general bound. As our main result in this work (Section 2), we prove that this factor in (1) cannot be removed. We also prove an analogous lower bound for order-dependent compression schemes (Section 3), where the factor becomes , which again is tight.

\hide

cite floyd+warmth, our recent compression paper, thank Nikita

## 2 Order-Independent Compression Schemes

Let , where is any nonempty set and , and suppose is equipped with a -algebra defining the measurable sets. An agnostic sample compression scheme is specified by a size and a reconstruction function , which maps any (multi)set with to a measurable function . For any and any sequence , define

 Hk,ρ(z1,…,zn)={ρ({zi1,…,zik′}):k′≤k,1≤i1<⋯

Now for any probability measure on and any , let be independent -distributed random variables, and for any classifier , define the error rate of , and define the empirical error rate of .

Now there are essentially two types of results for agnostic compression schemes in the literature: namely, uniform convergence rates and agnostic learning excess risk guarantees. We begin with the first of these. For any fixed agnostic sample compression scheme , denote

 Euc(n,k,ρ,P)=Esuph∈Hk,ρ(Z[n])|^R(h;Z[n])−R(h;P)|.

Then, for any , define

 Euc(n,k)=supP,ρEuc(n,k,ρ,P),

where ranges over all probability measures on , and ranges over all reconstruction functions (for the given size ). For results on uniform convergence for agnostic compression schemes, this is the object of primary interest to this work.

It is known (essentially from the arguments of Graepel et al. (2005, Theorem 2)) that for any with ,

 Euc(n,k)≲√klog(n/k)n.

This upper bound is similar in form to the original bound of Vapnik and Chervonenkis (1971) for uniform convergence rates for VC classes of VC dimension . However, that bound was later refined111 A detailed account of the intermediate steps leading to this seminal result is presented in Anthony and Bartlett (1999); significant milestones include Pollard (1982); Koltchinskii (1981); Talagrand (1994); Haussler (1995). to the form , removing the factor . It is therefore natural to wonder whether this same refinement might be achieved by size- agnostic sample compression schemes. To our knowledge, this question has not previously been addressed in the literature.

The other type of results of interest for agnostic compression schemes are agnostic learning excess risk guaratnees. Specifically, a compression function is a mapping from any sequence in to an unordered sub(multi)set222An element in may repeat up to as many times as it occurs in the sequence , so that effectively corresponds to picking a set of up to distinct indices in to include the corresponding points. of size at most . Then, denoting , define

 Eag(n,k,ρ,κ,P)=E[R(^hn;P)−minh∈Hk,ρ(Z[n])R(h;P)]

and then define

 Eag(n,k)=supρinfκsupPEag(n,k,ρ,κ,P),

where again ranges over all probability measures on and ranges over all reconstruction functions (for the given size ), and where ranges over all compression functions (for the given size ).

By a standard argument, if we specify so as to always minimize the empirical error rate , then the excess error rate can be bounded by twice the uniform convergence bound, which immediately implies

 Eag(n,k)≤2Euc(n,k). (1)

An immediate implication from above is then that any with has

 Eag(n,k)≲√klog(n/k)n.

Here again, this bound is of the same form originally proven by Vapnik and Chervonenkis (1971) for empirical risk minimization in classes of VC dimension , which was later refined to a sharp bound of order (Anthony and Bartlett, 1999, Theorem 4.10). As such, it is again natural to ask whether the factor in the above bound for agnostic sample compression can be reduced to a constant, or is in fact necessary. Our main contribution in this work is a construction showing that this log factor is indeed necessary, as stated in the following results. In all of the results below, represents a numerical constant, whose value must be set sufficiently large (as discussed in the proofs) for the results to hold.

###### Theorem 1.

For any with ,

 Eag(n,k)≳√klog(n/k)n.

By the relation (1) discussed above, between uniform convergence and agnostic learning by empirical risk minimization over , this also has the following immediate implication.

###### Theorem 2.

For any with ,

 Euc(n,k)≳√klog(n/k)n.

Together with the known upper bounds mentioned above, this provides a tight characterization of the worst-case rate of uniform convergence for agnostic sample compression schemes.

###### Corollary 3.

For any with ,

 Eag(n,k)≍√klog(n/k)n

and

 Euc(n,k)≍√klog(n/k)n.

We now present the proof of Theorem 1.

###### Proof of Theorem 1.

Fix any with for a sufficiently large numerical constant (discussed below), denote , and let denote any distinct elements of . For simplicity, suppose (the argument easily extends to the general case by introducing floor functions, with only the numerical constants changing in the final result). The essential strategy behind our construction is to create an embedded instance of a construction for proving the lower bound for agnostic learning in VC classes, where here the VC dimension of the embedded scenario will be . The construction of this embedded scenario is our starting point. From there we also need to argue that there is a function contained in with risk not too much larger than the best classifier in the embedded VC class, which allows us to extend the lower bound argument for the embedded VC class to compression schemes. For any , let denote the bit of in the binary representation of : that is, , with .

We construct the reconstruction function based on “blocks”, each with “sub-blocks”. Specifically, for each , define a block , and for each , define a sub-block

 Bts={(t−1)m+(s−1)log2(m),…,(t−1)m+slog2(m)−1}.

Then for any and , define as any function satisfying the property that, for (for any and ),

 ht,i(xj)=br(i−(t−1)m).

Thus, the subsequence of points corresponding to the indices within each sub-block have values corresponding to the bits of the integer , and this repeats identically for every sub-block in the block .

Now we construct a reconstruction function that outputs functions which correspond to some such function within each block , but potentially using a different bit pattern for each . Formally, for any with (for each ), and any , define , where is any function satisfying the property that each and has : that is, the points in the compression set are interpreted by the compression scheme as encoding the desired label sequence for sub-blocks in the bits of . For our purposes, may be defined arbitrarily for . Note that is invariant to the values, so for brevity we will drop the arguments and simply write (this is often referred to as an unlabeled compression scheme in the literature). For completeness, should also be defined for sets of size at most that do not have exactly one element with for every ; for our purposes, let us suppose that in these cases, for every with , let , and for every with , let ; then define . In this way, is defined for all with .

Now define a family of distributions , , with for and , as follows. Every has marginal on uniform on , and for each (for , , and ) set , where

 ϵ=√klog2(m)n.

Now let us suppose is chosen randomly, with independent . Then (since max average) note that choosing now results in

 Eag(n,k)≥E[infκEag(n,k,ρ,κ,P(σ))],

so that it suffices to study the expectation on the right hand side.

As mentioned, the purpose of this construction is to create an embedded instance of a scenario that witnesses the lower bound for agnostic learning in VC classes, where the VC dimension of the embedded scenario here is . Specifically, in our construction, for any and , denoting by

 Ct,r={(t−1)m+(s−1)log2(m)+r:s∈{1,…,m/log2(m)}},

the locations together essentially represent a single location in the embedded problem: that is, their values are bound together, as are their values. However, this itself is not sufficient to supply a lower bound, since the constructed scenario exists only in the complete space of possible reconstructions , and it is entirely possible that : that is, the smallest error rate achievable in can conceivably be significantly larger than the smallest error rate achievable in the embedded VC class, so that compression schemes in this scenario do not automatically inherit the lower bounds for the constructed VC class. To account for this, we will study a decomposition of the construction into subproblems, corresponding to the blocks in the construction, and we will argue that within these subproblems there remains in a function with optimal predictions on most of the points, and then stitch these functions together to argue that there do exist functions in having near-optimal error rates relative to the best in .

Specifically, fix any and let denote the conditional distribution of given and the event that . Also denote , , , and

 Ht(Z[n])={ht,i:i∈Bt,xi∈{x(t−1)m,X1,…,Xn}}.

These correspond to the classifications of block realizable by classifiers in (where the addition of the point to the data set is due to our specification of for sets that contain no elements with , so that classifying block according to is always possible). There are now two components at this stage in the argument: first, that any compression function results in with , and second, that .

For the first part, note that for any , for any , . Furthermore, for any compression function , note that any that is capable of producing has for every . In particular, if we let be the index with for every , then and agree on every element of . This also implies

 R(^h;P(σ)t)−R(h∗t;P(σ)t) =R(ht,^it;P(σ)t)−R(h∗t;P(σ)t) =1log2(m)log2(m)−1∑r=0ϵI[br(^it−(t−1)m)≠σt,r+12].

Therefore, denoting by , we have

 E[R(^h;P(σ)t)−R(h∗t;P(σ)t)]=ϵlog2(m)log2(m)−1∑r=0E[P(br(^it−(t−1)m)≠σt,r+12∣∣∣nt,r)].

For any given , enumerate the random variables with as , and note that given , the values are a sufficient statistic for (see Definition 2.4 of Schervish (1995)), and therefore (see Theorem 3.18 of Schervish (1995)) there exists a (randomized) decision rule depending only on these variables and independent random bits such that

 P(br(^it−(t−1)m)≠σt,r+12∣∣∣nt,r)=P(^ft,r(Yi(r,1),…,Yi(r,nt,r))≠σt,r+12∣∣∣nt,r).

Furthermore, by Lemma 5.1 of Anthony and Bartlett (1999)333 The lower bound in (Anthony and Bartlett, 1999, Lemma 5.1) relied on Slud’s lemma; the analysis has since been tightened to yield asymptotically optimal lower bounds (Kontorovich and Pinelis, 2016). , we have

 P(^ft,r(Yi(r,1),…,Yi(r,nt,r))≠σt,r+12∣∣∣nt,r)>18eexp{−(8/3)nt,rϵ2}.

Altogether, and combined with Jensen’s inequality, we have that

 E[R(^h;P(σ)t)−R(h∗t;P(σ)t)] =ϵ8elog2(m)log2(m)−1∑r=0exp{−(8/3)nklog2(m)ϵ2}≥ϵ8elog2(m)log2(m)−1∑r=0e−(8/3)≥ϵ8e4.

Now for the second part, for any , denote by the index such that . Note that an for which has minimal among all can equivalently be defined as an with minimal among all , and furthermore, for such an ,

 R(ht,i;P(σ)t)−R(h∗t;P(σ)t)=ϵlog2(m)log2(m)−1∑j=0I[bj(i−(t−1)m)≠bj(i∗t−(t−1)m)].

For any , denote

 Δt(i)=log2(m)−1∑j=0I[bj(i−(t−1)m)≠bj(i∗t−(t−1)m)].

Thus, it suffices to establish the stated upper bound for the quantity

 ϵlog2(m)E[mini∈Bt∩{I(X1),…,I(Xn),(t−1)m}Δt(i)].

Now consider a random variable : that is, has distribution the same as the marginal of on . Then note that the conditional distribution of given is . Let , and suppose the numerical constant is sufficiently large so that . Then we have

 P(Δt(I(X))≤12qlog2(m)∣∣∣σ)=⌊(1/2q)log2(m)⌋∑ℓ=0(log2(m)ℓ)1m ≥1m(log2(m)⌊(1/2q)log2(m)⌋)⌊(1/2q)log2(m)⌋≥1m(4q)(1/2q)log2(m)=m(1/2q)log2(4q)−1.

Thus, by independence of the samples , denoting , we have

 P(mini∈Bt∩{I(X1),…,I(Xn),(t−1)m}Δt(i)>12qlog2(m)∣∣∣σ,nt) ≤P(∀i∈Bt∩{I(X1),…,I(Xn)},Δt(i)>12qlog2(m)∣∣∣σ,nt) =P(Δt(I(X))>12qlog2(m)∣∣∣σ)nt ≤(1−m(1/2q)log2(4q)−1)nt≤exp{−m(1/2q)log2(4q)−1nt}.

Altogether, by the law of total expectation, and using the fact that , we have established that

 E[minh∈Ht(Z[n])R(h;P(σ)t)−R(h∗t;P(σ)t)]≤ϵ2q+E[exp{−m(1/2q)log2(4q)−1nt}].

Since is a random variable, the rightmost term evaluates to the moment generating function of this distribution: that is,

 E[exp{−m(1/2q)log2(4q)−1nt}]=(1−1k+1kexp{−m(1/2q)log2(4q)−1})n ≤max{2(1−1k)n,2(1k)nexp{−m(1/2q)log2(4q)−1n}} ≤max{2e−n/k,2exp{−m(1/2q)log2(4q)}} =max{2e−n/k,2(exp{−(1/2q)log2(4q)m(1/2q)log2(4q)})2qlog2(4q)} ≤max⎧⎪ ⎪⎨⎪ ⎪⎩2e−n/k,2(2qlog2(4q))2qlog2(4q)1m⎫⎪ ⎪⎬⎪ ⎪⎭.

Since both of these terms shrink strictly faster than the above specification of as a function of , and therefore, for a sufficiently large choice of the numerical constant , both of these terms are smaller than . Therefore, we conclude that

 E[minh∈Ht(Z[n])R(h;P(σ)t)−R(h∗t;P(σ)t)]≤ϵ16e4,

as claimed.

Together, these two components imply that

 E[R(^h;P(σ)t)−minh∈Ht(Z[n])R(h;P(σ)t)] =E[R(^h;P(σ)t)−R(h∗t;P(σ)t)]−E[minh∈Ht(Z[n])R(h;P(σ)t)−R(h∗t;P(σ)t)]≥ϵ16e4.

Finally, it is time to combine these results for the individual blocks into a global statement about . In particular, note that any has . Also note that any that is capable of producing from arguments that are subsets of can be represented as for some where every has and (where the addition of the covers the case that the set does not include any with , as we defined that case above). Furthermore, every function with values satisfying these conditions can be realized by using an argument that is a subset of of size at most : namely, the set . Therefore,

 minh∈Hk,ρ(Z[n])R(h;P(σ))=min(i1,…,ik)∈B1×⋯×Bk:{xi1,…,xik}⊆{X1,…,Xn}∪{x(t−1)m:t≤k}R(~hi1,…,ik;P(σ)) =min(i1,…,ik)∈B1×⋯×Bk:{xi1,…,xik}⊆{X1,…,Xn}∪{x(t−1)m:t≤k}1kk∑t=1R(ht,it;P(σ)t) =1kk∑t=1minit∈Bt:xit∈{X1,…,Xn,x(t−1)m}R(ht,it;P(σ)t)=1kk∑t=1minh∈Ht(Z[n])R(h;P(σ)t).

Thus, for any compression function , denoting ,

 E[R(^h;P(σ))−minh∈Hk,ρ(Z[n])R(h;P(σ))] ≥1kk∑t=1E[R(^h;P(σ)t)−minh∈Ht(Z[n])R(h;P(σ)t)]≥116e4ϵ≳√klog(n/k)n.

## 3 Order-Dependent Compression Schemes

The above construction shows that the well-known upper bound for agnostic compression schemes is sometimes tight. Note that, in the definition of agnostic compression schemes, we required that the reconstruction function take as input a (multi)set. This type of compression scheme is often referred to as being permutation invariant, since the compression set argument is unordered (or equivalently does not depend on the order of elements in its argument).

We can also show a related result for the case of order-dependent compression schemes. An order-dependent agnostic sample compression scheme is specified by a size and an order-dependent reconstruction function , which maps any ordered sequence with to a measurable function . For any and any sequence , define

 Hk,ρ(z1,…,zn)={ρ((zi1,…,zik′)):k′≤k,i1,…,ik′∈{1,…,n}}.

Now for any probability measure on and any , continuing the notation from above, for any fixed order-dependent agnostic sample compression scheme , as above denote

 Eouc(n,k,ρ,P)=Esuph∈Hk,ρ(Z1,…,Zn)|^R(h;Z[n])−R(h;P)|,

and for any , define