Refined Error Bounds for Several Learning Algorithms

# Refined Error Bounds for Several Learning Algorithms

\nameSteve Hanneke \emailsteve.hanneke@gmail.com
###### Abstract

This article studies the achievable guarantees on the error rates of certain learning algorithms, with particular focus on refining logarithmic factors. Many of the results are based on a general technique for obtaining bounds on the error rates of sample-consistent classifiers with monotonic error regions, in the realizable case. We prove bounds of this type expressed in terms of either the VC dimension or the sample compression size. This general technique also enables us to derive several new bounds on the error rates of general sample-consistent learning algorithms, as well as refined bounds on the label complexity of the CAL active learning algorithm. Additionally, we establish a simple necessary and sufficient condition for the existence of a distribution-free bound on the error rates of all sample-consistent learning rules, converging at a rate inversely proportional to the sample size. We also study learning in the presence of classification noise, deriving a new excess error rate guarantee for general VC classes under Tsybakov’s noise condition, and establishing a simple and general necessary and sufficient condition for the minimax excess risk under bounded noise to converge at a rate inversely proportional to the sample size.

Refined Error Bounds for Several Learning Algorithms Steve Hanneke steve.hanneke@gmail.com

Editor: John Shawe-Taylor

Keywords: sample complexity, PAC learning, statistical learning theory, active learning, minimax analysis

## 1 Introduction

Supervised machine learning is a classic topic, in which a learning rule is tasked with producing a classifier that mimics the classifications that would be assigned by an expert for a given task. To achieve this, the learner is given access to a collection of examples (assumed to be i.i.d.) labeled with the correct classifications. One of the major theoretical questions of interest in learning theory is: How many examples are necessary and sufficient for a given learning rule to achieve low classification error rate? This quantity is known as the sample complexity, and varies depending on how small the desired classification error rate is, the type of classifier we are attempting to learn, and various other factors. Equivalently, the question is: How small of an error rate can we guarantee a given learning rule will achieve, for a given number of labeled training examples?

A particularly simple setting for supervised learning is the realizable case, in which it is assumed that, within a given set of classifiers, there resides some classifier that is always correct. The optimal sample complexity of learning in the realizable case has recently been completely resolved, up to constant factors, in a sibling paper to the present article (Hanneke, 2016). However, there remains the important task of identifying interesting general families of algorithms achieving this optimal sample complexity. For instance, the best known general upper bounds for the general family of empirical risk minimization algorithms differ from the optimal sample complexity by a logarithmic factor, and it is known that there exist spaces for which this is unavoidable (Auer and Ortner, 2007). This same logarithmic factor gap appears in the analysis of several other learning methods as well. The present article focuses on this logarithmic factor, arguing that for certain types of learning rules, it can be entirely removed in some cases, and for others it can be somewhat refined. The technique leading to these results is rooted in an idea introduced in the author’s doctoral dissertation (Hanneke, 2009). By further exploring this technique, we also obtain new results for the related problem of active learning. We also derive interesting new results for learning with classification noise, where again the focus is on a logarithmic factor gap between upper and lower bounds.

### 1.1 Basic Notation

Before further discussing the results, we first introduce some essential notation. Let be any nonempty set, called the instance space, equipped with a -algebra defining the measurable sets; for simplicity, we will suppose the sets in are all measurable. Let be the label space. A classifier is any measurable function . Following Vapnik and Chervonenkis (1971), define the VC dimension of a set of subsets of , denoted , as the maximum cardinality over subsets such that (the power set of ); if no such maximum cardinality exists, define . For any set of classifiers, denote by the VC dimension of . Throughout, we fix a set of classifiers, known as the concept space, and abbreviate . To focus on nontrivial cases, throughout we suppose , which implies . We will also generally suppose (though some of the results would still hold without this restriction).

For any , and any classifier , define . For completeness, also define . Also, for any set of classifiers, denote , referred to as the set of classifiers in consistent with ; for completeness, also define . Fix an arbitrary probability measure on (called the data distribution), and a classifier (called the target function). For any classifier , denote , the error rate of . Let be independent -distributed random variables. We generally denote , and (called the version space). The general setting in which we are interested in producing a classifier with small , given access to the data , is a special case of supervised learning known as the realizable case (in contrast to settings where the observed labeling might not be realizable by any classifier in , due to label noise or model misspecification, as discussed in Section 6).

We adopt a few convenient notational conventions. For any , denote ; also denote . We adopt a shorthand notation for sequences, so that for a sequence , we denote . For any -valued functions , we write or if there exists a finite numerical constant such that for all . For any , denote and . For , denote and . We also adopt the conventions that for , , and . It will also be convenient to use the notation for a set , where is the empty sequence. Throughout, we also make the usual implicit assumption that all quantities required to be measurable in the proofs and lemmas from the literature are indeed measurable. See, for instance, van der Vaart and Wellner (1996, 2011), for discussions of conditions on that typically suffice for this.

### 1.2 Background and Summary of the Main Results

This work concerns the study of the error rates achieved by various learning rules: that is, mappings from the data set to a classifier ; for simplicity, we sometimes refer to itself as a learning rule, leaving dependence on implicit. There has been a substantial amount of work on bounding the error rates of various learning rules in the realizable case. Perhaps the most basic and natural type of learning rule in this setting is the family of consistent learning rules: that is, those that choose . There is a general upper bound for all consistent learning rules , due to Vapnik and Chervonenkis (1974); Blumer, Ehrenfeucht, Haussler, and Warmuth (1989), stating that with probability at least ,

 er(^hm)≲1m(dLog(md)+Log(1δ)). (1)

This is complemented by a general lower bound of Ehrenfeucht, Haussler, Kearns, and Valiant (1989), which states that for any learning rule (consistent or otherwise), there exists a choice of and such that, with probability greater than ,

 er(^hm)≳1m(d+Log(1δ)). (2)

Resolving the logarithmic factor gap between (2) and (1) has been a challenging subject of study for decades now, with many interesting contributions resolving special cases and proposing sometimes-better upper bounds (e.g., Haussler, Littlestone, and Warmuth, 1994; Giné and Koltchinskii, 2006; Auer and Ortner, 2007; Long, 2003). It is known that the lower bound is sometimes not achieved by certain consistent learning rules (Auer and Ortner, 2007). The question of whether the lower bound (2) can always be achieved by some algorithm remained open for a number of years (Ehrenfeucht, Haussler, Kearns, and Valiant, 1989; Warmuth, 2004), but has recently been resolved in a sibling paper to the present article (Hanneke, 2016). That work proposes a learning rule based on a majority vote of classifiers consistent with carefully-constructed subsamples of the data, and proves that with probability at least ,

 er(^hm)≲1m(d+Log(1δ)).

However, several avenues for investigation remain open, including identifying interesting general families of learning rules able to achieve this optimal bound under general conditions on . In particular, it remains an open problem to determine necessary and sufficient conditions on for the entire family of consistent learning rules to achieve the above optimal error bound.

The work of Giné and Koltchinskii (2006) includes a bound that refines the logarithmic factor in (1) in certain scenarios. Specifically, it states that, for any consistent learning rule , with probability at least ,

 er(^hm)≲1m(dLog(θ(dm))+Log(1δ)), (3)

where is the disagreement coefficient (defined below in Section 4). The doctoral dissertation of Hanneke (2009) contains a simple and direct proof of this bound, based on an argument which splits the data set in two parts, and considers the second part as containing a subsequence sampled from the conditional distribution given the region of disagreement of the version space induced by the first part of the data. Many of the results in the present work are based on variations of this argument, including a variety of interesting new bounds on the error rates achieved by certain families of learning rules.

As one of the cornerstones of this work, we find that a variant of this argument for consistent learning rules with monotonic error regions leads to an upper bound that matches the lower bound (2) up to constant factors. For such monotonic consistent learning rules to exist, we would need a very special kind of concept space. However, they do exist in some important cases. In particular, in the special case of learning intersection-closed concept spaces, the Closure algorithm (Natarajan, 1987; Auer and Ortner, 2004, 2007) can be shown to satisfy this monotonicity property. Thus, this result immediately implies that, with probability at least , the Closure algorithm achieves

 er(^hm)≲1m(d+Log(1δ)),

which was an open problem of Auer and Ortner (2004, 2007); this fact was recently also obtained by Darnstädt (2015), via a related direct argument. We also discuss a variant of this result for monotone learning rules expressible as compression schemes, where we remove a logarithmic factor present in a result of Littlestone and Warmuth (1986) and Floyd and Warmuth (1995), so that for based on a compression scheme of size , which has monotonic error regions (and is permutation-invariant), with probability at least ,

 er(^hm)≲1m(n+Log(1δ)).

This argument also has implications for active learning. In many active learning algorithms, the region of disagreement of the version space induced by samples, , plays an important role. In particular, the label complexity of the CAL active learning algorithm (Cohn, Atlas, and Ladner, 1994) is largely determined by the rate at which decreases, so that any bound on this quantity can be directly converted into a bound on the label complexity of CAL (Hanneke, 2011, 2009, 2014; El-Yaniv and Wiener, 2012). Wiener, Hanneke, and El-Yaniv (2015) have argued that the region can be described as a compression scheme, where the size of the compression scheme, denoted , is known as the version space compression set size (Definition 6 below). By further observing that is monotonic in , applying our general argument yields the fact that, with probability at least , letting ,

 P(DIS(Vm))≲1m(^n1:m+Log(1δ)), (4)

which is typically an improvement over the best previously-known general bound by a logarithmic factor.

In studying the distribution-free minimax label complexity of active learning, Hanneke and Yang (2015) found that a simple combinatorial quantity , which they term the star number, is of fundamental importance. Specifically (see also Definition 9), is the largest number of distinct points such that with , , or else if no such largest exists. Interestingly, the work of Hanneke and Yang (2015) also establishes that the largest possible value of (over and the data set) is exactly . Thus, (4) also implies a data-independent and distribution-free bound: with probability at least ,

 P(DIS(Vm))≲1m(s+Log(1δ)).

Now one interesting observation at this point is that the direct proof of (3) from Hanneke (2009) involves a step in which is relaxed to a bound in terms of . If we instead use (4) in this step, we arrive at a new bound on the error rates of all consistent learning rules : with probability at least ,

 er(^hm)≲1m(dLog(^n1:md)+Log(1δ)). (5)

Since Hanneke and Yang (2015) have shown that the maximum possible value of (over , , and ) is also exactly the star number , while has as its maximum possible value , we see that the bound in (5) sometimes reflects an improvement over (3). It further implies a new data-independent and distribution-free bound for any consistent learning rule : with probability at least ,

 er(^hm)≲1m(dLog(min{s,m}d)+Log(1δ)).

Interestingly, we are able to complement this with a lower bound in Section 5.1. Though not quite matching the above in terms of its joint dependence on and (and necessarily so), this lower bound does provide the interesting observation that is necessary and sufficient for there to exist a distribution-free bound on the error rates of all consistent learning rules, converging at a rate , and otherwise (when ) the best such bound is .

Continuing with the investigation of general consistent learning rules, we also find a variant of the argument of Hanneke (2009) that refines (3) in a different way: namely, replacing with a quantity based on considering a well-chosen subregion of the region of disagreement, as studied by Balcan, Broder, and Zhang (2007); Zhang and Chaudhuri (2014). Specifically, in the context of active learning, Zhang and Chaudhuri (2014) have proposed a general quantity (Definition 15 below), which is never larger than , and is sometimes significantly smaller. By adapting our general argument to replace with this well-chosen subregion, we derive a bound for all consistent learning rules : with probability at least ,

In particular, as a special case of this general result, we recover the theorem of Balcan and Long (2013) that all consistent learning rules have optimal sample complexity (up to constants) for the problem of learning homogeneous linear separators under isotropic log-concave distributions, as is bounded by a finite numerical constant in this case. In Section 6, we also extend this result to the problem of learning with classification noise, where there is also a logarithmic factor gap between the known general-case upper and lower bounds. In this context, we derive a new general upper bound under the Bernstein class condition (a generalization of Tsybakov’s noise condition), expressed in terms of a quantity related to , which applies to a particular learning rule. This sometimes reflects an improvement over the best previous general upper bounds (Massart and Nédélec, 2006; Giné and Koltchinskii, 2006; Hanneke and Yang, 2012), and again recovers a result of Balcan and Long (2013) for homogeneous linear separators under isotropic log-concave distributions, as a special case.

For many of these results, we also state bounds on the expected error rate: . In this case, the optimal distribution-free bound is known to be within a constant factor of (Haussler, Littlestone, and Warmuth, 1994; Li, Long, and Srinivasan, 2001), and this rate is achieved by the one-inclusion graph prediction algorithm of Haussler, Littlestone, and Warmuth (1994), as well as the majority voting method of Hanneke (2016). However, there remain interesting questions about whether other algorithms achieve this optimal performance, or require an extra logarithmic factor. Again we find that monotone consistent learning rules indeed achieve this optimal rate (up to constant factors), while a distribution-free bound on with dependence on is achieved by all consistent learning rules if and only if , and otherwise the best such bound has dependence on .

As a final interesting result, in the context of learning with classification noise, under the bounded noise assumption (Massart and Nédélec, 2006), we find that the condition is actually necessary and sufficient for the minimax optimal excess error rate to decrease at a rate , and otherwise (if ) it decreases at a rate . This result generalizes several special-case analyses from the literature (Massart and Nédélec, 2006; Raginsky and Rakhlin, 2011). Note that the “necessity” part of this statement is significantly stronger than the above result for consistent learning rules in the realizable case, since this result applies to the best error guarantee achievable by any learning rule.

## 2 Bounds for Consistent Monotone Learning

In order to state our results for monotonic learning rules in an abstract form, we introduce the following notation. Let denote any space, equipped with a -algebra defining the measurable subsets. For any collection of measurable subsets of , a consistent monotone rule is any sequence of functions , , such that , , , and , . We begin with the following very simple result, the proof of which will also serve to introduce, in its simplest form, the core technique underlying many of the results presented in later sections below.

###### Theorem 1

Let be a collection of measurable subsets of , and let (for ) be any consistent monotone rule. Fix any , any , and any probability measure on . Letting be independent -distributed random variables, and denoting , with probability at least ,

 P(Am)≤4m(17vc(A)+4ln(4δ)). (6)

Furthermore,

 E[P(Am)]≤68(vc(A)+1)m. (7)

The overall structure of this proof is based on an argument of Hanneke (2009). The most-significant novel element here is the use of monotonicity to further refine a logarithmic factor. The proof relies on the following classic result. Results of this type are originally due to Vapnik and Chervonenkis (1974); the version stated here features slightly better constant factors, due to Blumer, Ehrenfeucht, Haussler, and Warmuth (1989).

###### Lemma 2

For any collection of measurable subsets of , any , any , and any probability measure on , letting be independent -distributed random variables, with probability at least , every with satisfies

 P(A)≤2m(vc(A)Log2(2emvc(A))+Log2(2δ)).

We are now ready for the proof of Theorem 1.

Proof of Theorem 1  Fix any probability measure , let be independent -distributed random variables, and for each denote . We begin with the inequality in (6). The proof proceeds by induction on . If , then since and , and since the definition of a consistent monotone rule implies , the stated bound follows immediately from Lemma 2 for any . Now, as an inductive hypothesis, fix any integer such that, , , with probability at least ,

 P(Am′)≤4m′(17vc(A)+4ln(4δ)).

Now fix any and define

 N=∣∣{Z⌊m/2⌋+1,…,Zm}∩A⌊m/2⌋∣∣,

and enumerate the elements of as (retaining their original order).

Note that is conditionally -distributed given . In particular, with probability one, if , then . Otherwise, if , then note that are conditionally independent and -distributed given and . Thus, since , applying Lemma 2 (under the conditional distribution given and ), combined with the law of total probability, we have that on an event of probability at least , if , then

 P(Am|A⌊m/2⌋)≤2N(vc(A)Log2(2eNvc(A))+log2(4δ)).

Additionally, again since is conditionally -distributed given , applying a Chernoff bound (under the conditional distribution given ), combined with the law of total probability, we obtain that on an event of probability at least , if , then

 N≥P(A⌊m/2⌋)⌈m/2⌉/2≥P(A⌊m/2⌋)m/4.

In particular, if , then , so that if this occurs with , then we have . Noting that , then by monotonicity of for , we have that on , if , then

 P(Am|A⌊m/2⌋)≤8P(A⌊m/2⌋)mln(2)(vc(A)Log(eP(A⌊m/2⌋)m2vc(A))+ln(4δ)).

The monotonicity property of implies . Together with monotonicity of probability measures, this implies . It also implies that, if , then . Thus, on , if , then

 P(Am)≤8mln(2)(vc(A)Log(eP(A⌊m/2⌋)m2vc(A))+ln(4δ)).

The inductive hypothesis implies that, on an event of probability at least ,

 P(A⌊m/2⌋)≤4⌊m/2⌋(17vc(A)+4ln(16δ)).

Since , we have , so that the above implies

 P(A⌊m/2⌋)≤4⋅402199m(17vc(A)+4ln(16δ)).

Thus, on , if , then

 P(Am)≤8mln(2)(vc(A)Log(2⋅402e199(17+4vc(A)ln(16δ)))+ln(4δ)).

Lemma 20 in Appendix A allows us to simplify the logarithmic term here, revealing that the right hand side is at most

 8mln(2)(vc(A)Log(2⋅402e199(17+4ln(4)+4ln(4/e)))+(1+ln(4e))ln(4δ)) ≤4m(17vc(A)+4ln(4δ)).

Since , we have that, on , regardless of whether or not , we have

 P(Am)≤4m(17vc(A)+4ln(4δ)).

Noting that, by the union bound, the event has probability at least , this extends the inductive hypothesis to . By the principle of induction, this completes the proof of the first claim in Theorem 1.

For the bound on the expectation in (7), we note that, letting , by setting the bound in (6) equal to a value and solving for , the value of which is in for any , the result just established can be restated as: ,

 P(P(Am)>ε)≤4exp{(17/4)vc(A)−εm/16}.

Furthermore, for any , we of course still have . Therefore, we have that

 E[P(Am)] =∫∞0P(P(Am)>ε)dε≤εm+∫∞εm4exp{(17/4)vc(A)−εm/16}dε =εm+4⋅16mexp{(17/4)vc(A)−εmm/16}=4m(17vc(A)+4ln(4))+16m =4m(17vc(A)+4ln(4e))≤68vc(A)+39m≤68(vc(A)+1)m.

We can also state a variant of Theorem 1 applicable to sample compression schemes, which will in fact be more useful for our purposes below. To state this result, we first introduce the following additional terminology. For any , we say that a function is permutation-invariant if every and every bijection satisfy . For any , a consistent monotone sample compression rule of size is a consistent monotone rule with the additional properties that, , is permutation-invariant, and , such that

 ψt(z1,…,zt)=ϕt,nt(z[t])(zit,1(z[t]),…,zit,nt(z[t])(z[t])),

where is a permutation-invariant function for each , and are functions such that , are all distinct. In words, the element of mapped to by depends only on the unordered (multi)set , and can be specified by an unordered subset of of size at most . Following the terminology from the literature on sample compression schemes, we refer to the collection of functions as the compression function of , and to the collection of permutation-invariant functions as the reconstruction function of .

This kind of is a type of sample compression scheme (see Littlestone and Warmuth, 1986; Floyd and Warmuth, 1995), though certainly not all permutation-invariant compression schemes yield consistent monotone rules. Below, we find that consistent monotone sample compression rules of a quantifiable size arise naturally in the analysis of certain learning algorithms (namely, the Closure algorithm and the CAL active learning algorithm).

With the above terminology in hand, we can now state our second abstract result.

###### Theorem 3

Fix any , let be a collection of measurable subsets of , and let (for ) be any consistent monotone sample compression rule of size . Fix any , , and any probability measure on . Letting be independent -distributed random variables, and denoting , with probability at least ,

 P(Am)≤1m(21n+16ln(3δ)). (8)

Furthermore,

 E[P(Am)]≤21n+34m. (9)

The proof of Theorem 3 relies on the following classic result due to Littlestone and Warmuth (1986); Floyd and Warmuth (1995) (see also Herbrich, 2002; Wiener, Hanneke, and El-Yaniv, 2015, for a clear and direct proof).

###### Lemma 4

Fix any collection of measurable subsets of , any and with , and any permutation-invariant functions , . For any probability measure on , letting be independent -distributed random variables, for any , with probability at least , every , and every distinct with satisfy

 P(ϕk(Zi1,…,Zik))≤1m−n(nLog(emn)+Log(1δ)).

With this lemma in hand, we are ready for the proof of Theorem 3.

Proof of Theorem 3  The proof follows analogously to that of Theorem 1, but with several additional complications due to the form of Lemma 4 being somewhat different from that of Lemma 2. Let and be the compression function and reconstruction function of , respectively. For convenience, also denote , and note that this extends the monotonicity property of to . Fix any probability measure , let be independent -distributed random variables, and for each denote .

We begin with the inequality in (8). The special case of is directly implied by Lemma 4, so for the remainder of the proof of (8), we suppose . The proof proceeds by induction on . Since for all , and since , the stated bound is trivially satisfied for all if . Now, as an inductive hypothesis, fix any integer such that, , , with probability at least ,

 P(Am′)≤1m′(21n+16ln(3δ)).

Fix any and define

 N=∣∣{Z⌊m/2⌋+1,…,Zm}∩A⌊m/2⌋∣∣,

and enumerate the elements of as . Also enumerate the elements of as . Now note that, by the monotonicity property of , we have . Furthermore, by permutation-invariance of , we have that

 Am=ψm(^Z1,…,^ZN,Z1,…,Z⌊m/2⌋,^Z′1,…,^Z′⌈m/2⌉−N).

Combined with the monotonicity property of , this implies that . Altogether, we have that

 Am⊆A⌊m/2⌋∩ψN(^Z1,…,^ZN). (10)

Note that is conditionally -distributed given . In particular, with probability one, if , then . Otherwise, if , then note that are conditionally independent and -distributed given and . Since is a consistent monotone rule, we have that . We also have, by definition of , that . Thus, applying Lemma 4 (under the conditional distribution given and ), combined with the law of total probability, we have that on an event of probability at least , if , then

Combined with (10) and monotonicity of measures, this implies that on , if , then

 P(Am)≤P(A⌊m/2⌋∩ψN(^Z1,…,^ZN)) =P(A⌊m/2⌋)P(A⌊m/2⌋∩ψN(^Z1,…,^ZN)∣∣A⌊m/2⌋) ≤P(A⌊m/2⌋)1N−n(nln(eNn)+ln(3δ)).

Additionally, again since is conditionally -distributed given , applying a Chernoff bound (under the conditional distribution given ), combined with the law of total probability, we obtain that on an event of probability at least , if , then

 N≥P(A⌊m/2⌋)⌈m/2⌉/2≥P(A⌊m/2⌋)m/4.

Also note that if , then (10) and monotonicity of probability measures imply as well. In particular, if this occurs with , then we have . Thus, by monotonicity of for , we have that on , if , then

 P(Am)

The inductive hypothesis implies that, on an event of probability at least ,

 P(A⌊m/2⌋)≤1⌊m/2⌋(21n+16ln(9δ)).

Since , we have , so that the above implies

 P(A⌊m/2⌋)≤7837m(21n+16ln(9δ)).

Thus, on , if , then

 P(Am) <5m(nLog(78e4⋅37(21+16nln(9δ)))+ln(3δ)) ≤5m(nLog(78⋅2037⋅11(21⋅11e16⋅5+11e5ln(3)+11