Plug-in Approach to Active Learning

# Plug-in Approach to Active Learning

\fnmsStanislav \snmMinsker\thanksreft1,t2label=e1]sminsker@math.gatech.edu [ Georgia Institute of Technology
###### Abstract

We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight.

A
\startlocaldefs\endlocaldefs\runtitle

Plug-in Approach

{aug}\thankstext

t1Partially supported by ARC Fellowship, NSF Grants DMS-0906880 and CCF-0808863 \thankstextt2Mailing address: 686 Cherry street, School of Mathematics, Atlanta, GA 30332-0160

e1

ctive learning, selective sampling, model selection, classification, confidence bands

## 1 Introduction

Let be a measurable space and let be a random couple with unknown distribution . The marginal distribution of the design variable will be denoted by . Let be the regression function. The goal of binary classification is to predict label based on the observation . Prediction is based on a classifier - a measurable function . The quality of a classifier is measured in terms of its generalization error, . In practice, the distribution remains unknown but the learning algorithm has access to the training data - the i.i.d. sample from . It often happens that the cost of obtaining the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. This suggests to measure the performance of a learning algorithm in terms of its label complexity, the number of labels required to obtain a classifier with the desired accuracy. Active learning theory is mainly devoted to design and analysis of the algorithms that can take advantage of this modified framework. Most of these procedures can be characterized by the following property: at each step , observation is sampled from a distribution that depends on previously obtained (while passive learners obtain all available training data at the same time). is designed to be supported on a set where classification is difficult and requires more labeled data to be collected. The situation when active learners outperform passive algorithms might occur when the so-called Tsybakov’s low noise assumption is satisfied: there exist constants such that

 ∀ t>0, Π(x:|η(x)|≤t)≤Btγ (1.1)

This assumption provides a convenient way to characterize the noise level of the problem and will play a crucial role in our investigation.
The topic of active learning is widely present in the literature; see Balcan et al. [3], Hanneke [7], Castro and Nowak [4] for review. It was discovered that in some cases the generalization error of a resulting classifier can converge to zero exponentially fast with respect to its label complexity(while the best rate for passive learning is usually polynomial with respect to the cardinality of the training data set). However, available algorithms that adapt to the unknown parameters of the problem( in Tsybakov’s low noise assumption, regularity of the decision boundary) involve empirical risk minimization with binary loss, along with other computationally hard problems, see Balcan et al. [2], Hanneke [7]. On the other hand, the algorithms that can be effectively implemented, as in Castro and Nowak [4], are not adaptive.
The majority of the previous work in the field was done under standard complexity assumptions on the set of possible classifiers(such as polynomial growth of the covering numbers). Castro and Nowak [4] derived their results under the regularity conditions on the decision boundary and the noise assumption which is slightly more restrictive then (1.1). Essentially, they proved that if the decision boundary is a graph of the Hölder smooth function (see section 2 for definitions) and the noise assumption is satisfied with , then the minimax lower bound for the expected excess risk of the active classifier is of order and the upper bound is where is the label budget. However, the construction of the classifier that achieves an upper bound assumes and to be known.
In this paper, we consider the problem of active learning under classical nonparametric assumptions on the regression function - namely, we assume that it belongs to a certain Hölder class and satisfies to the low noise condition (1.1) with some positive . In this case, the work of Audibert and Tsybakov [1] showed that plug-in classifiers can attain optimal rates in the passive learning framework, namely, that the expected excess risk of a classifier is bounded above by (which is the optimal rate), where is the local polynomial estimator of the regression function and is the size of the training data set. We were able to partially extend this claim to the case of active learning: first, we obtain minimax lower bounds for the excess risk of an active classifier in terms of its label complexity. Second, we propose a new algorithm that is based on plug-in classifiers, attains almost optimal rates over a broad class of distributions and possesses adaptivity with respect to (within the certain range of these parameters).
The paper is organized as follows: the next section introduces remaining notations and specifies the main assumptions made throughout the paper. This is followed by a qualitative description of our learning algorithm. The second part of the work contains the statements and proofs of our main results - minimax upper and lower bounds for the excess risk.

## 2 Preliminaries

Our active learning framework is governed by the following rules:

1. Observations are sampled sequentially: is sampled from the modified distribution that depends on .

2. is sampled from the conditional distribution . Labels are conditionally independent given the feature vectors .

Usually, the distribution is supported on a set where classification is difficult.
Given the probability measure on , we denote the integral with respect to this measure by . Let be a class of bounded, measurable functions. The risk and the excess risk of with respect to the measure are defined by

 RQ(f):=QIy≠sign f(x) EQ(f):=RQ(f)−infg∈FRQ(g),

where is the indicator of event . We will omit the subindex when the underlying measure is clear from the context. Recall that we denoted the distribution of by . The minimal possible risk with respect to is

 R∗=infg:S↦[−1,1]Pr(Y≠sign g(X)),

where the infimum is taken over all measurable functions. It is well known that it is attained for any such that - a.s. Given , define

 F∞,A(g;δ):={f∈F: ∥f−g∥∞,A≤δ},

where . For , define the function class

 F|A:={f|A, f∈F}

where . From now on, we restrict our attention to the case . Let .

###### Definition 2.1.

We say that belongs to , the - Hölder class of functions, if is times continuously differentiable and for all satisfies

 |g(x1)−Tx(x1)|≤K∥x−x1∥β∞,

where is the Taylor polynomial of degree of at the point .

###### Definition 2.2.

is the class of probability distributions on
with the following properties:

1. ;

2. .

We do not mention the dependence of on the fixed constants explicitly, but this should not cause any uncertainty.
Finally, let us define and , the subclasses of , by imposing two additional assumptions. Along with the formal descriptions of these assumptions, we shall try to provide some motivation behind them. The first deals with the marginal . For an integer , let

 GM:={(k1M,…,kdM), ki=1…M, i=1…d}

be the regular grid on the unit cube with mesh size . It naturally defines a partition into a set of open cubes with edges of length and vertices in . Below, we consider the nested sequence of grids and corresponding dyadic partitions of the unit cube.

###### Definition 2.3.

We will say that is -regular with respect to if for any , any element of the partition such that , we have

 u1⋅2−dm≤Π(Ri)≤u2⋅2−dm. (2.1)

where .

###### Assumption 1.

is - regular.

In particular, -regularity holds for the distribution with a density on such that .
Let us mention that our definition of regularity is of rather technical nature; for most of the paper, the reader might think of as being uniform on ( however, we need slightly more complicated marginal to construct the minimax lower bounds for the excess risk). It is know that estimation of regression function in sup-norm is sensitive to the geometry of design distribution, mainly because the quality of estimation depends on the local amount of data at every point; conditions similar to our assumption 1 were used in the previous works where this problem appeared, e.g., strong density assumption in Audibert and Tsybakov [1] and assumption D in Gaïffas [5].
Another useful characteristic of - regular distribution is that this property is stable with respect to restrictions of to certain subsets of its support. This fact fits the active learning framework particularly well.

###### Definition 2.4.

We say that belongs to if and assumption 1 is satisfied for some .

The second assumption is crucial in derivation of the upper bounds. The space of piecewise-constant functions which is used to construct the estimators of is defined via

 Fm=⎧⎨⎩2dm∑i=1λiIRi(⋅): |λi|≤1, i=1…2dm⎫⎬⎭,

where forms the dyadic partition of the unit cube. Note that can be viewed as a -unit ball in the linear span of first Haar basis functions in . Moreover, is a nested family, which is a desirable property for the model selection procedures. By we denote the - projection of the regression function onto .
We will say that the set approximates the decision boundary if there exists such that

 {x:|η(x)|≤t}Π⊆AΠ⊆{x:|η(x)|≤3t}Π, (2.2)

where for any set we define . The most important example we have in mind is the following: let be some estimator of with and define the - band around by

 ^F={f: ^η(x)−2t≤f(x)≤^η(x)+2t ∀x∈[0,1]d}

Take , then it is easy to see that satisfies (2.2). Modified design distributions used by our algorithm are supported on the sets with similar structure.
Let be the sigma-algebra generated by and .

###### Assumption 2.

There exists such that for all , satisfying (2.2) and such that the following holds true:

 ∫[0,1]d(η−¯ηm)2Π(dx|x∈AΠ)≥B2∥η−¯ηm∥2∞,AΠ

Appearance of assumption 2 is motivated by the structure of our learning algorithm - namely, it is based on adaptive confidence bands for the regression function. Nonparametric confidence bands is a big topic in statistical literature, and the review of this subject is not our goal. We just mention that it is impossible to construct adaptive confidence bands of optimal size over the whole . Low [11], Hoffmann and Nickl [8] discuss the subject in details. However, it is possible to construct adaptive - confidence balls(see an example following Theorem 6.1 in Koltchinskii [10]). For functions satisfying assumption 2, this fact allows to obtain confidence bands of desired size. In particular,

1. functions that are differentiable, with gradient being bounded away from 0 in the vicinity of decision boundary;

2. Lipschitz continuous functions that are convex in the vicinity of decision boundary

satisfy assumption 2. For precise statements, see Propositions A.1, A.2 in Appendix A. A different approach to adaptive confidence bands in case of one-dimensional density estimation is presented in Giné and Nickl [6]. Finally, we define :

###### Definition 2.5.

We say that belongs to if and assumption 2 is satisfied for some .

### 2.1 Learning algorithm

Now we give a brief description of the algorithm, since several definitions appear naturally in this context. First, let us emphasize that the marginal distribution is assumed to be known to the learner. This is not a restriction, since we are not limited in the use of unlabeled data and can be estimated to any desired accuracy. Our construction is based on so-called plug-in classifiers of the form , where is a piecewise-constant estimator of the regression function. As we have already mentioned above, it was shown in Audibert and Tsybakov [1] that in the passive learning framework plug-in classifiers attain optimal rate for the excess risk of order , with being the local polynomial estimator.

Our active learning algorithm iteratively improves the classifier by constructing shrinking confidence bands for the regression function. On every step , the piecewise-constant estimator is obtained via the model selection procedure which allows adaptation to the unknown smoothness(for Hölder exponent ). The estimator is further used to construct a confidence band for . The active set assosiated with is defined as

 ^Ak=A(^Fk):={x∈supp(Π): ∃f1,f2∈^Fk,sign f1(x)≠sign f2(x)}

Clearly, this is the set where the confidence band crosses zero level and where classification is potentially difficult. serves as a support of the modified distribution : on step , label is requested only for observations , forcing the labeled data to concentrate in the domain where higher precision is needed. This allows one to obtain a tighter confidence band for the regression function restricted to the active set. Since approaches the decision boundary, its size is controlled by the low noise assumption. The algorithm does not require a priori knowledge of the noise and regularity parameters, being adaptive for .

Further details are given in section 3.2.

### 2.2 Comparison inequalities

Before proceeding to the main results, let us recall the well-known connections between the binary risk and the , - norm risks:

###### Proposition 2.1.

Under the low noise assumption,

 RP(f)−R∗≤D1∥(f−η)I{sign f≠sign η}∥1+γ∞; (2.3) RP(f)−R∗≤D2∥(f−η)I{sign f≠sign η}∥2(1+γ)2+γL2(Π); (2.4) RP(f)−R∗≥D3Π(sign f≠sign η)1+γγ (2.5)
###### Proof.

For (2.3) and (2.4), see Audibert and Tsybakov [1], lemmas 5.1, 5.2 respectively, and for (2.5)—Koltchinskii [10], lemma 5.2. ∎

## 3 Main results

The question we address below is: what are the best possible rates that can be achieved by active algorithms in our framework and how these rates can be attained.

### 3.1 Minimax lower bounds for the excess risk

The goal of this section is to prove that for no active learner can output a classifier with expected excess risk converging to zero faster than . Our result builds upon the minimax bounds of Audibert and Tsybakov [1], Castro and Nowak [4].
Remark The theorem below is proved for a smaller class , which implies the result for .

###### Theorem 3.1.

Let be such that . Then there exists such that for all large enough and for any active classifier we have

 supP∈P∗U(β,γ)ERP(^fn)−R∗≥CN−β(1+γ)2β+d−βγ
###### Proof.

We proceed by constructing the appropriate family of classifiers , in a way similar to Theorem 3.5 in Audibert and Tsybakov [1], and then apply Theorem 2.5 from Tsybakov [13]. We present it below for reader’s convenience.

###### Theorem 3.2.

Let be a class of models, - the pseudometric and - a collection of probability measures associated with . Assume there exists a subset of such that

1. for every

Then

 inf^fsupf∈ΣPf(d(^f,f)≥s)≥√M1+√M(1−2α−√2αlogM)

where the infimum is taken over all possible estimators of based on a sample from and is the Kullback-Leibler divergence.

Going back to the proof, let and

 Gq:={(2k1−12q,…,2kd−12q), ki=1…q, i=1…d}

be the grid on . For , let

 nq(x) =argmin {∥x−xk∥2: xk∈Gq}

If is not unique, we choose the one with smallest norm. The unit cube is partitioned with respect to as follows: belong to the same subset if . Let be some order on the elements of such that implies . Assume that the elements of the partition are enumerated with respect to the order of their centers induced by : . Fix and let

 S:=m⋃i=1Ri

Note that the partition is ordered in such a way that there always exists with

 B+(0,kq)⊆S⊆B+(0,k+3√dq), (3.1)

where . In other words, (3.1) means that that the difference between the radii of inscribed and circumscribed spherical sectors of is of order .
Let be three integers satisfying

 2−v<2−r1<2−r1√d<2−r2√d<2−1 (3.2)

Define by

 u(x):=∫∞xU(t)dt1/2∫2−vU(t)dt (3.3)

where

 U(t):=⎧⎪⎨⎪⎩exp(−1(1/2−x)(x−2−v)),x∈(2−v,12)0else.

Note that is an infinitely diffferentiable function such that and . Finally, for let

 Φ(x):=Cu(∥x∥2)

where is chosen such that .
Let and

 A0:={⋃iRi: Ri∩B+(0,rS+q−βγd)=∅}

Note that

 rS≤cm1/dq, (3.4)

since .
Define to be the hypercube of probability distributions on . The marginal distribution of is independent of : define its density by

 p(x)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩2d(r1−1)2d(r1−r2)−1, x∈B∞(z,2−r2q)∖B∞(z,2−r1q), z∈Gq∩S,c0,x∈A0,0else.

where , (note that ) and are defined in (3.2). In particular, satisfies assumption 1 since it is supported on the union of dyadic cubes and has bounded above and below on density.

Let

 Ψ(x):=u(1/2−qβγddist2(x,B+(0,rS))),

where is defined in (3.3) and .
Finally, the regression function is defined via

 ησ(x):=⎧⎪⎨⎪⎩σiq−βΦ(q[x−nq(x)]),x∈Ri, 1≤i≤m1CL,β√ddist2(x,B+(0,rS))dγ⋅Ψ(x),x∈[0,1]d∖S.

The graph of is a surface consisting of small ”bumps” spread around and tending away from 0 monotonically with respect to on . Clearly, satisfies smoothness requirement, since for

 dist2(x,B+(0,rS))=∥x∥2−rS

and by assumption. 111 can be replaced by 1 unless and is an integer, in which case extra smoothness at the boundary of , provided by , is necessary. Let’s check that it also satisfies the low noise condition. Since on support of , it is enough to consider for :

 Π(|ησ(x)|≤Czq−β) ≤mq−d+Π(dist2(x,B+(0,rS))≤Czγ/dq−βγd)≤ ≤mq−d+C2(rS+Czγ/dq−βγd)d≤ ≤mq−d+C3mq−d+C4zγq−βγ≤ ≤ˆCtγ,

if . Here, the first inequality follows from considering on and separately, and second inequality follows from (3.4) and direct computation of the sphere volume.
Finally, satisfies assumption 2 with some since on

 0

The next step in the proof is to choose the subset of which is “well-separated”: this can be done due to the following fact(see Tsybakov [13], Lemma 2.9):

###### Proposition 3.1 (Gilbert-Varshamov).

For , there exists

 {σ0,…,σM}⊂{−1,1}m

such that , and where stands for the Hamming distance.

Let be chosen such that satisfies the proposition above. Next, following the proof of Theorems 1 and 3 in Castro and Nowak [4], we note that

 KL(Pσ,N∥Pσ0,N)≤8Nmaxx∈[0,1](ησ(x)−ησ0(x))2≤32C2L,βNq−2β, (3.5)

where is the joint distribution of under hypothesis that the distribution of couple is . Let us briefly sketch the derivation of (3.5); see also the proof of Theorem 1 in Castro and Nowak [4]. Denote

 ¯Xk:=(X1,…,Xk), ¯Yk:=(Y1,…,Yk)

Then admits the following factorization:

 dPσ,N(¯XN,¯YN)=N∏i=1Pσ(Yi|Xi)dP(Xi|¯Xi−1,¯Yi−1),

where does not depend on but only on the active learning algorithm. As a consequence,

 KL(Pσ,N∥Pσ0,N) =EPσ,NlogdPσ,N(¯XN,¯YN)dPσ0,N(¯Xn,¯YN)=EPσ,Nlog∏Ni=1Pσ(Yi|Xi)∏Ni=1Pσ0(Yi|Xi)= =N∑i=1EPσ,N[EPσ(logPσ(Yi|Xi)Pσ0(Yi|Xi)|Xi)]≤ ≤8Nmaxx∈[0,1]d(ησ(x)−ησ0(x))2,

where the last inequality follows from Lemma 1, Castro and Nowak [4]. Also, note that we have in our bounds rather than the average over that would appear in the passive learning framework.
It remains to choose in appropriate way: set and where are such that and which is possible for big enough. In particular, . Together with the bound (3.5), this gives

 1M∑σ∈H′KL(Pσ∥Pσ0)≤32C2uNq−2β

so that conditions of Theorem 3.2 are satisfied. Setting

 fσ(x):=sign ησ(x),

we finally have

 d(fσ1,fσ2):=Π(sign ησ1(x)≠sign ησ2(x))≥m8qd≥C4N−βγ2β+d−βγ,

where the lower bound just follows by construction of our hypotheses. Since under the low noise assumption (see (2.5)), we conclude that

 inf^fNsupP∈P∗U(β,γ)Pr(RP(^fn)−R∗≥C4N−β(1+γ)2β+d−βγ)≥ ≥inf^fNsupP∈P∗U(β,γ)Pr(Π(^fn(x)≠sign ηP(x))≥C42N−βγ2β+d−βγ)≥τ>0.

### 3.2 Upper bounds for the excess risk

Below, we present a new active learning algorithm which is computationally tractable, adaptive with respect to (in a certain range of these parameters) and can be applied in the nonparametric setting. We show that the classifier constructed by the algorithm attains the rates of Theorem 3.1, up to polylogarithmic factor, if and (the last condition covers the most interesting case when the regression function hits or crosses the decision boundary in the interior of the support of ; for detailed statement about the connection between the behavior of the regression function near the decision boundary with parameters , see Proposition 3.4 in Audibert and Tsybakov [1]). The problem of adaptation to higher order of smoothness () is still awaiting its complete solution; we address these questions below in our final remarks.
For the purpose of this section, the regularity assumption reads as follows: there exists such that

 |η(x1)−η(x2)|≤B1∥x1−x2∥β∞ (3.6)

Since we want to be able to construct non-asymptotic confidence bands, some estimates on the size of constants in (3.6) and assumption 2 are needed. Below, we will additionally assume that

 B1≤logN B2≥log−1N,

where is the label budget. This can be replaced by any known bounds on .
Let with . Define

 ^ΠA(dx):=Π(dx|x∈AΠ)

and . Next, we introduce a simple estimator of the regression function on the set . Given the resolution level and an iid sample with , let

 ^ηm,A(x):=∑i:Ri∩AΠ≠∅∑Nj=1YjIRi(Xj)N⋅^ΠA(Ri)IRi(x) (3.7)

Since we assumed that the marginal is known, the estimator is well-defined. The following proposition provides the information about concentration of around its mean:

###### Proposition 3.2.

For all ,

 Pr(maxx∈AΠ|^ηm,A(x)− ¯ηm(x)| ≥t√2dmΠ(A)u1N)≤ ≤2dmexp⎛⎜ ⎜⎝−t22(1+t3√2dmΠ(A)/u1N)⎞⎟ ⎟⎠,
###### Proof.

This is a straightforward application of the Bernstein’s inequality to the random variables

 SiN:=N∑j=1YjIRi(Xj), i∈{i:Ri∩AΠ≠∅},

and the union bound: indeed, note that , so that

 Pr(∣∣∣SiN−N∫Riηd^ΠA∣∣∣≥tN^ΠA(Ri))≤2exp(−N^ΠA(Ri)t22+2t/3),

and the rest follows by simple algebra using that by the -regularity of . ∎

Given a sequence of hypotheses classes , define the index set

 J(N):={m∈N: 1≤dimGm≤Nlog2N} (3.8)

- the set of possible “resolution levels” of an estimator based on classified observations(an upper bound corresponds to the fact that we want the estimator to be consistent). When talking about model selection procedures below, we will implicitly assume that the model index is chosen from the corresponding set . The role of will be played by for appropriately chosen set . We are now ready to present the active learning algorithm followed by its detailed analysis(see Table 1).