Active Learning for Binary Classification with Abstention

Active Learning for Binary Classification with Abstention

Shubhanshu Shekhar
shshekha@eng.ucsd.edu
Tara Javidi
tjavidi@eng.ucsd.edu
Abstract

We construct and analyze active learning algorithms for the problem of binary classification with abstention. We consider three abstention settings: fixed-cost and two variants of bounded-rate abstention, and for each of them propose an active learning algorithm. All the proposed algorithms can work in the most commonly used active learning models, i.e., membership-query, pool-based, and stream-based sampling. We obtain upper-bounds on the excess risk of our algorithms in a general non-parametric framework, and establish their minimax near-optimality by deriving matching lower-bounds. Since our algorithms rely on the knowledge of some smoothness parameters of the regression function, we then describe a new strategy to adapt to these unknown parameters in a data-driven manner. Since the worst case computational complexity of our proposed algorithms increases exponentially with the dimension of the input space, we conclude the paper with a computationally efficient variant of our algorithm whose computational complexity has a polynomial dependence over a smaller but rich class of learning problems.

1 Introduction

We consider the problem of binary classification in which the learner has an additional provision of abstaining from declaring a label. This problem models several practical scenarios in which it is preferable to withhold a decision, perhaps at the cost of some additional experimentation, instead of making an incorrect decision and incurring much higher costs. A canonical application of this problem is in automated medical diagnostic systems (Rubegni et al., 2002), where classifiers which defer to a human expert on uncertain inputs are more desirable than classifiers that always make a decision. Other key applications include dialog systems and detecting harmful contents on the web.

Several existing works in the literature, such as Castro and Nowak (2008); Dasgupta (2006), have demonstrated the benefits of active learning (under certain conditions) in standard binary classification. However, in the case of classification with abstention, the design of active learning algorithms and their comparison with their passive counterparts have largely been unexplored. In this paper, we aim to fill this gap in the literature. More specifically, we design active learning algorithms for classification with abstention in three different settings. Setting 1 is the fixed-cost setting, in which every usage of the abstain option results in a known cost . Setting 2 is the bounded-rate with “known” input marginal () setting. This provides a smooth transition from Setting 1 to Setting 3, and allows us to demonstrate the key algorithmic changes in this transition. Setting 3 is the bounded-rate with “unknown” marginal () setting. Here, the algorithm has the option to request additional unlabelled samples, so long as grows only polynomially with the label budget . The fixed-cost setting is suitable for problems where a precise cost can be assigned to additional experimentation due to using the abstain option. In applications such as medical diagnostics, where the bottleneck is the processing speed of the human expert (Pietraszek, 2005), the bounded-rate framework is more natural.

Prior Work: Chow (1957) studied the problem of passive learning with abstention and derived the Bayes optimal classifier for both fixed-cost and bounded-rate settings (under certain continuity assumptions). Chow (1970) further analyzed the trade-off between error rate and rejection rate. Recently, a collection of papers have revisited this problem in the fixed-cost setting. Herbei and Wegkamp (2006) obtained convergence rates for classifiers in a non-parametric framework similar to our paper. Bartlett and Wegkamp (2008) and Yuan (2010) studied convex surrogate loss functions for this problem and obtained bounds on the excess risk of empirical risk minimization based classifiers. Wegkamp (2007) and Wegkamp and Yuan (2011) studied an -regularized version of this problem. Cortes et al. (2016) introduced a new framework which involved learning a pair of functions and proposed and analyzed convex surrogate loss functions. The problem of binary classification with a bounded-rate of abstention has also been studied, albeit less extensively. Pietraszek (2005) proposed a method to construct abstaining classifiers using ROC analysis. Denis and Hebiri (2015) re-derived the Bayes optimal classifier for the bounded rate setting under the same assumptions as Chow (1957). They further proposed a general plug-in strategy for constructing abstaining classifiers in a semi-supervised setting, and obtained an upper bound on the excess risk.

Contributions: For each of the three abstention setting mentioned earlier, we propose an algorithm that can work with three common active learning models (Settles, 2009, § 2): membership query, pool-based, and stream-based. After describing the algorithms, we obtain upper-bounds on their excess risk in a general non-parametric framework with mild assumptions on the joint distribution of input features and labels (Section 3). The obtained rates compare favorably with the existing results in the passive setting thus characterizing the gains associated with active learning (see Section 7 for a discussion). Since our proposed algorithms require knowledge of certain smoothness parameters, in Section 4, we propose a new adaptive scheme that adjusts to the unknown smoothness terms in a data driven manner. In Section 5, we derive lower-bounds on the excess risk for both fixed cost and bounded rate settings to establish the minimax near-optimality of our algorithms. Finally, we conclude in Section 6 by describing a computationally feasible version of our algorithm for a restricted but rich class of problems.

2 Preliminaries

Let denote the input space and denote the set of labels to be assigned to points in . We assume111This is to simplify the presentation; our work can be readily extended to any compact metric space with finite metric dimension. that and is the Euclidean metric on , i.e., for all . A binary classification problem is completely specified by , i.e., the joint distribution of the input-label random variables. Equivalently, it can also be represented in terms of the marginal over the input space, , and the regression function . A (randomized) abstaining classifier is defined as a mapping , where , the symbol represents the option of the classifier to abstain from declaring a label, and represents the set of probability distributions on . Such a classifier comprises of three functions , for , satisfying , for each . A classifier is called deterministic if the functions take values in the set . Every deterministic classifier partitions the input set into three disjoint sets .

Two common abstention models considered in the literature are:

• Fixed Cost, in which the abstain option can be employed with a fixed cost of . In this setting, the classification risk is defined as , and the classification problem is stated as

 mingRλ(g)=E[lλ(g,X,Y)]=PXY(g(X)≠Y ,g(X)≠Δ)+λPX(g(X)=Δ). (1)

The Bayes optimal classifier is defined as , , or , depending on whether , , or is the smallest.

• Bounded-Rate, in which the classifier can abstain up to a fraction of the input samples. In this setting, we define the misclassification risk of a classifier as , and state the classification problem as

 mingR(g),subject % toPX(g(X)=Δ)≤δ. (2)

The Bayes optimal classifier for (2) is in general a randomized classifier. However, under some continuity assumptions on the joint distribution , it is again of a threshold type, , , or , depending on whether , , or is minimum, where .

The main difference between (1) and (2) is that in the fixed cost setting, the threshold levels are known beforehand, while in the bounded rate of abstention setting, the mapping is not known, and in general is quite complex. In order to construct a classifier that satisfies the constraint in (2), we need some information about the marginal . Accordingly, we consider two variants of the bounded-rate setting: (i) the marginal is completely known to the learner, and (ii) is not known, and the learner can request a limited number (polynomial in query budget ) of unlabelled samples to estimate the measure of any set of interest.

Active learning models:

For every abstention model mentioned above, we propose active learning algorithms that can work in three commonly used active learning settings (Settles, 2009, § 2): (i) membership query synthesis, (ii) pool-based, and (iii) stream-based. Membership query synthesis requires the strongest query model, in which the learner can request labels at any point of the input space. A slightly weaker version of this model is the pool-based setting, in which the learner is provided with a pool of unlabelled samples and must request labels of a subset of the pool. Finally, in the stream-based setting, the learner receives a stream of samples and must decide whether to request a label or discard the sample.

2.1 Definitions and Assumptions

To construct our classifier, we will require a hierarchical sequence of partitions of the input space, called the tree of partitions (Bubeck et al., 2011; Munos et al., 2014).

Definition 1.

A sequence of subsets of are said to form a tree of partitions of , if they satisfy the following properties: (i) and we denote the elements of by , for , (ii) for every , we denote by , the cell associated with , which is defined as , where ties are broken in an arbitrary but deterministic manner, and (iii) there exist constants and such that for all and , we have , where is the open ball in centered at with radius .

Remark 1.

For the metric space considered in our paper, i.e.,  and being the Euclidean metric, the cells are dimensional rectangles. Thus, a suitable choice of parameter values for our algorithms are , , and .

Next, we define the dimensionality of the region of the input space at which the regression function is close to some threshold value .

Definition 2.

For a function and a threshold , we define the near- dimension associated with and the regression function as

 Dλ(ζ)\coloneqqinf{a≥0 ∣  ∃C>0: M(Xλ(ζ(r)),r)≤Cr−a,∀r>0}, (3)

where and is the packing number of .

The above definition is motivated by similar definitions used in the bandit literature such as the near-optimality dimension of Bubeck et al. (2011) and the zooming dimension of Kleinberg et al. (2013). For the case of considered in this paper, the term must be no greater than , i.e., . This is because , for all , and there exists a constant , such that , for all .

Remark 2.

We will use an instance of near- dimension for stating our results defined as , where and , for .

Assumptions:

We now state the assumptions required for the analysis of our classifiers:

(MA)

The joint distribution of the input-label pair satisfies the margin assumption with parameters and , for in the set , which means that for any , we have , for .

(HÖ)

The regression function is Hölder continuous with parameters and , i.e., for all , we have .

(DE)

For the values of in the same set as in (MA), we define the detectability assumption with parameters and as , for any .

The (MA) and (HÖ) assumptions are quite standard in the nonparametric learning literature (Herbei and Wegkamp, 2006; Minsker, 2012). The (DE) assumption, which is only required in the bounded-rate setting, has also been employed in several prior works such as Castro and Nowak (2008); Tong (2013). A detailed discussion of these assumptions is presented in Appendix A.1

3 Active Learning Algorithms

We consider three settings for the problem of binary classification with abstention in this paper. For each setting, we propose an active learning algorithm and prove an upper-bound on its excess risk.

The algorithm for Setting 1 provides us with the general template which is also followed in the other two settings with some additional complexity. Because of this, we describe the specifics of the algorithm for Setting 1 in the main text, and relegate the details of the algorithmic as well as analytic modifications required for Settings 2 and 3 to the appendix. Throughout this paper, we will refer to the algorithm for Setting  as Algorithm , for , , and .

3.1 Setting 1: Abstention with the fixed cost λ∈(0,1/2)

In this section, we first provide an outline of our active learning algorithm for this setting (Algorithm 1). We then describe the steps of this algorithm and present an upper-bound on the excess risk of the classifier constructed by the algorithm. We report the pseudo-code of the algorithm and the proofs in Appendices B.1 and B.3.

Outline of Algorithm 1.

At any time , the algorithm maintains a set of active points , such that the cells associated with the points in partition the whole , i.e., . The set is further divided into classified active points, , unclassified active points, , and discarded points, . The classified points are those at which the value of has been estimated sufficiently well so that we do not need to evaluate them further. The unclassified points require further evaluation and perhaps refinement before making a decision. The discarded points are those for which we do not have sufficiently many unlabelled samples in their cells (in the stream-based and pool-based settings). For every active point, the algorithm computes high probability upper and lower bounds on the maximum and minimum values in the cell associated with the point. The difference of these upper and lower bounds can be considered as a surrogate for the uncertainty in the value in a cell. In every round, the algorithm selects a candidate point from the unclassified set that has the largest value of this uncertainty. Having chosen the candidate point, the algorithm either refines the cell or asks for a label at that point.

Steps of Algorithm 1.

The algorithm proceeds in the following steps:

1. For , initialize , , , , , and .

2. For , for every , we calculate and , which are an upper-bound on the maximum value and a lower-bound on the minimum value of the regression function in , respectively. We define , where . Here is the empirical estimate of in the cell , is the number of times the cell has been queried by the algorithm up to time , represents the confidence interval length at (see Lemma 3 in Appendix B.3), and is an upper-bound on the maximum variation of the regression function in a cell at level of the tree of partitions. The term is defined in a similar manner using instead of and using . We add all points to the set , if they satisfy any one of these three conditions, (a) , (b) , or (c) .

3. The set of unclassified active points, , are those points in for which is nonempty.

4. We select a candidate point from according to the rule , where we define the index .

5. Once a candidate point is selected, we take one of the following two actions:

1. Refine. If the uncertainty in the regression function value at , denoted by , is smaller than the upper-bound on the function variation in the cell , denoted by , and if , then we perform the following operations:

 Xt←(Xt∖{xht,it})∪{xht+1,2it−1,xht+1,2it},ut(xht+1,2it−1)=ut(xht,it), lt(xht+1,2it−1)=lt(xht,it),ut(xht+1,2it)=ut(xht,it),lt(xht+1,2it)=lt(xht,it).
2. Request a Label. Otherwise, for each active learning model, we proceed as follows:

• In the membership query model, we request for the label at any point in the cell associated with .

• In the pool-based model, we request the label if there is an unlabelled sample remaining in the cell . Otherwise, we remove from , add it to , and return to Step 2.

• In the stream-based model, we discard the samples until a point in the cell arrives. If samples have been discarded, we remove from , add it to , and return to Step 2 without requesting a label.

6. Let denote the time at which the ’th query is made and the algorithm halts. Then, we define the final estimate of the regression function as , where

 πtn(x)\coloneqq{xh,i∈Xtn ∣ d(x,xh,i)≤d(x,xh′,i′),∀xh′,i′∈Xtn}, (4)

and define the discarded region of the input space as .

7. Finally, the classifier returned by the algorithm is defined as

 ^g(x)=⎧⎪⎨⎪⎩1if utn(πtn(x))>1−λ or x∈~Xn,0if ltn(πtn(x))<λ and x∉~Xn,Δotherwise. (5)

Note that the classifier (5) arbitrarily assigns label to the points in the discarded region .

Remark 3.

Algorithm 1 (and as we will see later Algorithms 2 and 3) assumes the knowledge of parameters , , , and . As described in Remark 1, it is straightforward to select the parameters and , but the smoothness parameters and are often not known to the algorithm. We address this in Section 4 by designing an algorithm that adapts to the smoothness parameters.

In the membership query model, the discarded set remains empty since the learner can always obtain a labelled sample from any cell. We begin with a result that shows that even in the other two models, the probability mass of the discarded region is small under some mild assumptions.

Lemma 1.

Assume that in the pool-based model, the pool size is greater than and in the stream-based model, the term is set to . Then, we have .

This lemma (proved in Appendix B.2) implies that in the pool-based and stream-based models, with high probability, the misclassification risk of can be upper-bounded by . Lemma 1 is quite important because it implies that under some mild conditions, the analysis of the pool-based and stream-based models reduces to the analysis of the membership query model with an additional cost that can be upper bounded by .

We now prove an upper-bound on the excess risk of the classifier (see Appendix B.3 for the proof).

Theorem 1.

Suppose that the assumptions (MA) and (HÖ) hold, and let be the dimension term defined in Remark 2. Then, for large enough , with probability at least , for the classifier defined by (5) and for any , we have

 Rλ(^g)−Rλ(g∗λ)=~O(n−β(α0+1)/(2β+a)), (6)

where the hidden constant depends on the parameters , , , , , , and .

The above result improves upon the convergence rate of the plug-in scheme of Herbei and Wegkamp (2006) in the passive setting mirroring the benefits of active learning in the standard binary classification problems. See Section 7 and Appendix H for further discussion.

3.2 Setting 2: Bounded-rate setting with known Px

This setting provides an intermediate step between the fixed-cost and bounded-rate settings. The key difference between the algorithms for this and the fixed-cost setting lies in the rule used for updating the set of unclassified points. Since in this case the threshold is not known, we need to use the current estimate of the regression function to obtain upper and lower bounds on the true threshold, and then use these bounds to decide which parts of the input space have to be further explored. We report the details of the algorithm in Appendix C.1, its pseudo-code in Appendix C.2, and the statement and proof of its excess risk bound (Theorem 3) in Appendix C.3.

3.3 Setting 3: Bounded-rate setting with unlabelled samples

Finally, we consider the general bounded-rate abstention model in the semi-supervised setting. In this case, the algorithm should request for unlabelled samples and use them to both construct the estimates of the appropriate threshold values and obtain better empirical estimates of the measure of a set. Unlike Algorithm 2, in Algorithm 3 we have to construct estimates of the threshold using empirical measure , and furthermore, based on the error in estimate of , we also need a strategy of updating by requesting more unlabelled samples. We report the details of Algorithm 3 in Appendix D.1, its pseudo-code in Appendix D.2, and the statement and proof of its excess risk bound (Theorem 4) in Appendix D.3. We note that the excess risk bound for Algorithm 3 is minimax (near)-optimal under the same assumptions as in Algorithms 1 and 2. However, in order to exploit easier problem instances in which is much smaller than , we require an additional (DE) assumption (see Section 7 for detailed discussion).

All the active learning algorithms discussed in Section 3 assume the knowledge of the Hölder smoothness parameters and . We now present a simple strategy to achieve adaptivity to these parameters. To simplify the presentation, we only consider the problem in the fixed-cost setting with membership query model. Extension to the other settings and models could be done in the same manner. The parameters are required by Algorithm 1 at two junctures: 1) to define the index for selecting a candidate point, and 2) to decide when to refine a cell. In our proposed adaptive scheme, we address these issues as follows:

• Instead of selecting one candidate point in each step, we select one point from each level from the current set of active points. This is similar to the approach used in the SOO algorithm (Munos, 2011) for global optimization. Since the maximum depth of the tree is , this modification only results in an additional factor in the excess risk.

• To decide when to refine, we need to estimate the variation of in a cell from samples. We make an additional assumption, (QU), that the pair has quality (see Appendix E for the definition). This assumption has been used in prior works on adaptive global optimization (Slivkins, 2011; Bull et al., 2015). We then proceed by proposing a local variant of Lepski’s technique (Lepski et al., 1997) to construct the required estimate of the variation of , combined with an appropriate stopping rule.

With these two modifications and the additional quality assumption (QU), we can achieve the rate , with , thus, matching the performance of Algorithm 1. The details of the adaptive scheme and the proof of convergence rate are provided in Appendix E.

Remark 4.

We note that there are other adaptive schemes for active learning, such as  Minsker (2012); Locatelli et al. (2017), that can also be applied to the problem studied in this paper. Our proposed adaptive scheme provides an alternative to these existing methods. Furthermore, our scheme can also be applied to classification problems with implicit similarity information, similar to Slivkins (2011), as well as to problems with spatially inhomogeneous regression functions.

5 Lower Bounds

We now derive minimax lower-bounds on the expected excess risk in the fixed-cost setting and for the membership query model. Since this is the strongest active learning query model, the obtained lower-bounds are also true for the other two models. The proof follows the general outline for obtaining lower bounds described in existing works, such as Audibert and Tsybakov (2007); Minsker (2012), reducing the estimation problem to that of an appropriate multiple hypothesis testing problem, and applying Theorem 2.5 of Tsybakov (2009). The novel elements of our proof are the construction of an appropriate class of regression functions (see Appendix F) and the comparison inequality presented in Lemma 2.

We begin by presenting a lemma that provides a lower-bound on the excess risk of an abstaining classifier in terms of the probability of the mismatch between the abstaining regions of the given classifier and the Bayes optimal classifier. The proof of Lemma 2 is given in Appendix F.

Lemma 2.

In the fixed-cost abstention setting with cost of abstention equal to , let represent any abstaining classifier and represent the Bayes optimal one. Then, we have

 Rλ(g)−Rλ(g∗λ)≥cPX((G∗λ∖Gλ)∪(Gλ∖G∗λ))1+α0α0, (7)

where is a constant, and is the parameter used in the assumptions of Section 2.1.

Lemma 2 aids our lower-bound proof in several ways: 1) the RHS of (7) motivates our construction of hard problem instances, in which it is difficult to distinguish between the ‘abstain’ and ‘not-abstain’ options, 2) the RHS of (7) also suggests a natural definition of pseudo-metric (see Theorem 5 in Appendix F.2), and 3) it allows us to convert the lower-bound on the hypothesis testing problem to that on the excess risk. We now state the main result of this section (see Appendix F for the proof).

Theorem 2.

Let be any active learning algorithm and be the abstaining classifier learned by with label queries in the fixed-cost abstention setting, with cost . Let represent the class of joint distributions satisfying the margin assumption (MA) with exponent , whose regression function is Hölder continuous with and . Then, we have

 infAsupPXY∈P(L,β,α0)(E[Rλ(^gn)−Rλ(g∗λ)]) ≥Cn−β(1+α0)/(2β+D).

Finally, by exploiting the relation between the Bayes optimal classifier in the fixed-cost and bounded-rate of abstention settings, we can obtain the following lower-bound on the expected excess risk in the bounded-rate of abstention setting.

Corollary 1.

For the bounded-rate of abstention setting, we have the following lower-bound:

The proof of this statement is given in Appendix F.

6 Computationally Feasible Algorithms

The lower bound obtained in the previous section implies that in the worst case, to ensure an excess risk smaller than , any algorithm will require label requests (in both the fixed-cost and bounded-rate settings). This means that the worst case computational complexity of any algorithm will have an exponential dependence of the dimension. The above discussion suggests that to obtain computationally tractable algorithms, we need to restrict the hypothesis class. We consider the class of learning problems where the regression function is a generalized linear map given by where is a monotonic invertible  Hölder continuous function. This class of problems (henceforth denoted by ), though much smaller than considered in previous sections, contains standard problem instances such as linear classifiers and logistic regression. Furthermore, by using appropriate feature maps, the class can model very complex decision boundaries.

Due to the special structure of the regression function, the learning problem (for Setting 1) then reduces to estimating the optimal hyperplane , and the value . Here we can employ the dimension coupling technique of Chen et al. (2017), which implies that the dimensional problem can be reduced to two dimensional problems. Furthermore, as we show in Proposition 2 (stated and proved in Appendix G), for an a modified version of Algorithm 1 can estimate the term for continuously differentiable with accuracy for a number of labelled samples which has a polynomial dependence of the dimension .

7 Discussion

Improved Convergence Rates (active over passive learning).

The convergence rates on the excess risk obtained by our active learning algorithms improve upon those in the literature obtained in the passive case. More specifically, the excess risk in the passive case for the fixed-cost (Herbei and Wegkamp, 2006) and bounded-rate (Denis and Hebiri, 2015) settings is (using the estimators of Audibert and Tsybakov 2007). In contrast, all our algorithms achieve an excess risk of , for . Thus, even for the worst case of , our algorithms achieve faster convergence in both abstention settings. Moreover, under the additional assumption that admits a density w.r.t. the Lebesgue measure, such that , for all , the convergence rates in the passive case for both abstention settings improve by getting rid of the term in the exponent. The performance of our algorithms also improves further with this additional assumption, and we can show that (see Appendix H.1 for details).

Necessity of the Detectability (DE) Assumption.

In Setting 3, the size of the unclassified region, , depends on two terms: 1) the error in the estimate of the regression function , and 2) the error due to using the empirical measure . The (DE) assumption ensures that for sufficiently accurate empirical estimates of the marginal , we can control the size of the unclassified region in terms of the errors in the estimate of the regression function (similar to Settings 1 and 2). A situation, where without (DE), Algorithm 3 has to explore a much larger region of the input space than Algorithm 2 (in Setting 2) is given in Appendix H.2. Since there exist problem instances for which , we note that (DE) is not needed to match the worst-case performance of Algorithm 2. However, it is required in order to exploit the easy problem instances with low values of .

8 Conclusions and Future Work

In this paper, we proposed and analyzed active learning algorithms for three settings of the problem of binary classification with abstention. The first setting considers the problem of classification with fixed cost of abstention, while the other settings consider two variants of classification with bounded abstention rate. We obtained upper bounds on the excess risk of all the algorithms and demonstrated their minimax (near)-optimality by deriving lower bounds. As all our algorithms relied on the knowledge of smoothness parameters, we then proposed a general strategy to adapt to these parameters in a data driven way. A novel aspect of our adaptive strategy is that it can also work for more general learning problems with implicit distance measure on the input space. Finally, we also presented a computationally efficient version of our algorithms for a small but rich class of problems.

In Section 6, we discussed an efficient version of our algorithms in the realizable case when the Bayes optimal classifier is a halfspace. An important topic of ongoing research is to extend ideas presented in this paper to the agnostic case, and design general computationally feasible active learning strategies for learning classifiers with abstention.

References

• Audibert and Tsybakov (2007) Audibert, J.-Y. and Tsybakov, A. (2007). Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608–633.
• Bartlett and Wegkamp (2008) Bartlett, P. and Wegkamp, M. (2008). Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823–1840.
• Bousquet et al. (2003) Bousquet, O., Boucheron, S., and Lugosi, G. (2003). Introduction to statistical learning theory. In Summer School on Machine Learning, pages 169–207. Springer.
• Bubeck et al. (2011) Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. (2011). X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695.
• Bull et al. (2015) Bull, A. D. et al. (2015). Adaptive-treed bandits. Bernoulli, 21(4):2289–2307.
• Castro and Nowak (2008) Castro, R. M. and Nowak, R. D. (2008). Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339–2353.
• Cavalier (1997) Cavalier, L. (1997). Nonparametric estimation of regression level sets. Statistics A Journal of Theoretical and Applied Statistics, 29(2):131–160.
• Chen et al. (2017) Chen, L., Hassani, H., and Karbasi, A. (2017). Near-optimal active learning of halfspaces via query synthesis in the noisy setting. In Thirty-First AAAI Conference on Artificial Intelligence.
• Chow (1970) Chow, C. (1970). On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46.
• Chow (1957) Chow, C.-K. (1957). An optimum character recognition system using decision functions. IRE Transactions on Electronic Computers, (4):247–254.
• Cortes et al. (2016) Cortes, C., DeSalvo, G., and Mohri, M. (2016). Learning with rejection. In International Conference on Algorithmic Learning Theory, pages 67–82.
• Dasgupta (2006) Dasgupta, S. (2006). Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pages 235–242.
• Denis and Hebiri (2015) Denis, C. and Hebiri, M. (2015). Consistency of plug-in confidence sets for classification in semi-supervised learning. arXiv preprint arXiv:1507.07235.
• Herbei and Wegkamp (2006) Herbei, R. and Wegkamp, M. (2006). Classification with reject option. Canadian Journal of Statistics, 34(4):709–721.
• Karp and Kleinberg (2007) Karp, R. M. and Kleinberg, R. (2007). Noisy binary search and its applications. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 881–890. Society for Industrial and Applied Mathematics.
• Kleinberg et al. (2013) Kleinberg, R., Slivkins, A., and Upfal, E. (2013). Bandits and experts in metric spaces. arXiv preprint arXiv:1312.1277.
• Lepski et al. (1997) Lepski, O. V., Spokoiny, V. G., et al. (1997). Optimal pointwise adaptive methods in nonparametric estimation. The Annals of Statistics, 25(6):2512–2546.
• Locatelli et al. (2017) Locatelli, A., Carpentier, A., and Kpotufe, S. (2017). Adaptivity to noise parameters in nonparametric active learning. arXiv preprint arXiv:1703.05841.
• Minsker (2012) Minsker, S. (2012). Plug-in approach to active learning. Journal of Machine Learning Research, 13(Jan):67–90.
• Munos (2011) Munos, R. (2011). Optimistic optimization of a deterministic function without the knowledge of its smoothness. In Advances in neural information processing systems, pages 783–791.
• Munos et al. (2014) Munos, R. et al. (2014). From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129.
• Pietraszek (2005) Pietraszek, T. (2005). Optimizing abstaining classifiers using roc analysis. In Proceedings of the 22nd international conference on Machine learning, pages 665–672. ACM.
• Rigollet and Tong (2011) Rigollet, P. and Tong, X. (2011). Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12:2831–2855.
• Rubegni et al. (2002) Rubegni, P., Cevenini, G., Burroni, M., Perotti, R., Dell’Eva, G., Sbano, P., Miracco, C., Luzi, P., Tosi, P., Barbini, P., et al. (2002). Automated diagnosis of pigmented skin lesions. International Journal of Cancer, 101(6):576–580.
• Settles (2009) Settles, B. (2009). Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
• Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
• Slivkins (2011) Slivkins, A. (2011). Multi-armed bandits on implicit metric spaces. In Advances in Neural Information Processing Systems, pages 1602–1610.
• Tong (2013) Tong, X. (2013). A plug-in approach to Neyman-Pearson classification. Journal of Machine Learning Research, 14(1):3011–3040.
• Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition.
• Tsybakov et al. (1997) Tsybakov, A. B. et al. (1997). On nonparametric estimation of density level sets. The Annals of Statistics, 25(3):948–969.
• Wegkamp (2007) Wegkamp, M. (2007). Lasso type classifiers with a reject option. Electronic Journal of Statistics, 1:155–168.
• Wegkamp and Yuan (2011) Wegkamp, M. and Yuan, M. (2011). Support vector machines with a reject option. Bernoulli, 17(4):1368–1385.
• Yuan (2010) Yuan, M.and Wegkamp, M. (2010). Classification methods with reject option based on convex risk minimization. Journal of Machine Learning Research, 11:111–130.

Appendix A Details from Section 1 and Section 2

a.1 Discussion on Assumptions

The margin assumption (MA) controls the amount of measure assigned to the regions of the input space with values in the vicinity of the threshold values.The assumption (MA), which is a modification of the Tsybakov’s margin condition for binary classification (Bousquet et al., 2003, Definition 7), has be employed in several existing works in classification with abstention literature such as (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Yuan, 2010).

The Hölder continuity assumption ensures that points which are close to each other have similar distribution on the label set. For simplicity, we restrict our attention to the case of so that it suffices to consider piecewise constant estimators. For Hölder functions with , our algorithms can be suitably modified by replacing the piece-wise constant estimators with local polynomial estimators (Tsybakov, 2009, § 1.6).

The detectability assumption (DE) is a converse of the (MA) assumption. It provides a lower bound on the amount of measure in the regions of with values close to the thresholds. We note that our proposed algorithms acheive the minimax optimal rates without this assumption. However, this assumption is required by our algorithm in the most general problem setting (Theorem 4) for exploiting easier problem instances. Assumptions similar to (DE) have been used in various prior works in the nonparametric learning and estimation literature (Castro and Nowak, 2008; Tong, 2013; Rigollet and Tong, 2011; Cavalier, 1997; Tsybakov et al., 1997). We discuss the necessity of this assumption in Section 7 and in Appendix H.

Appendix B Pseudo-code and Proofs of the Algorithm from Section 3.1

b.1 Pseudo-code of Algorithm 1

In this section, we report the pseudo-code of Algorithm 1 that was outlined and described in Section 3.1. This is our active learning algorithm for the fixed-cost setting, with cost of abstention equal to . As mentioned earlier our proposed algorithm can work in the three commonly used active learning frameworks, namely, membership query model, pool-based and stream-based models. The only difference is the way the algorithm interacts with the labelling oracle, and this is captured by the REQUEST_LABEL subroutine given in Appendix B.1.1.

b.2 Proof of Lemma 1

We begin with the proof of Lemma 1 which shows that with probability at least , the measure of the (random) set is no larger than .

Suppose the discarded region consists of components, i.e., . Since the algorithm only refines cells up to the depth , and the total number of cells in is , we can trivially upper bound the number of discarded cells/points with , i.e., .

Stream-based setting.

In this case a cell is discarded, if after consecutive draws from , none of the samples fall in . We proceed as follows:

 P(PX(~Xn)>1/n) =P⎛⎜ ⎜⎝∑xh,i∈X(d)tnPX(Xh,i)>1/n⎞⎟ ⎟⎠\lx@stackrel(a)≤P(∃xh,i∈X(d)tn:PX(Xh,i)>1/(nT)) \lx@stackrel(b)≤∑xh,i∈X(d)tnP(PX(Xh,i)>1/(nT);xh,i∈X(d)tn)\lx@stackrel(c)≤T(1−1nT)Nn \lx@stackrel(d)≤n(1−1n2)Nn≤exp(−Nnn2+log(n))\lx@stackrel(e)=1n.

In the above display,
(a) follows from the pigeonhole principle,
(b) follows from an application of union bound,
(c) follows from the rule used for discarding cells in the stream-based setting,
(d) follows from the fact that , and
(e) follows from the choice of .

Pool-based setting.

Let denote the pool of unlabelled samples available to the learner, and for any we introduce the notation to represent the number of samples lying in the cell . Recall that a cell is discarded if the number of unique unlabelled samples in the cell is smaller than the number of label requests in the cell, which can be trivially upper bounded by , the total budget. Thus, introducing the terms and , we get the following (for any realization of ):

 PX(~Xn) ≤PX⎛⎜⎝⋃xh,i∈C1Xh,i⎞⎟⎠≤n(1n2)+PX⎛⎜⎝⋃xh,i∈C2Xh,i⎞⎟⎠,

where in first term after the second inequality above, we use the fact that the total number of cells discarded up to the depth of cannot be larger than .

Now, we claim that to complete the proof, it suffices to show that for any such that , we have . This is because , and , and combined with the previous statement it implies that is an empty set with proabability at least .

Consider any cell such that . For points in define the random variable . Suppose . Then we have the following:

 P(Mh,i

In the above display:
(a) follows from the fact that ,
(b) follows from the fact that ,
(c) follows from the application of Chernoff inequality for the lower tail of Binomial,
(d) follows from the fact that and .

Lemma