We develop a new active learning algorithm for the streaming setting satisfying three important properties: 1) It provably works for any classifier representation and classification problem including those with severe noise. 2) It is efficiently implementable with an ERM oracle. 3) It is more aggressive than all previous approaches satisfying 1 and 2. To do this we create an algorithm based on a newly defined optimization problem and analyze it. We also conduct the first experimental analysis of all efficient agnostic active learning algorithms, evaluating their strengths and weaknesses in different settings.
Efficient and Parsimonious Agnostic Active Learning
|Tzu-Kuo Huang||Alekh Agarwal||Daniel J. Hsu|
|John Langford Robert E. Schapire|
|Microsoft Research||Department of Computer Science|
|New York, NY||Columbia University, New York, NY|
How can you best learn a classifier given a label budget?
Active learning approaches are known to yield exponential improvements over supervised learning under strong assumptions (Cohn et al., 1994). Under much weaker assumptions, streaming-based agnostic active learning (Balcan et al., 2006; Beygelzimer et al., 2009, 2010; Dasgupta et al., 2007; Zhang and Chaudhuri, 2014) is particularly appealing since it is known to work for any classifier representation and any label noise distribution with an i.i.d. data source.111See the monograph of Hanneke (2014) for an overview of the existing literature, including alternative settings where additional assumptions are placed on the data source (e.g., separability) as is common in other works (Dasgupta, 2005; Balcan et al., 2007; Balcan and Long, 2013). Here, a learning algorithm decides for each unlabeled example in sequence whether or not to request a label, never revisiting this decision. Restated then: What is the best possible active learning algorithm which works for any classifier representation, any label noise distribution, and is computationally tractable?
Computational tractability is a critical concern, because most known algorithms for this setting (e.g., Balcan et al., 2006; Koltchinskii, 2010; Zhang and Chaudhuri, 2014) require explicit enumeration of classifiers, implying exponentially-worse computational complexity compared to typical supervised learning algorithms. Active learning algorithms based on empirical risk minimization (ERM) oracles (Beygelzimer et al., 2009, 2010; Hsu, 2010) can overcome this intractability by using passive classification algorithms as the oracle to achieve a computationally acceptable solution.
Achieving generality, robustness, and acceptable computation has a cost. For the above methods (Beygelzimer et al., 2009, 2010; Hsu, 2010), a label is requested on nearly every unlabeled example where two empirically good classifiers disagree. This results in a poor label complexity, well short of information-theoretic limits (Castro and Nowak, 2008) even for general robust solutions (Zhang and Chaudhuri, 2014). Until now.
In Section 3, we design a new algorithm Active Cover (AC) for constructing query probability functions that minimize the probability of querying inside the disagreement region—the set of points where good classifiers disagree—and never query otherwise. This requires a new algorithm that maintains a parsimonious cover of the set of empirically good classifiers. The cover is a result of solving an optimization problem (in Section 5) specifying the properties of a desirable query probability function. The cover size provides a practical knob between computation and label complexity, as demonstrated by the complexity analysis we present in Section 5.
In Section 4, we provide our main results which demonstrate that AC effectively maintains a set of good classifiers, achieves good generalization error, and has a label complexity bound tighter than previous approaches. The label complexity bound depends on the disagreement coefficient (Hanneke, 2009), which does not completely capture the advantage of the algorithm. In Section 4.2.2, we provide an example of a hard active learning problem where AC is substantially superior to previous tractable approaches. Together, these results show that AC is better and sometimes substantially better in theory. The key aspects in the proof of our generalization results are presented in Section 7, with more technical details and label complexity analysis presented in the appendix.
Do agnostic active learning algorithms work in practice? No previous works have addressed this question empirically. Doing so is important because analysis cannot reveal the degree to which existing classification algorithms effectively provide an ERM oracle. We conduct an extensive study in Section 6 by simulating the interaction of the active learning algorithm with a streaming supervised dataset. Results on a wide array of datasets show that agnostic active learning typically outperforms passive learning, and the magnitude of improvement depends on how carefully the active learning hyper-parameters are chosen.
Let be a set of binary classifiers, which we assume is finite for simplicity.222The assumption that is finite can be relaxed to VC-classes using standard arguments. Let denote expectation with respect to , the marginal of over . The expected error of a classifier is , and the error minimizer is denoted by . The (importance weighted) empirical error of on a multiset of importance weighted and labeled examples drawn from is . The disagreement region for a subset of classifiers is . The regret of a classifier relative to another is , and the analogous empirical regret on is . When the second classifier in (empirical) regret is omitted, it is taken to be the (empirical) error minimizer in .
A streaming-based active learner receives i.i.d. labeled examples from one at a time; each label is hidden unless the learner decides on the spot to query it. The goal is to produce a classifier with low error , while querying as few labels as possible.
In the IWAL framework (Beygelzimer et al., 2009), a decision whether or not to query a label is made randomly: the learner picks a probability , and queries the label with that probability. Whenever , an unbiased error estimate can be produced using inverse probability weighting (Horvitz and Thompson, 1952). Specifically, for any classifier , an unbiased estimator of based on and is as follows: if is queried, then ; else, . It is easy to check that . Thus, when the label is queried, we produce the importance weighted labeled example .333If the label is not queried, we produce an ignored example of weight zero; its only purpose is to maintain the correct count of querying opportunities. This ensures that is the correct normalization in .
Our new algorithm, shown in Algorithm 1, breaks the example stream into epochs. The algorithm admits any epoch schedule so long as the epoch lengths satisfy . For technical reasons, we always query the first 3 labels to kick-start the algorithm. At the start of epoch , AC computes a query probability function which will be used for sampling the data points to query during the epoch. This is done by maintaining a few objects of interest during each epoch:
In step 1, we compute the best classifier on the sample that we have collected so far. Note that the sample consists of the queried, true labels on some examples, while predicted labels for the others.
A radius is computed in step 2 based on the desired level of concentration we want the various empirical quantities to satisfy.
The set in step 3 consists of all the hypotheses which are good according to our sample , with the notion of good being measured as empirical regret being at most .
Within the epoch, determines the probability of querying an example in the disagreement region for this set of “good” classifiers; examples outside this region are not queried but given labels predicted by . Consequently, the sample is not unbiased unlike some of the predecessors of our work. The various constants in Algorithm 1 must satisfy:
The algorithm as stated takes an arbitrary epoch schedule subject to . Two natural extremes are unit-length epochs, , and doubling epochs, . The main difference comes in the number of times is solved, which is a substantial computational consideration. Unless otherwise stated, we assume the doubling epoch schedule so that the query probability and ERM classifier are recomputed only times.
Optimization problem (op) to obtain :
AC computes as the solution to the optimization problem (op). In essence, the problem encodes the properties of a query probability function that are essential to ensure good generalization, while maintaining a low label complexity. As we will discuss later, some of the previous works can be seen as specific ways of constructing feasible solutions to this optimization problem. The objective function of encourages small query probabilities in order to minimize the label complexity. It might appear odd that we do not use the more obvious choice for objective which would be , however our choice simultaneously encourages low query probabilities and also provides a barrier for the constraint –an important algorithmic aspect as we will discuss in Section 5.
The constraints (6) in (op) bound the variance in our importance-weighted regret estimates for every . This is key to ensuring good generalization as we will later use Bernstein-style bounds which rely on our random variables having a small variance. Let us examine these constraints in more detail. The LHS of the constraints measures the variance in our empirical regret estimates for , measured only on the examples in the disagreement region . This is because the importance weights in the form of are only applied to these examples; outside this region we use the predicted labels with an importance weight of 1. The RHS of the constraint consists of three terms. The first term ensures the feasibility of the problem, as for will always satisfy the constraints. The second empirical regret term makes the constraints easy to satisfy for bad hypotheses–this is crucial to rule out large label complexities in case there are bad hypotheses that disagree very often with . A benefit of this is easily seen when , which might have a terrible regret, but would force a near-constant query probability on the disagreement region if . Finally, the third term will be on the same order as the second one for hypotheses in , and is only included to capture the allowed level of slack in our constraints which will be exploited for the efficient implementation in Section 5.
Of course, variance alone is not adequate to ensure concentration, and we also require the random variables of interest to be appropriately bounded. This is ensured through the constraints (6), which impose a minimum query probability on the disagreement region. Outside the disagreement region, we use the predicted label with an importance weight of 1, so that our estimates will always be bounded (albeit biased) in this region. Note that this optimization problem is written with respect to the marginal distribution of the data points , meaning that we might have infinite number of the latter constraints. In Section 5, we describe how to solve this optimization problem efficiently, and using access to only unlabeled examples drawn from .
Finally we verify that the choices for according to some of the previous methods are indeed feasible in (op). This is most easily seen for Oracular CAL (Hsu, 2010) which queries with probability 1 if and 0 otherwise. Since (4) in the variance constraints (6), the choice for is feasible for (op), and consequently Oracular CAL always queries more often than the optimal distribution at each epoch. A similar argument can also be made for the IWAL method (Beygelzimer et al., 2010), which also queries in the disagreement region with probability 1, and hence suffers from the same sub-optimality compared to our choice.
4 Generalization and Label Complexity
We now present guarantees on the generalization error and label complexity of Algorithm 1 assuming a solver for , which we provide in the next section.
4.1 Generalization guarantees
Our first theorem provides a bound on generalization error. Define
Essentially is a population counterpart of the quantity used in Algorithm 1, and crucially relies on , the true error of restricted to the disagreement region instead of the empirical error of the ERM at epoch . This quantity captures the inherent noisiness of the problem, and modulates the transition between to type error bounds as we see next.
Pick any such that . Then recalling that , we have for all epochs , with probability at least
Since we use , the bound (9) implies that for all epochs . This also maintains that all the predicted labels used by our algorithm are identical to those of , since no disagreement amongst classifiers in was observed on those examples. This observation will be critical to our proofs, where we will exploit the fact that using labels predicted by instead of observed labels on certain examples only introduces a bias in favor of , thereby ensuring that we never mistakenly drop the optimal classifier from our version space .
The bound (8) shows that every hypothesis in has a small regret to . Since the ERM classifier is always in , this yields our main generalization error bound on the classifier output by Algorithm 1. Additionally, it also clarifies the definition of the sets as the set of good classifiers: these are classifiers which have small population regret relative to indeed. In the worst case, if is a constant, then the overall regret bound is . The actual rates implied by the theorem, however depend on the properties of the distribution and below we illustrate this with two corollaries. We start with a simple specialization to the realizable setting.
Corollary 1 (Realizable case).
Under the conditions of Theorem 1, suppose further that . Then and hence for all hypotheses .
In words, the corollary demonstrates a rate after seeing unlabeled examples in the realizable setting. Of course the use of in defining allows us to retain the fast rates even when makes some errors but they do not fall in the disagreement region of good classifiers. One intuitive condition that controls the errors within the disagreement region is the low-noise condition of Tsybakov (2004), which asserts that there exist constants and such that
Under this assumption, the extreme corresponds to the worst-case setting while corresponds to having a zero error on disagreement set of the classifiers with regret at most . Under this assumption, we get the following corollary of Theorem 1.
Corollary 2 (Tsybakov noise).
The proof of this result is deferred to Appendix E. It is worth noting that the rates obtained here are known to be unimprovable for even passive learning under the Tsybakov noise condition (Castro and Nowak, 2008).555 in our statement of the low-noise condition (10) corresponds to in the results of Castro and Nowak (2008). Consequently, there is no loss of statistical efficiency in using our active learning approach. The result is easily extended for other values of by using the worst-case bound until the first epoch when drops below and then apply our analysis above from onwards. We leave this development to the reader.
4.2 Label complexity
Generalization alone does not convey the entire quality of an active learning algorithm, since a trivial algorithm queries always with probability 1, thereby matching the generalization guarantees of passive learning. In this section, we show that our algorithm can achieve the aforementioned generalization guarantees, despite having a small label complexity in favorable situations. We begin with a worst-case result in the agnostic setting, and then describe a specific example which demonstrates some key differences of our approach from its predecessors.
4.2.1 Disagreement-based label complexity bounds
In order to quantify the extent of gains over passive learning, we measure the hardness of our problem using the disagreement coefficient (Hanneke, 2014), which is defined as
Intuitively, given a set of classifiers and a data distribution , an active learning problem is easy if good classifiers disagree on only a small fraction of the examples, so that the active learning algorithm can increasingly restrict attention only to this set. With this definition, we have the following result for the label complexity of Algorithm 1.
The proof is in Appendix D. The dominant first term of the label complexity bound is linear in the number of unlabeled examples, but can be quite small if is small, or if —it is indeed 0 in the realizable setting. We illustrate this aspect of the theorem with a corollary for the realizable setting.
Corollary 3 (Realizable case).
In words, we attain a logarithmic label complexity in the realizable setting, so long as the disagreement coefficient is bounded. We contrast this with the label complexity of IWAL (Beygelzimer et al., 2010), which grows as independent of . This leads to an exponential difference in the label complexities of the two methods in low-noise problems. A much closer comparison is with respect to the Oracular CAL algorithm (Hsu, 2010), which does have a dependence on in the second term, but has a worse dependence on the disagreement coefficient .
Just like Corollary 2, we can also obtain improved bounds on label complexity under the Tsybakov noise condition.
Corollary 4 (Tsybakov noise).
The proof of this result is deferred to Appendix E. The label complexity obtained above is indeed optimal in terms of the dependence on , the number of unlabeled examples, matching known information-theoretic rates of Castro and Nowak (2008) when the disagreement coefficient is bounded. This can be seen since the regret from Corollary 2 falls as a function of the number of queries at a rate of after epochs, where is the number of label queries. This is indeed optimal according to the lower bounds of Castro and Nowak (2008), after recalling that in their results. Once again, the corollary highlights our improvements on top of IWAL, which does not attain this optimal label complexity.
These results, while strong, still do not completely capture the performance of our method. Indeed the proofs of these results are entirely based on the fact that we do not query outside the disagreement region, a property shared by the previous Oracular CAL algorithm (Hsu, 2010). Indeed we only improve upon that result as we use more refined error bounds to define the disagreement region. However, such analysis completely ignores the fact that we construct a rather non-trivial query probability function on the disagreement region, as opposed to using any constant probability of querying over this entire region. This gives our algorithm the ability to query much more rarely even over the disagreement region, if the queries do not provide much information regarding the optimal hypothesis . The next section illustrates an example where this gain can be quantified.
4.2.2 Improved label complexity for a hard problem instance
We now present an example where the label complexity of Algorithm 1 is significantly smaller than both IWAL and Oracular CAL by virtue of rarely querying in the disagreement region. The example considers a distribution and a classifier space with the following structure: (i) for most examples a single good classifier predicts differently from the remaining classifiers (ii) on a few examples half the classifiers predict one way and half the other. In the first case, little advantage is gained from a label because it provides evidence against only a single classifier. Active Cover queries over the disagreement region with a probability close to in case (i) and probability in case (ii), while others query with probability everywhere implying times more queries.
Concretely, we consider the following binary classification problem. Let denote the finite classifier space (defined later), and distinguish some . Let denote the uniform distribution on . The data distribution and the classifiers are defined jointly:
With probability ,
With probability ,
Indeed, is the best classifier because , while . This problem is hard because only a small fraction of examples contain information about . Ideally we want to focus label queries on those informative examples while skipping the uninformative ones. However, algorithms like IWAL, or more generally, active learning algorithms that determine label query probabilities based on error differences between a pair of classifiers, query frequently on the uninformative examples. Let denote the error difference between two different classifiers and . Let be a random variable such that for the case and for the case. Then it is easy to see that
Therefore, IWAL queries all the time on uninformative examples ().
Now let us consider the label complexity of Algorithm 1 on this problem. Let us focus on the query probability inside the region, and fix it to some constant . Let us also allow a query probability of 1 on the region. Then the left hand side in the constraint (6) for any classifier is at most , since and disagree only on those points in the region where one of them is picked as the disagreeing classifier in the random draw. On the other hand, the RHS of the constraints is at least , which is at least as long as is small enough and is large enough for empirical error to be close to true error. Consequently, assuming that , we find that any satisfies the constraints. Of course we also have that , which is in this case since is a constant. Consequently, for large enough is feasible and hence optimal for the population . Since we find an approximately optimal solution based on Theorem 4, the label complexity at epoch is . Summing things up, it can then be checked easily that we make queries over examples, a factor of smaller than baselines such as IWAL and Oracular CAL on this example.
5 Efficient implementation
In Algorithm 1, the computation of is an ERM operation, which can be performed efficiently whenever an efficient passive learner is available. However, several other hurdles remain. Testing for in the algorithm, as well as finding a solution to are considerably more challenging. The epoch schedule helps, but is still solved times, necessitating an extremely efficient solver.
Starting with the first issue, we follow Dasgupta et al. (2007) who cleverly observed that can be efficiently determined using a single call to an ERM oracle. Specifically, to apply their method, we use the oracle to find666We only have access to an unconstrained oracle. But that is adequate to solve with one constraint. See Appendix F of (Karampatziakis and Langford, 2011) for details. . It can then be argued that if and only if the easily-measured regret of (that is, ) is at most .
Solving efficiently is a much bigger challenge because, as an optimization problem, it is enormous: There is one variable for every point , one constraint (6) for each classifier and bound constraints (6) on for every . This leads to infinitely many variables and constraints, with an ERM oracle being the only computational primitive available. Another difficulty is that is defined in terms of the true expectation with respect to the example distribution , which is unavailable.
In the following we first demonstrate how to efficiently solve assuming access to the true expectation , and then discuss a relaxation that uses expectation over samples. For the ease of exposition, we recall the shorthand from earlier.
5.1 Solving (op) with the true expectation
The main challenge here is that the optimization variable is of infinite dimension. We deal with this difficulty using Lagrange duality, which leads to a dual representation of in terms of a set of classifiers found through successive calls to an ERM oracle. As will become clear shortly, each of these classifiers corresponds to the most violated variance constraint (6) under some intermediate query probability function. Thus at a high level, our strategy is to expand the set of classifiers for representing until the amount of constraint violation gets reduced to an acceptable level.
We start by eliminating the bound constraints using barrier functions. Notice that the objective is already a barrier at . To enforce the lower bound (6), we modify the objective to
We solve the problem in the dual where we have a large but finite number of optimization variables, and efficiently maximize the dual using coordinate ascent with access to an ERM oracle over . Let denote the Lagrange multiplier for the constraint (6) for classifier . Then for any , we can minimize the Lagrangian
over each primal variable yielding the solution.
To see this, pick any satisfying for all and consider the difference in the Lagrangians evaluated at and :
The first term is non-negative because . For the second term, notice that
and that the minimum function value is exactly . Hence the second term is also non-negative.
Clearly, for all , so all the bound constraints (6) in are satisfied if we choose . Plugging the solution into the Lagrangian, we obtain the dual problem of maximizing the dual objective
over . The constant is equal to where . An algorithm to approximately solve this problem is presented in Algorithm 2. The algorithm takes a parameter specifying the degree to which all of the constraints (6) are to be approximated. Since is concave, the rescaling step can be solved using a straightforward numerical line search. The main implementation challenge is in finding the most violated constraint (Step 3). Fortunately, this step can be reduced to a single call to an ERM oracle. To see this, note that the constraint violation on classifier can be written as
The first term of the right-hand expression is the risk (classification error) of in predicting samples labeled according to with importance weights of if and 0 otherwise; note that these weights may be positive or negative. The second term is simply the scaled risk of with respect to the actual labels. The last two terms do not depend on . Thus, given access to (or samples approximating it, discussed shortly), the most violated constraint can be found by solving an ERM problem defined on the labeled samples in and samples drawn from labeled by , with appropriate importance weights detailed in Appendix F.1.
When all primal constraints are approximately satisfied, the algorithm stops. Consequently, we can execute each step of Algorithm 2 with one call to an appropriately defined ERM oracle, and approximate primal feasibility is guaranteed when the algorithm stops. More specifically, we can prove the following guarantee on the convergence of the algorithm.
When run on the -th epoch, Algorithm 2 has the following guarantees.
It halts in at most iterations.
The solution it outputs has bounded norm: .
That is, we find a solution with small constraint violation to ensure generalization, and a small objective value to be label efficient. If is set to , an amount of constraint violation tolerable in our analysis, the number of iterations in Theorem 3 varies between and as the varies between a constant and . The theorem is proved in Appendix F.2.
5.2 Solving (op) with expectation over samples
So far we considered solving defined on the unlabeled data distribution , which is not available in practice. A simple and natural substitute for is an i.i.d. sample drawn from it. Here we show that solving a properly-defined sample variant of (op) leads to a solution to the original with similar guarantees as in Theorem 3.
More specifically, we define the following sample variant of . Let be a large sample drawn i.i.d. from , and be the same as except with all population expectations replaced by empirical expectations taken with respect to . Now for any , define to be the same as except that the variance constraints (6) are relaxed by an additive slack of .
Every time Active Cover needs to solve (Step 5 of Algorithm 1), it draws a fresh unlabeled i.i.d. sample of size from , which can be done easily in a streaming setting by collecting the next examples. It then applies Algorithm 2 to solve with accuracy parameter . Note that this is different from solving with accuracy parameter . We establish the following convergence guarantees.
Let be an i.i.d. sample of size from . When run on the -th epoch for solving with accuracy parameter , Algorithm 2 satisfies the following.
It halts in at most iterations, where .
The solution it outputs has bounded norm: .
The proof is in Appendix F.3. Intuitively, the optimal solution to is also feasible in since satisfying the population constraints leads to approximate satisfaction of sample constraints. Since our solution is approximately optimal for (this is essentially due to Theorem 3), this means that the sample objective at is not much larger than . We now use a concentration argument to show that this guarantee holds also for the population objective with slightly worse constants. The approximate constraint satisfaction in follows by a similar concentration argument. Our proofs use standard concentration inequalities along with Rademacher complexity to provide uniform guarantees for all vectors with bounded norm.
The first two statements, finite convergence and boundedness of , are identical to Theorem 3 except is replaced by . When is set properly, i.e, to be , the number of unlabeled examples in the third statement varies between and as the varies between a constant and . The third statement shows that with enough unlabeled examples, we can get a query probability function almost as good as the solution to the population problem .