Online Learning with Abstention
Abstract
We present an extensive study of a key problem in online learning where the learner can opt to abstain from making a prediction, at a certain cost. In the adversarial setting, we show how existing online algorithms and guarantees can be adapted to this problem. In the stochastic setting, we first point out a bias problem that limits the straightforward extension of algorithms such as ucbn to this context. Next, we give a new algorithm, ucbgt, that exploits historical data and timevarying feedback graphs. We show that this algorithm benefits from more favorable regret guarantees than a natural extension of ucbn. We further report the results of a series of experiments demonstrating that ucbgt largely outperforms that extension of ucbn, as well as other standard baselines.
1 Introduction
We consider an online learning scenario, prevalent in many applications, where the learner is granted the option of abstaining from making a prediction, at a certain cost. For example, in the classification setting, at each round, the learner can choose to make a prediction and incur a standard zeroone misclassification cost, or elect to abstain, in which case she incurs an abstention cost, typically less than one. Abstention can thus represent an attractive option to avoid a higher cost of misclassification. Note, however, that when the learner abstains, she does not receive the true label (correct class), which results in a loss of information.
This scenario of online learning with abstention is relevant to many reallife problems. As an example, consider the scenario where a doctor can choose to make a diagnosis based on the current information available about a patient, or abstain and request further laboratory tests, which can represent both a time delay and a financial cost. In this case, the abstention cost is usually substantially lower than that of a wrong diagnosis. The online model is appropriate since it captures the gradual experience a doctor gains by testing, examining and following new patients.
Another instance of this problem appears in the design of spokendialog applications such as those in modern personal assistants. Each time the user asks a question, the assistant can either offer a direct response to the question, at the risk of providing an inaccurate response, or choose to say “I am sorry, I do not understand?”, which results in a longer and thereby more costly dialog requesting the user to reformulate his question. Similar online learning problems arise in the context of selfdriving cars where, at each instant, the assistant must determine whether to continue steering the car or return the control to the driver. Online learning with abstention also naturally models many problems arising in electronic commerce platforms such as an Ad Exchange, an online platform set up by a publisher where several advertisers bid in order to compete for an ad slot, the abstention cost being the opportunity loss of not bidding for a specific ad slot.
In the batch setting, the problem of learning with abstention has been studied in a number of publications, starting with (Chow, 1957, 1970). Its theoretical aspects have been analyzed by several authors in the last decade. ElYaniv and Wiener (2010, 2011) studied the tradeoff between the coverage and accuracy of classifiers. Bartlett and Wegkamp (2008) introduced a loss function including explicitly an abstention cost and gave a consistency analysis of a surrogate loss that they used to derive an algorithm. More recently, Cortes et al. (2016b, a) presented a comprehensive study of the problem, including an analysis of the properties of a corresponding abstention (or rejection) loss with a series of theoretical guarantees and algorithmic results both for learning with kernelbased hypotheses and for boosting.
This paper presents an extensive study of the problem of online learning with abstention, in both the adversarial and the stochastic settings. We consider the common scenario of prediction with expert advice (Littlestone and Warmuth, 1994) and adopt the same general abstention loss function as in (Cortes et al., 2016b), with each expert formed by a pair made of a predictor and an abstention function.
A key aspect of the problem we investigate, which makes it distinct from both batch learning with abstention, where labels are known for all training points, and standard online learning (in the full information setting) is the following: if the algorithm abstains from making a prediction for the input point received at a given round, the true label of that point is not revealed. As a result, the loss of the experts that would have instead made a prediction on that point cannot be determined at that round. Thus, we are dealing with an online learning scenario with partial feedback. If the algorithm chooses to predict, then the true label is revealed and the losses of all experts, including abstaining ones, are known. But, if the algorithm elects to abstain, then only the losses of the abstaining experts are known, all of them being equal to the same abstention cost.
As we shall see, our learning problem can be cast as a specific instance of online learning with a feedback graph, a framework introduced by Mannor and Shamir (2011) and later extensively analyzed by several authors (Caron et al., 2012; Alon et al., 2013, 2014, 2015; Kocák et al., 2014; Neu, 2015; Cohen et al., 2016)). In our context, the feedback graph varies over time, a scenario for which most of the existing algorithms and analyses (specifically, in the stochastic setting) do not readily apply. Our setting is distinct from the KWIK (knows what it knows) framework of Li et al. (2008) and its later extensions, though there are some connections, as discussed in Appendix A.
Our contribution can be summarized as follows. In Section 3, we analyze an adversarial setting both in the case of a finite family of experts and that of an infinite family. We show that the problem of learning with abstention can be cast as that of online learning with a timevarying feedback graph tailored to the problem. In the finite case, we show how ideas from (Alon et al., 2014, 2015) can be extended and combined with this timevarying feedback graph to devise an algorithm, exp3abs, that benefits from favorable guarantees. In turn, exp3abs is used as a subroutine for the infinite case where we show how a surrogate loss function can be carefully designed for the abstention loss, while maintaining the same partial observability. We use the structure of this loss function to extend ContextualExp3 (CesaBianchi et al., 2017) to the abstention scenario and prove regret guarantees for its performance.
In Section 4, we shift our attention to the stochastic setting. Stochastic bandits with a fixed feedback graph have been previously studied by Caron et al. (2012) and Cohen et al. (2016). We first show that an immediate extension of these algorithms to the timevarying graphs in the abstention scenario faces a technical bias problem in the estimation of the expert losses. Next, we characterize a set of feedback graphs that can circumvent this bias problem in the general setting of online learning with feedback graphs. We further design a new algorithm, ucbgt, whose feedback graph is estimated based on past observations. We prove that the algorithm admits more favorable regret guarantees than the ucbn algorithm (Caron et al., 2012). Finally, in Section 5 we report the results of several experiments with both artificial and realworld datasets demonstrating that ucbgt in practice significantly outperforms an unbiased, but limited, extension of ucbn, as well as a standard bandit baseline, like UCB (Auer et al., 2002a).
2 Learning Problem
Let denote the input space (e.g., is a bounded subset of ). We denote by a family of predictors , and consider the familiar binary classification problem where the loss of on a labeled pair is defined by either the 0/1loss , or some Lipschitz variant thereof (see Section 3). In all cases, we assume . We also denote by a family of abstention functions , with indicating an abstention on (or that is rejected), and that is predicted upon (or that is accepted).
We consider a specific online learning scenario whose regime lies between bandit and full information, sometimes referred to as bandit with sideinformation (e.g., Mannor and Shamir (2011); Caron et al. (2012); Alon et al. (2013, 2014, 2015); Kocák et al. (2014); Neu (2015); Cohen et al. (2016)). In our case, the arms are pairs made of a predictor function and an abstention function in a given family . We will denote by , , the elements of . In fact, depending on the setting, may be finite or (uncountably) infinite. Given , one natural choice for the associated abstention function is a confidencebased abstention function of the form , for some threshold . Yet, more general pairs can be considered here. This provides an important degree of flexibility in the design of algorithms where abstentions are allowed, as shown in (Cortes et al., 2016b, a). Appendix A presents a concrete example illustrating the benefits of learning with these pair of functions.
The online learning protocol is described as follows. The set is known to the learning algorithm beforehand. At each round , the online algorithm receives an input and chooses (possibly at random) an arm (henceforth also called “expert” or “pair”) . If the inequality holds, then the algorithm abstains and incurs as loss an abstention cost . Otherwise, it predicts based on the sign of , receives the true label , and incurs the loss . Thus, the overall abstention loss of expert on the labeled pair is defined as follows:
(1) 
For simplicity, we will assume throughout that the abstention cost is a (known) constant , independent of , though all our results can be straightforwardly extended to the case when is a (Lipschitz) function of , which is indeed desirable in some applications.
Our problem can be naturally cast as an online learning problem with side information in the form of a feedback graph. Online learning with a feedback graph is a general framework that covers a variety of problems with partial information, including the full information scenario, where the graph is fully connected, and the bandit scenario where all vertices admit only selfloops and are disconnected (Alon et al., 2013, 2014). In our case, we have a directed graph that depends on the instance received by the algorithm at round . Here, denotes the finite set of vertices of this graph, which, in the case of a finite set of arms, coincides with the set of experts , while denotes the set of directed edges at round . The directed edge is in if the loss of expert is observed when expert is selected by the algorithm at round . In our problem, if the learner chooses to predict at round (i.e., if ), then she observes the loss of all experts , since the label is revealed to her. If instead she abstains at round (i.e., if ), then she only observes for those experts that are abstaining in that round, that is, the set of such that , since for all such , we have . Notice that in both cases the learner can observe the loss of her own action. Thus, the feedback graph we are operating with is a nearly fully connected directed graph with selfloops, except that it admits only oneway edges from predicting to abstaining vertices (see Figure 1 for an example). Observe also that the feedback graph is fully determined by .
We will consider both an adversarial setting (Section 3), where no distributional assumption is made about the sequence , , and a stochastic setting (Section 4), where is assumed to be drawn i.i.d. from some unknown distribution over . For both settings, we measure the performance of an algorithm by its (pseudo)regret , defined as where the expectation is taken both with respect to the algorithm’s choice of actions s and, in the stochastic setting, the random draw of the s.
In the stochastic setting, we will be mainly concerned with the case where is a finite set of experts . We then denote by the expected loss of expert , , by the expected loss of the best expert, , and by the loss gap to the best, . In the adversarial setting, we will analyze both the finite and infinite expert scenarios. In the infinite case, since is nonconvex in the relevant parameters (Eq. (1)), further care is needed.
3 Adversarial setting
As a warmup, we start with the adversarial setting with finitelymany experts. Following ideas from Alon et al. (2014, 2015), we design an online algorithm for the abstention scenario by combining standard finitearm bandit algorithms, like exp3 (Auer et al., 2003), with the feedback graph of Section 2. We call the resulting algorithm exp3abs (exp3 with abstention). The algorithm is a variant of exp3 where the importance weighting scheme to achieve unbiased loss estimates is based on the probability of the loss of an expert being observed as opposed to that of an expert being selected — see Appendix B (Algorithm 3). The following guarantee holds for this algorithm.
Theorem 1
Let exp3abs be run with learning rate over a set of experts . Then, the algorithm admits the following regret guarantee after rounds:
In particular, if exp3abs is run with , then
The proof of this result, as well as all other proofs, is given in the appendix. The dependency of the bound on the number of experts is clearly more favorable than the standard bound for exp3 ( instead of ). Theorem 1 is in fact reminiscent of what one can achieve using the contextualbandit algorithm EXP4 (Auer et al., 2002b) run on experts, each one having two actions.
We now turn our attention to the case of an uncountably infinite . To model this more general framework, one might be tempted to focus on parametric classes of functions and , e.g., the family of linear functions
introduce some convex surrogate of the abstention loss (1), and work in the parametric space of through some Bandit Convex Optimization technique (e.g., (Hazan, 2016)). Unfortunately, this approach is not easy to put in place, since the surrogate loss not only needs to ensure convexity and some form of calibration, but also the ability for the algorithm to observe the loss of its own action (the selfloops in the graph of Figure 1).
We have been unable to get around this problem by just resorting to convex surrogate losses (and we strongly suspect that it is not possible), and in what follows we instead introduce a surrogate abstention loss which is Lipschitz but not convex. Moreover, we take the more general viewpoint of competing with pairs of Lipschitz functions with bounded Lipschitz constant. Let us then consider the version of the abstention loss (1) with , where is the 0/1loss with slope at the origin, (see Figure 2 (a)), and the class of experts . Here, functions and in the definition of are assumed to be Lipschitz with respect to an appropriate distance on , for some constant which determines the size of the family .
Using ideas from (CesaBianchi et al., 2017), we present an algorithm that approximates the action space by a finite cover while using the structure of the abstention setting. The crux of the problem is to define a Lipschitz function that uppers bounds the abstention loss while maintaining the same feedback assumptions, namely the feedback graph given in Figure 1. One Lipschitz function that precisely solves this problem is the following:
for . is plotted in Figure 2(b). Notice that this function is consistent with the feedback requirements of Section 2: implies that is known to the algorithm (i.e., is independent of ) for all such that , while gives complete knowledge of for all , since is observed.
We can then adapt the machinery from (CesaBianchi et al., 2017) so as to apply a contextual version of exp3abs to the sequence of losses . The algorithm adaptively covers with balls of a fixed radius , each ball hosting an instance of exp3abs. We call this algorithm Contexp3abs – see Appendix B.2 for details.
Theorem 2
Consider the abstention loss
and let , with made of pairs of Lipschitz functions as described above. If Contexp3abs is run with parameter and an appropriate learning rate (see Appendix B), then, it admits the following regret guarantee:
where is the number of such that .
In the above, hides constant and factors, while disregards constants like , and various log factors. Contexp3abs is also computationally efficient, thereby providing a compelling solution to the infinite armed case of online learning with abstention.
4 Stochastic setting
We now turn to studying the stochastic setting. As pointed out in Section 2, the problem can be cast as an instance of online learning with timevarying feedback graphs . Thus, a natural method for tackling the problem would be to extend existing algorithms designed for the stochastic setting with feedback graphs to our abstention scenario (Cohen et al., 2016; Caron et al., 2012). We cannot benefit from the algorithm of Cohen et al. (2016) in our scenario. This is because at the heart of its design and theoretical guarantees lies the assumption that the graphs and losses are independent. The dependency of the feedback graphs on the observations , which also define the losses, is precisely a property that we wish to exploit in our scenario.
An alternative is to extend the ucbn algorithm of Caron et al. (2012), for which the authors provide gapbased regret guarantees. This algorithm is defined for a stochastic setting with an undirected feedback graph that is fixed over time. The algorithm can be straightforwardly extended to the case of directed timevarying feedback graphs (see Algorithm 1). We will denote that extension by ucbnt to explicitly differentiate it from ucbn. Let denote the set of outneighbors of vertex in the directed graph at time , i.e., the set of vertices destinations of an edge from . Then, as with ucbn, the algorithm updates, at each round , the upperconfidence bound of every expert for which a feedback is received (those in ), as opposed to updating only the upperconfidence bound of the expert selected, as in the standard ucb of Auer et al. (2002a).
In the context of learning with abstention, the natural feedback graph at time depends on the observation and varies over time. Can we extend the regret guarantees of Caron et al. (2012) to ucbnt with such graphs? We will show in Section 4.1 that vanishing regret guarantees do not hold for ucbnt run with graphs . This is because of a fundamental estimation bias problem that arises when the graph at time depends on the observation . This issue affects more generally any natural method using the graphs. Nevertheless, we will show in Section 4.2 that ucbnt does benefit from favorable guarantees, provided the feedback graph it uses at round is replaced by one that only depends on events up to time .
4.1 Bias problem
Assume there are two experts: (red) and (blue) with and (see Figure 3). For , the red expert is abstaining and incurring a loss , whereas the blue expert is never abstaining. Assume that the probability mass is quasiuniform over the interval but with slightly more mass over the region . The algorithm may then start out by observing points in this region. Here, both experts accept and the algorithm obtains error estimates corresponding to the solid red and blue lines for . When the algorithm observes a point , it naturally selects the red abstaining expert since it admits a better current estimated loss. However, for , the red expert is worse than the blue expert . Furthermore, it is abstaining and thus providing no updates for expert (which is instead predicting). Hence, the algorithm continues to maintain an estimate of ’s loss at the level of the blue solid line indicated for ; it then continues to select the red expert for all s and incurs a high regret.^{1}^{1}1 For the sake of clarity, we did not introduce specific real values for the expected loss of each expert on each of the half intervals, but that can be done straightforwardly. We have also verified experimentally with such values that the bias problem just pointed out indeed leads to poor regret for ucbnt.
This simple example shows that, unlike the adversarial scenario (Section 3), , here, cannot depend on the input , and that, in general, the indiscriminate use of feedback graphs may result in biased loss observations. On the other hand, we know that if we were to avoid using feedback graphs at all (which is always possible using ucb), we would always be able to define unbiased loss estimates. A natural question is then: can we construct timevarying feedback graphs that lead to unbiased loss observations? In the next section, we show how to design such a sequence of auxiliary feedback graphs, which in turn allows us to then extend ucbnt to the setting of timevarying feedback graphs for general loss functions. Under this assumption, we can achieve unbiased empirical estimates of the average losses of the experts, which will allow us to apply standard concentration bounds in the proof of this algorithm.
4.2 Timevarying graphs for ucbnt
We now show that ucbnt benefits from favorable guarantees, so long as the feedback graph it uses at time depends only on events up to time . This extension works for general bounded losses and does not only apply to our specific abstention loss .
So, let us assume that the feedback graph in round (and the associated outneighborhoods ) in Algorithm 1 only depends on the observed losses and inputs , for , and , and let us denote this feedback graph by , so as not to get confused with . Under this assumption, we can derive strong regret guarantees for ucbnt with timevarying graphs using a newly introduced notion of admissible coverings. For the feedback graph at time , let be a collection of subsets of covering , such that , means that and . We denote such a collection an admissible covering of . Let denote the set of all admissible coverings of , and let , i.e. the collection of shared admissible coverings that apply across all time steps. Then by construction, for any and , means that and for every . Note that the definition of is equivalent to considering the set of edges that are shared across all , and then considering admissible coverings over the graph induced by these shared edges. Moreover, since for every , is always nonempty.
Theorem 3
Assume that, for all , the feedback graph depends only on information up to time . Then, the regret of ucbnt is bounded as follows:
The theorem gives a bound on the regret based on any admissible covering that applies to every feedback graph seen during learning, and the minimum chooses the admissible covering with the smallest regret.
Theorem 3 can be interpreted as an extension of Theorem 2 in Caron et al. (2012) to timevarying feedback graphs. Its proof involves showing that the use of feedback graphs that depend only on information up to can result in unbiased loss estimates, and it considers shared admissible coverings that apply across the sequence of feedback graphs to derive a timevarying bound that leverages the shared updates from the graph.
Moreover, the bound illustrates that if the feedback graphs in a problem admit a shared admissible covering with a small number of elements (e.g. if the feedback graphs can be decomposed into a small number of components that are fixed across time) for which , then this bound can be up to a factor tighter than the bound guaranteed by the standard UCB algorithm. Moreover, this regret guarantee is always more favorable than that of the standard UCB since the (trivial) admissible covering that splits into singletons for all is always an admissible covering of every . Furthermore, note that if the feedback graph is fixed throughout all rounds and we interpret the doublydirected edges as edges of an undirected graph , it follows that for every . Thus, we straightforwardly obtain the following result, which is comparable to Theorem 2 in (Caron et al., 2012).
Corollary 1
If the feedback graph is fixed over time, then the guarantee of Theorem 3 is upperbounded by:
the outer minimum being over all admissible coverings of .
Caron et al. (2012) present matching lower bounds for the case of stochastic bandits with a fixed feedback graph. Since we can again design abstention scenarios with fixed feedback graphs, these bounds carry over to our setting.
Now, how can we use the results of this section to design an algorithm for the abstention scenario? The natural feedback graphs we discussed in Section 3 are no longer applicable since depends on . Nevertheless, we will present two solutions to this problem. In Section 4.3, we present a solution with a fixed graph that closely captures the problem of learning with abstention. Next, in Section 4.4, we will show how to define and leverage a timevarying graph that is estimated based on past observations.
4.3 ucbn with the subset feedback graph
In this section, we define a subset feedback graph, , that captures the most informative feedback in the problem of learning with abstention and yet is safe in the sense that it does not depend on . The definition of the graph is based on the following simple observation: if the abstention region associated with is a subset of that of , then, if is selected at some round and is abstaining, so is . For an example, see and in Figure 4 (top). Crucially, this implication holds regardless of the particular input point received in the region of abstention of . Thus, the set of vertices of is , and admits an edge from to , iff . Since does not vary with time, it trivially verifies the condition of the previous section. Thus, ucbnt run with admits the regret guarantees of Theorem 3, where we only need to consider the set of admissible coverings of the fixed graph .
The example of Section 4.1 illustrated a bias problem in a special case where the feedback graphs were not subgraphs of . The following result shows more generally that feedback graphs not included in may result in catastrophic regret behavior.
Proposition 1
Assume that ucbnt is run with feedback graphs that are not subsets of . Then, there exists a family of predictors , a Lipschitz loss function in (1), and a distribution over s for which ucbnt incurs linear regret with arbitrarily high probability.
The proof of the proposition is given in Appendix C.3. In view of this result, no fixed feedback graph for ucbnt can be more informative than . But how can we leverage past observations (up to time ) to derive a feedback graph that would be more informative than the simple subset graph ? The next section provides a solution based on feedback graphs estimated based on past observations and a new algorithm.
4.4 UCBGT algorithm
We seek graphs that admit as a subgraph. We will show how certain types of edges can be safely added to based on past observations. This leads to a new algorithm, ucbgt (ucb with estimated timevarying graph), whose pseudocode is given in Algorithm 2.
As illustrated by Figure 4, the key idea of ucbgt is to augment with edges from to where the subset property may not hold, but where the implication holds with high probability over the choice of , that is, the region admits low probability. Of course, adding such an edge can cause the estimation bias of Section 4.1. But, if we restrict ourselves to cases where is upper bounded by some carefully chosen quantity that changes over rounds, the effect of this bias will be limited. In reverse, as illustrated in Figure 4, the resulting feedback graph can be substantially more beneficial since it may have many more edges than , hence leading to more frequent updates of the experts’ losses and more favorable regret guarantees. This benefit is further corroborated by our experimental results (Section 5).
Since we do not have access to , we use instead empirical estimates . At time , if expert is selected, we update expert if the condition holds with If the expert chosen abstains while expert predicts and satisfies , then we do not have access to the true label . In that case, we update optimistically our empirical estimate as if the expert had loss at that round (Step (*) in Alg. 2).
The feedback graph just described can be defined via the outneighborhood of vertex : . The following regret guarantee holds for ucbgt.
Theorem 4
For any , let the feedback graph be defined by the outneighborhood . Then, the regret of ucbgt is bounded as follows:
Since the graph of ucbgt has more edges than , it admits at least as many admissible coverings as , which leads to a more favorable guarantee than that of ucbnt run with . The proof of this result differs from the standard UCB analysis and that of Theorem 3 in that it involves showing that the ucbgt algorithm can adequately control the amount of bias introduced by the skewed loss estimates. The experiments in the next section provide an empirical validation of this theoretical comparison.
5 Experiments
In this section, we report the results of several experiments on ten datasets comparing ucbgt, ucbnt with feedback graph , vanilla ucb (with no sharing information across experts), as well as FullSupervision, fs. fs is an algorithm that at each round chooses the expert with the smallest abstention loss so far, , and even if this expert abstains, the algorithm receives the true label and can update the empirical abstention loss estimates for all experts. fs reflects an unrealistic and overly optimistic scenario that clearly falls outside the abstention setting, but it provides an upper bound for the best performance we may hope for.
We used the following eight datasets from the UCI data repository: HIGGS, phishing, ijcnn, covtype, eye, skin, codrna, and guide. We also used the CIFAR dataset from (Krizhevsky et al., 2009), where we extracted the first twentyfive principal components and used their projections as features, and a synthetic dataset of points drawn according to the uniform distribution in . For each dataset, we generated a total of experts and all the algorithms were tested for a total of rounds. The experts, , were chosen in the following way. The predictors are hyperplanes centered at the origin whose normal vector in is drawn randomly from the Gaussian distribution, , where is the dimension of the feature space of the dataset. The abstention functions are concentric annuli around the origin with radii in . For each dataset, we generated predictors and each predictor is paired with the 21 abstention functions . For a fixed set of experts, we first calculated the regret by averaging over five random draws of the data, where the bestinclass expert was determined in hindsight as the one with the minimum average cumulative abstention loss. We then repeated this experiment five times over different sets of experts and averaged the results. We report these results for .
Figure 5 shows the averaged regret with standard deviations across the five repetitions for the different algorithms as a function of for two datasets. In Appendix D, we present plots of the regret for all ten datasets. These results show that ucbgt outperforms both ucbnt and ucb on all datasets for all abstention cost values. Remarkably, ucbgt’s performance is close to that of fs for most datasets, thereby implying that ucbgt attains almost the best regret that we could hope for. We also find that ucbnt performs better than the vanilla ucb.
Figure 5 also illustrates the fraction of points in which the chosen expert abstains, as well as the number of edges in the feedback graph as a function of rounds. We only plot the number of edges of ucbgt since that is the only graph that varies with time. For both experiments depicted and in general for the rest of the datasets, the number of edges for ucbgt is between 1 million to 3 million, which is at least a factor of 5 more than for ucbnt, where the number of edges we observed are of the order . fs enjoys the full information property and the number of edges is fixed at 4 million (complete graph). The increased information sharing of ucbgt is clearly a strong contributing factor to the algorithm’s improvement in regret relative to ucbnt. In general, we find that, provided that the estimation bias is controlled, the higher is the number of edges, the smaller the regret. Regarding the value of the cost , as expected, we observe that the fraction of points that the chosen expert abstains on always decreases as increases, but also that that fraction depends on the dataset and the experts used.
Finally, Appendix D includes more experiments for different aspects of the problem. In particular, we tested how the number of experts or a different choice of experts (confidencebased experts) affected the results. We also experimented with some extreme abstention costs and, as expected, found the fraction of abstained points to be large for and small for . In all of these additional experiments, ucbgt outperformed ucbnt.
6 Conclusion
We presented a comprehensive analysis of the novel setting of online learning with abstention, including algorithms with favorable guarantees both in the stochastic and adversarial scenarios, and extensive experiments demonstrating the performance of ucbgt in practice. Our algorithms and analysis can be straightforwardly extended to similar problems, including the multiclass and regression settings, as well as other related scenarios, such as online learning with budget constraints. A key idea behind the design of our algorithms in the stochastic setting is to leverage the stochastic sequence of feedback graphs. This idea can perhaps be generalized and applied to other problems where timevarying feedback graphs naturally appear. Furthermore, our regret guarantees can be instead expressed in terms of the independence number of timevarying graphs by proceeding as in (Lykouris et al., 2019).
References
 Online learning with feedback graphs: beyond bandits. JMLR. Cited by: §B.1, §1, §1, §2, §3.
 Nonstochastic multiarmed bandits with graphstructured feedback.. In CoRR, Cited by: §B.1, §1, §1, §2, §2, §3.
 From bandits to experts: a tale of domination and independence.. In NIPS, Cited by: §1, §2, §2.
 Finitetime analysis of the multiarmed bandit problem. Mach. Learn. 47 (23), pp. 235–256. Cited by: §1, §4.
 The nonstochastic multiarmed bandit problem.. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: §3.
 The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: §3.
 Classification with a reject option using a hinge loss.. JMLR, pp. 291–307. Cited by: §1.
 Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning 5 (1), pp. 1–122. Cited by: §B.1.
 Leveraging side observations in stochastic bandits. In UAI, Cited by: §1, §1, §2, §4.2, §4.2, §4.2, §4, §4, §4.
 Algorithmic chaining and the role of partial feedback in online nonparametric learning. In MLR, Cited by: §B.2, §1, §3, §3, Remark 1.
 An optimum character recognition system using decision function. IEEE T. C.. Cited by: §1.
 On optimum recognition error and reject tradeoff. IEEE T. C.. Cited by: §1.
 Nearestneighbor searching and metric space dimensions. In NearestNeighbor Methods for Learning and Vision: Theory and Practice, Cited by: Remark 1.
 Online learning with feedback graphs without the graphs. In ICML, Cited by: §1, §1, §2, §4.
 Boosting with abstention. In NIPS, Cited by: §1, §2.
 Learning with rejection. In ALT, pp. 67–82. Cited by: Figure 6, §1, §1, §2.
 On the foundations of noisefree selective classification. JMLR. Cited by: §1.
 Agnostic selective classification. In NIPS, Cited by: §1.
 Online learning with prior knowledge. In COLT, pp. 499–513. Cited by: Remark 1.
 Introduction to online convex optimization. Foundations and Trends in Optimization, Now Publishers Inc.. Cited by: §3.
 Efficient learning by implicit exploration in bandit problems with side observations. In NIPS, pp. 613–621. Cited by: §1, §2.
 CIFAR10 (Canadian Institute for Advanced Research). External Links: Link Cited by: §5.
 Knows What It Knows: a framework for selfaware learning. In ICML, Cited by: Appendix A, §1.
 The weighted majority algorithm. Information and computation 108 (2), pp. 212–261. Cited by: §1.
 Feedback graph regret bounds for thompson sampling and ucb.. In ArXiv, Cited by: §6.
 From bandits to experts: on the value of sideobservations.. In NIPS, pp. 291–307. Cited by: §1, §2.
 Explore no more: improved highprobability regret bounds for nonstochastic bandits. In NIPS, pp. 3168–3176. Cited by: §1, §2.
 Trading off mistakes and don’tknow predictions. In NIPS, Cited by: Appendix A.
 The extended Littlestone’s dimension for learning with mistakes and abstentions. In COLT, Cited by: Appendix A.
Appendix A Further Related Work
Learning with abstention is a useful paradigm in applications where the cost of misclassifying a point is high. More concretely, suppose the cost of abstention is less than and consider the set of points along the real line illustrated in Figure 6 where and indicate their labels. The best threshold classifier is the hypothesis given by threshold , since it correctly classifies points to the right of , with an expected loss of . On the other hand, the best abstention pair would abstain on the region left of and correctly classify the rest, with an expected loss of . Since , the abstention pair always admits a better loss then the best threshold classifier.
Within the online learning literature, work related to our scenario includes the KWIK (knows what it knows) framework of Li et al. (2008) in which the learning algorithm is required to make only correct predictions but admits the option of abstaining from making a prediction. The objective is then to learn a concept exactly with the fewest number of abstentions. If in our framework we received the label at every round, KWIK could be seen as a special case of our framework for online learning with abstention with an infinite misclassification cost and some finite abstention cost. A relaxed version of the KWIK framework was introduced and analyzed by Sayedi et al. (2010) where a fixed number of incorrect predictions are allowed with a learning algorithm related to the solution of the ’megaegg game puzzle’. A theoretical analysis of learning in this framework was also recently given by Zhang and Chaudhuri (2016). Our framework does not strictly cover this relaxed framework. However, for some choices of the misclassification cost depending on the horizon, the framework is very close to ours. The analysis in these frameworks was given in terms of mistake bounds since the problem is assumed to be realizable. We will not restrict ourselves to realizable problems and, instead, will provide regret guarantees.
Appendix B Additional material for the adversarial setting
We first present the pseudocode and proofs for the finite arm setting and next analyze the infinite arm setting.
b.1 Finite arm setting
Algorithm 3 contains the pseudocode for exp3abs, an algorithm for online learning with abstention under an adversarial data model that guarantees small regret. The algorithm itself is a simple adaptation of the ideas in (Alon et al., 2014, 2015), where we incorporate the side information that the loss of an abstaining arm is always observed, while the loss of a predicting arm is observed only if the algorithm actually plays a predicting arm. In the pseudocode and in the proof that follows, is a shorthand for .
Proof of Theorem 1.
Proof.
By applying the standard regret bound of Hedge (e.g., (Bubeck and CesaBianchi, 2012)) to distributions
generated by exp3abs and to the
nonnegative loss estimates , the following holds:
(2) 
for any fixed . Using the fact that and , we can write
For each , we can split the nodes of into the two subsets and where if a node is abstaining at time then , and otherwise . Thus, for any round , we can write
The first inequality holds since if is an abstaining expert at time , we know that and , while for the accepting experts we know that anyway. The second inequality holds because if is an accepting expert, we have . Combining this inequality with (2) concludes the proof.
b.2 Infinite arm setting
Here, the input space is assumed to be totally bounded, so that there exists a constant such that, for all , can be covered with at most balls of radius . Let be a shorthand for , the range space of the pairs . An covering of with respect to the Euclidean distance on has size for some constant .
The online learning scenario for the loss under the abstention setting’s feedback graphs is as follows. Given an unknown sequence of pairs , for every round :

The environment reveals input ;

The learner selects an action and incurs loss ;

The learner obtains feedback from the environment.
Our algorithm is described as Algorithm 4. The algorithm essentially works as follows. At each round , if a new incoming input is not contained in any existing ball generated so far, then a new ball centered at is created, and a new instance of exp3abs is allocated to handle . Otherwise, the exp3abs instance associated with the closest input so far is used. Each allocated exp3abs instance operates on the discretized action space .
Consider the function
where is the Lipschitz variant of the 0/1loss mentioned in Section 3 of the main text (Figure 2 (a)). For any fixed , the function is Lipschitz when viewed as a function of , and is Lipschitz for any fixed when viewed as a function of . Hence
so that is Lipschitz w.r.t. the Euclidean distance on . Furthermore, a quick comparison to the abstention loss
reveals that (recall Figure 2 (b) in the main text) :

is an upper bound on , i.e.,

approximates in that
(3)
With the above properties of at hand, we are ready to prove Theorem 2.
Proof of Theorem 2.
Proof.
On each ball that Contexp3abs allocates during its online execution, Theorem 1 supplies the following
regret guarantee for the associated instance of exp3abs:
where is the number of points falling into ball . Now, taking into account that is Lipschitz, and that the functions and are assumed to be Lipschitz on , a direct adaptation of the proof of Theorem 1 in (CesaBianchi et al., 2017) gives the bound
being the maximum number of balls created by Contexp3abs. Using and setting yields
Next, optimizing for by setting (and disregarding