Online Learning with Abstention

# Online Learning with Abstention

Corinna Cortes    Giulia DeSalvo    Claudio Gentile    Mehryar Mohri    Scott Yang1
**Work done at the Courant Institute of Mathematical Sciences.
###### Abstract

We present an extensive study of a key problem in online learning where the learner can opt to abstain from making a prediction, at a certain cost. In the adversarial setting, we show how existing online algorithms and guarantees can be adapted to this problem. In the stochastic setting, we first point out a bias problem that limits the straightforward extension of algorithms such as ucb-n to this context. Next, we give a new algorithm, ucb-gt, that exploits historical data and time-varying feedback graphs. We show that this algorithm benefits from more favorable regret guarantees than a natural extension of ucb-n. We further report the results of a series of experiments demonstrating that ucb-gt largely outperforms that extension of ucb-n, as well as other standard baselines.

online learning, abstention option, feedback graphs

## 1 Introduction

We consider an online learning scenario, prevalent in many applications, where the learner is granted the option of abstaining from making a prediction, at a certain cost. For example, in the classification setting, at each round, the learner can choose to make a prediction and incur a standard zero-one misclassification cost, or elect to abstain, in which case she incurs an abstention cost, typically less than one. Abstention can thus represent an attractive option to avoid a higher cost of misclassification. Note, however, that when the learner abstains, she does not receive the true label (correct class), which results in a loss of information.

This scenario of online learning with abstention is relevant to many real-life problems. As an example, consider the scenario where a doctor can choose to make a diagnosis based on the current information available about a patient, or abstain and request further laboratory tests, which can represent both a time delay and a financial cost. In this case, the abstention cost is usually substantially lower than that of a wrong diagnosis. The online model is appropriate since it captures the gradual experience a doctor gains by testing, examining and following new patients.

Another instance of this problem appears in the design of spoken-dialog applications such as those in modern personal assistants. Each time the user asks a question, the assistant can either offer a direct response to the question, at the risk of providing an inaccurate response, or choose to say “I am sorry, I do not understand?”, which results in a longer and thereby more costly dialog requesting the user to reformulate his question. Similar online learning problems arise in the context of self-driving cars where, at each instant, the assistant must determine whether to continue steering the car or return the control to the driver. Online learning with abstention also naturally models many problems arising in electronic commerce platforms such as an Ad Exchange, an online platform set up by a publisher where several advertisers bid in order to compete for an ad slot, the abstention cost being the opportunity loss of not bidding for a specific ad slot.

In the batch setting, the problem of learning with abstention has been studied in a number of publications, starting with (Chow, 1957, 1970). Its theoretical aspects have been analyzed by several authors in the last decade. El-Yaniv and Wiener (2010, 2011) studied the trade-off between the coverage and accuracy of classifiers. Bartlett and Wegkamp (2008) introduced a loss function including explicitly an abstention cost and gave a consistency analysis of a surrogate loss that they used to derive an algorithm. More recently, Cortes et al. (2016b, a) presented a comprehensive study of the problem, including an analysis of the properties of a corresponding abstention (or rejection) loss with a series of theoretical guarantees and algorithmic results both for learning with kernel-based hypotheses and for boosting.

This paper presents an extensive study of the problem of online learning with abstention, in both the adversarial and the stochastic settings. We consider the common scenario of prediction with expert advice (Littlestone and Warmuth, 1994) and adopt the same general abstention loss function as in (Cortes et al., 2016b), with each expert formed by a pair made of a predictor and an abstention function.

A key aspect of the problem we investigate, which makes it distinct from both batch learning with abstention, where labels are known for all training points, and standard online learning (in the full information setting) is the following: if the algorithm abstains from making a prediction for the input point received at a given round, the true label of that point is not revealed. As a result, the loss of the experts that would have instead made a prediction on that point cannot be determined at that round. Thus, we are dealing with an online learning scenario with partial feedback. If the algorithm chooses to predict, then the true label is revealed and the losses of all experts, including abstaining ones, are known. But, if the algorithm elects to abstain, then only the losses of the abstaining experts are known, all of them being equal to the same abstention cost.

As we shall see, our learning problem can be cast as a specific instance of online learning with a feedback graph, a framework introduced by Mannor and Shamir (2011) and later extensively analyzed by several authors (Caron et al., 2012; Alon et al., 2013, 2014, 2015; Kocák et al., 2014; Neu, 2015; Cohen et al., 2016)). In our context, the feedback graph varies over time, a scenario for which most of the existing algorithms and analyses (specifically, in the stochastic setting) do not readily apply. Our setting is distinct from the KWIK (knows what it knows) framework of Li et al. (2008) and its later extensions, though there are some connections, as discussed in Appendix A.

Our contribution can be summarized as follows. In Section 3, we analyze an adversarial setting both in the case of a finite family of experts and that of an infinite family. We show that the problem of learning with abstention can be cast as that of online learning with a time-varying feedback graph tailored to the problem. In the finite case, we show how ideas from (Alon et al., 2014, 2015) can be extended and combined with this time-varying feedback graph to devise an algorithm, exp3-abs, that benefits from favorable guarantees. In turn, exp3-abs is used as a subroutine for the infinite case where we show how a surrogate loss function can be carefully designed for the abstention loss, while maintaining the same partial observability. We use the structure of this loss function to extend ContextualExp3 (Cesa-Bianchi et al., 2017) to the abstention scenario and prove regret guarantees for its performance.

In Section 4, we shift our attention to the stochastic setting. Stochastic bandits with a fixed feedback graph have been previously studied by Caron et al. (2012) and Cohen et al. (2016). We first show that an immediate extension of these algorithms to the time-varying graphs in the abstention scenario faces a technical bias problem in the estimation of the expert losses. Next, we characterize a set of feedback graphs that can circumvent this bias problem in the general setting of online learning with feedback graphs. We further design a new algorithm, ucb-gt, whose feedback graph is estimated based on past observations. We prove that the algorithm admits more favorable regret guarantees than the ucb-n algorithm (Caron et al., 2012). Finally, in Section 5 we report the results of several experiments with both artificial and real-world datasets demonstrating that ucb-gt in practice significantly outperforms an unbiased, but limited, extension of ucb-n, as well as a standard bandit baseline, like UCB (Auer et al., 2002a).

## 2 Learning Problem

Let denote the input space (e.g., is a bounded subset of ). We denote by a family of predictors , and consider the familiar binary classification problem where the loss of on a labeled pair is defined by either the 0/1-loss , or some Lipschitz variant thereof (see Section 3). In all cases, we assume . We also denote by a family of abstention functions , with indicating an abstention on (or that is rejected), and that is predicted upon (or that is accepted).

We consider a specific online learning scenario whose regime lies between bandit and full information, sometimes referred to as bandit with side-information (e.g., Mannor and Shamir (2011); Caron et al. (2012); Alon et al. (2013, 2014, 2015); Kocák et al. (2014); Neu (2015); Cohen et al. (2016)). In our case, the arms are pairs made of a predictor function and an abstention function in a given family . We will denote by , , the elements of . In fact, depending on the setting, may be finite or (uncountably) infinite. Given , one natural choice for the associated abstention function is a confidence-based abstention function of the form , for some threshold . Yet, more general pairs can be considered here. This provides an important degree of flexibility in the design of algorithms where abstentions are allowed, as shown in (Cortes et al., 2016b, a). Appendix A presents a concrete example illustrating the benefits of learning with these pair of functions.

The online learning protocol is described as follows. The set is known to the learning algorithm beforehand. At each round , the online algorithm receives an input and chooses (possibly at random) an arm (henceforth also called “expert” or “pair”) . If the inequality holds, then the algorithm abstains and incurs as loss an abstention cost . Otherwise, it predicts based on the sign of , receives the true label , and incurs the loss . Thus, the overall abstention loss of expert on the labeled pair is defined as follows:

 L(ξ,z)=ℓ(y,h(x))1r(x)>0+c(x)1r(x)⩽0 . (1)

For simplicity, we will assume throughout that the abstention cost is a (known) constant , independent of , though all our results can be straightforwardly extended to the case when is a (Lipschitz) function of , which is indeed desirable in some applications.

Our problem can be naturally cast as an online learning problem with side information in the form of a feedback graph. Online learning with a feedback graph is a general framework that covers a variety of problems with partial information, including the full information scenario, where the graph is fully connected, and the bandit scenario where all vertices admit only self-loops and are disconnected (Alon et al., 2013, 2014). In our case, we have a directed graph that depends on the instance received by the algorithm at round . Here, denotes the finite set of vertices of this graph, which, in the case of a finite set of arms, coincides with the set of experts , while denotes the set of directed edges at round . The directed edge is in if the loss of expert is observed when expert is selected by the algorithm at round . In our problem, if the learner chooses to predict at round (i.e., if ), then she observes the loss of all experts , since the label is revealed to her. If instead she abstains at round (i.e., if ), then she only observes for those experts that are abstaining in that round, that is, the set of such that , since for all such , we have . Notice that in both cases the learner can observe the loss of her own action. Thus, the feedback graph we are operating with is a nearly fully connected directed graph with self-loops, except that it admits only one-way edges from predicting to abstaining vertices (see Figure 1 for an example). Observe also that the feedback graph is fully determined by .

We will consider both an adversarial setting (Section 3), where no distributional assumption is made about the sequence , , and a stochastic setting (Section 4), where is assumed to be drawn i.i.d. from some unknown distribution over . For both settings, we measure the performance of an algorithm by its (pseudo-)regret , defined as where the expectation is taken both with respect to the algorithm’s choice of actions s and, in the stochastic setting, the random draw of the s.

In the stochastic setting, we will be mainly concerned with the case where is a finite set of experts . We then denote by the expected loss of expert , , by the expected loss of the best expert, , and by the loss gap to the best, . In the adversarial setting, we will analyze both the finite and infinite expert scenarios. In the infinite case, since is non-convex in the relevant parameters (Eq. (1)), further care is needed.

As a warm-up, we start with the adversarial setting with finitely-many experts. Following ideas from Alon et al. (2014, 2015), we design an online algorithm for the abstention scenario by combining standard finite-arm bandit algorithms, like exp3 (Auer et al., 2003), with the feedback graph of Section 2. We call the resulting algorithm exp3-abs (exp3 with abstention). The algorithm is a variant of exp3 where the importance weighting scheme to achieve unbiased loss estimates is based on the probability of the loss of an expert being observed as opposed to that of an expert being selected — see Appendix B (Algorithm 3). The following guarantee holds for this algorithm.

###### Theorem 1

Let exp3-abs be run with learning rate over a set of experts . Then, the algorithm admits the following regret guarantee after rounds:

 RT(\textscexp3−abs)⩽(logK)/η+ηT(c2+1)/2.

In particular, if exp3-abs is run with , then

The proof of this result, as well as all other proofs, is given in the appendix. The dependency of the bound on the number of experts is clearly more favorable than the standard bound for exp3 ( instead of ). Theorem 1 is in fact reminiscent of what one can achieve using the contextual-bandit algorithm EXP4 (Auer et al., 2002b) run on experts, each one having two actions.

We now turn our attention to the case of an uncountably infinite . To model this more general framework, one might be tempted to focus on parametric classes of functions and , e.g., the family of linear functions

 {(h,r):h(x)=w⊤x,r(x)=|w⊤x|−θ,w∈Rd,θ>0},

introduce some convex surrogate of the abstention loss (1), and work in the parametric space of through some Bandit Convex Optimization technique (e.g., (Hazan, 2016)). Unfortunately, this approach is not easy to put in place, since the surrogate loss not only needs to ensure convexity and some form of calibration, but also the ability for the algorithm to observe the loss of its own action (the self-loops in the graph of Figure 1).

We have been unable to get around this problem by just resorting to convex surrogate losses (and we strongly suspect that it is not possible), and in what follows we instead introduce a surrogate abstention loss which is Lipschitz but not convex. Moreover, we take the more general viewpoint of competing with pairs of Lipschitz functions with bounded Lipschitz constant. Let us then consider the version of the abstention loss (1) with , where is the 0/1-loss with slope at the origin,  (see Figure 2 (a)), and the class of experts . Here, functions and in the definition of are assumed to be -Lipschitz with respect to an appropriate distance on , for some constant which determines the size of the family .

Using ideas from (Cesa-Bianchi et al., 2017), we present an algorithm that approximates the action space by a finite cover while using the structure of the abstention setting. The crux of the problem is to define a Lipschitz function that uppers bounds the abstention loss while maintaining the same feedback assumptions, namely the feedback graph given in Figure 1. One Lipschitz function that precisely solves this problem is the following:

for . is plotted in Figure 2(b). Notice that this function is consistent with the feedback requirements of Section 2: implies that is known to the algorithm (i.e., is independent of ) for all such that , while gives complete knowledge of for all , since is observed.

We can then adapt the machinery from (Cesa-Bianchi et al., 2017) so as to apply a contextual version of exp3-abs to the sequence of losses . The algorithm adaptively covers with balls of a fixed radius , each ball hosting an instance of exp3-abs. We call this algorithm Contexp3-abs – see Appendix B.2 for details.

###### Theorem 2

Consider the abstention loss

 L(ξ,z)=fγ(−yh(x))1r(x)>0+c1r(x)⩽0 ,

and let , with made of pairs of Lipschitz functions as described above. If Contexp3-abs is run with parameter and an appropriate learning rate (see Appendix B), then, it admits the following regret guarantee:

 RT(\textscContexp3−abs)⩽˜O(Td+1d+2γ−dd+2)+M∗T(γ),

where is the number of such that .

In the above, hides constant and factors, while disregards constants like , and various log factors. Contexp3-abs is also computationally efficient, thereby providing a compelling solution to the infinite armed case of online learning with abstention.

## 4 Stochastic setting

We now turn to studying the stochastic setting. As pointed out in Section 2, the problem can be cast as an instance of online learning with time-varying feedback graphs . Thus, a natural method for tackling the problem would be to extend existing algorithms designed for the stochastic setting with feedback graphs to our abstention scenario (Cohen et al., 2016; Caron et al., 2012). We cannot benefit from the algorithm of Cohen et al. (2016) in our scenario. This is because at the heart of its design and theoretical guarantees lies the assumption that the graphs and losses are independent. The dependency of the feedback graphs on the observations , which also define the losses, is precisely a property that we wish to exploit in our scenario.

An alternative is to extend the ucb-n algorithm of Caron et al. (2012), for which the authors provide gap-based regret guarantees. This algorithm is defined for a stochastic setting with an undirected feedback graph that is fixed over time. The algorithm can be straightforwardly extended to the case of directed time-varying feedback graphs (see Algorithm 1). We will denote that extension by ucb-nt to explicitly differentiate it from ucb-n. Let denote the set of out-neighbors of vertex in the directed graph at time , i.e., the set of vertices destinations of an edge from . Then, as with ucb-n, the algorithm updates, at each round , the upper-confidence bound of every expert for which a feedback is received (those in ), as opposed to updating only the upper-confidence bound of the expert selected, as in the standard ucb of Auer et al. (2002a).

In the context of learning with abstention, the natural feedback graph at time depends on the observation and varies over time. Can we extend the regret guarantees of Caron et al. (2012) to ucb-nt with such graphs? We will show in Section 4.1 that vanishing regret guarantees do not hold for ucb-nt run with graphs . This is because of a fundamental estimation bias problem that arises when the graph at time depends on the observation . This issue affects more generally any natural method using the graphs. Nevertheless, we will show in Section 4.2 that ucb-nt does benefit from favorable guarantees, provided the feedback graph it uses at round is replaced by one that only depends on events up to time .

### 4.1 Bias problem

Assume there are two experts: (red) and (blue) with and (see Figure 3). For , the red expert is abstaining and incurring a loss , whereas the blue expert is never abstaining. Assume that the probability mass is quasi-uniform over the interval but with slightly more mass over the region . The algorithm may then start out by observing points in this region. Here, both experts accept and the algorithm obtains error estimates corresponding to the solid red and blue lines for . When the algorithm observes a point , it naturally selects the red abstaining expert since it admits a better current estimated loss. However, for , the red expert is worse than the blue expert . Furthermore, it is abstaining and thus providing no updates for expert (which is instead predicting). Hence, the algorithm continues to maintain an estimate of ’s loss at the level of the blue solid line indicated for ; it then continues to select the red expert for all s and incurs a high regret.111 For the sake of clarity, we did not introduce specific real values for the expected loss of each expert on each of the half intervals, but that can be done straightforwardly. We have also verified experimentally with such values that the bias problem just pointed out indeed leads to poor regret for ucb-nt.

This simple example shows that, unlike the adversarial scenario (Section 3), , here, cannot depend on the input , and that, in general, the indiscriminate use of feedback graphs may result in biased loss observations. On the other hand, we know that if we were to avoid using feedback graphs at all (which is always possible using ucb), we would always be able to define unbiased loss estimates. A natural question is then: can we construct time-varying feedback graphs that lead to unbiased loss observations? In the next section, we show how to design such a sequence of auxiliary feedback graphs, which in turn allows us to then extend ucb-nt to the setting of time-varying feedback graphs for general loss functions. Under this assumption, we can achieve unbiased empirical estimates of the average losses of the experts, which will allow us to apply standard concentration bounds in the proof of this algorithm.

### 4.2 Time-varying graphs for ucb-nt

We now show that ucb-nt benefits from favorable guarantees, so long as the feedback graph it uses at time depends only on events up to time . This extension works for general bounded losses and does not only apply to our specific abstention loss .

So, let us assume that the feedback graph in round (and the associated out-neighborhoods ) in Algorithm 1 only depends on the observed losses and inputs , for , and , and let us denote this feedback graph by , so as not to get confused with . Under this assumption, we can derive strong regret guarantees for ucb-nt with time-varying graphs using a newly introduced notion of admissible coverings. For the feedback graph at time , let be a collection of subsets of covering , such that , means that and . We denote such a collection an admissible covering of . Let denote the set of all admissible coverings of , and let , i.e. the collection of shared admissible coverings that apply across all time steps. Then by construction, for any and , means that and for every . Note that the definition of is equivalent to considering the set of edges that are shared across all , and then considering admissible coverings over the graph induced by these shared edges. Moreover, since for every , is always non-empty.

###### Theorem 3

Assume that, for all , the feedback graph depends only on information up to time . Then, the regret of ucb-nt is bounded as follows:

 O(E[minC∈F∑C∈Cmaxj∈CΔjminj∈CΔ2jlog(T)+K]) .

The theorem gives a bound on the regret based on any admissible covering that applies to every feedback graph seen during learning, and the minimum chooses the admissible covering with the smallest regret.

Theorem 3 can be interpreted as an extension of Theorem 2 in Caron et al. (2012) to time-varying feedback graphs. Its proof involves showing that the use of feedback graphs that depend only on information up to can result in unbiased loss estimates, and it considers shared admissible coverings that apply across the sequence of feedback graphs to derive a time-varying bound that leverages the shared updates from the graph.

Moreover, the bound illustrates that if the feedback graphs in a problem admit a shared admissible covering with a small number of elements (e.g. if the feedback graphs can be decomposed into a small number of components that are fixed across time) for which , then this bound can be up to a factor tighter than the bound guaranteed by the standard UCB algorithm. Moreover, this regret guarantee is always more favorable than that of the standard UCB since the (trivial) admissible covering that splits into singletons for all is always an admissible covering of every . Furthermore, note that if the feedback graph is fixed throughout all rounds and we interpret the doubly-directed edges as edges of an undirected graph , it follows that for every . Thus, we straightforwardly obtain the following result, which is comparable to Theorem 2 in (Caron et al., 2012).

###### Corollary 1

If the feedback graph is fixed over time, then the guarantee of Theorem 3 is upper-bounded by:

 O(minC∑C∈Cmaxi∈CΔimini∈CΔ2ilog(T)+K) ,

the outer minimum being over all admissible coverings of .

Caron et al. (2012) present matching lower bounds for the case of stochastic bandits with a fixed feedback graph. Since we can again design abstention scenarios with fixed feedback graphs, these bounds carry over to our setting.

Now, how can we use the results of this section to design an algorithm for the abstention scenario? The natural feedback graphs we discussed in Section 3 are no longer applicable since depends on . Nevertheless, we will present two solutions to this problem. In Section 4.3, we present a solution with a fixed graph that closely captures the problem of learning with abstention. Next, in Section 4.4, we will show how to define and leverage a time-varying graph that is estimated based on past observations.

### 4.3 ucb-n with the subset feedback graph

In this section, we define a subset feedback graph, , that captures the most informative feedback in the problem of learning with abstention and yet is safe in the sense that it does not depend on . The definition of the graph is based on the following simple observation: if the abstention region associated with is a subset of that of , then, if is selected at some round and is abstaining, so is . For an example, see and in Figure 4 (top). Crucially, this implication holds regardless of the particular input point received in the region of abstention of . Thus, the set of vertices of is , and admits an edge from to , iff . Since does not vary with time, it trivially verifies the condition of the previous section. Thus, ucb-nt run with admits the regret guarantees of Theorem 3, where we only need to consider the set of admissible coverings of the fixed graph .

The example of Section 4.1 illustrated a bias problem in a special case where the feedback graphs were not subgraphs of . The following result shows more generally that feedback graphs not included in may result in catastrophic regret behavior.

###### Proposition 1

Assume that ucb-nt is run with feedback graphs that are not subsets of . Then, there exists a family of predictors , a Lipschitz loss function in (1), and a distribution over s for which ucb-nt incurs linear regret with arbitrarily high probability.

The proof of the proposition is given in Appendix C.3. In view of this result, no fixed feedback graph for ucb-nt can be more informative than . But how can we leverage past observations (up to time ) to derive a feedback graph that would be more informative than the simple subset graph ? The next section provides a solution based on feedback graphs estimated based on past observations and a new algorithm.

### 4.4 UCB-GT algorithm

We seek graphs that admit as a subgraph. We will show how certain types of edges can be safely added to based on past observations. This leads to a new algorithm, ucb-gt (ucb with estimated time-varying graph), whose pseudocode is given in Algorithm 2.

As illustrated by Figure 4, the key idea of ucb-gt is to augment with edges from to where the subset property may not hold, but where the implication holds with high probability over the choice of , that is, the region admits low probability. Of course, adding such an edge can cause the estimation bias of Section 4.1. But, if we restrict ourselves to cases where is upper bounded by some carefully chosen quantity that changes over rounds, the effect of this bias will be limited. In reverse, as illustrated in Figure 4, the resulting feedback graph can be substantially more beneficial since it may have many more edges than , hence leading to more frequent updates of the experts’ losses and more favorable regret guarantees. This benefit is further corroborated by our experimental results (Section 5).

Since we do not have access to , we use instead empirical estimates . At time , if expert is selected, we update expert if the condition holds with If the expert chosen abstains while expert predicts and satisfies , then we do not have access to the true label . In that case, we update optimistically our empirical estimate as if the expert had loss at that round (Step (*) in Alg. 2).

The feedback graph just described can be defined via the out-neighborhood of vertex : . The following regret guarantee holds for ucb-gt.

###### Theorem 4

For any , let the feedback graph be defined by the out-neighborhood . Then, the regret of ucb-gt is bounded as follows:

 O(E[minC∈F∑C∈Cmaxj∈CΔjminj∈CΔ2jlog(T)+K]) .

Since the graph of ucb-gt has more edges than , it admits at least as many admissible coverings as , which leads to a more favorable guarantee than that of ucb-nt run with . The proof of this result differs from the standard UCB analysis and that of Theorem 3 in that it involves showing that the ucb-gt algorithm can adequately control the amount of bias introduced by the skewed loss estimates. The experiments in the next section provide an empirical validation of this theoretical comparison.

## 5 Experiments

In this section, we report the results of several experiments on ten datasets comparing ucb-gt, ucb-nt with feedback graph , vanilla ucb (with no sharing information across experts), as well as Full-Supervision, fs. fs is an algorithm that at each round chooses the expert with the smallest abstention loss so far, , and even if this expert abstains, the algorithm receives the true label and can update the empirical abstention loss estimates for all experts. fs reflects an unrealistic and overly optimistic scenario that clearly falls outside the abstention setting, but it provides an upper bound for the best performance we may hope for.

We used the following eight datasets from the UCI data repository: HIGGS, phishing, ijcnn, covtype, eye, skin, cod-rna, and guide. We also used the CIFAR dataset from (Krizhevsky et al., 2009), where we extracted the first twenty-five principal components and used their projections as features, and a synthetic dataset of points drawn according to the uniform distribution in . For each dataset, we generated a total of experts and all the algorithms were tested for a total of rounds. The experts, , were chosen in the following way. The predictors are hyperplanes centered at the origin whose normal vector in is drawn randomly from the Gaussian distribution, , where is the dimension of the feature space of the dataset. The abstention functions are concentric annuli around the origin with radii in . For each dataset, we generated predictors and each predictor is paired with the 21 abstention functions . For a fixed set of experts, we first calculated the regret by averaging over five random draws of the data, where the best-in-class expert was determined in hindsight as the one with the minimum average cumulative abstention loss. We then repeated this experiment five times over different sets of experts and averaged the results. We report these results for .

Figure 5 shows the averaged regret with standard deviations across the five repetitions for the different algorithms as a function of for two datasets. In Appendix D, we present plots of the regret for all ten datasets. These results show that ucb-gt outperforms both ucb-nt and ucb on all datasets for all abstention cost values. Remarkably, ucb-gt’s performance is close to that of fs for most datasets, thereby implying that ucb-gt attains almost the best regret that we could hope for. We also find that ucb-nt performs better than the vanilla ucb.

Figure 5 also illustrates the fraction of points in which the chosen expert abstains, as well as the number of edges in the feedback graph as a function of rounds. We only plot the number of edges of ucb-gt  since that is the only graph that varies with time. For both experiments depicted and in general for the rest of the datasets, the number of edges for ucb-gt is between 1 million to 3 million, which is at least a factor of 5 more than for ucb-nt, where the number of edges we observed are of the order . fs enjoys the full information property and the number of edges is fixed at 4 million (complete graph). The increased information sharing of ucb-gt is clearly a strong contributing factor to the algorithm’s improvement in regret relative to ucb-nt. In general, we find that, provided that the estimation bias is controlled, the higher is the number of edges, the smaller the regret. Regarding the value of the cost , as expected, we observe that the fraction of points that the chosen expert abstains on always decreases as increases, but also that that fraction depends on the dataset and the experts used.

Finally, Appendix D includes more experiments for different aspects of the problem. In particular, we tested how the number of experts or a different choice of experts (confidence-based experts) affected the results. We also experimented with some extreme abstention costs and, as expected, found the fraction of abstained points to be large for and small for . In all of these additional experiments, ucb-gt outperformed ucb-nt.

## 6 Conclusion

We presented a comprehensive analysis of the novel setting of online learning with abstention, including algorithms with favorable guarantees both in the stochastic and adversarial scenarios, and extensive experiments demonstrating the performance of ucb-gt in practice. Our algorithms and analysis can be straightforwardly extended to similar problems, including the multi-class and regression settings, as well as other related scenarios, such as online learning with budget constraints. A key idea behind the design of our algorithms in the stochastic setting is to leverage the stochastic sequence of feedback graphs. This idea can perhaps be generalized and applied to other problems where time-varying feedback graphs naturally appear. Furthermore, our regret guarantees can be instead expressed in terms of the independence number of time-varying graphs by proceeding as in (Lykouris et al., 2019).

## References

• N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren (2015) Online learning with feedback graphs: beyond bandits. JMLR. Cited by: §B.1, §1, §1, §2, §3.
• N. Alon, N. Cesa-Bianchi, C. Gentile, S. Mannor, Y. Mansour, and O. Shamir (2014) Nonstochastic multi-armed bandits with graph-structured feedback.. In CoRR, Cited by: §B.1, §1, §1, §2, §2, §3.
• N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour (2013) From bandits to experts: a tale of domination and independence.. In NIPS, Cited by: §1, §2, §2.
• P. Auer, N. Cesa-Bianchi, and P. Fischer (2002a) Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47 (2-3), pp. 235–256. Cited by: §1, §4.
• P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002b) The nonstochastic multi-armed bandit problem.. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: §3.
• P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2003) The nonstochastic multi-armed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: §3.
• P. Bartlett and M. Wegkamp (2008) Classification with a reject option using a hinge loss.. JMLR, pp. 291–307. Cited by: §1.
• S. Bubeck and N. Cesa-Bianchi (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5 (1), pp. 1–122. Cited by: §B.1.
• S. Caron, B. Kveton, M. Lelarge, and S. Bhagat (2012) Leveraging side observations in stochastic bandits. In UAI, Cited by: §1, §1, §2, §4.2, §4.2, §4.2, §4, §4, §4.
• N. Cesa-Bianchi, P. Gaillard, C. Gentile, and S. Gerchinovitz (2017) Algorithmic chaining and the role of partial feedback in online nonparametric learning. In MLR, Cited by: §B.2, §1, §3, §3, Remark 1.
• C.K. Chow (1957) An optimum character recognition system using decision function. IEEE T. C.. Cited by: §1.
• C.K. Chow (1970) On optimum recognition error and reject trade-off. IEEE T. C.. Cited by: §1.
• K. L. Clarkson (2006) Nearest-neighbor searching and metric space dimensions. In Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, Cited by: Remark 1.
• A. Cohen, T. Hazan, and T. Koren (2016) Online learning with feedback graphs without the graphs. In ICML, Cited by: §1, §1, §2, §4.
• C. Cortes, G. DeSalvo, and M. Mohri (2016a) Boosting with abstention. In NIPS, Cited by: §1, §2.
• C. Cortes, G. DeSalvo, and M. Mohri (2016b) Learning with rejection. In ALT, pp. 67–82. Cited by: Figure 6, §1, §1, §2.
• R. El-Yaniv and Y. Wiener (2010) On the foundations of noise-free selective classification. JMLR. Cited by: §1.
• R. El-Yaniv and Y. Wiener (2011) Agnostic selective classification. In NIPS, Cited by: §1.
• E. Hazan and N. Megiddo (2007) Online learning with prior knowledge. In COLT, pp. 499–513. Cited by: Remark 1.
• E. Hazan (2016) Introduction to online convex optimization. Foundations and Trends in Optimization, Now Publishers Inc.. Cited by: §3.
• T. Kocák, G. Neu, M. Valko, and R. Munos (2014) Efficient learning by implicit exploration in bandit problems with side observations. In NIPS, pp. 613–621. Cited by: §1, §2.
• L. Li, M. Littman, and W. Thomas (2008) Knows What It Knows: a framework for self-aware learning. In ICML, Cited by: Appendix A, §1.
• N. Littlestone and M. K. Warmuth (1994) The weighted majority algorithm. Information and computation 108 (2), pp. 212–261. Cited by: §1.
• T. Lykouris, E. Tardos, and D. Wali (2019) Feedback graph regret bounds for thompson sampling and ucb.. In ArXiv, Cited by: §6.
• S. Mannor and O. Shamir (2011) From bandits to experts: on the value of side-observations.. In NIPS, pp. 291–307. Cited by: §1, §2.
• G. Neu (2015) Explore no more: improved high-probability regret bounds for non-stochastic bandits. In NIPS, pp. 3168–3176. Cited by: §1, §2.
• A. Sayedi, M. Zadimoghaddam, and A. Blum (2010) Trading off mistakes and don’t-know predictions. In NIPS, Cited by: Appendix A.
• C. Zhang and K. Chaudhuri (2016) The extended Littlestone’s dimension for learning with mistakes and abstentions. In COLT, Cited by: Appendix A.

## Appendix A Further Related Work

Learning with abstention is a useful paradigm in applications where the cost of misclassifying a point is high. More concretely, suppose the cost of abstention is less than and consider the set of points along the real line illustrated in Figure 6 where and indicate their labels. The best threshold classifier is the hypothesis given by threshold , since it correctly classifies points to the right of , with an expected loss of . On the other hand, the best abstention pair would abstain on the region left of and correctly classify the rest, with an expected loss of . Since , the abstention pair always admits a better loss then the best threshold classifier.

Within the online learning literature, work related to our scenario includes the KWIK (knows what it knows) framework of Li et al. (2008) in which the learning algorithm is required to make only correct predictions but admits the option of abstaining from making a prediction. The objective is then to learn a concept exactly with the fewest number of abstentions. If in our framework we received the label at every round, KWIK could be seen as a special case of our framework for online learning with abstention with an infinite misclassification cost and some finite abstention cost. A relaxed version of the KWIK framework was introduced and analyzed by Sayedi et al. (2010) where a fixed number of incorrect predictions are allowed with a learning algorithm related to the solution of the ’mega-egg game puzzle’. A theoretical analysis of learning in this framework was also recently given by Zhang and Chaudhuri (2016). Our framework does not strictly cover this relaxed framework. However, for some choices of the misclassification cost depending on the horizon, the framework is very close to ours. The analysis in these frameworks was given in terms of mistake bounds since the problem is assumed to be realizable. We will not restrict ourselves to realizable problems and, instead, will provide regret guarantees.

We first present the pseudocode and proofs for the finite arm setting and next analyze the infinite arm setting.

### b.1 Finite arm setting

Algorithm 3 contains the pseudocode for exp3-abs, an algorithm for online learning with abstention under an adversarial data model that guarantees small regret. The algorithm itself is a simple adaptation of the ideas in (Alon et al., 2014, 2015), where we incorporate the side information that the loss of an abstaining arm is always observed, while the loss of a predicting arm is observed only if the algorithm actually plays a predicting arm. In the pseudocode and in the proof that follows, is a shorthand for .

Proof of Theorem 1.
Proof. By applying the standard regret bound of Hedge (e.g., (Bubeck and Cesa-Bianchi, 2012)) to distributions generated by exp3-abs and to the non-negative loss estimates , the following holds:

 E[T∑t=1∑ξj∈Eqt(ξj)E[ˆLt(ξj)]−T∑t=1E[ˆLt(ξ⋆)]]⩽logKη+η2T∑t=1E⎡⎣∑ξj∈Eqt(ξj)E[ˆLt(ξj)2]⎤⎦, (2)

for any fixed . Using the fact that and , we can write

For each , we can split the nodes of into the two subsets and where if a node is abstaining at time then , and otherwise . Thus, for any round , we can write

 ∑ξj∈Eqt(ξj)Pt(ξj)Lt(ξj)2 =∑ξj∈Vabs,tqt(ξj)Pt(ξj)Lt(ξj)2+∑ξj∈Vacc,tqt(ξj)Pt(ξj)Lt(ξj)2 ⩽∑ξj∈Vabs,tqt(ξj)c2+∑ξj∈Vacc,tqt(ξj)Pt(ξj) ⩽c2+1 .

The first inequality holds since if is an abstaining expert at time , we know that and , while for the accepting experts we know that anyway. The second inequality holds because if is an accepting expert, we have . Combining this inequality with (2) concludes the proof.

### b.2 Infinite arm setting

Here, the input space is assumed to be totally bounded, so that there exists a constant such that, for all , can be covered with at most balls of radius . Let be a shorthand for , the range space of the pairs . An -covering of with respect to the Euclidean distance on has size for some constant .

The online learning scenario for the loss under the abstention setting’s feedback graphs is as follows. Given an unknown sequence of pairs , for every round :

1. The environment reveals input ;

2. The learner selects an action and incurs loss ;

3. The learner obtains feedback from the environment.

Our algorithm is described as Algorithm 4. The algorithm essentially works as follows. At each round , if a new incoming input is not contained in any existing ball generated so far, then a new ball centered at is created, and a new instance of exp3-abs is allocated to handle . Otherwise, the exp3-abs instance associated with the closest input so far is used. Each allocated exp3-abs instance operates on the discretized action space .

Consider the function

 ˜L(a,r)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩cif r⩽−γ1+(1−cγ)rif r∈(−γ,0)1−(1−fγ(−a)γ)rif r∈[0,γ)fγ(−a)if r⩾γ ,

where is the Lipschitz variant of the 0/1-loss mentioned in Section 3 of the main text (Figure 2 (a)). For any fixed , the function is -Lipschitz when viewed as a function of , and is -Lipschitz for any fixed when viewed as a function of . Hence

 |˜L(a,r)−˜L(a′,r′)| ⩽|˜L(a,r)−˜L(a,r′)|+|˜L(a,r′)−˜L(a′,r′)| ⩽1γ|r−r′|+12γ|a−a′| ⩽√1γ2+14γ2√(a−a′)2+(r−r′)2 <2γ√(a−a′)2+(r−r′)2 ,

so that is -Lipschitz w.r.t. the Euclidean distance on . Furthermore, a quick comparison to the abstention loss

 L(a,r)=fγ(a)1r>0+c1r⩽0

reveals that (recall Figure 2 (b) in the main text) :

• is an upper bound on , i.e.,

 ˜L(a,r)⩾L(a,r),∀ (a,r)∈Y ;
• approximates in that

 ˜L(a,r)=L(a,r),∀ (a,r)∈Y:|r|⩾γ . (3)

With the above properties of at hand, we are ready to prove Theorem 2.

Proof of Theorem 2.
Proof. On each ball that Contexp3-abs allocates during its online execution, Theorem 1 supplies the following regret guarantee for the associated instance of exp3-abs:

 logKεη+η2TB(c2+1),

where is the number of points falling into ball . Now, taking into account that is -Lipschitz, and that the functions and are assumed to be -Lipschitz on , a direct adaptation of the proof of Theorem 1 in (Cesa-Bianchi et al., 2017) gives the bound

 supξ∈EE[T∑t=1˜L(ξIt,zt)−T∑t=1˜L(ξ,zt)]⩽NTlogKεη+η2T(c2+1)+LEε2γT ,

being the maximum number of balls created by Contexp3-abs. Using and setting yields

 supξ∈EE[T∑t=1˜L(ξIt,zt)−T∑t=1˜L(ξ,zt)]⩽2√TNTlogKε+LEε2γT .

Next, optimizing for by setting (and disregarding