Multi-View Active Learning in the Non-Realizable Case

# Multi-View Active Learning in the Non-Realizable Case

Wei Wang    Zhi-Hua Zhou
###### Abstract

The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be , contrasting to single-view setting where the polynomial improvement is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is , where the order of is independent of the parameter in Tsybakov noise, contrasting to previous polynomial bounds where the order of is related to the parameter in Tsybakov noise.

$\ast$$\ast$footnotetext: Corresponding author. Email: zhouzh@nju.edu.cn

Multi-View Active Learning in the Non-Realizable Case

National Key Laboratory for Novel Software Technology

Nanjing University, Nanjing 210093, China

Key words: active learning, non-realizable case

## 1 Introduction

In active learning David94 (); DasguptaNIPS07 (); Freund97 (), the learner draws unlabeled data from the unknown distribution defined on the learning task and actively queries some labels from an oracle. In this way, the active learner can achieve good performance with much fewer labels than passive learning. The number of these queried labels, which is necessary and sufficient for obtaining a good leaner, is well-known as the sample complexity of active learning.

Many theoretical bounds on the sample complexity of active learning have been derived based on the realizability assumption (i.e., there exists a hypothesis perfectly separating the data in the hypothesis class) Balcan2007 (); BalcanCOLT2008 (); Dasgupta05 (); DasguptaNIPS05 (); Dasgupta2005 (); Freund97 (). The realizability assumption, however, rarely holds in practice. Recently, the sample complexity of active learning in the non-realizable case (i.e., the data cannot be perfectly separated by any hypothesis in the hypothesis class because of the noise) has been studied BalcanBL06 (); DasguptaNIPS07 (); Hanneke07 (). It is worth noting that these bounds obtained in the non-realizable case match the lower bound Kaariainen06 (), in the same order as the upper bound of passive learning ( denotes the generalization error rate of the optimal classifier in the hypothesis class and bounds how close to the optimal classifier in the hypothesis class the active learner has to get). This suggests that perhaps active learning in the non-realizable case is not as efficient as that in the realizable case. To improve the sample complexity of active learning in the non-realizable case remarkably, the model of the noise or some assumptions on the hypothesis class and the data distribution must be considered. Tsybakov noise model Tsybakov04 () is more and more popular in theoretical analysis on the sample complexity of active learning. However, existing result CastroN08 () shows that obtaining exponential improvement in the sample complexity of active learning with unbounded Tsybakov noise is hard.

Inspired by WangZ08 () which proved that multi-view setting Blum:Mitchell1998 () can help improve the sample complexity of active learning in the realizable case remarkably, we have an insight that multi-view setting will also help active learning in the non-realizable case. In this paper, we present the first analysis on the sample complexity of active learning in the non-realizable case under multi-view setting, where the non-realizability is caused by Tsybakov noise. Specifically:

-We define -expansion, which extends the definition in Balcan:Blum:Yang2005 () and WangZ08 () to the non-realizable case, and -condition for multi-view setting.

-We prove that the sample complexity of active learning with Tsybakov noise under multi-view setting can be improved to when the learner satisfies non-degradation condition.111The notation is used to hide the factor . This exponential improvement holds no matter whether Tsybakov noise is bounded or not, contrasting to single-view setting where the polynomial improvement is the best possible achievement for active learning with unbounded Tsybakov noise.

-We also prove that, when non-degradation condition does not hold, the sample complexity of active learning with unbounded Tsybakov noise under multi-view setting is , where the order of is independent of the parameter in Tsybakov noise, i.e., the sample complexity is always no matter how large the unbounded Tsybakov noise is. While in previous polynomial bounds, the order of is related to the parameter in Tsybakov noise and is larger than 1 when unbounded Tsybakov noise is larger than some degree (see Section 2). This discloses that, when non-degradation condition does not hold, multi-view setting is still able to lead to a faster convergence rate and our polynomial improvement in the sample complexity is better than previous polynomial bounds when unbounded Tsybakov noise is large.

The rest of this paper is organized as follows. After introducing related work in Section 2 and preliminaries in Section 3, we define -expansion in the non-realizable case in Section 4. Then we analyze the sample complexity of active learning with Tsybakov noise under multi-view setting with and without the non-degradation condition in Section 5 and Section 6, respectively, and verify the improvement in the sample complexity empirically in Section 7. Finally we conclude the paper in Section 8.

## 2 Related Work

Generally, the non-realizability of learning task is caused by the presence of noise. For learning the task with arbitrary forms of noise, Balcan et al. BalcanBL06 () proposed the agnostic active learning algorithm and proved that its sample complexity is .222The notation is used to hide the factor . Hoping to get tighter bound on the sample complexity of the algorithm , Hanneke Hanneke07 () defined the disagreement coefficient , which depends on the hypothesis class and the data distribution, and proved that the sample complexity of the algorithm is . Later, Dasgupta et al. DasguptaNIPS07 () developed a general agnostic active learning algorithm which extends the scheme in David94 () and proved that its sample complexity is .

Recently, the popular Tsybakov noise model Tsybakov04 () was considered in theoretical analysis on active learning and there have been some bounds on the sample complexity. For some simple cases, where Tsybakov noise is bounded, it has been proved that the exponential improvement in the sample complexity is possible Balcan2007 (); CastroAllerton06 (); Hanneke2009 (). As for the situation where Tsybakov noise is unbounded, only polynomial improvement in the sample complexity has been obtained. Balcan et al. Balcan2007 () assumed that the samples are drawn uniformly from the the unit ball in and proved that the sample complexity of active learning with unbounded Tsybakov noise is ( depends on Tsybakov noise). This uniform distribution assumption, however, rarely holds in practice. Castro and Nowak CastroN08 () showed that the sample complexity of active learning with unbounded Tsybakov noise is ( depends on another form of Tsybakov noise, depends on the Hölder smoothness and is the dimension of the data). This result is also based on the strong uniform distribution assumption. Cavallanti et al. CavallantiCG08 () assumed that the labels of examples are generated according to a simple linear noise model and indicated that the sample complexity of active learning with unbounded Tsybakov noise is . Hanneke Hanneke2009 () proved that the algorithms or variants thereof in BalcanBL06 () and DasguptaNIPS07 () can achieve the polynomial sample complexity for active learning with unbounded Tsybakov noise. For active learning with unbounded Tsybakov noise, Castro and Nowak CastroN08 () also proved that at least labels are requested to learn an -approximation of the optimal classifier ( depends on Tsybakov noise). This result shows that the polynomial improvement is the best possible achievement for active learning with unbounded Tsybakov noise in single-view setting. Wang wangNIPS2009 () introduced smooth assumption to active learning with approximate Tsybakov noise and proved that if the classification boundary and the underlying distribution are smooth to -th order and , the sample complexity of active learning is ; if the boundary and the distribution are infinitely smooth, the sample complexity of active learning is . Nevertheless, this result is for approximate Tsybakov noise and the assumption on large smoothness order (or infinite smoothness order) rarely holds for data with high dimension in practice.

## 3 Preliminaries

In multi-view setting, the instances are described with several different disjoint sets of features. For the sake of simplicity, we only consider two-view setting in this paper. Suppose that is the instance space, and are the two views, is the label space and is the distribution over . Suppose that is the optimal Bayes classifier, where and are the optimal Bayes classifiers in the two views, respectively. Let and be the hypothesis class in each view and suppose that and . For any instance , the hypothesis makes that if and otherwise, where is a subset of . In this way, any hypothesis corresponds to a subset of (as for how to combine the hypotheses in the two views, see Section 5). Considering that and denote the same instance in different views, we overload to denote the instance set without confusion. Let correspond to the optimal Bayes classifier . It is well-known DEVROYE1996 () that , where . Here, we also overload to denote the instances set . The error rate of a hypothesis under the distribution is . In general, and the excess error of can be denoted as follows, where and is a pseudo-distance between the sets and .

 R(Sv)−R(S∗v)=∫SvΔS∗v|2φv(xv)−1|pxvdxv≜d(Sv,S∗v) (1)

Let denote the error rate of the optimal Bayes classifier which is also called as the noise rate in the non-realizable case. In general, is less than . In order to model the noise, we assume that the data distribution and the Bayes decision boundary in each view satisfies the popular Tsybakov noise condition Tsybakov04 () that for some finite , and all , where corresponds to the best learning situation and the noise is called bounded CastroN08 (); while corresponds to the worst situation. When , the noise is called unbounded CastroN08 (). According to Proposition 1 in Tsybakov04 (), it is easy to know that (2) holds.

 d(Sv,S∗v)≥C1dkΔ(Sv,S∗v) (2)

Here , , is also a pseudo-distance between the sets and , and . We will use the following lamma Anthony1999 () which gives the standard sample complexity for non-realizable learning task.

###### Lemma 1

Suppose that is a set of functions from to with finite VC-dimension and is the fixed but unknown distribution over . For any , , there is a positive constant , such that if the size of sample from is , then with probability at least , for all , the following holds.

 |1N∑Ni=1I(h(xi)≠yi)−E(x,y)∈DI(h(x)≠y)|≤ϵ

## 4 α-Expansion in the Non-realizable Case

Multi-view active learning first described in MusleaMK02 () focuses on the contention points (i.e., unlabeled instances on which different views predict different labels) and queries some labels of them. It is motivated by that querying the labels of contention points may help at least one of the two views to learn the optimal classifier. Let denote the contention points between and , then denotes the probability mass on the contentions points. “” and “” mean the same operation rule. In this paper, we use “” when referring the excess error between and and use “” when referring the difference between the two views and . In order to study multi-view active learning, the properties of contention points should be considered. One basic property is that should not be too small, otherwise the two views could be exactly the same and two-view setting would degenerate into single-view setting.

In multi-view learning, the two views represent the same learning task and generally are consistent with each other, i.e., for any instance the labels of in the two views are the same. Hence we first assume that . As for the situation where , we will discuss on it further in Section 5.2. The instances agreed by the two views can be denoted as . However, some of these agreed instances may be predicted different label by the optimal classifier , i.e., the instances in . Intuitively, if the contention points can convey some information about , then querying the labels of contention points could help to improve and . Based on this intuition and that should not be too small, we give our definition on -expansion in the non-realizable case.

###### Definition 1

is -expanding if for some and any , , (3) holds.

 Pr(S1⊕S2)≥α(Pr(S1∩S2−S∗)+Pr(¯¯¯¯¯¯S1∩¯¯¯¯¯¯S2−¯¯¯¯¯¯S∗)) (3)

We say that is -expanding with respect to hypothesis class if the above holds for all , (here we denote by the set { : } for ).

Balcan et al. Balcan:Blum:Yang2005 () also gave a definition of expansion, , for realizable learning task under the assumptions that the learner in each view is never “confident but wrong” and the learning algorithm is able to learn from positive data only. Here denotes the instances which are classified as positive confidently in each view. Generally, in realizable learning tasks, we aim at studying the asymptotic performance and assume that the performance of initial classifier is better than guessing randomly, i.e., . This ensures that is larger than . In addition, in Balcan:Blum:Yang2005 () the instances which are agreed by the two views but are predicted different label by the optimal classifier can be denoted as . So, it can be found that Definition 1 and the definition of expansion in Balcan:Blum:Yang2005 () are based on the same intuition that the amount of contention points is no less than a fraction of the amount of instances which are agreed by the two views but are predicted different label by the optimal classifiers.

## 5 Multi-view Active Learning with Non-degradation Condition

In this section, we first consider the multi-view learning in Table 1 and analyze whether multi-view setting can help improve the sample complexity of active learning in the non-realizable case remarkably. In multi-view setting, the classifiers are often combined to make predictions and many strategies can be used to combine them. In this paper, we consider the following two combination schemes, and , for binary classification:

 (4)

### 5.1 The Situation Where S∗1=s∗2

With (4), the error rate of the combined classifiers and satisfy (5) and (6), respectively.

 R(hi+)−R(S∗)=R(Si1∩Si2)−R(S∗)≤dΔ(Si1∩Si2,S∗) (5) R(hi−)−R(S∗)=R(Si1∪Si2)−R(S∗)≤dΔ(Si1∪Si2,S∗) (6)

Here () corresponds to the classifier in the -th round. In each round of multi-view active learning, labels of some contention points are queried to augment the training data set and the classifier in each view is then refined. As discussed in WangZ08 (), we also assume that the learner in Table 1 satisfies the non-degradation condition as the amount of labeled training examples increases, i.e., (7) holds, which implies that the excess error of is no larger than that of in the region of .

 Pr(Si+1vΔS∗∣∣¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Si1⊕Si2)≤Pr(SivΔS∗∣∣¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Si1⊕Si2) (7)

To illustrate the non-degradation condition, we give the following example: Suppose the data in () fall into different clusters, denoted by , and every cluster has the same probability mass for simplicity. The positive class is the union of some clusters while the negative class is the union of the others. Each positive (negative) cluster in is associated with only positive (negative) clusters in (i.e., given an instance in , will only be in one of these ). Suppose the learning algorithm will predict all instances in each cluster with the same label, i.e., the hypothesis class consists of the hypotheses which do not split any cluster. Thus, the cluster can be classified according to the posterior probability and querying the labels of instances in cluster will not influence the estimation of the posterior probability for cluster (). It is evident that the non-degradation condition holds in this task. Note that the non-degradation assumption may not always hold, and we will discuss on this in Section 6. Now we give Theorem 1.

###### Theorem 1

For data distribution -expanding with respect to hypothesis class according to Definition 1, when the non-degradation condition holds, if and , the multi-view active learning in Table 1 will generate two classifiers and , at least one of which is with error rate no larger than with probability at least .
Here, where denotes the VC-dimension of the hypothesis class , , and .

###### Proof.

Let . First we prove that if each view () satisfies Tsybakov noise condition, i.e., for some finite , and all , Tsybakov noise condition can also be met in , i.e., for some finite , and all . Suppose Tsybakov noise condition cannot be met in , then for and , there exists some to satisfy that . So we get

 Prxv∈Xv(|φv(xv)−1/2|≤t)≥Prxv∈Qi(|φv(xv)−1/2|≤t)>C3tλ3∗.

It is in contradiction with that satisfies Tsybakov noise condition. Thus, we get that Tsybakov noise condition can also be met in . Without loss of generality, suppose that Tsybakov noise condition in all and can be met for the same finite and .

Since , according to Lemma 1 we know that with probability at least . With , we get . It is easy to find that holds with probability at least .

For , number of labels are queried randomly from . Thus, similarly according to Lemma 1 we have with probability at least . Let and , it is easy to get

 Pr(S∗∩(Si+11⊕Si+12)|¯¯¯¯¯¯Qi)−Pr(¯¯¯¯¯¯S∗∩(Si+11⊕Si+12)|¯¯¯¯¯¯Qi)=−2τi+1Pr(Si+11⊕Si+12|¯Qi).

Considering the non-degradation condition and , we calculate that

 dΔ(Si+11∩Si+12|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi) = 12(dΔ(Si+11|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi)+dΔ(Si+12|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi))+12Pr(S∗∩(Si+11⊕Si+12)|¯¯¯¯¯¯Qi) −12Pr(¯¯¯¯¯¯S∗∩(Si+11⊕Si+12)|¯¯¯¯¯¯Qi) ≤ 12(dΔ(Si1|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi)+dΔ(Si2|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi))−τi+1Pr(Si+11⊕Si+12|¯¯¯¯¯¯Qi) = dΔ(Si1∩Si2|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi)−τi+1Pr(Si+11⊕Si+12|¯¯¯¯¯¯Qi).

So we have

 dΔ(Si+11∩Si+12,S∗) = dΔ(Si+11∩Si+12|Qi,S∗|Qi)Pr(Qi)+dΔ(Si+11∩Si+12|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi)Pr(¯¯¯¯¯¯Qi) ≤ 18Pr(Qi)+dΔ(Si1∩Si2|¯Qi,S∗|¯¯¯¯¯¯Qi)Pr(¯¯¯¯¯¯Qi)−τi+1Pr((Si+11⊕Si+12)∩¯¯¯¯¯¯Qi).

Considering , we have

 dΔ(Si+11∩Si+12,S∗) ≤ Pr(Si1∩Si2−S∗)+Pr(¯¯¯¯¯¯Si1∩¯¯¯¯¯¯Si2−¯¯¯¯¯¯S∗)+18Pr(Si1⊕Si2)−τi+1Pr((Si+11⊕Si+12)∩¯¯¯¯¯¯Qi).

Similarly, we get

 dΔ(Si+11∪Si+12,S∗) ≤ Pr(Si1∩Si2−S∗)+Pr(¯¯¯¯¯¯Si1∩¯¯¯¯¯¯Si2−¯¯¯¯¯¯S∗)+18Pr(Si1⊕Si2)+τi+1Pr((Si+11⊕Si+12)∩¯¯¯¯¯¯Qi).

Let , we have

 dΔ(Si1∩Si2,S∗) = dΔ(Si1∩Si2|Qi,S∗|Qi)Pr(Qi)+dΔ(Si1∩Si2|¯¯¯¯¯¯Qi,S∗|¯¯¯¯¯¯Qi)Pr(¯¯¯¯¯¯Qi) = (1/2−γi)Pr(Si1⊕Si2)+Pr(Si1∩Si2−S∗)+Pr(¯¯¯¯¯¯Si1∩¯¯¯¯¯¯Si2−¯¯¯¯¯¯S∗)

and .

As in each round of the multi-view active learning some contention points of the two views are queried and added into the training set, the difference between the two views is decreasing, i.e., is no larger than .

Case 1: If , with respect to Definition 1, we have

 dΔ(Si+11∪Si+12,S∗)dΔ(Si1∪Si2,S∗) ≤ 18Pr(Si1⊕Si2)+|τi+1|Pr(Si+11⊕Si+12)+1αPr(Si1⊕Si2)(12+γi)Pr(Si1⊕Si2)+1αPr(Si1⊕Si2) ≤ (18+γi)Pr(Si1⊕Si2)+1αPr(Si1⊕Si2)(12+γi)Pr(Si1⊕Si2)+1αPr(Si1⊕Si2)≤5α+88α+8;

Case 2: If , with respect to Definition 1, we have

 dΔ(Si+11∩Si+12,S∗)dΔ(Si1∩Si2,S∗) ≤ 18Pr(Si1⊕Si2)+|τi+1|Pr(Si+11⊕Si+12)+1αPr(Si1⊕Si2)(12+|γi|)Pr(Si1⊕Si2)+1αPr(Si1⊕Si2) ≤ 5α+88α+8;

Case 3: If and , with respect to Definition 1, we have

 dΔ(Si+11∩Si+12,S∗)dΔ(Si1∩Si2,S∗) ≤ 18Pr(Si1⊕Si2)+1αPr(Si1⊕Si2)(12−γi)Pr(Si1⊕Si2)+1αPr(Si1⊕Si2) ≤ α+82α+8;

Case 4: If and , with respect to Definition 1, we have

 dΔ(Si+11∪Si+12,S∗)dΔ(Si1∪Si2,S∗) ≤ 18Pr(Si1⊕Si2)+τi+1Pr(Si+11⊕Si+12)+1αPr(Si1⊕Si2)(12+γi)Pr(Si1⊕Si2)+1αPr(Si1⊕Si2) ≤ 5α+86α+8;

Case 5: If and , with respect to Definition 1, we have

 dΔ(Si+11∪Si+12,S∗)dΔ(Si1∪Si2,