Active Robust Learning
Abstract
In many practical applications of learning algorithms, unlabeled data is cheap and abundant whereas labeled data is expensive. Active learning algorithms developed to achieve better performance with lower cost. Usually Representativeness and Informativeness are used in active learning algoirthms. Advanced recent active learning methods consider both of these criteria. Despite its vast literature, very few active learning methods consider noisy instances, i.e. label noisy and outlier instances. Also, these methods didn’t consider accuracy in computing representativeness and informativeness. Based on the idea that inaccuracy in these measures and not taking noisy instances into consideration are two sides of a coin and are inherently related, a new loss function is proposed. This new loss function helps to decrease the effect of noisy instances while at the same time, reduces bias. We defined ”instance complexity” as a new notion of complexity for instances of a learning problem. It is proved that noisy instances in the data if any, are the ones with maximum instance complexity. Based on this loss function which has two functions for classifying ordinary and noisy instances, a new classifier, named ”SimpleComplex Classifier” is proposed. In this classifier there are a simple and a complex function, with the complex function responsible for selecting noisy instances. The resulting optimization problem for both learning and active learning is highly nonconvex and very challenging. In order to solve it, a convex relaxation is proposed. In every iteration of active learning, a problem with some small changes needs to be solved. In order to take the advantage of this, an algorithm for solving this problem is proposed which is capable of using the most information available from the previous solutions of the problem. Accelerated version of the optimization algorithm is also proposed. Theoretical and experimental studies show that this method is efficient.
1 Introduction
Supervised machine learning methods need labeled data. Labeled information is expensive and human annotators often provide noisy labels. On the other hand, unlabeled data is cheap and easy to obtain. To reduce the cost of building efficient learning systems, Semisupervised and Active Learning methods have been developed. In active learning, it is assumed that knowledge about labels of some data points are more usefull than others[1]. Also, semisupervised learning methods assume that unlabeled data has knowledge which can be utilised for learning.
Unfortunately, these methods sometimes encountered degraded performance[2, 3]. One of its reasons is noisy instances. Noisy instances can be outlier or label noisy. These data points largely affect classfication boundary and reduce the generalizability of resulting classifier. In spite of this, there are few reasearches about noisy data in active learning[4].
Recently, Hanneke [3] showed that in active learning with convex surrogate losses achieving maximum improvment in sample complexity, i.e. number of queries necessary for a certain accuracy, in presence of noise is not possible. This is because noisy data points move classifier boundary away from optimial by having large impact on it and this results to more queries. In active learning especially in early stages there are very few labeled instances. This makes active learning methods more vulnerable to noisy instances. Moreover, correctly labeled outlier instances may reduce generalizability of classifier[4]. With incomplete information in active learning applications, it’s not possible to remove instances considered noisy in each stage. By doing this, we may lose some portions of boundary, since with new labels, these instances may be no longer considered noisy.
In this paper, we argue that in order to have more effective learning and active learning algorithms in presence of noisy instances, we must have a noise resistant mechanism at the classifier level. We propose a method that under mild conditions can remove effect of both outlier and noisy label instances as much as possible. It is proved that this method is unbiased.
Based on this method, Robust Active Learning (RAL) is devised to minimize the effect of noisy instances on both the classifier boundary and sample complexity. In semisupervised learning and active learning, where all the data is not labeled, learning can easily get biased. This is more severe in active learning, which labeled data are no longer i.i.d. In RAL unlabeled data is used in order to alleviate this problem.
Informativeness and representativeness are two criteria which many active learning methods use. RAL considers both of these criteria. In addition, by paying attention to noisy instances and reducing effects of these instances through an unbiased method, RAL also concentrates on the accuracy of these criteria.
The intution behind the proposed method is that usefull data for learning lies in a range of complexities. We define the notion of instance complexity as the amount of distortion it can be added without changing of learning function much. For noisy instances, even a small distortion makes the learning function much different. Also, for noisy labeled instances a small change of instance may result in quite a change of learning function. Nonnoisy instances on the other hand are more resistant to distortion. The notion of instance complexity is defined based on the equivanlence of robustness and regularization[5].
In order to discriminate noisy instances we devise a new loss functions. For noisy instances loss function value is small and for other instances its value is equal to ordinary loss values. In this loss function two functions are involved. A simpler function corresponding to ordinary classification function and a more complex function. This complex function is used to discriminate noisy instances.
The resulting optimization problem is highly nonconvex and very challenging. A convexrelaxation for RAL is developed to obtain a good approximation to solution of the problem. (we show that objective of this approxiamtion is near to global optimal solution of the original nonconvex problem).
Main contributions of the paper are as follows:

Definition of instance complexity based on distortion of noisy instances

The new loss function which is robust to outlier and noisy labeled instances

Convex Relaxation for Robust Active Learning

Efficient Algorithm for solving Convex Relaxation problem capable of warmstarting effectively which is necessary for Active Learning.
In the following, we first review related works. After that notation and some preliminaries is dicussed. In section 2, problem formulation is stated. In section 3, SimpleComplex classifier, problem is dicussed. In the next section, some theoretical results regarding the simplecomplex classifier is introduced. In section 5, RAL convex relaxation is solved using a Nesterov’s method. In section 6, experimental results are analysed. And finally, future works and conclusion is in the last section.
1.1 Motivation and Related Works
Someone may think why not eliminate outlier instances before active learning began? Before starting and in the first stages of active learning, we have very few labels. Therefore, an accurate enough estiamte of classifier is not accessible. In this situation by elimination of data we may lose some important information. Furthermore, in some applications we may not have an accurate estimate of the instances themselves. For example, in learning from distributions, when every instance itself is a distribution, many modern methods use embedding of distributions on a reproducing kernel hilbert space. Usually, there are a small sample drawn from a distribution and estimating kernel embedding for distribution is not accurate[6]. When we don’t have an accuate estimate of both learning function and instances, eliminating precious information may not be a good choice.
Also, in some applications, we need to relearn the function with the newly arrived data. Unless we have plenty of data, we don’t know in advance whether some data is outlier or not, yet we have to learn using existing data.
Depending on the degree of outlyingness[7], eliminating instances in early stages of learning causes different degrees of risk. If degree of outlyingness for an instance is very high, it may be safe to delete it. But if degree of outlyingness is not very high deleting instance from data, will cause considerable risk to learning.
In many active learning application, aquiring label information is highly expensive. For example, when labels come from an expensive or time consuming experiment or they come from expensive experts time. In some other cases, it may be very important to have highest possible accuracy, using available data,(e.g. Mine fields prediction). In such case, it is very important to use all means for achieving a more accurate classifier using as few as possible labeled data and therefore, may be speed of querying is not so imporant.
Active Learning for Support Vecctor Machines is introduced in [8]. Their approach is to select data point that minimizes version space. For support vector machines this means selecting data point nearest to current boundary. Extending this approach to selecting a batch of data points is difficult.
Based on Active Learning for Support Vector Machines, minmax framework for Active learning is proposed[9]. In this framework, nearest point to boundary regardless of its label is selected. When a new labeled instance added, assuming classifier is fixed results to overestimation of the impact of this new labeled point[9]. Therefore, we must simultaneously optimize on both classifier and query point. Notice that the query point is no longer the nearest point to current classification boundary. This approach selects the most informative instance. Informativeness of a data point is measured with uncertainty of predicting its label using labeled data.
Selecting most informative instances makes learning biased to sample. To solve this issue usually representativeness is also considered. Many active learning methods are adhoc in combining repesentativeness and informativeness. In [10] representativeness of an example measured by uncertainty in predicting its label based on unlabeled data in minmax framework[9]. Minimizing svm cost function with respect to classifier, unlabeled data and selection variable for active learning, makes the problem very challenging. Instead [10] used least squares as loss function. Objective function value with least squares loss has a close form, which they exploit to compute query point. Although they used least squares loss function in active learning, they used svm for computing classifier to report accuracy. In classification, hinge loss considered a better surrogate for 01 loss than least squares. Although a different loss in active learning and learning used which may bias the model, they didn’t prove any results regarding unbiasedness of the learning. In our method we used hinge loss for both classificatin and active learning and solved a semisupervised learning in every iteration. In this way better measures of informativeness and representativeness is obtained.
[4] states that outlier and label noisy data are harmfull for active learning and proposed a forwardbackward approach to explore unlabeled and labeled data to get rid off them. In forward phase they query to add unlabeled data to labeled dataset. In backward phase, some most probable noisy instances are removed from labeled dataset to protect classifier from their impact. These data points deteriorate classifier, the most. Decision on which data points are noisy performed on two levels, instance and labellevel.
The authors in [11] proposed a convexrelaxation for active learning. They first constructed a matrix based on the uncertainty and divergence of data and then selects some rows/columns of matrix using an integer quadrtic program, which they managed to develop a convexrelaxation for. As stated [10], this kind of combining informativness and representativeness are adhoc and a more principled approach is usually prefered.
Regarding instace complexity, same as [5], it is different from Influence function(IF). IF considers the change of functions from the perspective of classifier[12]. On the other hand, instance complexity considers change of classifier by disturbances in an instance from the perspective of instance.
In [7] a geometric theory for outliers developed, which we used their definition.
2 Active and Semisupervised Learning
Minmax framework to active learning [9, 10] attempts to find the most usefull instances using labeled and unlabeled data. Unfortunately, noisy labeled data can impact classification boundary severly. Also, it’s well known that classifier may overfit to correctly labeled outliers[4]. Therefore, noisy data may decrease generalization. In order to improve accuracy in presence of noisy training data, the following scheme is proposed.
2.1 The Framework
Let be the nonnoisy initial labeled set. Define cost function as
(1) 
Let be the optimal classifier. Version space minimization approach select closest instance to classification boundary, i.e, . Unfortnately, especially in early stages of active learning, current boundary differ much from optimal boundary. By increasing number of labeled instances, it is expected that current boundary moves closer to optimal boundary. Based on [9, 10], cost function of minimax framework for active learning, can be written as
(2) 
In the second equality, is depenedent to and . Since there may be noisy instances, we cannot assume that this coefficients are equal for every point. Using , importance of loss function for instance could be adjusted compared to empirical risk of other instances. A small means that this data point is not important for learning the function. For representativeness[10] of querypoints, minimization on lables of unlabeled data is used, i.e.
(3)  
(4) 
Unfortunately, the coefficients for all unlabeled instaces are unknown. But we know that it is zero for noisy instances. Noisy instance’s impact on the classifier is undesireable. Let be the set of noisy instaces, we can simply set
Assume we have access to a function with value zero for nonnoisy instances and , i.e. hypothetical class label of noisy instances.
If including loss for an instance in cost function is usefull, must be zero. Also if function assigns a value other than zero to an instance, loss of this instance must be zero. we can set using the loss
Corollary 0.1
With defined as above, problem (3) is equaivalent with the same problem when there are no noisy instances.
Unfortunately noisy data set as well as function are unknown. How it’s possible to remove those instances from learning when they are unkown?
3 SimpleComplex Classifier
In order to provide an answer to this question, we first define notion of instance complexity. This defintion is motivated by equivalence of regualrization and robustness[5]. Let , and .
Definition 1
Instance Complexity Define instance complexity of instance , as
This definition is very intuitive. If instance is simple, changing it to with large doesn’t change function learned on this data much. But even a small disturbation on a complex instance makes learnt function much different. In otherwords, classifier is too sensitive to complex instances. The following therorem proves that if there are any noisy instances, they are the most compelx instances.
Theorem 1
Assume learned on which has a subset of the size of noisy instances. Then these instances have the highest instance complexity values .
Based on theorem (1), we need a mechanism to select the most complex instances. As stated before, usefull instances for learning are in a range of complexities. Too simple instaces are not usefull for learning and too complex instances are noisy and harmfull for learning. Therefore, more complex instances must be classified using function and simple instances using function . In order to restrict classifier from classifying simple instances, total energy of this function must be limited. In otherwords, since there is only a limited amount of noisy data, number of nonzero values of must be limited. Therefore with being complex enough and more complex than , noisy instances will be selected by . Considering the constraint for initially labeled instances and assuming there is only noisy instances, problem (4) just becomes:
(6)  
We can adjust complexity of function with properties of reproducing kernel hilbert space, and parameter .
Replacing with is very intuitive. In Theoretical Results section, we prove that this cost function finds noisy instances. Furthermore, it is proven that this cost functions gives us an unbiased classifier. Let and . It is very inuitive that in ideal case, we want loss vectors and to be orthogonal.
Loss function is nonconvex. If we assume , then can be approximated by any surrogate loss function such as hinge loss. Also, in the following is used as hinge. Let and , where and . Replacing both losses with hinge loss, we reach to the following problem:
(7)  
(9)  
Proof of the following is in Supplementary Material.
Theorem 2
Let and . in problem above is equivalent to
(10)  
s.t.  (11)  
(12)  
(13)  
(14) 
The rank constraint makes the above problem nonconvex. Removing it and deriving dual of the inner problem with respect to similar to [13], the following is obtained. See proof in Supplementary Material.
Theorem 3
Convex relaxation for the problem above is
(15)  
s.t.  (16)  
(17)  
(18)  
(19)  
(20) 
4 Theoretical Results
In the following theorem based on the definition of outlyingness in [7], discrimination of noisy instances is considered. Proof ot this theorem based on a set of lemmas are in Supplementary Material.
Theorem 4
If classifier is simple enough and there exists a direction such that it’s sign for instance , have enough compliance with instance labes, such that the following inequalities satisfy
where . Then classifier will discriminate noisy instances.
When there is a coefficient in loss function, it’s possible that learning becomes bias. In the following, it is proved that under mild conditions, SimpleComplex classifier isn’t biased.
Theorem 5
Let be the distribution data without noise. Let be a distribution the same as but contaminated with noise. Now assume the following condition is satisfied at optimality in (7),
(21) 
If is fixed, then
where ,
This result proved based on [14] in Supplementary Material ,shows that if weighted risk function based on SimpleComplex classifier is minimized, with high probability, an upper bound of expected risk on test sample is minimized. This is very interesting. This shows that SimpleComplex problem corrects bias. Result of this unbiasing mechanism in SimpleComplex is responsible for minimizing impact of noisy instances.
5 Robust Active Learning
The objective function (3) is convex with respect to and concave with respect to . Based on the minimax lemma, they can be exchanged in optimization. Using a binary variable, , the problem (3) with objective function (4) replaced with objective (6) becomes
(22)  
A noisy instance cannot be selected for querying. Therefore, . In this problem, for all . Using we can compare noisyness of two instance, or add some constraints about noisyness of instances. This can even be used to query degree of noisyness of an instance.
Unfortunately, this problem is a nonconvex integer program and is very difficult to solve. Constraints about domain of variables and are relaxed to .
Theorem 6
If and is relaxed in the above problem, then this problem is equivalent to
s.t.  (23)  
(24)  
(25)  
(26)  
(27) 
This problem is highly nonconvex. Before devising a convexrelaxation for this problem, a convex relaxation for the problem without is proposed. In this case the sole source of nonconvexity is (24). Based on convexrelaxation proposed by Geomans and Williamson [15], this constraint can be relaxed to
(28) 
where . Based on the result of Geomans and williamson[15], the accuracy of this high(clear the accuracy). Using schur lemma, this is equivalent to
(29)  
(30) 
,where . Unfortunately, this constraint is still nonconvex. Since , we have . In this case, convexrelaxation for the problem is
(31)  
s.t.  (32)  
(33)  
(34)  
(35)  
(36) 
where is as in (29).
For problem in theorem (6), constraint (24) can be written as (29) but with , where , can be written
(37)  
(38)  
(39)  
(40) 
Since we don’t want to select noisy instances for querying, at least one of or is very small and the other one is less than one, therefore the last term is small. So, the last equation is just becomes . Assuming is a good approximation, since is complex therefore it can more easily fit data. Using this approximation, this equation becomes
(41)  
Instead using the following approximation may produce better results
(42)  
(43)  
(44)  
(45)  
(46)  
(47)  
(48)  
(49)  
(50)  
(51)  
(52)  
(53) 
In addition to the above constraints, we must add, to objective function to enforce . Furthermore we have
(54)  
(55) 
Therefore,
(56)  
(57) 
The final form for this problem using 41 is
(58)  
s.t.  (59)  
(60)  
(61)  
(62)  
(63)  
(64)  
(65)  
(66)  
(67)  
(68)  
(69) 
Based on representation lemma, . For notational simplicty, in objective function (58) , let and .(If linear approximation for is used define ). Then constraint set of the above problem can be represented using proper definition of the operators , , , , , as
(70)  
In this way, the problem becomes more of a standard conic problem:
(71) 
where
(72) 
and we have
(73) 
with some simplification about and
(74)  
(75) 
we have
(76)  
(77)  
(78)  
(79) 
And finally,
(80)  
5.1 Solving Robust Active Learning Problem
Function is concave with respect to and convex with respect to and constraint set of the problem (71), are affine. Based on the minimax lemma, . Therefore this problem is a convexconcave saddle point problem. It is well known that the operator defined as for this problem is maximal monotone. If , then is the saddle point of the problem (72). We proposed two method for this problem. The first method is based on forwardbackwardforward method or Tseng’s method[16].
5.1.1 forwardbackwardforward method
By building a maximal monotone operator and finding its fixed point, i.e., for problem (71), saddle point of the problem can be obtained. By corollary 24.5 in [16], the following operator is maximally monotone.
(81)  
(82)  
(83)  
(84) 
Based on theorem 25.10 in [16] using Tseng’s method, and defining the following iteration converges to saddle point of the above problem
(85)  
(86)  
(87)  
(88) 
For we have:
and for :
(89) 
where and is the proper inner product for and is an operator such that . We define as
(90) 
This is proximal point step in a conic space. Proof of the following theorem which is based on [17] can be found in Supplementary Material.
Theorem 7
Dual of the above problem is
(91)  
where . Derivation with respect to and is
(92)  
(93)  
and setting zero, we will have
(94)  