Robust and Discriminative Labeling for Multilabel Active Learning Based on Maximum Correntropy Criterion
Abstract
Multilabel learning draws great interests in many real world applications. It is a highly costly task to assign many labels by the oracle for one instance. Meanwhile, it is also hard to build a good model without diagnosing discriminative labels. Can we reduce the label costs and improve the ability to train a good model for multilabel learning simultaneously?
Active learning addresses the less training samples problem by querying the most valuable samples to achieve a better performance with little costs. In multilabel active learning, some researches have been done for querying the relevant labels with less training samples or querying all labels without diagnosing the discriminative information. They all cannot effectively handle the outlier labels for the measurement of uncertainty. Since Maximum Correntropy Criterion (MCC) provides a robust analysis for outliers in many machine learning and data mining algorithms, in this paper, we derive a robust multilabel active learning algorithm based on MCC by merging uncertainty and representativeness, and propose an efficient alternating optimization method to solve it. With MCC, our method can eliminate the influence of outlier labels that are not discriminative to measure the uncertainty. To make further improvement on the ability of information measurement, we merge uncertainty and representativeness with the prediction labels of unknown data. It can not only enhance the uncertainty but also improve the similarity measurement of multilabel data with labels information. Experiments on benchmark multilabel data sets have shown a superior performance than the stateoftheart methods.
I Introduction
Machine learning is the topic of the day. However, less training samples problem is always the challenge problem in machine learning fields[1]. Especially nowadays amount of data is generated quickly within a short period time. Active learning as a subfield of machine learning is an effective machine learning approach to address the less training samples problem in classification. It has been elaborately developed for various classification tasks by querying the most valuable samples, which is an iterative loop to find the most valuable samples for the ’oracle’ to label, and gradually improves the models’ generalization ability until the convergence condition is satisfied [1]. In general, there are two motivations behind the design of a practical active learning algorithm, namely, uncertainty and representativeness[2, 3, 4]. Uncertainty is the criterion used to select the samples that can help to improve the generalization ability of classification models, ensuring that the classification results of unknown data are more reliable. Representativeness measures the overall patterns of the unlabeled data to prevent the bias of a classification model with few or no initial labeled data. No matter which kind of active learning method is used, the key lies on how to select the most valuable samples, which is referred as the query function.
Among all the tasks, such as object recognition, scene classification[5], and image retrieval, multilabel classification, which aims to assign each instance with multiple labels, may be the most difficult one[6, 7, 8]. For each training sample, different combinations of labels need to be considered. Compared with singlelabel classification, the labeling of multilabel classification is more costly[9]. Currently, multilabel learning has been successfully applied in machine learning and computer vision fields, including web classification[10], video annotation[11], and so on[12, 13, 14]. To solve the classification tasks, many types of machine learning algorithms have been developed. However, less training samples is not solved in many of these techniques, and they all face such a problem. Hence, active learning has become even more important to solve the less training samples problem, reducing the costs of the various classification tasks.
Stateoftheart multilabel active learning algorithms can be classified into three categories based on the query function. One category relies on the labeled data to design a query function with uncertainty[15, 16, 17]. For these methods, the design of query function ignores the structural information in the largescale unlabeled data, leading to a serious sample bias or an undesirable performance. To eliminate this problem, the second category, which depends on the representativeness, has been developed[18, 19, 20]. In these approaches, the structural information of the unlabeled data is elaborately considered, but the discriminative (uncertain) information is discarded. Therefore, a large number of samples would be required before an optimal boundary is found. Since utilizing either the uncertainty criterion or the representativeness criterion may not achieve a desirable performance, the third category which combines both criteria[3, 4, 21] born naturally. These methods are either heuristic in designing the specific query criterion or ad hoc in measuring the uncertainty and representativeness of the samples. Although it is effective, the two parts are independent. Hence, the uncertainty still just relies on the limited labeled samples, and the information of two criteria are not enhanced. Most importantly, they ignored the outlier labels that exist in multilabel classification when designed a query function for active learning.
However, the outlier labels have significant influence on the measurement of uncertainty and representativeness in multilabel learning. In the following, we will discuss the outlier label and its negative influence on the measurement of uncertainty and representativeness in detail.
Fig. 1 shows a simple example about the influence of outlier labels. As the input, we annotate the image with three labels, namely tree, elephant and lion. Hence, the feature of image is combined with three parts, the feature of tree, the feature of elephant and the feature of lion. Intuitively, in the image feature, the feature of tree is much more than elephant and lion, and the feature of lion is the least. If we use the image with the three labels to learn a lion/ nonlion binary classification model, the model would actually depend on the tree¡¯s and elephant¡¯s features rather than the lion¡¯s. Thus it would be a biased model for classifying the lion and the nonlions. Given the test image where a lion covers the most regions in the image, the trained model would not recognize the lion. If we use such a model to measure the uncertainty in active learning, it may cause wrong measurement for the images with lion label. We name the lion label in the input image as an outlier label.
Furthermore, we present the formal definition of an outlier label. Denote as the samplelabel pairs of an instance. is the most relevant label of the instance , and is much more relevant to than . Define as the outlier label, if it has the two properties. The first one is that is a relevant label to the instance , and the second one is that compared with the most relevant label , is much less relevant to than . Fig. 2 shows the two properties, and we can understand the outlier label easier from it. According to the definition of outlier label, the lion is naturally treated as the outlier label. Since the feature of lion is not very obvious, if we treat the lion as a positive label, and use the image in Fig. 2 to learn a model, the model would not be able to effectively query an informative sample. Hence, if the influence of the outlier label can be avoided or decreased when we query the most informative sample, it would be very useful to build a promising model for classification. The definition of the outlier label also fits the fact that the outlier label may not be paid attention by the oracle at the first glance. In Fig. 2, the tree can be recognized at the first glance by the oracle, but the lion is very veiled and may be ignored with careless. The definition of outlier label is also consistent with the query types proposed in [22].
For two multilabel images, if they have the same labels with different outlier labels, this may lead to the result that the features of the two images have a large difference. Therefore, it is very hard to diagnose the similarity based on features between two instances with different outlier labels. In Fig. 3, we provide a simple example to show such a problem. We present the similarity between the sift features with Gaussian kernel, and the labels similarity based on MCC. In Fig. 3, the similarity between image 1 and image 2 should be larger than the similarity between image 2 and image 3, since the labels in image 1 and image 2 are exactly the same. However, the result is opposite when the similarity is measured with their sift features. The outlier label is lion in image 1, and tree trunk in image 2. The two outlier labels will largely increase the features’ difference of the two images. In summary, the measurement of uncertainty and representativeness would be deteriorated with the outlier labels.
To address the above problems, in this paper, we proposed a robust multilabel active learning (RMLAL) algorithm, which effectively merges the uncertainty and the representativeness based on MCC.
As to robustness, the correntropy has been proved promising in information theoretic learning (ITL)[23] and can efficiently handle large outliers[24, 25]. In conventional active learning algorithms, the mean square error (MSE) cannot easily control the large errors caused by the outliers. For example, in Fig. 2, there are two labels for the image: tree and lion. If we use the lion model to learn the image, the prediction value must be very far from the lion’s label. If we use the MSE loss to measure the loss between the prediction value and the label, a large error may be introduced, since MSE extends the error with square. MCC calculates the loss between the prediction value and the label with a kernel function. If a large enough error is introduced, the value of MCC is almost equal to zero. Hence, the influence of the large error will be restrained. We therefore replace MSE loss with MCC in the minimum margin model in the proposed formulation. In this way, the proposed method is able to eliminate the influence of the outlier labels, making the query function more robust.
As to discriminative labeling, we use the MCC to measure the loss between the true label and the prediction label. MCC can improve the most discriminative information and suppress the useless information or unexpected information. Hence, with MCC in the proposed method, if the label is not an outlier label, it will play an important role in the query model construction. Otherwise, the model will decrease the influence of the outlier label to measure the uncertainty. With such an approach, the discriminative labels’ effects are improved and the outlier labels’ are suppressed. Thus, the discriminative labeling can be achieved.
As to representativeness, we mix the prediction labels of the unlabeled data with the MCC as the representativeness. As is shown in Fig. 3, although the samples have the same labels, their outlier labels are different, making their features distinguishing. If we just use the corresponding features to measure the similarity, it will lead to a wrong diagnosis. Hence, we propose to use the combination of labels’ and features’ similarity to define the consistency. The combination makes the measurement of representativeness more general. To decrease the computational complexity of the proposed method, the halfquadratic optimization technique is adopted to optimize the MCC.
The contributions of our work can be summarized as follows:

To the best of our knowledge, it is the first work to focus on the outlier labels in multilabel active learning. We find a robust and effective query model for multilabel active learning.

The prediction labels of the unlabeled data and the labels of the labeled data are utilized with MCC to merge the uncertain and representative information, deriving an approach to make the uncertain information more precise.

The proposed representative measurement considers labels similarity by MCC. It can handle the outlier labels effectively and make the similarity more accurate for multilabel data. Meanwhile, a new way is provided to merge representativeness into uncertainty.
The rest of the paper is organized as follows: Section 2 briefly introduces the related works. Section 3 defines and discusses a new objective function for robust multilabel active learning, and then proposes an algorithm based on halfquadratic optimization. Section 4 evaluates our method on several benchmark multilabel data sets. Finally, we summarize the paper in Section 5.
Ii Related works
Multilabel problem is universal in the real world, so that multilabel classification has drawn great interests in many fields. For a multilabel instance, it needs the human annotator to consider all the relevant labels. Hence, the labeling of multilabel tasks is more costly than single label learning, but the research of active learning on multilabel learning is still less.
In multilabel learning, one instance is corresponding to more than one labels. To solve a multilabel problem, it is a direct way to convert the multilabel problem into several binary problems[26, 27]. In these approaches, the uncertainty is measured for each label, and then a combined strategy is adopted to measure the uncertainty of the instance. [26] trained a binary SVM model for each label and combined them with different strategies for the instance selection. [27] predicted the number of relevant labels for each instance by a logistic regression, and then adopted the SVM models to minimize the expected loss for the instance selection. Recently, [12] adopted the mutual information to design the selection criterion for Bayesian multilabel active learning, and [28] selected the valuable instances by minimizing the empirical risk. Other works have been done by combining the informativeness and representativeness together for a better query[29, 30]. [29] combined the label cardinality inconsistency and the separation margin with a tradeoff parameter. [30] took into account the cluster structure of the unlabeled instances as well as the class assignments of the labeled examples for a better selection of instances. All the above algorithms were designed to query all the labels of the selected instances without diagnosing discriminative labels. Another kind of approaches were developed to query the labelinstance pairs with relevant label and instance at each iteration[22, 31, 32]. [22] queried the most relevant label based on the types. [32] selected labelinstance pairs based on a label ranking model. In these approaches, the most relevant label is assigned to the instance, and some relevant labels may be lost with the limited query labels. Therefore, it may need to query much more labelinstance pairs to achieve a good performance. It is proved that considering the combination of informativeness and representativeness is very effective in active learning. We adopt this strategy in the paper.
No matter whether selecting the instance by all the labels or by the labelinstance pairs, most of the active learning algorithms only selected the uncertain instance based on very limited samples, and ignored the labels information. For example, given all the labels to one instance, if the instance has many outlier labels, such instance may decrease the performance of the classification task. To address these problems, we use the prediction labels of the unlabeled data to enhance the uncertain measurement and adopt the MCC to consider the relevant labels as many as possible except the outlier labels. As far to our knowledge, it is the first time to adopt the MCC in multilabel active learning with data labels for query.
Iii Multilabel active learning
Suppose we are given a multilabel data set with samples and possible labels for each sample. Initially, we label samples in . Without loss of generality, we denote the labeled samples as set , where is the labels set for sample , with ; and the remaining unlabeled samples are denoted as set . It is the candidate set for active learning. Moreover, we denote as the sample that we want to query in the active learning process, and define that is the labels matrix for the labeled data. In the following discussion, the symbols are used as above.
Iiia Maximum Correntropy Criterion
In multilabel classification tasks, the outlier labels are the great challenge to train a precise classifier, mainly due to the unpredictable nature of the errors (bias) caused by those outliers. In active learning, in particular, the limited labeled samples with outliers easily lead to a great bias. This would directly lead to the bias of uncertain information, furthermore make the query instances undesirable or even lead to bad performances. Recently, the concept of correntropy was firstly proposed in information theoretic learning (ITL) and it had drawn much attention in the signal processing and machine learning community for robust analysis, which can effectively handle the outliers[33, 34]. In fact, correntropy is a similarity measurement between two arbitrary random variables and [24, 33], defined by
(1) 
where is the kernel function that satisfies Mercer theory and is the expectation operator. We can observe that the definition of correntropy is based on the kernel method. Hence, it has the same advantages that the kernel technique owns. However, different from the conventional kernel based methods, correntropy works independently with pairwise samples and has a strong theoretical foundation[24]. With such a definition, the properties of correntropy are symmetric, positive and bounded.
However, in the real world applications, the joint probability density function of and is unknown, and the available data in and are finite. We define the finite number of available data in and is , and the data set is denoted as . Thus, the sample estimator of correntropy is usually adopted by
(2) 
where is Gaussian kernel function . According to[24, 33], the correntropy between and is given by
(3) 
The objective function (3) is called MCC, where is the auxiliary parameter to be specified in Proposition 2. Compared with MSE, which is a global metric, the correntropy is a local metric. That means the correntropy value is mainly determined by the kernel function along the line = .
IiiB Multilabel Active learning based on MCC
Usually, in active learning methods, the uncertainty is measured according to the labeled data whereas the representativeness according to the unlabeled data. In this paper, we propose a novel approach to merge the uncertainty and representativeness of instances in active learning based on MCC. Mathematically, it is formulated as an optimization problem w.r.t. the classifier and the query sample :
(4) 
where is a reproducing kernel Hilbert space and is used to constrain the complexity of the classifier. is the MCC loss function. is the labels set for all the unlabeled data, which is calculated by the classifiers . However, there is a problem in solving (4) that: the labels of are unknown. Our goal is to find the optimal and with (4), and the labels of are assigned by oracle after query. Therefore, the labels of should be precise before query. We replace the precise labels of with pseudo labels to solve (4), and obtain the following problem:
(5) 
where is the pseudo label for . It belongs to . If contains the label, is equal to 1, otherwise, is equal to 1. In (5), the first three terms correspond to the regularized risk for all the labeled samples after query, which carries the uncertain information embedded in the current classifier. We call them the uncertain part. Meanwhile, in the last term, the unlabeled data are also embedded in the current classifier to enhance the uncertain part. However, the function of the last term is not just to enhance the uncertain information. The main function of the last term is to describe the distribution difference between the labeled samples after query and all the available samples, which captures the representative information embedded in the labeled samples. balances the uncertain and representative information in the formulation. In the remaining part of this section, we will analyze this objective function in a specific form and propose a practical algorithm to solve the optimization problem.
IiiC Uncertainty based on MCC
Minimum margin is the most popular and direct approach to measure the uncertainty of the unlabeled sample by its position to the boundary. Let be the classifier that is trained by the labeled samples. The sample that we want to query in the unlabeled data based on the margin can be found as follows:
(6) 
Generally, with the labeled samples, we can find a classification model for a binary class problem in a supervised learning approach with the following objective function
(7) 
where is a reproducing kernel Hilbert space endowed with kernel function is the loss function, and belongs to . Following the works of [3, 35], Proposition 1 connects the margin based query selection with the minmax formulation of active learning.
Proposition 1
The criterion of the minimum margin to find a desirable sample can be written as
where is the pseudo label for the sample .
In previous works, the loss function is usually adopted with quadratic loss for MSE. But it is not robust for the occasion of outliers. To overcome this problem, we introduce the MCC as the loss function, given by
(8) 
where is the kernel width. Following the Proposition 1, we can observe that the objective function (6) is equal to the objective function (8). To solve (8), since , we define , and optimize the worst case of (8) for selection. The objective function becomes
(9) 
In our work, we extend multilabel classification as several binary classification problems with label correlation. For the convenience of presentation, we consider the simple case by learning one classifier for each label independently. Then, we use the summation of each binary classifier as the minimum margin in multilabel learning, presented by
(10) 
where is the binary classifier between the label and the other labels, . Then the objective function of multilabel learning task based on MCC can be formalized as:
(11) 
IiiD Representativeness based on MCC
Since the labeled samples in are very limited, it is very important to utilize the unlabeled data to enhance the performance of active learning. However, the labels of the unlabeled data are unknown, and it is hard to add the unlabeled data in a supervised model. To enhance the uncertain information, we merge the representative information into the uncertain information by prediction labels of the unlabeled data. The current similarity measurement is based on the instance features, and it cannot use the unlabeled data to enhance the uncertain information. To overcome this problem, and consider the outlier labels’ influence, we take the prediction labels of the unlabeled data into consideration for similarity measurement. We define a novel consistency with labels similarity and features similarity of two instances based on MCC as:
(12) 
where is the similarity between two samples with kernel function . Let denote the symmetric similarity matrix for the unlabeled data, where is the consistency between and . We can collect the consistency in a matrix as:
(13) 
With such a consistency matrix, the representativeness is to find the sample that can well represent the unlabeled data set. To do so, [4] proposed a convex optimization framework by introducing variables which indicates the probability that represents , and collected it with a matrix
(14) 
In our consistency measurement based on MCC, if is very similar to the point , and it is not similar to the point , there will be . Such a consistency measurement has already made the difference between representatives and nonrepresentatives large. Therefore, we define as , if is the representative one, where is a vector with length and all the entries in are 1; otherwise is , where is a vector with length and all the entries in are 0. Obviously, it is the summation of consistency between and the unlabeled data. Hence, we can collect the similarities of the query sample and the unlabeled data as:
(15) 
Similarly, let and be the consistency matrix and probability between the unlabeled data and the labeled data respectively. The similarities of the query sample and the labeled data can be collected as follows:
(16) 
To query a desirable representative sample, which can not only represent the unlabeled data but also has no overlap information with the labeled data, the description of the representative sample on the unlabeled data and labeled data is in contrast. Hence, we maximize the difference of (13) and (14) to measure the representativeness as:
(17) 
Since there is a large difference between the number of unlabeled data and labeled data, we use the expectation operator and a tradeoff parameter to surrogate them
(18) 
IiiE The Proposed Robust Multilabel Active Learning
To enhance the query information of uncertainty and representativeness, in our approach, we combine them with a tradeoff parameter, and the objective function can be presented as:
(19) 
To merge the representative part into uncertain part, we use the prediction labels of the unlabeled data. Denoting as the prediction labels set for the sample in the unlabeled data, the objective function based on MCC can be defined as:
(20) 
To query the specific point from the unlabeled data in the objective function (20), we use the numerical optimizationbased techniques. An indicator vector is introduced, which is a binary vector with length. Each entry denotes whether the corresponding sample is queried as . If is queried as , is 1; otherwise, is 0. Then, the optimization can be formulated as:
(21) 
For a binary classifier, we use a linear regression model in the kernel space as the classifier , where is the feature mapping to the kernel space, and then the labels set for with multilabel can be given by
(22) 
where is an identity matrix of size , and it can also be the label correlation matrix. is the kronecker product between matrices. We define , and then the multilabel classifier can be presented by . The objective function can be formalized as:
(23) 
where is a length of vector . We derive an iterative algorithm based on halfquadratic technique with alternating optimization strategy to solve (23) efficiently[36]. Based on the theory of convex conjugated functions, we can easily derive the proposition 2 [37, 38].
Proposition 2
A convex conjugate function exits to make sure
where is the auxiliary variable, and with a fixed , reaches the maximum value at .
Following the Proposition 2, the objective function (23) can be formulated as:
(24) 
where , and with are the auxiliary variables, with
The objective function (24) can be solved by the alternating optimization strategy.
IiiF The solution
In this part, we will discuss the details of the algorithm to solve the objective function (23). We solve it with alternative strategy in two steps. Firstly, is fixed. In this step, the classifier is adopted with kernel form, and we use to learn for each classifier, where . We define , and learn from the following formulation
(26)  
As stated above, is an identify matrix. Define . Meanwhile, an auxiliary variable is introduced. Then, the objective function becomes
(27)  
where and are length of vectors and respectively, with all the entries being 1. is the labels matrix for the labeled data. is the function to convert a matrix to a vector along the column. The augmented Lagrangian function is given by
(28)  
The updating rules are as follows:
The problem to solve is a sparse one, and it can be solved with SPLA toolbox^{1}^{1}1http://spamsdevel.gforge.inria.fr/downloads.html. It stops until the convergence condition is satisfied.
In the second step, is fixed to solve . As stated above, the objective function is
(29) 
where . The linear program can be used to solve (29), and we select the most valuable sample that is corresponding to the largest value in . We summarize our algorithm in the Algorithm 1.
Iv Experiments
In this section, we present the experimental results to validate the effectiveness of the proposed method on 12 multilabel data sets from Mulan project^{2}^{2}2http://mulan.sourceforge.net/datasetsmlc.html. The characteristics of data sets are introduced in Table I. To demonstrate the superiority of our method, several methods listed as follows are regarded as competitors.

RANDOM is the baseline which randomly selects instances for labeling.

AUDI[32] combines label ranking with threshold learning, and then exploits both uncertainty and diversity in the instance space as well as the label space.

Adaptive[29] combines the maxmargin prediction uncertainty and the label cardinality inconsistency as the criterion for active selection.

QUIRE[3] provides a systematic way for measuring and combining the informativeness and representativeness of an unlabeled instance by incorporating the correlation among labels.

Batchrank[21] selects the best query with an NPhard optimization problem based on the mutual information.

RMLAL: Robust Multilabel Active Learning is the proposed method in this paper.
Dataset  domain  #instance  #label  #feature  #LC 
Corel16k  images  13,766  153  500  2.86 
Mediamill  video  43,097  101  120  4.37 
Emotions  music  593  6  72  1.87 
Enron  text  1,702  53  1,001  3.38 
Image  images  2,000  5  294  1.24 
Medical  text  978  45  1,449  1.25 
Scene  images  2,407  6  294  1.07 
Health  text  5,000  32  612  1.66 
Social  text  5,000  39  1,047  1.28 
Corel5k  images  5000  374  499  3.52 
Genbase  biology  662  27  1,185  1.25 
CAL500  music  502  174  68  26.04 
Dataset  Vs QUIRE  Vs AUDI  Vs Adaptive  Batchrank  Vs Random 
Corel16k  25/0/0  25/0/0  25/0/0  25/0/0  25/0/0 
Mediamill  5/16/4  10/12/3  25/0/0  25/0/0  25/0/0 
Emotions  25/0/0  25/0/0  25/0/0  25/0/0  25/0/0 
Enron  19/5/1  25/0/0  25/0/0  25/0/0  25/0/0 
Image  15/10/0  17/8/0  25/0/0  25/0/0  25/0/0 
Medical  13/10/2  25/0/0  25/0/0  25/0/0  25/0/0 
Scene  15/5/5  25/0/0  25/0/0  25/0/0  25/0/0 
Health  13/10/2  18/5/2  25/0/0  25/0/0  25/0/0 
Social  25/0/0  25/0/0  25/0/0  25/0/0  25/0/0 
Corel5k  25/0/0  25/0/0  25/0/0  25/0/0  25/0/0 
Genbase  25/0/0  20/5/0  25/0/0  7/15/3  25/0/0 
CAL500  25/0/0  25/0/0  25/0/0  25/0/0  25/0/0 
LC is the average number of labels for each instance. We randomly divided each data set into two equal parts. One was regarded as testing data set. For the other part, we randomly selected 4% as the initial labeled set, and the remaining samples of this part were used as the unlabeled data set. In the compared methods, AUDI and QUIRE query a relevant labelinstance pairs at each iteration. We can notice that querying all labels for one instance is equal to query labelinstance pairs. Hence, to achieve a fair comparison, we queried labelinstance pairs as one query instance in AUDI and QUIRE. For the method Batchrank, in the original paper, the tradeoff parameter is set as 1. For a fair comparison, we chose the tradeoff parameter from a candidate set that is the same as the proposed method. The other methods and the parameters were all set as the original papers. For the kernel parameters, we adopted the same values for all methods.
Without loss of generality, we adopted the liblinear^{3}^{3}3https://www.csie.ntu.edu.tw/ cjlin/liblinear/ as the classifier for all methods, and evaluated the performance with microF1[40], which is commonly used as performance measurement in the multilabel learning. Following[12], for each data set, we repeated each method for 5 times and reported the average results. We stopped the querying process when 100 iterations were reached and one instance was queried at each iteration.
Iva Results
We report the average results on each data set in Fig.4. Besides, we compare the competitors in each run with the proposed method based on the paired ttest at 95% significance level, and show the Win/Tie/Lose for all datasets in Table II. From all these results, we can observe that the proposed method performs best on most of the data sets. It achieves the best results in almost the whole active learning process. In general, QUIRE and AUDI are two methods to query the labelinstance pairs for labeling. They almost show the superior performance to the Batchrank and Adaptive, which query all labels for the instance. This demonstrates that querying the relevant labels is more efficient than querying all labels for one instance. But our method achieves the best performance by querying all the labels for one instance than querying the relevant labelinstance pairs. The reason may be that although the Batchrank and Adaptive query all the labels, they cannot avoid the influence of the outlier labels without considering the labels correlation, leading to the query samples undesirable. This reason can also explain why Batchrank and Adaptive perform worse than random method on some data. For QUIRE and AUDI methods, some labels information is lost when they just query the limited relevant labels, and they need more samples to achieve a better performance. The results demonstrate the proposed method can not only achieve discriminative labeling but also avoid the influence of the outlier labels. To put it in nutshell, the proposed method merging the uncertainty and representativeness with MCC can solve the problems in multilabel active learning effectively as stated above.
For the computational cost, the time complexity of the proposed method is , where and are the number of iterations. The time complexity of Adaptive and AUDI are and , respectively. QUIRE and Batchrank are costing, and the time complexity of them are both . is the number of classes. is the number of the unlabeled data. is the number of the labeled data and is the dimension of the data. Hence, compared with Adaptive and AUDI, the proposed method is costing, but it is relatively efficient when compared it with QUIRE and Batchrank. We show the time complexity of all the methods in Table III.
Methods  Time complexity 

RMLAL  
Adaptive  
AUDI  
Batchrank  
QUIRE 
IvB Evaluation parameters
In the proposed method, the kernel parameter is very important for the MCC, which controls all the robust properties of correntropy[24]. There are two tradeoff parameters on the uncertain part and representative part respectively. For convenience, in our experiments, we defined kernel size . Meanwhile, we fixed the kernel size as in label space , and fixed it as in feature space, where is the dimension of the data. To discover the influence of the kernel size for the proposed method, we evaluated the kernel size for MCC in label space. We reported the average results when the kernel size was set as respectively on two popular benchmark datasets emotions and scene[21], which had the same number of labels but with different LC. For the tradeoff parameters, we set them as . The other settings were same to the previous experiments.
Fig. 5 shows the average results in 10 runs with the kernel size changing. From Fig. 5, we can observe that the larger the kernel size , the better results that the proposed method obtains. This may be because that when the kernel size is large, the values of the outlier labels based on MCC will be small in the objective function and the influence of the outlier labels is decreased as much as possible. Hence, we can set the kernel size with a larger value for better performance. Fig.6 shows the average results in 10 runs with different pairs of the tradeoff parameters. For these results, we can observe that uncertain information and representative information have a big influence on the results. This may be because that the number of the labeled samples and that of unlabeled samples are changing in the active learning process. In active learning, the uncertain information is related to the labeled data, and the representative information is related to the unlabeled data. Hence, when the uncertain information and representative information are fixed, it is hard to control the required information in different iterations. From Fig.6, we can also observe that when is large and is small, the results on the two data sets are consistent, and they all achieve the relatively good results. Although the results are not the best, the proposed method performs stably and presents superiority to the results when is small and is large. Hence, in practice, a large value for and a small value for can be adopted so that the unlabeled data can be fully used.
IvC Further Analysis
In order to further explain the motivation of the proposed method, we replaced the MCC loss function with MSE, which is usually adopted in the stateoftheart methods[3, 22]. Meanwhile, a visual data set PASCAL VOC2007^{4}^{4}4http://host.robots.ox.ac.uk/pascal/VOC/voc2007/examples/index.html was adopted. We selected a subset with 4666 samples and 20 classes from PASCAL VOC2007. The PHOW features and spatial histograms of each image were obtained with VLFeat toolbox^{5}^{5}5http://www.vlfeat.org/. To observe the motivation directly, we show the images that are queried at the first, the twentieth, the fortieth, the sixtieth, the eightieth, and the one hundredth iterations in Fig. 7 and Fig. 8. Fig. 7 and Fig. 8 are obtained by the proposed method based on MCC and MSE respectively. The results obtained by MCC and MSE are also shown in the whole active learning process in Fig. 9. From Fig. 7, we can observe that the labels in each image are all very relevant to the image, and there are no outlier labels. Meanwhile, compared with the background, the object corresponding to the image’s label covers a larger region in the image, leading to the result that the object is very relevant to the image. In Fig. 8, the leaves which are the background in the image selected at iteration cover a larger region than the bird which is a label to the image, and the mountain which is background in the image selected at iteration cover a larger region than the cow which is a label to the image. Hence, the labels of the images selected at the and iterations are less relevant to the images than the background is. Compared with the background, the labels of these images seem to be outlier labels. In a word, the method based on MSE may select the images that the backgrounds cover larger regions than the objects corresponding to the images’ labels, while the proposed method based on MCC can decrease the influence of the outlier labels, and select the images that the labels are more obvious than the backgrounds to make full use of the labels in the images.
V Conclusion
Outlier labels are very common in multilabel scenarios and may cause the supervised information bias. In this paper, we propose a robust multilabel active learning based on MCC to solve the problem. The proposed method queries the samples that can not only build a strong query model to measure the uncertainty but also represent the similarity well for multilabel data. Different from the traditional active learning methods by combining uncertainty and representativeness heuristically, we merge the representativeness into uncertainty with the prediction labels of the unlabeled data with MCC to enhance the uncertain information. With MCC, the supervised information of outlier labels will be suppressed, and that of discriminative labels will be enhanced. It outperforms stateoftheart methods in most of the experiments. The experimental analysis also reveals that it is beneficial to update the tradeoff parameters that balance the uncertain and representative information during the query process. In our future work, we plan to develop an adaptive mechanism to tune these parameters automatically, making our algorithm more practical.
References
 [1] B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, vol. 52, no. 5566, p. 11, 2010.
 [2] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” ACM Transactions on Knowledge Discovery from Data, vol. 9, no. 3, p. 17, 2015.
 [3] H. ShengJun, J. Rong, and Z. ZhiHua, “Active learning by querying informative and representative examples.” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 10, pp. 1936–1949, 2014.
 [4] E. Elhamifar, G. Sapiro, A. Yang, and S. Sasrty, “A convex optimization framework for active learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 209–216.
 [5] X. Li and Y. Guo, “Multilevel adaptive active learning for scene classification,” in 13th European Conference on Computer Vision–ECCV 2014, 2014, pp. 234–249.
 [6] J. Tang, Z.J. Zha, D. Tao, and T.S. Chua, “Semanticgaporiented active learning for multilabel image annotation,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 2354–2360, 2012.
 [7] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification via highorder label correlation driven active learning,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1430–1441, 2014.
 [8] R. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for weaklysupervised multilabel image classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 121–135, 2015.
 [9] M. Singh, E. Curran, and P. Cunningham, “Active learning for multilabel image annotation,” in Proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science, 2009, pp. 173–182.
 [10] X. Chen, A. Shrivastava, and A. Gupta, “Neil: Extracting visual knowledge from web data,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1409–1416.
 [11] C. Vondrick and D. Ramanan, “Video annotation and tracking with active learning,” in Advances in Neural Information Processing Systems, 2011, pp. 28–36.
 [12] D. Vasisht, A. Damianou, M. Varma, and A. Kapoor, “Active learning for sparse bayesian multilabel classification,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 472–481.
 [13] C. Wan, X. Li, B. Kao, X. Yu, Q. Gu, D. Cheung, and J. Han, “Classification with active learning and metapaths in heterogeneous information networks,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 443–452.
 [14] M. Zuluaga, G. Sergent, A. Krause, and M. Püschel, “Active learning for multiobjective optimization,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 462–470.
 [15] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” The Journal of Machine Learning Research, vol. 2, pp. 45–66, 2002.
 [16] Y. Guo, “Active instance sampling via matrix partition,” in Advances in Neural Information Processing Systems, 2010, pp. 802–810.
 [17] Y. Guo and D. Schuurmans, “Discriminative batch mode active learning,” in Advances in neural information processing systems, 2008, pp. 593–600.
 [18] C. Ye, J. Wu, V. S. Sheng, S. Zhao, P. Zhao, and Z. Cui, “Multilabel active learning with chisquare statistics for image classification,” in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015, pp. 583–586.
 [19] X. Kong, W. Fan, and P. S. Yu, “Dual active feature and sample selection for graph classification,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 654–662.
 [20] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji, “Representative discovery of structure cues for weaklysupervised image segmentation,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 470–479, 2014.
 [21] S. Chakraborty, V. Balasubramanian, Q. Sun, S. Panchanathan, and J. Ye, “Active batch selection via convex relaxations with guaranteed solution bounds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 10, pp. 1945–1958, 2015.
 [22] S.J. Huang, S. Chen, and Z.H. Zhou, “Multilabel active learning: query type matters,” in Proceedings of the 24th International Conference on Artificial Intelligence, 2015, pp. 946–952.
 [23] J. C. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” Unsupervised adaptive filtering, vol. 1, pp. 265–319, 2000.
 [24] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: properties and applications in nongaussian signal processing,” IEEE Transactions on Signal Processing, vol. 55, no. 11, pp. 5286–5298, 2007.
 [25] R. He, B.G. Hu, W.S. Zheng, and X.W. Kong, “Robust principal component analysis based on maximum correntropy criterion,” IEEE Transactions on Image Processing, vol. 20, no. 6, pp. 1485–1494, 2011.
 [26] X. Li, L. Wang, and E. Sung, “Multilabel svm active learning for image classification,” in International Conference on Image Processing, vol. 4, 2004, pp. 2207–2210.
 [27] B. Yang, J.T. Sun, T. Wang, and Z. Chen, “Effective multilabel active learning for text classification,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 917–926.
 [28] T. Windheuser, H. Ishikawa, and D. Cremers, “Generalized roof duality for multilabel optimization: Optimal lower bounds and persistency,” in ECCV, 2012, pp. 400–413.
 [29] X. Li and Y. Guo, “Active learning with multilabel svm classification.” in Proceedings of the 22th International Conference on Artificial Intelligence, 2013.
 [30] S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learning,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 269–278.
 [31] G.J. Qi, X.S. Hua, Y. Rui, J. Tang, and H.J. Zhang, “Twodimensional active learning for image classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
 [32] S.J. Huang and Z.H. Zhou, “Active query driven by uncertainty and diversity for incremental multilabel learning,” in 2013 IEEE 13th International Conference on Data Mining, 2013, pp. 1079–1084.
 [33] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens, “Learning with the maximum correntropy criterion induced losses for regression,” Journal of Machine Learning Research, vol. 16, pp. 993–1034, 2015.
 [34] R. He, W.S. Zheng, T. Tan, and Z. Sun, “Halfquadraticbased iterative minimization for robust sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 261–275, 2014.
 [35] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Semisupervised svm batch mode active learning for image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–7.
 [36] J. C. Bezdek and R. J. Hathaway, “Convergence of alternating optimization,” Neural, Parallel & Scientific Computations, vol. 11, no. 4, pp. 351–368, 2003.
 [37] X.T. Yuan and B.G. Hu, “Robust feature extraction via information theoretic learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 1193–1200.
 [38] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
 [39] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
 [40] Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Manifold regularized multitask learning for semisupervised multilabel image classification,” IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 523–536, 2013.