Robust and Discriminative Labeling for Multi-label Active Learning Based on Maximum Correntropy Criterion

Robust and Discriminative Labeling for Multi-label Active Learning Based on Maximum Correntropy Criterion

Bo Du,  Zengmao Wang,  Lefei Zhang,  Liangpei Zhang,  Dacheng Tao, This work was supported in part by the National Natural Science Foundation of China under Grants 61471274, 41431175 and 61401317, by the Natural Science Foundation of Hubei Province under Grant 2014CFB193, by the Fundamental Research Funds for the Central Universities under Grant 2042014kf0239, and by Australian Research Council Projects FT-130101457, DP-140102164, LP-150100671. B. Du, Z. Wang and L. Zhang are with the School of Computer, Wuhan University, Wuhan 430079, China (email:gunspace@163.com; wzm902009@gmail.com; zhanglefei@whu.edu.cn).(Corresponding Author: Lefei Zhang)L. Zhang is with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430072, China (email:zlp62@whu.edu.cn).D. Tao is with the School of Information Technologies and the Faculty of Engineering and Information Technologies, University of Sydney, J12/318 Cleveland St, Darlington NSW 2008, Australia (email: dacheng.tao@sydney.edu.au).©20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

Multi-label learning draws great interests in many real world applications. It is a highly costly task to assign many labels by the oracle for one instance. Meanwhile, it is also hard to build a good model without diagnosing discriminative labels. Can we reduce the label costs and improve the ability to train a good model for multi-label learning simultaneously?

Active learning addresses the less training samples problem by querying the most valuable samples to achieve a better performance with little costs. In multi-label active learning, some researches have been done for querying the relevant labels with less training samples or querying all labels without diagnosing the discriminative information. They all cannot effectively handle the outlier labels for the measurement of uncertainty. Since Maximum Correntropy Criterion (MCC) provides a robust analysis for outliers in many machine learning and data mining algorithms, in this paper, we derive a robust multi-label active learning algorithm based on MCC by merging uncertainty and representativeness, and propose an efficient alternating optimization method to solve it. With MCC, our method can eliminate the influence of outlier labels that are not discriminative to measure the uncertainty. To make further improvement on the ability of information measurement, we merge uncertainty and representativeness with the prediction labels of unknown data. It can not only enhance the uncertainty but also improve the similarity measurement of multi-label data with labels information. Experiments on benchmark multi-label data sets have shown a superior performance than the state-of-the-art methods.

Active learning, Multi-label learning, Multi-label classification

I Introduction

Machine learning is the topic of the day. However, less training samples problem is always the challenge problem in machine learning fields[1]. Especially nowadays amount of data is generated quickly within a short period time. Active learning as a subfield of machine learning is an effective machine learning approach to address the less training samples problem in classification. It has been elaborately developed for various classification tasks by querying the most valuable samples, which is an iterative loop to find the most valuable samples for the ’oracle’ to label, and gradually improves the models’ generalization ability until the convergence condition is satisfied [1]. In general, there are two motivations behind the design of a practical active learning algorithm, namely, uncertainty and representativeness[2, 3, 4]. Uncertainty is the criterion used to select the samples that can help to improve the generalization ability of classification models, ensuring that the classification results of unknown data are more reliable. Representativeness measures the overall patterns of the unlabeled data to prevent the bias of a classification model with few or no initial labeled data. No matter which kind of active learning method is used, the key lies on how to select the most valuable samples, which is referred as the query function.

Among all the tasks, such as object recognition, scene classification[5], and image retrieval, multi-label classification, which aims to assign each instance with multiple labels, may be the most difficult one[6, 7, 8]. For each training sample, different combinations of labels need to be considered. Compared with single-label classification, the labeling of multi-label classification is more costly[9]. Currently, multi-label learning has been successfully applied in machine learning and computer vision fields, including web classification[10], video annotation[11], and so on[12, 13, 14]. To solve the classification tasks, many types of machine learning algorithms have been developed. However, less training samples is not solved in many of these techniques, and they all face such a problem. Hence, active learning has become even more important to solve the less training samples problem, reducing the costs of the various classification tasks.

Fig. 1: The influence of outlier label in the learning process.

State-of-the-art multi-label active learning algorithms can be classified into three categories based on the query function. One category relies on the labeled data to design a query function with uncertainty[15, 16, 17]. For these methods, the design of query function ignores the structural information in the large-scale unlabeled data, leading to a serious sample bias or an undesirable performance. To eliminate this problem, the second category, which depends on the representativeness, has been developed[18, 19, 20]. In these approaches, the structural information of the unlabeled data is elaborately considered, but the discriminative (uncertain) information is discarded. Therefore, a large number of samples would be required before an optimal boundary is found. Since utilizing either the uncertainty criterion or the representativeness criterion may not achieve a desirable performance, the third category which combines both criteria[3, 4, 21] born naturally. These methods are either heuristic in designing the specific query criterion or ad hoc in measuring the uncertainty and representativeness of the samples. Although it is effective, the two parts are independent. Hence, the uncertainty still just relies on the limited labeled samples, and the information of two criteria are not enhanced. Most importantly, they ignored the outlier labels that exist in multi-label classification when designed a query function for active learning.

However, the outlier labels have significant influence on the measurement of uncertainty and representativeness in multi-label learning. In the following, we will discuss the outlier label and its negative influence on the measurement of uncertainty and representativeness in detail.

Fig. 1 shows a simple example about the influence of outlier labels. As the input, we annotate the image with three labels, namely tree, elephant and lion. Hence, the feature of image is combined with three parts, the feature of tree, the feature of elephant and the feature of lion. Intuitively, in the image feature, the feature of tree is much more than elephant and lion, and the feature of lion is the least. If we use the image with the three labels to learn a lion/ non-lion binary classification model, the model would actually depend on the tree¡¯s and elephant¡¯s features rather than the lion¡¯s. Thus it would be a biased model for classifying the lion and the non-lions. Given the test image where a lion covers the most regions in the image, the trained model would not recognize the lion. If we use such a model to measure the uncertainty in active learning, it may cause wrong measurement for the images with lion label. We name the lion label in the input image as an outlier label.

Furthermore, we present the formal definition of an outlier label. Denote as the sample-label pairs of an instance. is the most relevant label of the instance , and is much more relevant to than . Define as the outlier label, if it has the two properties. The first one is that is a relevant label to the instance , and the second one is that compared with the most relevant label , is much less relevant to than . Fig. 2 shows the two properties, and we can understand the outlier label easier from it. According to the definition of outlier label, the lion is naturally treated as the outlier label. Since the feature of lion is not very obvious, if we treat the lion as a positive label, and use the image in Fig. 2 to learn a model, the model would not be able to effectively query an informative sample. Hence, if the influence of the outlier label can be avoided or decreased when we query the most informative sample, it would be very useful to build a promising model for classification. The definition of the outlier label also fits the fact that the outlier label may not be paid attention by the oracle at the first glance. In Fig. 2, the tree can be recognized at the first glance by the oracle, but the lion is very veiled and may be ignored with careless. The definition of outlier label is also consistent with the query types proposed in [22].

Fig. 2: The interface of two properties for outlier labels. Left: The outlier label (Lion) is relevant to the image; right: the outlier (Lion) is much less relevant to the image than the most relevant label (Tree) is.

For two multi-label images, if they have the same labels with different outlier labels, this may lead to the result that the features of the two images have a large difference. Therefore, it is very hard to diagnose the similarity based on features between two instances with different outlier labels. In Fig. 3, we provide a simple example to show such a problem. We present the similarity between the sift features with Gaussian kernel, and the labels similarity based on MCC. In Fig. 3, the similarity between image 1 and image 2 should be larger than the similarity between image 2 and image 3, since the labels in image 1 and image 2 are exactly the same. However, the result is opposite when the similarity is measured with their sift features. The outlier label is lion in image 1, and tree trunk in image 2. The two outlier labels will largely increase the features’ difference of the two images. In summary, the measurement of uncertainty and representativeness would be deteriorated with the outlier labels.

Fig. 3: The influence of the outlier labels for the measurement of similarity

To address the above problems, in this paper, we proposed a robust multi-label active learning (RMLAL) algorithm, which effectively merges the uncertainty and the representativeness based on MCC.

As to robustness, the correntropy has been proved promising in information theoretic learning (ITL)[23] and can efficiently handle large outliers[24, 25]. In conventional active learning algorithms, the mean square error (MSE) cannot easily control the large errors caused by the outliers. For example, in Fig. 2, there are two labels for the image: tree and lion. If we use the lion model to learn the image, the prediction value must be very far from the lion’s label. If we use the MSE loss to measure the loss between the prediction value and the label, a large error may be introduced, since MSE extends the error with square. MCC calculates the loss between the prediction value and the label with a kernel function. If a large enough error is introduced, the value of MCC is almost equal to zero. Hence, the influence of the large error will be restrained. We therefore replace MSE loss with MCC in the minimum margin model in the proposed formulation. In this way, the proposed method is able to eliminate the influence of the outlier labels, making the query function more robust.

As to discriminative labeling, we use the MCC to measure the loss between the true label and the prediction label. MCC can improve the most discriminative information and suppress the useless information or unexpected information. Hence, with MCC in the proposed method, if the label is not an outlier label, it will play an important role in the query model construction. Otherwise, the model will decrease the influence of the outlier label to measure the uncertainty. With such an approach, the discriminative labels’ effects are improved and the outlier labels’ are suppressed. Thus, the discriminative labeling can be achieved.

As to representativeness, we mix the prediction labels of the unlabeled data with the MCC as the representativeness. As is shown in Fig. 3, although the samples have the same labels, their outlier labels are different, making their features distinguishing. If we just use the corresponding features to measure the similarity, it will lead to a wrong diagnosis. Hence, we propose to use the combination of labels’ and features’ similarity to define the consistency. The combination makes the measurement of representativeness more general. To decrease the computational complexity of the proposed method, the half-quadratic optimization technique is adopted to optimize the MCC.

The contributions of our work can be summarized as follows:

  • To the best of our knowledge, it is the first work to focus on the outlier labels in multi-label active learning. We find a robust and effective query model for multi-label active learning.

  • The prediction labels of the unlabeled data and the labels of the labeled data are utilized with MCC to merge the uncertain and representative information, deriving an approach to make the uncertain information more precise.

  • The proposed representative measurement considers labels similarity by MCC. It can handle the outlier labels effectively and make the similarity more accurate for multi-label data. Meanwhile, a new way is provided to merge representativeness into uncertainty.

The rest of the paper is organized as follows: Section 2 briefly introduces the related works. Section 3 defines and discusses a new objective function for robust multi-label active learning, and then proposes an algorithm based on half-quadratic optimization. Section 4 evaluates our method on several benchmark multi-label data sets. Finally, we summarize the paper in Section 5.

Ii Related works

Multi-label problem is universal in the real world, so that multi-label classification has drawn great interests in many fields. For a multi-label instance, it needs the human annotator to consider all the relevant labels. Hence, the labeling of multi-label tasks is more costly than single label learning, but the research of active learning on multi-label learning is still less.

In multi-label learning, one instance is corresponding to more than one labels. To solve a multi-label problem, it is a direct way to convert the multi-label problem into several binary problems[26, 27]. In these approaches, the uncertainty is measured for each label, and then a combined strategy is adopted to measure the uncertainty of the instance. [26] trained a binary SVM model for each label and combined them with different strategies for the instance selection. [27] predicted the number of relevant labels for each instance by a logistic regression, and then adopted the SVM models to minimize the expected loss for the instance selection. Recently, [12] adopted the mutual information to design the selection criterion for Bayesian multi-label active learning, and [28] selected the valuable instances by minimizing the empirical risk. Other works have been done by combining the informativeness and representativeness together for a better query[29, 30]. [29] combined the label cardinality inconsistency and the separation margin with a tradeoff parameter. [30] took into account the cluster structure of the unlabeled instances as well as the class assignments of the labeled examples for a better selection of instances. All the above algorithms were designed to query all the labels of the selected instances without diagnosing discriminative labels. Another kind of approaches were developed to query the label-instance pairs with relevant label and instance at each iteration[22, 31, 32]. [22] queried the most relevant label based on the types. [32] selected label-instance pairs based on a label ranking model. In these approaches, the most relevant label is assigned to the instance, and some relevant labels may be lost with the limited query labels. Therefore, it may need to query much more label-instance pairs to achieve a good performance. It is proved that considering the combination of informativeness and representativeness is very effective in active learning. We adopt this strategy in the paper.

No matter whether selecting the instance by all the labels or by the label-instance pairs, most of the active learning algorithms only selected the uncertain instance based on very limited samples, and ignored the labels information. For example, given all the labels to one instance, if the instance has many outlier labels, such instance may decrease the performance of the classification task. To address these problems, we use the prediction labels of the unlabeled data to enhance the uncertain measurement and adopt the MCC to consider the relevant labels as many as possible except the outlier labels. As far to our knowledge, it is the first time to adopt the MCC in multi-label active learning with data labels for query.

Iii Multi-label active learning

Suppose we are given a multi-label data set with samples and possible labels for each sample. Initially, we label samples in . Without loss of generality, we denote the labeled samples as set , where is the labels set for sample , with ; and the remaining unlabeled samples are denoted as set . It is the candidate set for active learning. Moreover, we denote as the sample that we want to query in the active learning process, and define that is the labels matrix for the labeled data. In the following discussion, the symbols are used as above.

Iii-a Maximum Correntropy Criterion

In multi-label classification tasks, the outlier labels are the great challenge to train a precise classifier, mainly due to the unpredictable nature of the errors (bias) caused by those outliers. In active learning, in particular, the limited labeled samples with outliers easily lead to a great bias. This would directly lead to the bias of uncertain information, furthermore make the query instances undesirable or even lead to bad performances. Recently, the concept of correntropy was firstly proposed in information theoretic learning (ITL) and it had drawn much attention in the signal processing and machine learning community for robust analysis, which can effectively handle the outliers[33, 34]. In fact, correntropy is a similarity measurement between two arbitrary random variables and [24, 33], defined by

(1)

where is the kernel function that satisfies Mercer theory and is the expectation operator. We can observe that the definition of correntropy is based on the kernel method. Hence, it has the same advantages that the kernel technique owns. However, different from the conventional kernel based methods, correntropy works independently with pairwise samples and has a strong theoretical foundation[24]. With such a definition, the properties of correntropy are symmetric, positive and bounded.

However, in the real world applications, the joint probability density function of and is unknown, and the available data in and are finite. We define the finite number of available data in and is , and the data set is denoted as . Thus, the sample estimator of correntropy is usually adopted by

(2)

where is Gaussian kernel function . According to[24, 33], the correntropy between and is given by

(3)

The objective function (3) is called MCC, where is the auxiliary parameter to be specified in Proposition 2. Compared with MSE, which is a global metric, the correntropy is a local metric. That means the correntropy value is mainly determined by the kernel function along the line = .

Iii-B Multi-label Active learning based on MCC

Usually, in active learning methods, the uncertainty is measured according to the labeled data whereas the representativeness according to the unlabeled data. In this paper, we propose a novel approach to merge the uncertainty and representativeness of instances in active learning based on MCC. Mathematically, it is formulated as an optimization problem w.r.t. the classifier and the query sample :

(4)

where is a reproducing kernel Hilbert space and is used to constrain the complexity of the classifier. is the MCC loss function. is the labels set for all the unlabeled data, which is calculated by the classifiers . However, there is a problem in solving (4) that: the labels of are unknown. Our goal is to find the optimal and with (4), and the labels of are assigned by oracle after query. Therefore, the labels of should be precise before query. We replace the precise labels of with pseudo labels to solve (4), and obtain the following problem:

(5)

where is the pseudo label for . It belongs to . If contains the label, is equal to 1, otherwise, is equal to -1. In (5), the first three terms correspond to the regularized risk for all the labeled samples after query, which carries the uncertain information embedded in the current classifier. We call them the uncertain part. Meanwhile, in the last term, the unlabeled data are also embedded in the current classifier to enhance the uncertain part. However, the function of the last term is not just to enhance the uncertain information. The main function of the last term is to describe the distribution difference between the labeled samples after query and all the available samples, which captures the representative information embedded in the labeled samples. balances the uncertain and representative information in the formulation. In the remaining part of this section, we will analyze this objective function in a specific form and propose a practical algorithm to solve the optimization problem.

Iii-C Uncertainty based on MCC

Minimum margin is the most popular and direct approach to measure the uncertainty of the unlabeled sample by its position to the boundary. Let be the classifier that is trained by the labeled samples. The sample that we want to query in the unlabeled data based on the margin can be found as follows:

(6)

Generally, with the labeled samples, we can find a classification model for a binary class problem in a supervised learning approach with the following objective function

(7)

where is a reproducing kernel Hilbert space endowed with kernel function is the loss function, and belongs to . Following the works of [3, 35], Proposition 1 connects the margin based query selection with the min-max formulation of active learning.

Proposition 1

The criterion of the minimum margin to find a desirable sample can be written as

where is the pseudo label for the sample .

In previous works, the loss function is usually adopted with quadratic loss for MSE. But it is not robust for the occasion of outliers. To overcome this problem, we introduce the MCC as the loss function, given by

(8)

where is the kernel width. Following the Proposition 1, we can observe that the objective function (6) is equal to the objective function (8). To solve (8), since , we define , and optimize the worst case of (8) for selection. The objective function becomes

(9)

In our work, we extend multi-label classification as several binary classification problems with label correlation. For the convenience of presentation, we consider the simple case by learning one classifier for each label independently. Then, we use the summation of each binary classifier as the minimum margin in multi-label learning, presented by

(10)

where is the binary classifier between the label and the other labels, . Then the objective function of multi-label learning task based on MCC can be formalized as:

(11)

Iii-D Representativeness based on MCC

Since the labeled samples in are very limited, it is very important to utilize the unlabeled data to enhance the performance of active learning. However, the labels of the unlabeled data are unknown, and it is hard to add the unlabeled data in a supervised model. To enhance the uncertain information, we merge the representative information into the uncertain information by prediction labels of the unlabeled data. The current similarity measurement is based on the instance features, and it cannot use the unlabeled data to enhance the uncertain information. To overcome this problem, and consider the outlier labels’ influence, we take the prediction labels of the unlabeled data into consideration for similarity measurement. We define a novel consistency with labels similarity and features similarity of two instances based on MCC as:

(12)

where is the similarity between two samples with kernel function . Let denote the symmetric similarity matrix for the unlabeled data, where is the consistency between and . We can collect the consistency in a matrix as:

(13)

With such a consistency matrix, the representativeness is to find the sample that can well represent the unlabeled data set. To do so, [4] proposed a convex optimization framework by introducing variables which indicates the probability that represents , and collected it with a matrix

(14)

In our consistency measurement based on MCC, if is very similar to the point , and it is not similar to the point , there will be . Such a consistency measurement has already made the difference between representatives and non-representatives large. Therefore, we define as , if is the representative one, where is a vector with length and all the entries in are 1; otherwise is , where is a vector with length and all the entries in are 0. Obviously, it is the summation of consistency between and the unlabeled data. Hence, we can collect the similarities of the query sample and the unlabeled data as:

(15)

Similarly, let and be the consistency matrix and probability between the unlabeled data and the labeled data respectively. The similarities of the query sample and the labeled data can be collected as follows:

(16)

To query a desirable representative sample, which can not only represent the unlabeled data but also has no overlap information with the labeled data, the description of the representative sample on the unlabeled data and labeled data is in contrast. Hence, we maximize the difference of (13) and (14) to measure the representativeness as:

(17)

Since there is a large difference between the number of unlabeled data and labeled data, we use the expectation operator and a tradeoff parameter to surrogate them

(18)

Iii-E The Proposed Robust Multi-label Active Learning

To enhance the query information of uncertainty and representativeness, in our approach, we combine them with a tradeoff parameter, and the objective function can be presented as:

(19)

To merge the representative part into uncertain part, we use the prediction labels of the unlabeled data. Denoting as the prediction labels set for the sample in the unlabeled data, the objective function based on MCC can be defined as:

(20)

To query the specific point from the unlabeled data in the objective function (20), we use the numerical optimization-based techniques. An indicator vector is introduced, which is a binary vector with length. Each entry denotes whether the corresponding sample is queried as . If is queried as , is 1; otherwise, is 0. Then, the optimization can be formulated as:

(21)

For a binary classifier, we use a linear regression model in the kernel space as the classifier , where is the feature mapping to the kernel space, and then the labels set for with multi-label can be given by

(22)

where is an identity matrix of size , and it can also be the label correlation matrix. is the kronecker product between matrices. We define , and then the multi-label classifier can be presented by . The objective function can be formalized as:

(23)

where is a length of vector . We derive an iterative algorithm based on half-quadratic technique with alternating optimization strategy to solve (23) efficiently[36]. Based on the theory of convex conjugated functions, we can easily derive the proposition 2 [37, 38].

Proposition 2

A convex conjugate function exits to make sure

where is the auxiliary variable, and with a fixed , reaches the maximum value at .

Following the Proposition 2, the objective function (23) can be formulated as:

(24)

where , and with are the auxiliary variables, with

The objective function (24) can be solved by the alternating optimization strategy.

Firstly, is fixed, and the objective function (24) is to find the optimal classifier . It can be solved by the alternating direction method of multipliers (ADMM)[39].

Secondly, obtained in the first step is fixed, and the objective function (24) becomes

(25)

where

To solve (25), as in [21], we relax to a continuous range [0, 1]. Thus, the can be solved with a linear program. The sample corresponding to the largest value in will be queried as .

Iii-F The solution

In this part, we will discuss the details of the algorithm to solve the objective function (23). We solve it with alternative strategy in two steps. Firstly, is fixed. In this step, the classifier is adopted with kernel form, and we use to learn for each classifier, where . We define , and learn from the following formulation

(26)

As stated above, is an identify matrix. Define . Meanwhile, an auxiliary variable is introduced. Then, the objective function becomes

(27)

where and are length of vectors and respectively, with all the entries being 1. is the labels matrix for the labeled data. is the function to convert a matrix to a vector along the column. The augmented Lagrangian function is given by

(28)

The updating rules are as follows:

The problem to solve is a sparse one, and it can be solved with SPLA toolbox111http://spams-devel.gforge.inria.fr/downloads.html. It stops until the convergence condition is satisfied.

In the second step, is fixed to solve . As stated above, the objective function is

(29)

where . The linear program can be used to solve (29), and we select the most valuable sample that is corresponding to the largest value in . We summarize our algorithm in the Algorithm 1.

Iv Experiments

In this section, we present the experimental results to validate the effectiveness of the proposed method on 12 multi-label data sets from Mulan project222http://mulan.sourceforge.net/datasets-mlc.html. The characteristics of data sets are introduced in Table I. To demonstrate the superiority of our method, several methods listed as follows are regarded as competitors.

0:     The labeled data set and the unlabeled data set , the tradeoff parameters and , and the initial variables and parameters.
1:  repeat
2:     Fix , and calculate the objective function (27) with ADMM strategy to obtain the values of (w.r.t ).
3:     With the values of , calculate the indicator vector by solving (29), and select the sample that is corresponding to the largest value in .
4:  until the tolerance is satisfied
4:     The query index of unlabeled samples.
Algorithm 1 The Active Learning Framework for Cold-start Recommendation
  1. RANDOM is the baseline which randomly selects instances for labeling.

  2. AUDI[32] combines label ranking with threshold learning, and then exploits both uncertainty and diversity in the instance space as well as the label space.

  3. Adaptive[29] combines the max-margin prediction uncertainty and the label cardinality inconsistency as the criterion for active selection.

  4. QUIRE[3] provides a systematic way for measuring and combining the informativeness and representativeness of an unlabeled instance by incorporating the correlation among labels.

  5. Batchrank[21] selects the best query with an NP-hard optimization problem based on the mutual information.

  6. RMLAL: Robust Multi-label Active Learning is the proposed method in this paper.

Dataset domain #instance #label #feature #LC
Corel16k images 13,766 153 500 2.86
Mediamill video 43,097 101 120 4.37
Emotions music 593 6 72 1.87
Enron text 1,702 53 1,001 3.38
Image images 2,000 5 294 1.24
Medical text 978 45 1,449 1.25
Scene images 2,407 6 294 1.07
Health text 5,000 32 612 1.66
Social text 5,000 39 1,047 1.28
Corel5k images 5000 374 499 3.52
Genbase biology 662 27 1,185 1.25
CAL500 music 502 174 68 26.04
TABLE I: Characteristics of the datasets, including the numbers of the corresponding instance,labels, features and cardinality.
Dataset Vs QUIRE Vs AUDI Vs Adaptive Batchrank Vs Random
Corel16k 25/0/0 25/0/0 25/0/0 25/0/0 25/0/0
Mediamill 5/16/4 10/12/3 25/0/0 25/0/0 25/0/0
Emotions 25/0/0 25/0/0 25/0/0 25/0/0 25/0/0
Enron 19/5/1 25/0/0 25/0/0 25/0/0 25/0/0
Image 15/10/0 17/8/0 25/0/0 25/0/0 25/0/0
Medical 13/10/2 25/0/0 25/0/0 25/0/0 25/0/0
Scene 15/5/5 25/0/0 25/0/0 25/0/0 25/0/0
Health 13/10/2 18/5/2 25/0/0 25/0/0 25/0/0
Social 25/0/0 25/0/0 25/0/0 25/0/0 25/0/0
Corel5k 25/0/0 25/0/0 25/0/0 25/0/0 25/0/0
Genbase 25/0/0 20/5/0 25/0/0 7/15/3 25/0/0
CAL500 25/0/0 25/0/0 25/0/0 25/0/0 25/0/0
TABLE II: Win/Tie/Loss counts of our method versus the competitors based on paired t-test at 95 percent significance level.

LC is the average number of labels for each instance. We randomly divided each data set into two equal parts. One was regarded as testing data set. For the other part, we randomly selected 4% as the initial labeled set, and the remaining samples of this part were used as the unlabeled data set. In the compared methods, AUDI and QUIRE query a relevant label-instance pairs at each iteration. We can notice that querying all labels for one instance is equal to query label-instance pairs. Hence, to achieve a fair comparison, we queried label-instance pairs as one query instance in AUDI and QUIRE. For the method Batchrank, in the original paper, the tradeoff parameter is set as 1. For a fair comparison, we chose the tradeoff parameter from a candidate set that is the same as the proposed method. The other methods and the parameters were all set as the original papers. For the kernel parameters, we adopted the same values for all methods.

Without loss of generality, we adopted the liblinear333https://www.csie.ntu.edu.tw/ cjlin/liblinear/ as the classifier for all methods, and evaluated the performance with micro-F1[40], which is commonly used as performance measurement in the multi-label learning. Following[12], for each data set, we repeated each method for 5 times and reported the average results. We stopped the querying process when 100 iterations were reached and one instance was queried at each iteration.

(a) Corel16k
(b) Mediamill
(c) Enron
(d) Image
(e) Scene
(f) Health
(g) Emotions
(h) Medical
(i) Social
(j) Corel5k
(k) Genbase
(l) CAL500
Fig. 4: Comparison of different active learning methods on twelve benchmark datasets. The curves show the micro-F1 accuracy over queries, and each curve represents the average result of 5 runs.

Iv-a Results

We report the average results on each data set in Fig.4. Besides, we compare the competitors in each run with the proposed method based on the paired t-test at 95% significance level, and show the Win/Tie/Lose for all datasets in Table II. From all these results, we can observe that the proposed method performs best on most of the data sets. It achieves the best results in almost the whole active learning process. In general, QUIRE and AUDI are two methods to query the label-instance pairs for labeling. They almost show the superior performance to the Batchrank and Adaptive, which query all labels for the instance. This demonstrates that querying the relevant labels is more efficient than querying all labels for one instance. But our method achieves the best performance by querying all the labels for one instance than querying the relevant label-instance pairs. The reason may be that although the Batchrank and Adaptive query all the labels, they cannot avoid the influence of the outlier labels without considering the labels correlation, leading to the query samples undesirable. This reason can also explain why Batchrank and Adaptive perform worse than random method on some data. For QUIRE and AUDI methods, some labels information is lost when they just query the limited relevant labels, and they need more samples to achieve a better performance. The results demonstrate the proposed method can not only achieve discriminative labeling but also avoid the influence of the outlier labels. To put it in nutshell, the proposed method merging the uncertainty and representativeness with MCC can solve the problems in multi-label active learning effectively as stated above.

For the computational cost, the time complexity of the proposed method is , where and are the number of iterations. The time complexity of Adaptive and AUDI are and , respectively. QUIRE and Batchrank are costing, and the time complexity of them are both . is the number of classes. is the number of the unlabeled data. is the number of the labeled data and is the dimension of the data. Hence, compared with Adaptive and AUDI, the proposed method is costing, but it is relatively efficient when compared it with QUIRE and Batchrank. We show the time complexity of all the methods in Table III.

Methods Time complexity
RMLAL
Adaptive
AUDI
Batchrank
QUIRE
TABLE III: THE TIME COMPLEXITY OF ALL THE METHODS

Iv-B Evaluation parameters

In the proposed method, the kernel parameter is very important for the MCC, which controls all the robust properties of correntropy[24]. There are two tradeoff parameters on the uncertain part and representative part respectively. For convenience, in our experiments, we defined kernel size . Meanwhile, we fixed the kernel size as in label space , and fixed it as in feature space, where is the dimension of the data. To discover the influence of the kernel size for the proposed method, we evaluated the kernel size for MCC in label space. We reported the average results when the kernel size was set as respectively on two popular benchmark datasets emotions and scene[21], which had the same number of labels but with different LC. For the tradeoff parameters, we set them as . The other settings were same to the previous experiments.

(a) Scene
(b) Emotions
Fig. 5: Comparison of different on two data sets
(a) Scene
(b) Emotions
Fig. 6: Comparison of different tradeoff parameter pairs on two data sets

Fig. 5 shows the average results in 10 runs with the kernel size changing. From Fig. 5, we can observe that the larger the kernel size , the better results that the proposed method obtains. This may be because that when the kernel size is large, the values of the outlier labels based on MCC will be small in the objective function and the influence of the outlier labels is decreased as much as possible. Hence, we can set the kernel size with a larger value for better performance. Fig.6 shows the average results in 10 runs with different pairs of the tradeoff parameters. For these results, we can observe that uncertain information and representative information have a big influence on the results. This may be because that the number of the labeled samples and that of unlabeled samples are changing in the active learning process. In active learning, the uncertain information is related to the labeled data, and the representative information is related to the unlabeled data. Hence, when the uncertain information and representative information are fixed, it is hard to control the required information in different iterations. From Fig.6, we can also observe that when is large and is small, the results on the two data sets are consistent, and they all achieve the relatively good results. Although the results are not the best, the proposed method performs stably and presents superiority to the results when is small and is large. Hence, in practice, a large value for and a small value for can be adopted so that the unlabeled data can be fully used.

Iv-C Further Analysis

In order to further explain the motivation of the proposed method, we replaced the MCC loss function with MSE, which is usually adopted in the state-of-the-art methods[3, 22]. Meanwhile, a visual data set PASCAL VOC2007444http://host.robots.ox.ac.uk/pascal/VOC/voc2007/examples/index.html was adopted. We selected a subset with 4666 samples and 20 classes from PASCAL VOC2007. The PHOW features and spatial histograms of each image were obtained with VLFeat toolbox555http://www.vlfeat.org/. To observe the motivation directly, we show the images that are queried at the first, the twentieth, the fortieth, the sixtieth, the eightieth, and the one hundredth iterations in Fig. 7 and Fig. 8. Fig. 7 and Fig. 8 are obtained by the proposed method based on MCC and MSE respectively. The results obtained by MCC and MSE are also shown in the whole active learning process in Fig. 9. From Fig. 7, we can observe that the labels in each image are all very relevant to the image, and there are no outlier labels. Meanwhile, compared with the background, the object corresponding to the image’s label covers a larger region in the image, leading to the result that the object is very relevant to the image. In Fig. 8, the leaves which are the background in the image selected at iteration cover a larger region than the bird which is a label to the image, and the mountain which is background in the image selected at iteration cover a larger region than the cow which is a label to the image. Hence, the labels of the images selected at the and iterations are less relevant to the images than the background is. Compared with the background, the labels of these images seem to be outlier labels. In a word, the method based on MSE may select the images that the backgrounds cover larger regions than the objects corresponding to the images’ labels, while the proposed method based on MCC can decrease the influence of the outlier labels, and select the images that the labels are more obvious than the backgrounds to make full use of the labels in the images.

(a) cat, TV monitor
(b) dining table, chair, bottle
(c) horse, person
(d) motorbike
(e) bicycle, person
(f) boat
Fig. 7: The images that are queried based on MCC at several iterations:(a) iteration; (b) iteration; (c) iteration; (d) iteration; (e) iteration; (f) iteration
(a) aeroplane
(b) bird
(c) bird, person
(d) cow
(e) chair, dining table
(f) cat
Fig. 8: The images that are queried based on MSE at several iterations:(a) iteration; (b) iteration; (c) iteration; (d) iteration; (e) iteration; (f) iteration
Fig. 9: The average results of the proposed method based on MCC and MSE in the whole active learning process

V Conclusion

Outlier labels are very common in multi-label scenarios and may cause the supervised information bias. In this paper, we propose a robust multi-label active learning based on MCC to solve the problem. The proposed method queries the samples that can not only build a strong query model to measure the uncertainty but also represent the similarity well for multi-label data. Different from the traditional active learning methods by combining uncertainty and representativeness heuristically, we merge the representativeness into uncertainty with the prediction labels of the unlabeled data with MCC to enhance the uncertain information. With MCC, the supervised information of outlier labels will be suppressed, and that of discriminative labels will be enhanced. It outperforms state-of-the-art methods in most of the experiments. The experimental analysis also reveals that it is beneficial to update the tradeoff parameters that balance the uncertain and representative information during the query process. In our future work, we plan to develop an adaptive mechanism to tune these parameters automatically, making our algorithm more practical.

References

  • [1] B. Settles, “Active learning literature survey,” University of Wisconsin, Madison, vol. 52, no. 55-66, p. 11, 2010.
  • [2] Z. Wang and J. Ye, “Querying discriminative and representative samples for batch mode active learning,” ACM Transactions on Knowledge Discovery from Data, vol. 9, no. 3, p. 17, 2015.
  • [3] H. Sheng-Jun, J. Rong, and Z. Zhi-Hua, “Active learning by querying informative and representative examples.” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 10, pp. 1936–1949, 2014.
  • [4] E. Elhamifar, G. Sapiro, A. Yang, and S. Sasrty, “A convex optimization framework for active learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 209–216.
  • [5] X. Li and Y. Guo, “Multi-level adaptive active learning for scene classification,” in 13th European Conference on Computer Vision–ECCV 2014, 2014, pp. 234–249.
  • [6] J. Tang, Z.-J. Zha, D. Tao, and T.-S. Chua, “Semantic-gap-oriented active learning for multilabel image annotation,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 2354–2360, 2012.
  • [7] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification via high-order label correlation driven active learning,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1430–1441, 2014.
  • [8] R. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for weakly-supervised multi-label image classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 121–135, 2015.
  • [9] M. Singh, E. Curran, and P. Cunningham, “Active learning for multi-label image annotation,” in Proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science, 2009, pp. 173–182.
  • [10] X. Chen, A. Shrivastava, and A. Gupta, “Neil: Extracting visual knowledge from web data,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1409–1416.
  • [11] C. Vondrick and D. Ramanan, “Video annotation and tracking with active learning,” in Advances in Neural Information Processing Systems, 2011, pp. 28–36.
  • [12] D. Vasisht, A. Damianou, M. Varma, and A. Kapoor, “Active learning for sparse bayesian multilabel classification,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 472–481.
  • [13] C. Wan, X. Li, B. Kao, X. Yu, Q. Gu, D. Cheung, and J. Han, “Classification with active learning and meta-paths in heterogeneous information networks,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 443–452.
  • [14] M. Zuluaga, G. Sergent, A. Krause, and M. Püschel, “Active learning for multi-objective optimization,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 462–470.
  • [15] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” The Journal of Machine Learning Research, vol. 2, pp. 45–66, 2002.
  • [16] Y. Guo, “Active instance sampling via matrix partition,” in Advances in Neural Information Processing Systems, 2010, pp. 802–810.
  • [17] Y. Guo and D. Schuurmans, “Discriminative batch mode active learning,” in Advances in neural information processing systems, 2008, pp. 593–600.
  • [18] C. Ye, J. Wu, V. S. Sheng, S. Zhao, P. Zhao, and Z. Cui, “Multi-label active learning with chi-square statistics for image classification,” in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015, pp. 583–586.
  • [19] X. Kong, W. Fan, and P. S. Yu, “Dual active feature and sample selection for graph classification,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 654–662.
  • [20] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji, “Representative discovery of structure cues for weakly-supervised image segmentation,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 470–479, 2014.
  • [21] S. Chakraborty, V. Balasubramanian, Q. Sun, S. Panchanathan, and J. Ye, “Active batch selection via convex relaxations with guaranteed solution bounds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 10, pp. 1945–1958, 2015.
  • [22] S.-J. Huang, S. Chen, and Z.-H. Zhou, “Multi-label active learning: query type matters,” in Proceedings of the 24th International Conference on Artificial Intelligence, 2015, pp. 946–952.
  • [23] J. C. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” Unsupervised adaptive filtering, vol. 1, pp. 265–319, 2000.
  • [24] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: properties and applications in non-gaussian signal processing,” IEEE Transactions on Signal Processing, vol. 55, no. 11, pp. 5286–5298, 2007.
  • [25] R. He, B.-G. Hu, W.-S. Zheng, and X.-W. Kong, “Robust principal component analysis based on maximum correntropy criterion,” IEEE Transactions on Image Processing, vol. 20, no. 6, pp. 1485–1494, 2011.
  • [26] X. Li, L. Wang, and E. Sung, “Multilabel svm active learning for image classification,” in International Conference on Image Processing, vol. 4, 2004, pp. 2207–2210.
  • [27] B. Yang, J.-T. Sun, T. Wang, and Z. Chen, “Effective multi-label active learning for text classification,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 917–926.
  • [28] T. Windheuser, H. Ishikawa, and D. Cremers, “Generalized roof duality for multi-label optimization: Optimal lower bounds and persistency,” in ECCV, 2012, pp. 400–413.
  • [29] X. Li and Y. Guo, “Active learning with multi-label svm classification.” in Proceedings of the 22th International Conference on Artificial Intelligence, 2013.
  • [30] S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learning,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 269–278.
  • [31] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, and H.-J. Zhang, “Two-dimensional active learning for image classification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • [32] S.-J. Huang and Z.-H. Zhou, “Active query driven by uncertainty and diversity for incremental multi-label learning,” in 2013 IEEE 13th International Conference on Data Mining, 2013, pp. 1079–1084.
  • [33] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. Suykens, “Learning with the maximum correntropy criterion induced losses for regression,” Journal of Machine Learning Research, vol. 16, pp. 993–1034, 2015.
  • [34] R. He, W.-S. Zheng, T. Tan, and Z. Sun, “Half-quadratic-based iterative minimization for robust sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 261–275, 2014.
  • [35] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Semi-supervised svm batch mode active learning for image retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–7.
  • [36] J. C. Bezdek and R. J. Hathaway, “Convergence of alternating optimization,” Neural, Parallel & Scientific Computations, vol. 11, no. 4, pp. 351–368, 2003.
  • [37] X.-T. Yuan and B.-G. Hu, “Robust feature extraction via information theoretic learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 1193–1200.
  • [38] S. Boyd and L. Vandenberghe, Convex optimization.    Cambridge university press, 2004.
  • [39] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [40] Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Manifold regularized multitask learning for semi-supervised multilabel image classification,” IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 523–536, 2013.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
352903
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description