Multi-Instance Multi-Label Learning

# Multi-Instance Multi-Label Learning

###### Abstract

In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we propose the D-MimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming single-instances into the MIML representation for learning, while SubCod works by transforming single-label examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the single-instances or single-label examples directly.

Zhi-Hua Zhou\@footnotemark\@footnotetextCorresponding author. E-mail: zhouzh@lamda.nju.edu.cn, Min-Ling Zhang, Sheng-Jun Huang, Yu-Feng Li

National Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210046, China

Key words:  Machine Learning, Multi-Instance Multi-Label Learning, MIML, Multi-Label Learning, Multi-Instance Learning

## 1 Introduction

In traditional supervised learning, an object is represented by an instance, i.e., a feature vector, and associated with a class label. Formally, let denote the instance space (or feature space) and the set of class labels. The task is to learn a function from a given data set , where is an instance and is the known label of . Although this formalization is prevailing and successful, there are many real-world problems which do not fit in this framework well. In particular, each object in this framework belongs to only one concept and therefore the corresponding instance is associated with a single class label. However, many real-world objects are complicated, which may belong to multiple concepts simultaneously. For example, an image can belong to several classes simultaneously, e.g., grasslands, lions, Africa, etc.; a text document can be classified to several categories if it is viewed from different aspects, e.g., scientific novel, Jules Verne’s writing or even books on traveling; a web page can be recognized as news page, sports page, soccer page, etc. In a specific real task, maybe only one of the multiple concepts is the right semantic meaning. For example, in image retrieval when a user is interested in an image with lions, s/he may be only interested in the concept lions instead of the other concepts grasslands and Africa associated with that image. The difficulty here is caused by those objects that involve multiple concepts. To choose the right semantic meaning for such objects for a specific scenario is the fundamental difficulty of many tasks. In contrast to starting from a large universe of all possible concepts involved in the task, it may be helpful to get the subset of concepts associated with the concerned object at first, and then make a choice in the small subset later. However, getting the subset of concepts, that is, assigning proper class labels to such objects, is still a challenging task.

We notice that as an alternative to representing an object by a single instance, in many cases it is possible to represent a complicated object using a set of instances. For example, multiple patches can be extracted from an image where each patch is described by an instance, and thus the image can be represented by a set of instances; multiple sections can be extracted from a document where each section is described by an instance, and thus the document can be represented by a set of instances; multiple links can be extracted from a web page where each link is described by an instance, and thus the web page can be represented by a set of instances. Using multiple instances to represent those complicated objects may be helpful because some inherent patterns which are closely related to some labels may become explicit and clearer. In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework, where an example is described by multiple instances and associated with multiple class labels.

Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects. To exploit the advantages of the MIML representation, new learning algorithms are needed. We propose the MimlBoost algorithm and the MimlSvm algorithm based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings under the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we also propose the D-MimlSvm (i.e., Direct MimlSvm) algorithm which tackles MIML problems directly in a regularization framework. Experiments show that this “direct” algorithm outperforms the “indirect” MimlSvm algorithm.

In some practical tasks we do not have access to the real objects themselves such as the real images and the real web pages; instead, we are given observational data where each real object has already been represented by a single instance. Thus, in such cases we cannot capture more information from the real objects using the MIML representation. Even in this situation, however, MIML is still useful. We propose the InsDif (i.e., INStance DIFferentiation) algorithm which transforms single-instances into MIML examples for learning. This algorithm is able to achieve a better performance than learning the single-instances directly in some tasks. This is not strange because for an object associated with multiple class labels, if it is described by only a single instance, the information corresponding to these labels are mixed and thus difficult for learning; if we can transform the single-instance into a set of instances in some proper ways, the mixed information might be detached to some extent and thus less difficult for learning.

MIML can also be helpful for learning single-label objects. We propose the SubCod (i.e., SUB-COncept Discovery) algorithm which works by discovering sub-concepts of the target concept at first and then transforming the data into MIML examples for learning. This algorithm is able to achieve a better performance than learning the single-label examples directly in some tasks. This is also not strange because for a label corresponding to a high-level complicated concept, it may be quite difficult to learn this concept directly since many different lower-level concepts are mixed; if we can transform the single-label into a set of labels corresponding to some sub-concepts, which are relatively clearer and easier for learning, we can learn these labels at first and then derive the high-level complicated label based on them with a less difficulty.

The rest of this paper is organized as follows. In Section 2, we review some related work. In Section 3, we propose the MIML framework. In Section 4 we propose the MimlBoost and MimlSvm algorithms, and apply them to tasks where the objects are represented as MIML examples. In Section 5 we present the D-MimlSvm algorithm and compare it with the “indirect” MimlSvm algorithm. In Sections 6 and 7, we study the usefulness of MIML when we do not have access to real objects. Concretely, in Section 6, we propose the InsDif algorithm and show that using MIML can be better than learning single-instances directly; in Section 7 we propose the SubCod algorithm and show that using MIML can be better than learning single-label examples directly. Finally, we conclude the paper in Section 8.

## 2 Related Work

Much work has been devoted to the learning of multi-label examples under the umbrella of multi-label learning. Note that multi-label learning studies the problem where a real-world object described by one instance is associated with a number of class labels\@footnotemark\@footnotetextMost work on multi-label learning assumes that an instance can be associated with multiple valid labels, but there is also some work assuming that only one of the labels among those associated with an instance is correct [35]., which is different from multi-class learning or multi-task learning [28]. In multi-class learning each object is only associated with a single label; while in multi-task learning different tasks may involve different domains and different data sets. Actually, traditional two-class and multi-class problems can both be cast into multi-label problems by restricting that each instance has only one label. The generality of multi-label problems, however, inevitably makes it more difficult to address.

One famous approach to solving multi-label problems is Schapire and Singer’s AdaBoost.MH [56], which is an extension of AdaBoost and is the core of a successful multi-label learning system BoosTexter [56]. This approach maintains a set of weights over both training examples and their labels in the training phase, where training examples and their corresponding labels that are hard (easy) to predict get incrementally higher (lower) weights. Later, De Comité et al. [22] used alternating decision trees [30] which are more powerful than decision stumps used in BoosTexter to handle multi-label data and thus obtained the AdtBoost.MH algorithm. Probabilistic generative models have been found useful in multi-label learning. McCallum [47] proposed a Bayesian approach for multi-label document classification, where a mixture probabilistic model (one mixture component per category) is assumed to generate each document and an EM algorithm is employed to learn the mixture weights and the word distributions in each mixture component. Ueda and Saito [65] presented another generative approach, which assumes that the multi-label text has a mixture of characteristic words appearing in single-label text belonging to each of the multi-labels. It is noteworthy that the generative models used in [47] and [65] are both based on learning text frequencies in documents, and are thus specific to text applications.

Many other multi-label learning algorithms have been developed, such as decision trees, neural networks, -nearest neighbor classifiers, support vector machines, etc. Clare and King [21] developed a multi-label version of C4.5 decision trees through modifying the definition of entropy. Zhang and Zhou [79] presented multi-label neural network Bp-Mll, which is derived from the Backpropagation algorithm by employing an error function to capture the fact that the labels belonging to an instance should be ranked higher than those not belonging to that instance. Zhang and Zhou [80] also proposed the Ml-nn algorithm, which identifies the nearest neighbors of the concerned instance and then assigns labels according to the maximum a posteriori principle. Elisseeff and Weston [27] proposed the RankSvm algorithm for multi-label learning by defining a specific cost function and the corresponding margin for multi-label models. Other kinds of multi-label Svms have been developed by Boutell et al. [11] and Godbole and Sarawagi [33]. In particular, by hierarchically approximating the Bayes optimal classifier for the H-loss, Cesa-Bianchi et al. [15] proposed an algorithm which outperforms simple hierarchical Svms. Recently, non-negative matrix factorization has also been applied to multi-label learning [43], and multi-label dimensionality reduction methods have been developed [74, 85].

Roughly speaking, earlier approaches to multi-label learning attempt to divide multi-label learning to a number of two-class classification problems [36, 72] or transform it into a label ranking problem [56, 27], while some later approaches try to exploit the correlation between the labels [65, 43, 85].

Most studies on multi-label learning focus on text categorization [56, 47, 65, 22, 33, 39, 74], and several studies aim to improve the performance of text categorization systems by exploiting additional information given by the hierarchical structure of classes [14, 53, 15] or unlabeled data [43]. In addition to text categorization, multi-label learning has also been found useful in many other tasks such as scene classification [11], image and video annotation [38, 48], bioinformatics [21, 27, 12, 7, 13], and even association rule mining [63, 50].

There is a lot of research on multi-instance learning, which studies the problem where a real-world object described by a number of instances is associated with a single class label. Here the training set is composed of many bags each containing multiple instances; a bag is labeled positively if it contains at least one positive instance and negatively otherwise. The goal is to label unseen bags correctly. Note that although the training bags are labeled, the labels of their instances are unknown. This learning framework was formalized by Dietterich et al. [24] when they were investigating drug activity prediction.

Long and Tan [44] studied the Pac-learnability of multi-instance learning and showed that if the instances in the bags are independently drawn from product distribution, the Apr (Axis-Parallel Rectangle) proposed by Dietterich et al. [24] is Pac-learnable. Auer et al. [5] showed that if the instances in the bags are not independent then Apr learning under the multi-instance learning framework is NP-hard. Moreover, they presented a theoretical algorithm that does not require product distribution, which was transformed into a practical algorithm named Multinst [4]. Blum and Kalai [10] described a reduction from Pac-learning under the multi-instance learning framework to Pac-learning with one-sided random classification noise. They also presented an algorithm with smaller sample complexity than that of the algorithm of Auer et al. [5].

Many multi-instance learning algorithms have been developed during the past decade. To name a few, Diverse Density [45] and Em-dd [83], -nearest neighbor algorithms Citation-nn and Bayesian-nn [67], decision tree algorithms Relic [54] and Miti [9], neural network algorithms Bp-mip and extensions [90, 77] and Rbf-mip [78], rule learning algorithm Ripper-mi [20], support vector machines and kernel methods mi-Svm and Mi-Svm [3], Dd-Svm [18], MissSvm [88], Mi-Kernel [32], Bag-Instance Kernel [19], Marginalized Mi-Kernel [42] and convex-hull method Ch-Fd [31], ensemble algorithms Mi-Ensemble [91], MiBoosting [70] and MilBoosting [6], logistic regression algorithm Mi-lr [51], etc. Actually almost all popular machine learning algorithms have their multi-instance versions. Most algorithms attempt to adapt single-instance supervised learning algorithms to the multi-instance representation, by shifting their focus from discrimination on instances to discrimination on bags [91]. Recently there is some proposal on adapting the multi-instance representation to single-instance algorithms by representation transformation [93].

It is worth mentioning that standard multi-instance learning [24] assumes that if a bag contains a positive instance then the bag is positive; this implies that there exists a key instance in a positive bag. Many algorithms were designed based on this assumption. For example, the point with maximal diverse density identified by the Diverse Density algorithm [45] actually corresponds to a key instance; many Svm algorithms defined the margin of a positive bag by the margin of its most positive instance [3, 19]. As the research of multi-instance learning goes on, however, some other assumptions have been introduced [29]. For example, in contrast to assuming that there is a key instance, some work has assumed that there is no key instance and every instance contributes to the bag label [70, 17]. There is also an argument that the instances in the bags should not be treated independently [88]. All those assumptions have been put under the umbrella of multi-instance learning, and generally, in tackling real tasks it is difficult to know which assumption is the fittest. In other words, in different tasks multi-instance learning algorithms based on different assumptions may have different superiorities.

In the early years of the research of multi-instance learning, most work considered multi-instance classification with discrete-valued outputs. Later, multi-instance regression with real-valued outputs was studied [2, 52], and different versions of generalized multi-instance learning have been defined [68, 58]. The main difference between standard multi-instance learning and generalized multi-instance learning is that in standard multi-instance learning there is a single concept, and a bag is positive if it has an instance satisfying this concept; while in generalized multi-instance learning [68, 58] there are multiple concepts, and a bag is positive only when all concepts are satisfied (i.e., the bag contains instances from every concept). Recently, research on multi-instance clustering [82], multi-instance semi-supervised learning [49] and multi-instance active learning [60] have also been reported.

Multi-instance learning has also attracted the attention of the Ilp community. It has been suggested that multi-instance problems could be regarded as a bias on inductive logic programming, and the multi-instance paradigm could be the key between the propositional and relational representations, being more expressive than the former, and much easier to learn than the latter [23]. Alphonse and Matwin [1] approximated a relational learning problem by a multi-instance problem, fed the resulting data to feature selection techniques adapted from propositional representations, and then transformed the filtered data back to relational representation for a relational learner. Thus, the expressive power of relational representation and the ease of feature selection on propositional representation are gracefully combined. This work confirms that multi-instance learning can really act as a bridge between propositional and relational learning.

Multi-instance learning techniques have already been applied to diverse applications including image categorization [17, 18], image retrieval [71, 84], text categorization [3, 60], web mining [86], spam detection [37], computer security [54], face detection [66, 76], computer-aided medical diagnosis [31], etc.

## 3 The MIML Framework

Let denote the instance space and the set of class labels. Then, formally, the MIML task is defined as:

• MIML (multi-instance multi-label learning): To learn a function from a given data set , where is a set of instances , , and is a set of labels , . Here denotes the number of instances in and the number of labels in .

It is interesting to compare MIML with the existing frameworks of traditional supervised learning, multi-instance learning, and multi-label learning.

• Traditional supervised learning (single-instance single-label learning): To learn a function from a given data set , where is an instance and is the known label of .

• Multi-instance learning (multi-instance single-label learning): To learn a function from a given data set , where is a set of instances , , and is the label of .\@footnotemark\@footnotetextAccording to notions used in multi-instance learning, is a labeled bag while an unlabeled bag. Here denotes the number of instances in .

• Multi-label learning (single-instance multi-label learning): To learn a function from a given data set , where is an instance and is a set of labels , . Here denotes the number of labels in .

From Fig. 1 we can see the differences among these learning frameworks. In fact, the multi- learning frameworks are resulted from the ambiguities in representing real-world objects. Multi-instance learning studies the ambiguity in the input space (or instance space), where an object has many alternative input descriptions, i.e., instances; multi-label learning studies the ambiguity in the output space (or label space), where an object has many alternative output descriptions, i.e., labels; while MIML considers the ambiguities in both the input and output spaces simultaneously. In solving real-world problems, having a good representation is often more important than having a strong learning algorithm, because a good representation may capture more meaningful information and make the learning task easier to tackle. Since many real objects are inherited with input ambiguity as well as output ambiguity, MIML is more natural and convenient for tasks involving such objects.

It is worth mentioning that MIML is more reasonable than (single-instance) multi-label learning in many cases. Suppose a multi-label object is described by one instance but associated with number of class labels, namely label, label, , label. If we represent the multi-label object using a set of instances, namely instance, instance, , instance, the underlying information in a single instance may become easier to exploit, and for each label the number of training instances can be significantly increased. So, transforming multi-label examples to MIML examples for learning may be beneficial in some tasks, which will be shown in Section 6. Moreover, when representing the multi-label object using a set of instances, the relation between the input patterns and the semantic meanings may become more easily discoverable. Note that in some cases, understanding why a particular object has a certain class label is even more important than simply making an accurate prediction, while MIML offers a possibility for this purpose. For example, under the MIML representation, we may discover that one object has label because it contains instance; it has label because it contains instance; while the occurrence of both instance and instance triggers label.

MIML can also be helpful for learning single-label examples involving complicated high-level concepts. For example, as Fig. 2(a) shows, the concept Africa has a broad connotation and the images belonging to Africa have great variance, thus it is not easy to classify the top-left image in Fig. 2(a) into the Africa class correctly. However, if we can exploit some low-level sub-concepts that are less ambiguous and easier to learn, such as tree, lions, elephant and grassland shown in Fig. 2(b), it is possible to induce the concept Africa much easier than learning the concept Africa directly. The usefulness of MIML in this process will be shown in Section 7.

## 4 Solving MIML Problems by Degeneration

It is evident that traditional supervised learning is a degenerated version of multi-instance learning as well as a degenerated version of multi-label learning, while traditional supervised learning, multi-instance learning and multi-label learning are all degenerated versions of MIML. So, a simple idea to tackle MIML is to identify its equivalence in the traditional supervised learning framework, using multi-instance learning or multi-label learning as the bridge, as shown in Fig. 3.

• Solution A: Using multi-instance learning as the bridge:

The MIML learning task, i.e., to learn a function , can be transformed into a multi-instance learning task, i.e., to learn a function . For any , if and otherwise. The proper labels for a new example can be determined according to . This multi-instance learning task can be further transformed into a traditional supervised learning task, i.e., to learn a function , under a constraint specifying how to derive from . For any , if and otherwise. Here the constraint can be which has been used by Xu and Frank [70] in transforming multi-instance learning tasks into traditional supervised learning tasks. Note that other kinds of constraint can also be used here.

• Solution B: Using multi-label learning as the bridge:

The MIML learning task, i.e., to learn a function , can be transformed into a multi-label learning task, i.e., to learn a function . For any , if , . The proper labels for a new example can be determined according to . This multi-label learning task can be further transformed into a traditional supervised learning task, i.e., to learn a function . For any , if and otherwise. That is, . Here the mapping can be implemented with constructive clustering which was proposed by Zhou and Zhang [93] in transforming multi-instance bags into traditional single-instances. Note that other kinds of mappings can also be used here.

In the rest of this section we will propose two MIML algorithms, MimlBoost and MimlSvm. MimlBoost is an illustration of Solution A, which uses category-wise decomposition for the A1 step in Fig. 3 and MiBoosting for A2; MimlSvm is an illustration of Solution B, which uses clustering-based representation transformation for the B1 step and MlSvm for B2. Other MIML algorithms can be developed by taking alternative options. Both MimlBoost and MimlSvm are quite simple. We will see that for dealing with complicated objects with multiple semantic meanings, good performance can be obtained under the MIML framework even by using such simple algorithms. This demonstrates that the MIML framework is very promising, and we expect better performance can be achieved in the future if researchers put forward more powerful MIML algorithms.

### 4.1 MimlBoost

Now we propose the MimlBoost algorithm according to the first solution mentioned above, that is, identifying the equivalence in the traditional supervised learning framework using multi-instance learning as the bridge. Note that this strategy can also be used to derive other kinds of MIML algorithms.

Given any set , let denote its size, i.e., the number of elements in ; given any predicate , let be 1 if holds and 0 otherwise; given , for any , let if and otherwise, where is a function which judges whether a label is a proper label of or not. The basic assumption of MimlBoost is that the labels are independent so that the MIML task can be decomposed into a series of multi-instance learning tasks to solve, by treating each label as a task. The pseudo-code of MimlBoost is summarized in Appendix A (Table A.1).

In the first step of MimlBoost, each MIML example is transformed into a set of number of multi-instance bags, i.e., . Note that is a labeled multi-instance bag where is a bag containing number of instances, i.e., , and is the label of this bag.

Thus, the original MIML data set is transformed into a multi-instance data set containing number of bags. We order them as , and let denote the -th of these number of bags which contains number of instances.

Then, from the data set a multi-instance learning function can be learned, which can accomplish the desired MIML function because . In this paper, the MiBoosting algorithm [70] is used to implement . Note that by using MiBoosting, the MimlBoost algorithm assumes that all instances in a bag contribute independently in an equal way to the label of that bag.

For convenience, let denote the bag , , , and denotes the expectation. Then, here the goal is to learn a function minimizing the bag-level exponential loss , which ultimately estimates the bag-level log-odds function on the training set. In each boosting round, the aim is to expand into , i.e., adding a new weak classifier, so that the exponential loss is minimized. Assuming that all instances in a bag contribute equally and independently to the bag’s label, can be derived, where is the prediction of the instance-level classifier for the -th instance of the bag , and is the number of instances in .

It has been shown by [70] that the best to be added can be achieved by seeking which maximizes , given the bag-level weights . By assigning each instance the label of its bag and the corresponding weight , can be learned by minimizing the weighted instance-level classification error. This actually corresponds to the Step 3a of MimlBoost. When is found, the best multiplier can be got by directly optimizing the exponential loss:

 EBEG|B[exp(−gF(B)+c(−gf(B)))] = ∑iW(i)exp⎡⎢⎣c⎛⎜⎝−g(i)∑jh(b(i)j)ni⎞⎟⎠⎤⎥⎦ (1) = ∑iW(i)exp[(2e(i)−1)c] ,

where (computed in Step 3b). Minimization of this expectation actually corresponds to Step 3d, where numeric optimization techniques such as quasi-Newton method can be used. Note that in Step 3c if , the Boosting process will stop [89]. Finally, the bag-level weights are updated in Step 3f according to the additive structure of .

### 4.2 MimlSvm

Now we propose the MimlSvm algorithm according to the second solution mentioned before, that is, identifying the equivalence in the traditional supervised learning framework using multi-label learning as the bridge. Note that this strategy can also be used to derive other kinds of MIML algorithms.

Again, given any set , let denote its size, i.e., the number of elements in ; given and where , for any , let if and otherwise, where is a function . The basic assumption of MimlSvm is that the spatial distribution of the bags carries relevant information, and information helpful for label discrimination can be discovered by measuring the closeness between each bag and the representative bags identified through clustering. The pseudo-code of MimlSvm is summarized in Appendix A (Table A.2).

In the first step of MimlSvm, the of each MIML example is collected and put into a data set . Then, in the second step, -medoids clustering is performed on . Since each data item in , i.e. , is an unlabeled multi-instance bag instead of a single instance, Hausdorff distance [26] is employed to measure the distance. The Hausdorff distance is a famous metric for measuring the distance between two bags of points, which has often been used in computer vision tasks; other techniques that can measure the distance between bags of points, such as the set kernel [32], can also be used here. In detail, given two bags and , the Hausdorff distance between and is defined as

 dH(A,B)=max{maxa∈Aminb∈B∥a−b∥,maxb∈Bmina∈A∥b−a∥} , (2)

where measures the distance between the instances and , which takes the form of Euclidean distance here.

After the clustering process, the data set is divided into partitions, whose medoids are , respectively. With the help of these medoids, the original multi-instance example is transformed into a -dimensional numerical vector , where the -th component of is the distance between and , that is, . In other words, encodes some structure information of the data, that is, the relationship between and the -th partition of . This process reassembles the constructive clustering process used by Zhou and Zhang [93] in transforming multi-instance examples into single-instance examples except that in [93] the clustering is executed at the instance level while here it is executed at the bag level. Thus, the original MIML examples have been transformed into multi-label examples , which corresponds to the Step 3 of MimlSvm.

Then, from the data set a multi-label learning function can be learned, which can accomplish the desired MIML function because . In this paper, the MlSvm algorithm [11] is used to implement . Concretely, MlSvm decomposes the multi-label learning problem into multiple independent binary classification problems (one per class), where each example associated with the label set is regarded as a positive example when building Svm for any class , while regarded as a negative example when building Svm for any class , as shown in the Step 4 of MimlSvm. In making predictions, the T-Criterion [11] is used, which actually corresponds to the Step 5 of the MimlSvm algorithm. That is, the test example is labeled by all the class labels with positive Svm scores, except that when all the Svm scores are negative, the test example is labeled by the class label which is with the top (least negative) score.

### 4.3 Experiments

#### 4.3.1 Multi-Label Evaluation Criteria

In traditional supervised learning where each object has only one class label, accuracy is often used as the performance evaluation criterion. Typically, accuracy is defined as the percentage of test examples that are correctly classified. When learning with complicated objects associated with multiple labels simultaneously, however, accuracy becomes less meaningful. For example, if approach missed one proper label while approach missed four proper labels for a test example having five labels, it is obvious that is better than , but the accuracy of and may be identical because both of them incorrectly classified the test example.

Five criteria are often used for evaluating the performance of learning with multi-label examples [56, 92]; they are hamming loss, one-error, coverage, ranking loss and average precision. Using the same denotation as that in Sections 3 and 4, given a test set , these five criteria are defined as below. Here, returns a set of proper labels of ; returns a real-value indicating the confidence for to be a proper label of ; returns the rank of derived from .

• , where stands for the symmetric difference between two sets. The hamming loss evaluates how many times an object-label pair is misclassified, i.e., a proper label is missed or a wrong label is predicted. The performance is perfect when ; the smaller the value of , the better the performance of .

• . The one-error evaluates how many times the top-ranked label is not a proper label of the object. The performance is perfect when one-error; the smaller the value of one-error, the better the performance of .

• . The coverage evaluates how far it is needed, on the average, to go down the list of labels in order to cover all the proper labels of the object. It is loosely related to precision at the level of perfect recall. The smaller the value of coverage, the better the performance of .

• , where denotes the complementary set of in . The ranking loss evaluates the average fraction of label pairs that are misordered for the object. The performance is perfect when ; the smaller the value of , the better the performance of .

• . The average precision evaluates the average fraction of proper labels ranked above a particular label . The performance is perfect when avgprec; the larger the value of avgprec, the better the performance of .

In addition to the above criteria, we design two new multi-label criteria, average recall and average F1, as below.

• . The average recall evaluates the average fraction of proper labels that have been predicted. The performance is perfect when avgrecl; the larger the value of avgrecl, the better the performance of .

• . The average F1 expresses a tradeoff between the average precision and the average recall. The performance is perfect when avgF1; the larger the value of avgF1, the better the performance of .

Note that since the above criteria measure the performance from different aspects, it is difficult for one algorithm to outperform another on every one of these criteria.

In the following we study the performance of MIML algorithms on two tasks involving complicated objects with multiple semantic meanings. We will show that for such tasks, MIML is a good choice, and good performance can be achieved even by using simple MIML algorithms such as MimlBoost and MimlSvm.

#### 4.3.2 Scene Classification

The scene classification data set consists of 2,000 natural scene images belonging to the classes desert, mountains, sea, sunset and trees. Over 22 of the images belong to multiple classes simultaneously. Each image has already been represented as a bag of nine instances generated by the Sbn method [46], which uses a Gaussian filter to smooth the image and then subsamples the image to an matrix of color blobs where each blob is a set of pixels within the matrix. An instance corresponds to the combination of a single blob with its four neighboring blobs (up, down, left, right), which is described with 15 features. The first three features represent the mean R, G, B values of the central blob and the remaining twelve features express the differences in mean color values between the central blob and other four neighboring blobs respectively.\@footnotemark\@footnotetextThe data set is available at http://lamda.nju.edu.cn/data_MIMLimage.ashx.

We evaluate the performance of the MIML algorithms MimlBoost and MimlSvm. Note that MimlBoost and MimlSvm are merely proposed to illustrate the two general degeneration solutions to MIML problems shown in Fig. 3. We do not claim that they are the best algorithms that can be developed through the degeneration paths. There may exist other processes for transforming MIML examples into multi-instance single-label (MISL) examples or single-instance multi-label (SIML) examples. Even by using the same degeneration process as that used in MimlBoost and MimlSvm, there are also many alternatives to realize the second step. For example, by using mi-Svm [3] to replace the MiBoosting used in MimlBoost and by using the two-layer neural network structure [81] to replace the MlSvm used in MimlSvm, we get MimlSvm and MimlNn respectively. Their performance is also evaluated in our experiments.

We compare the MIML algorithms with several state-of-the-art algorithms for learning with multi-label examples, including AdtBoost.MH [22], RankSvm [27], MlSvm [11] and Ml-nn [80]; these algorithms have been introduced briefly in Section 2. Note that these are single-instance algorithms that regard each image as a 135-dimensional feature vector, which is obtained by concatenating the nine instances in the direction from upper-left to right-bottom.

The parameter configurations of RankSvm, MlSvm and Ml-nn are set by considering the strategies adopted in [27], [11] and [80] respectively. For RankSvm, polynomial kernel is used where polynomial degrees of 2 to 9 are considered as in [27] and chosen by hold-out tests on training sets. For MlSvm, Gaussian kernel is used. For Ml-nn, the number of nearest neighbors considered is set to 10.

The boosting rounds of AdtBoost.MH and MimlBoost are set to 25 and 50, respectively; The performance of the two algorithms at different boosting rounds is shown in Appendix B (Fig. B.1), it can be observed that at those rounds the performance of the algorithms have become stable. Gaussian kernel Libsvm [16] is used for the Step 3a of MimlBoost. The MimlSvm and MimlSvm are also realized with Gaussian kernels. The parameter of MimlSvm is set to be 20% of the number of training images; The performance of this algorithm with different values is shown in Appendix B (Fig. B.2), it can be observed that the setting of does not significantly affect the performance of MimlSvm. Note that in Appendix B (Figs. B.1 and B.2) we plot average precision, average recall and average F1 such that in all the figures, the lower the curve, the better the performance.

Here in the experiments, 1,500 images are used as training examples while the remaining 500 images are used for testing. Experiments are repeated for thirty runs by using random training/test partitions, and the average and standard deviation are summarized in Table 1,\@footnotemark\@footnotetextFor the shared implementation of AdtBoost.MH (http://www.grappa.univ-lille3.fr/ grappa/en_index.php3?info=software), ranking loss, average recall and average F1 are not available in the program’s outputs. where the best performance on each criterion has been highlighted in boldface.

Pairwise -tests with 95% significance level disclose that all the MIML algorithms are significantly better than AdtBoost.MH and MlSvm on all the seven evaluation criteria. This is impressive since as mentioned before, these evaluation criteria measure the learning performance from different aspects and one algorithm rarely outperforms another algorithm on all criteria. MimlSvm and MimlSvm are both significantly better than RankSvm on all the evaluation criteria, while MimlBoost and MimlNn are both significantly better than RankSvm on the first five criteria. MimlNn is significantly better than Ml-nn on all the evaluation criteria. Both MimlBoost and MimlSvm are significantly better than Ml-nn on all criteria except hamming loss. MimlSvm is significantly better than Ml-nn on one-error, average precision, average recall and average F1, while there are ties on the other criteria. Moreover, note that the best performance on all evaluation criteria are always attained by MIML algorithms. Overall, comparison on the scene classification task shows that the MIML algorithms can be significantly better than the non-MIML algorithms; this validates the powerfulness of the MIML framework.

#### 4.3.3 Text Categorization

The Reuters-21578 data set is used in this experiment. The seven most frequent categories are considered. After removing documents that do not have labels or main texts, and randomly removing some documents that have only one label, a data set containing 2,000 documents is obtained, where over 14.9 documents have multiple labels. Each document is represented as a bag of instances according to the method used in [3]. Briefly, the instances are obtained by splitting each document into passages using overlapping windows of maximal 50 words each. As a result, there are 2,000 bags and the number of instances in each bag varies from 2 to 26 (3.6 on average). The instances are represented based on term frequency. The words with high frequencies are considered, excluding “function words” that have been removed from the vocabulary using the Smart stop-list [55]. It has been found that based on document frequency, the dimensionality of the data set can be reduced to 1-10 without loss of effectiveness [73]. Thus, we use the top 2% frequent words, and therefore each instance is a 243-dimensional feature vector.\@footnotemark\@footnotetextThe data set is available at http://lamda.nju.edu.cn/data_MIMLtext.ashx

The parameter configurations of RankSvm, MlSvm and Ml-nn are set in the same way as in Section 4.3.2. The boosting rounds of AdtBoost.MH and MimlBoost are set to 25 and 50, respectively. Linear kernels are used. The parameter of MimlSvm is set to be 20% of the number of training images. The single-instance algorithms regard each document as a 243-dimensional feature vector which is obtained by aggregating all the instances in the same bag; this is equivalent to represent the document using a sole term frequency feature vector.

Here in the experiments, 1,500 documents are used as training examples while the remaining 500 documents are used for testing. Experiments are repeated for thirty runs by using random training/test partitions, and the average and standard deviation are summarized in Table 2, where the best performance on each criterion has been highlighted in boldface.

Pairwise -tests with 95% significance level disclose that, impressively, both MimlSvm and MimlSvm are significantly better than all the non-MIML algorithms. MimlNn is significantly better than AdtBoost.MH, RankSvm, and Ml-nn on all the evaluation criteria; significantly better than MlSvm on hamming loss, average recall and average F1 while there are ties on the other criteria. MimlBoost is significantly better than AdtBoost.MH on all criteria except that there is a tie on hamming loss; significantly better than RankSvm on all criteria; significantly better than MlSvm on average recall and there is a tie on average F1; significantly better than Ml-nn on one-error, coverage, ranking loss and average precision. Moreover, note that the best performance on all evaluation criteria are always attained by MIML algorithms. Overall, comparison on the text categorization task shows that the MIML algorithms are better than the non-MIML algorithms; this validates the powerfulness of the MIML framework.

## 5 Solving MIML Problems by Regularization

The degeneration methods presented in Section 4 may lose information during the degeneration process, and thus a “direct” MIML algorithm is desirable. In this section we propose a regularization method for MIML. In contrast to MimlSvm and MimlSvm, this method is developed from the regularization framework directly and so we call it D-MimlSvm. The basic assumption of D-MimlSvm is that the labels associated to the same example have some relatedness, and the performance of classifying the bags depends on the loss between the labels and the predictions on the bags as well as on the constituent instances. Moreover, considering that for any class label the number of positive examples is smaller than that of negative examples, this method incorporates a mechanism to deal with class imbalance. We employ the constrained concave-convex procedure (Cccp) which has well-studied convergence properties [62] to solve the resultant non-convex optimization problem. We also present a cutting plane algorithm that finds the solution efficiently.

### 5.1 The Loss Function

Given a set of MIML training examples , the goal of D-MimlSvm is to learn a mapping where the proper label set for each bag corresponds to . Specifically, D-MimlSvm chooses to instantiate with functions, i.e. , where is the number of labels in the label space . Here, the -th function determines the belongingness of for , i.e. . In addition, each single instance in a bag can be viewed as a bag containing only one instance, such that is also a well-defined function. For convenience, and are simplified as and in the rest of this section.

To train the component functions in , D-MimlSvm employs the following empirical loss function involving two terms (balanced by ):

 V({Xi}mi=1,{Yi}mi=1,f)=V1({Xi}mi=1,{Yi}mi=1,f)+λ⋅V2({Xi}mi=1,f) (3)

Here, the first term considers the loss between the ground-truth label set of each training bag , i.e. , to its predicted label set, i.e. . Let if holds (). Otherwise, . Furthermore, let denote the hinge loss function. Accordingly, the first loss term is defined as:

 V1({Xi}mi=1,{Yi}mi=1,f)=1mTm∑i=1T∑t=1(1−yitft(Xi))+ (4)

The second term considers the loss between and the predictions of ’s constituent instances, i.e. , which reflects the relationships between the bag and its instances . Here, the common assumption in multi-instance learning is that the strength for to hold a label is equal to the maximum strength for its instances to hold the label, i.e. .\@footnotemark\@footnotetextNote that this assumption may be restrictive to some extent. There are many cases where the label of the bag does not rely on the instance with the maximum predictions, as discussed in Section 2. In addition, in classification only the sign of prediction is important [19], i.e. . However, in this paper the above common assumption is still adopted due to its popularity and simplicity. Accordingly, the second loss term is defined as:

 V2({Xi}mi=1,f)=1mTm∑i=1T∑t=1l(ft(Xi),maxj=1,⋯,nift(xij)) (5)

Here, can be defined in various ways and is set to be the loss in this paper, i.e. . By combining Eq. 4 and Eq. 5, the empirical loss function in Eq. 3 is then specified as:

 V({Xi}mi=1,{Yi}mi=1,f) = 1mTm∑i=1T∑t=1(1−yitft(Xi))+ (6) +λmTm∑i=1T∑t=1l(ft(Xi),maxj=1,⋯,nift(xij))

### 5.2 Representer Theorem for MIML

For simplicity, we assume that each function is a linear model, i.e., where is the feature map induced by a kernel function and denotes the standard inner product in the Reproducing Kernel Hilbert Space (RKHS) induced by the kernel . We recall that an instance can be regarded as a bag containing only one instance, so the kernel can be any kernel defined on a set of instances, such as the set kernel [32]. In the case of classification, objects (bags or instances) are classified according to the sign of .

D-MimlSvm assumes that the labels associated with a bag should have some relatedness; otherwise they should not be associated with the bag simultaneously. To reflect this basic assumption, D-MimlSvm regularizes the empirical loss function in Eq. 6 with an additional term :

 Ω(f)+γ⋅V({Xi}mi=1,{Yi}mi=1,f) (7)

Here, is a regularization parameter balancing the model complexity and the empirical risk . Inspired by [28], we assume that the relatedness among the labels can be measured by the mean function ,

 w0=1TT∑t=1wt (8)

The original idea in [28] is to minimize and meanwhile minimize , i.e. to set the regularizer as:

 Ω(f)=1TT∑t=1||wt−w0||2+η||w0||2 (9)

According to Eq.8, the first term in the RHS of Eq. 9 can be rewritten as:

 1TT∑t=1∥wt−w0∥2=1TT∑t=1∥wt∥2−∥w0∥2 (10)

Therefore, by substituting Eq. 10 into Eq. 9, the regularizer can be simplified as:

 Ω(f)=1TT∑t=1||wt||2+μ||w0||2 (11)

Further note that and , by substituting Eq. 11 into Eq. 7, we have the regularization framework of D-MimlSvm as follows:

 minf∈H 1TT∑t=1∥ft∥2H+μ∥∑Tt=1ftT∥2H+γ⋅V({Xi}mi=1,{Yi}mi=1,f) (12)

Here, is a parameter to trade off the discrepancy and commonness among the labels, that is, how similar or dissimilar the ’s are. Refer to Eq. 10, we have . Intuitively, when (or ) is large, minimization of Eq. 12 will force to tend to be zero and the discrepancy among the labels becomes more important; when (or ) is small, minimization of Eq. 12 will force to tend to be zero and the commonness among the labels becomes more important [28].

Given the above setup, we can prove the following representer theorem.

###### Theorem 1

The minimizer of the optimization problem 12 admits an expansion

 ft(x)=m∑i=1(αt,i0k(x,Xi)+ni∑j=1αt,ijk(x,xij))

where all .

###### Proof.

Analogous to [28], we first introduce a combined feature map

 Ψ(x,t)=⎛⎜⎝ϕ(x)√r,0,⋯,0t−1,ϕ(x),0,⋯,0T−t⎞⎟⎠

and its decision function, i.e., where

 ^w=(√rw0,w1−w0,⋯,wT−w0).

Here . Let denote the kernel function induced by and is its corresponding RKHS. We have Eqs. 13 and 14.

 ^f(x,t)=⟨^w,Ψ(x,t)⟩=⟨(w0+wt−w0),ϕ(x)⟩=⟨wt,ϕ