MultiInstance MultiLabel Learning
Abstract
In this paper, we propose the MIML (MultiInstance MultiLabel learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we propose the DMimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming singleinstances into the MIML representation for learning, while SubCod works by transforming singlelabel examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the singleinstances or singlelabel examples directly.
ZhiHua Zhou\@footnotemark\@footnotetextCorresponding author. Email: zhouzh@lamda.nju.edu.cn, MinLing Zhang, ShengJun Huang, YuFeng Li
National Key Laboratory for Novel Software Technology,
Nanjing
University, Nanjing 210046, China
Key words: Machine Learning, MultiInstance MultiLabel Learning, MIML, MultiLabel Learning, MultiInstance Learning
1 Introduction
In traditional supervised learning, an object is represented by an instance, i.e., a feature vector, and associated with a class label. Formally, let denote the instance space (or feature space) and the set of class labels. The task is to learn a function from a given data set , where is an instance and is the known label of . Although this formalization is prevailing and successful, there are many realworld problems which do not fit in this framework well. In particular, each object in this framework belongs to only one concept and therefore the corresponding instance is associated with a single class label. However, many realworld objects are complicated, which may belong to multiple concepts simultaneously. For example, an image can belong to several classes simultaneously, e.g., grasslands, lions, Africa, etc.; a text document can be classified to several categories if it is viewed from different aspects, e.g., scientific novel, Jules Verne’s writing or even books on traveling; a web page can be recognized as news page, sports page, soccer page, etc. In a specific real task, maybe only one of the multiple concepts is the right semantic meaning. For example, in image retrieval when a user is interested in an image with lions, s/he may be only interested in the concept lions instead of the other concepts grasslands and Africa associated with that image. The difficulty here is caused by those objects that involve multiple concepts. To choose the right semantic meaning for such objects for a specific scenario is the fundamental difficulty of many tasks. In contrast to starting from a large universe of all possible concepts involved in the task, it may be helpful to get the subset of concepts associated with the concerned object at first, and then make a choice in the small subset later. However, getting the subset of concepts, that is, assigning proper class labels to such objects, is still a challenging task.
We notice that as an alternative to representing an object by a single instance, in many cases it is possible to represent a complicated object using a set of instances. For example, multiple patches can be extracted from an image where each patch is described by an instance, and thus the image can be represented by a set of instances; multiple sections can be extracted from a document where each section is described by an instance, and thus the document can be represented by a set of instances; multiple links can be extracted from a web page where each link is described by an instance, and thus the web page can be represented by a set of instances. Using multiple instances to represent those complicated objects may be helpful because some inherent patterns which are closely related to some labels may become explicit and clearer. In this paper, we propose the MIML (MultiInstance MultiLabel learning) framework, where an example is described by multiple instances and associated with multiple class labels.
Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects. To exploit the advantages of the MIML representation, new learning algorithms are needed. We propose the MimlBoost algorithm and the MimlSvm algorithm based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings under the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we also propose the DMimlSvm (i.e., Direct MimlSvm) algorithm which tackles MIML problems directly in a regularization framework. Experiments show that this “direct” algorithm outperforms the “indirect” MimlSvm algorithm.
In some practical tasks we do not have access to the real objects themselves such as the real images and the real web pages; instead, we are given observational data where each real object has already been represented by a single instance. Thus, in such cases we cannot capture more information from the real objects using the MIML representation. Even in this situation, however, MIML is still useful. We propose the InsDif (i.e., INStance DIFferentiation) algorithm which transforms singleinstances into MIML examples for learning. This algorithm is able to achieve a better performance than learning the singleinstances directly in some tasks. This is not strange because for an object associated with multiple class labels, if it is described by only a single instance, the information corresponding to these labels are mixed and thus difficult for learning; if we can transform the singleinstance into a set of instances in some proper ways, the mixed information might be detached to some extent and thus less difficult for learning.
MIML can also be helpful for learning singlelabel objects. We propose the SubCod (i.e., SUBCOncept Discovery) algorithm which works by discovering subconcepts of the target concept at first and then transforming the data into MIML examples for learning. This algorithm is able to achieve a better performance than learning the singlelabel examples directly in some tasks. This is also not strange because for a label corresponding to a highlevel complicated concept, it may be quite difficult to learn this concept directly since many different lowerlevel concepts are mixed; if we can transform the singlelabel into a set of labels corresponding to some subconcepts, which are relatively clearer and easier for learning, we can learn these labels at first and then derive the highlevel complicated label based on them with a less difficulty.
The rest of this paper is organized as follows. In Section 2, we review some related work. In Section 3, we propose the MIML framework. In Section 4 we propose the MimlBoost and MimlSvm algorithms, and apply them to tasks where the objects are represented as MIML examples. In Section 5 we present the DMimlSvm algorithm and compare it with the “indirect” MimlSvm algorithm. In Sections 6 and 7, we study the usefulness of MIML when we do not have access to real objects. Concretely, in Section 6, we propose the InsDif algorithm and show that using MIML can be better than learning singleinstances directly; in Section 7 we propose the SubCod algorithm and show that using MIML can be better than learning singlelabel examples directly. Finally, we conclude the paper in Section 8.
2 Related Work
Much work has been devoted to the learning of multilabel examples under the umbrella of multilabel learning. Note that multilabel learning studies the problem where a realworld object described by one instance is associated with a number of class labels\@footnotemark\@footnotetextMost work on multilabel learning assumes that an instance can be associated with multiple valid labels, but there is also some work assuming that only one of the labels among those associated with an instance is correct [35]., which is different from multiclass learning or multitask learning [28]. In multiclass learning each object is only associated with a single label; while in multitask learning different tasks may involve different domains and different data sets. Actually, traditional twoclass and multiclass problems can both be cast into multilabel problems by restricting that each instance has only one label. The generality of multilabel problems, however, inevitably makes it more difficult to address.
One famous approach to solving multilabel problems is Schapire and Singer’s AdaBoost.MH [56], which is an extension of AdaBoost and is the core of a successful multilabel learning system BoosTexter [56]. This approach maintains a set of weights over both training examples and their labels in the training phase, where training examples and their corresponding labels that are hard (easy) to predict get incrementally higher (lower) weights. Later, De Comité et al. [22] used alternating decision trees [30] which are more powerful than decision stumps used in BoosTexter to handle multilabel data and thus obtained the AdtBoost.MH algorithm. Probabilistic generative models have been found useful in multilabel learning. McCallum [47] proposed a Bayesian approach for multilabel document classification, where a mixture probabilistic model (one mixture component per category) is assumed to generate each document and an EM algorithm is employed to learn the mixture weights and the word distributions in each mixture component. Ueda and Saito [65] presented another generative approach, which assumes that the multilabel text has a mixture of characteristic words appearing in singlelabel text belonging to each of the multilabels. It is noteworthy that the generative models used in [47] and [65] are both based on learning text frequencies in documents, and are thus specific to text applications.
Many other multilabel learning algorithms have been developed, such as decision trees, neural networks, nearest neighbor classifiers, support vector machines, etc. Clare and King [21] developed a multilabel version of C4.5 decision trees through modifying the definition of entropy. Zhang and Zhou [79] presented multilabel neural network BpMll, which is derived from the Backpropagation algorithm by employing an error function to capture the fact that the labels belonging to an instance should be ranked higher than those not belonging to that instance. Zhang and Zhou [80] also proposed the Mlnn algorithm, which identifies the nearest neighbors of the concerned instance and then assigns labels according to the maximum a posteriori principle. Elisseeff and Weston [27] proposed the RankSvm algorithm for multilabel learning by defining a specific cost function and the corresponding margin for multilabel models. Other kinds of multilabel Svms have been developed by Boutell et al. [11] and Godbole and Sarawagi [33]. In particular, by hierarchically approximating the Bayes optimal classifier for the Hloss, CesaBianchi et al. [15] proposed an algorithm which outperforms simple hierarchical Svms. Recently, nonnegative matrix factorization has also been applied to multilabel learning [43], and multilabel dimensionality reduction methods have been developed [74, 85].
Roughly speaking, earlier approaches to multilabel learning attempt to divide multilabel learning to a number of twoclass classification problems [36, 72] or transform it into a label ranking problem [56, 27], while some later approaches try to exploit the correlation between the labels [65, 43, 85].
Most studies on multilabel learning focus on text categorization [56, 47, 65, 22, 33, 39, 74], and several studies aim to improve the performance of text categorization systems by exploiting additional information given by the hierarchical structure of classes [14, 53, 15] or unlabeled data [43]. In addition to text categorization, multilabel learning has also been found useful in many other tasks such as scene classification [11], image and video annotation [38, 48], bioinformatics [21, 27, 12, 7, 13], and even association rule mining [63, 50].
There is a lot of research on multiinstance learning, which studies the problem where a realworld object described by a number of instances is associated with a single class label. Here the training set is composed of many bags each containing multiple instances; a bag is labeled positively if it contains at least one positive instance and negatively otherwise. The goal is to label unseen bags correctly. Note that although the training bags are labeled, the labels of their instances are unknown. This learning framework was formalized by Dietterich et al. [24] when they were investigating drug activity prediction.
Long and Tan [44] studied the Paclearnability of multiinstance learning and showed that if the instances in the bags are independently drawn from product distribution, the Apr (AxisParallel Rectangle) proposed by Dietterich et al. [24] is Paclearnable. Auer et al. [5] showed that if the instances in the bags are not independent then Apr learning under the multiinstance learning framework is NPhard. Moreover, they presented a theoretical algorithm that does not require product distribution, which was transformed into a practical algorithm named Multinst [4]. Blum and Kalai [10] described a reduction from Paclearning under the multiinstance learning framework to Paclearning with onesided random classification noise. They also presented an algorithm with smaller sample complexity than that of the algorithm of Auer et al. [5].
Many multiinstance learning algorithms have been developed during the past decade. To name a few, Diverse Density [45] and Emdd [83], nearest neighbor algorithms Citationnn and Bayesiannn [67], decision tree algorithms Relic [54] and Miti [9], neural network algorithms Bpmip and extensions [90, 77] and Rbfmip [78], rule learning algorithm Rippermi [20], support vector machines and kernel methods miSvm and MiSvm [3], DdSvm [18], MissSvm [88], MiKernel [32], BagInstance Kernel [19], Marginalized MiKernel [42] and convexhull method ChFd [31], ensemble algorithms MiEnsemble [91], MiBoosting [70] and MilBoosting [6], logistic regression algorithm Milr [51], etc. Actually almost all popular machine learning algorithms have their multiinstance versions. Most algorithms attempt to adapt singleinstance supervised learning algorithms to the multiinstance representation, by shifting their focus from discrimination on instances to discrimination on bags [91]. Recently there is some proposal on adapting the multiinstance representation to singleinstance algorithms by representation transformation [93].
It is worth mentioning that standard multiinstance learning [24] assumes that if a bag contains a positive instance then the bag is positive; this implies that there exists a key instance in a positive bag. Many algorithms were designed based on this assumption. For example, the point with maximal diverse density identified by the Diverse Density algorithm [45] actually corresponds to a key instance; many Svm algorithms defined the margin of a positive bag by the margin of its most positive instance [3, 19]. As the research of multiinstance learning goes on, however, some other assumptions have been introduced [29]. For example, in contrast to assuming that there is a key instance, some work has assumed that there is no key instance and every instance contributes to the bag label [70, 17]. There is also an argument that the instances in the bags should not be treated independently [88]. All those assumptions have been put under the umbrella of multiinstance learning, and generally, in tackling real tasks it is difficult to know which assumption is the fittest. In other words, in different tasks multiinstance learning algorithms based on different assumptions may have different superiorities.
In the early years of the research of multiinstance learning, most work considered multiinstance classification with discretevalued outputs. Later, multiinstance regression with realvalued outputs was studied [2, 52], and different versions of generalized multiinstance learning have been defined [68, 58]. The main difference between standard multiinstance learning and generalized multiinstance learning is that in standard multiinstance learning there is a single concept, and a bag is positive if it has an instance satisfying this concept; while in generalized multiinstance learning [68, 58] there are multiple concepts, and a bag is positive only when all concepts are satisfied (i.e., the bag contains instances from every concept). Recently, research on multiinstance clustering [82], multiinstance semisupervised learning [49] and multiinstance active learning [60] have also been reported.
Multiinstance learning has also attracted the attention of the Ilp community. It has been suggested that multiinstance problems could be regarded as a bias on inductive logic programming, and the multiinstance paradigm could be the key between the propositional and relational representations, being more expressive than the former, and much easier to learn than the latter [23]. Alphonse and Matwin [1] approximated a relational learning problem by a multiinstance problem, fed the resulting data to feature selection techniques adapted from propositional representations, and then transformed the filtered data back to relational representation for a relational learner. Thus, the expressive power of relational representation and the ease of feature selection on propositional representation are gracefully combined. This work confirms that multiinstance learning can really act as a bridge between propositional and relational learning.
Multiinstance learning techniques have already been applied to diverse applications including image categorization [17, 18], image retrieval [71, 84], text categorization [3, 60], web mining [86], spam detection [37], computer security [54], face detection [66, 76], computeraided medical diagnosis [31], etc.
3 The MIML Framework
Let denote the instance space and the set of class labels. Then, formally, the MIML task is defined as:

MIML (multiinstance multilabel learning): To learn a function from a given data set , where is a set of instances , , and is a set of labels , . Here denotes the number of instances in and the number of labels in .
It is interesting to compare MIML with the existing frameworks of traditional supervised learning, multiinstance learning, and multilabel learning.

Traditional supervised learning (singleinstance singlelabel learning): To learn a function from a given data set , where is an instance and is the known label of .

Multiinstance learning (multiinstance singlelabel learning): To learn a function from a given data set , where is a set of instances , , and is the label of .\@footnotemark\@footnotetextAccording to notions used in multiinstance learning, is a labeled bag while an unlabeled bag. Here denotes the number of instances in .

Multilabel learning (singleinstance multilabel learning): To learn a function from a given data set , where is an instance and is a set of labels , . Here denotes the number of labels in .
From Fig. 1 we can see the differences among these learning frameworks. In fact, the multi learning frameworks are resulted from the ambiguities in representing realworld objects. Multiinstance learning studies the ambiguity in the input space (or instance space), where an object has many alternative input descriptions, i.e., instances; multilabel learning studies the ambiguity in the output space (or label space), where an object has many alternative output descriptions, i.e., labels; while MIML considers the ambiguities in both the input and output spaces simultaneously. In solving realworld problems, having a good representation is often more important than having a strong learning algorithm, because a good representation may capture more meaningful information and make the learning task easier to tackle. Since many real objects are inherited with input ambiguity as well as output ambiguity, MIML is more natural and convenient for tasks involving such objects.
It is worth mentioning that MIML is more reasonable than (singleinstance) multilabel learning in many cases. Suppose a multilabel object is described by one instance but associated with number of class labels, namely label, label, , label. If we represent the multilabel object using a set of instances, namely instance, instance, , instance, the underlying information in a single instance may become easier to exploit, and for each label the number of training instances can be significantly increased. So, transforming multilabel examples to MIML examples for learning may be beneficial in some tasks, which will be shown in Section 6. Moreover, when representing the multilabel object using a set of instances, the relation between the input patterns and the semantic meanings may become more easily discoverable. Note that in some cases, understanding why a particular object has a certain class label is even more important than simply making an accurate prediction, while MIML offers a possibility for this purpose. For example, under the MIML representation, we may discover that one object has label because it contains instance; it has label because it contains instance; while the occurrence of both instance and instance triggers label.
MIML can also be helpful for learning singlelabel examples involving complicated highlevel concepts. For example, as Fig. 2(a) shows, the concept Africa has a broad connotation and the images belonging to Africa have great variance, thus it is not easy to classify the topleft image in Fig. 2(a) into the Africa class correctly. However, if we can exploit some lowlevel subconcepts that are less ambiguous and easier to learn, such as tree, lions, elephant and grassland shown in Fig. 2(b), it is possible to induce the concept Africa much easier than learning the concept Africa directly. The usefulness of MIML in this process will be shown in Section 7.
4 Solving MIML Problems by Degeneration
It is evident that traditional supervised learning is a degenerated version of multiinstance learning as well as a degenerated version of multilabel learning, while traditional supervised learning, multiinstance learning and multilabel learning are all degenerated versions of MIML. So, a simple idea to tackle MIML is to identify its equivalence in the traditional supervised learning framework, using multiinstance learning or multilabel learning as the bridge, as shown in Fig. 3.

Solution A: Using multiinstance learning as the bridge:
The MIML learning task, i.e., to learn a function , can be transformed into a multiinstance learning task, i.e., to learn a function . For any , if and otherwise. The proper labels for a new example can be determined according to . This multiinstance learning task can be further transformed into a traditional supervised learning task, i.e., to learn a function , under a constraint specifying how to derive from . For any , if and otherwise. Here the constraint can be which has been used by Xu and Frank [70] in transforming multiinstance learning tasks into traditional supervised learning tasks. Note that other kinds of constraint can also be used here.

Solution B: Using multilabel learning as the bridge:
The MIML learning task, i.e., to learn a function , can be transformed into a multilabel learning task, i.e., to learn a function . For any , if , . The proper labels for a new example can be determined according to . This multilabel learning task can be further transformed into a traditional supervised learning task, i.e., to learn a function . For any , if and otherwise. That is, . Here the mapping can be implemented with constructive clustering which was proposed by Zhou and Zhang [93] in transforming multiinstance bags into traditional singleinstances. Note that other kinds of mappings can also be used here.
In the rest of this section we will propose two MIML algorithms, MimlBoost and MimlSvm. MimlBoost is an illustration of Solution A, which uses categorywise decomposition for the A1 step in Fig. 3 and MiBoosting for A2; MimlSvm is an illustration of Solution B, which uses clusteringbased representation transformation for the B1 step and MlSvm for B2. Other MIML algorithms can be developed by taking alternative options. Both MimlBoost and MimlSvm are quite simple. We will see that for dealing with complicated objects with multiple semantic meanings, good performance can be obtained under the MIML framework even by using such simple algorithms. This demonstrates that the MIML framework is very promising, and we expect better performance can be achieved in the future if researchers put forward more powerful MIML algorithms.
4.1 MimlBoost
Now we propose the MimlBoost algorithm according to the first solution mentioned above, that is, identifying the equivalence in the traditional supervised learning framework using multiinstance learning as the bridge. Note that this strategy can also be used to derive other kinds of MIML algorithms.
Given any set , let denote its size, i.e., the number of elements in ; given any predicate , let be 1 if holds and 0 otherwise; given , for any , let if and otherwise, where is a function which judges whether a label is a proper label of or not. The basic assumption of MimlBoost is that the labels are independent so that the MIML task can be decomposed into a series of multiinstance learning tasks to solve, by treating each label as a task. The pseudocode of MimlBoost is summarized in Appendix A (Table A.1).
In the first step of MimlBoost, each MIML example is transformed into a set of number of multiinstance bags, i.e., . Note that is a labeled multiinstance bag where is a bag containing number of instances, i.e., , and is the label of this bag.
Thus, the original MIML data set is transformed into a multiinstance data set containing number of bags. We order them as , and let denote the th of these number of bags which contains number of instances.
Then, from the data set a multiinstance learning function can be learned, which can accomplish the desired MIML function because . In this paper, the MiBoosting algorithm [70] is used to implement . Note that by using MiBoosting, the MimlBoost algorithm assumes that all instances in a bag contribute independently in an equal way to the label of that bag.
For convenience, let denote the bag , , , and denotes the expectation. Then, here the goal is to learn a function minimizing the baglevel exponential loss , which ultimately estimates the baglevel logodds function on the training set. In each boosting round, the aim is to expand into , i.e., adding a new weak classifier, so that the exponential loss is minimized. Assuming that all instances in a bag contribute equally and independently to the bag’s label, can be derived, where is the prediction of the instancelevel classifier for the th instance of the bag , and is the number of instances in .
It has been shown by [70] that the best to be added can be achieved by seeking which maximizes , given the baglevel weights . By assigning each instance the label of its bag and the corresponding weight , can be learned by minimizing the weighted instancelevel classification error. This actually corresponds to the Step 3a of MimlBoost. When is found, the best multiplier can be got by directly optimizing the exponential loss:
(1)  
where (computed in Step 3b). Minimization of this expectation actually corresponds to Step 3d, where numeric optimization techniques such as quasiNewton method can be used. Note that in Step 3c if , the Boosting process will stop [89]. Finally, the baglevel weights are updated in Step 3f according to the additive structure of .
4.2 MimlSvm
Now we propose the MimlSvm algorithm according to the second solution mentioned before, that is, identifying the equivalence in the traditional supervised learning framework using multilabel learning as the bridge. Note that this strategy can also be used to derive other kinds of MIML algorithms.
Again, given any set , let denote its size, i.e., the number of elements in ; given and where , for any , let if and otherwise, where is a function . The basic assumption of MimlSvm is that the spatial distribution of the bags carries relevant information, and information helpful for label discrimination can be discovered by measuring the closeness between each bag and the representative bags identified through clustering. The pseudocode of MimlSvm is summarized in Appendix A (Table A.2).
In the first step of MimlSvm, the of each MIML example is collected and put into a data set . Then, in the second step, medoids clustering is performed on . Since each data item in , i.e. , is an unlabeled multiinstance bag instead of a single instance, Hausdorff distance [26] is employed to measure the distance. The Hausdorff distance is a famous metric for measuring the distance between two bags of points, which has often been used in computer vision tasks; other techniques that can measure the distance between bags of points, such as the set kernel [32], can also be used here. In detail, given two bags and , the Hausdorff distance between and is defined as
(2) 
where measures the distance between the instances and , which takes the form of Euclidean distance here.
After the clustering process, the data set is divided into partitions, whose medoids are , respectively. With the help of these medoids, the original multiinstance example is transformed into a dimensional numerical vector , where the th component of is the distance between and , that is, . In other words, encodes some structure information of the data, that is, the relationship between and the th partition of . This process reassembles the constructive clustering process used by Zhou and Zhang [93] in transforming multiinstance examples into singleinstance examples except that in [93] the clustering is executed at the instance level while here it is executed at the bag level. Thus, the original MIML examples have been transformed into multilabel examples , which corresponds to the Step 3 of MimlSvm.
Then, from the data set a multilabel learning function can be learned, which can accomplish the desired MIML function because . In this paper, the MlSvm algorithm [11] is used to implement . Concretely, MlSvm decomposes the multilabel learning problem into multiple independent binary classification problems (one per class), where each example associated with the label set is regarded as a positive example when building Svm for any class , while regarded as a negative example when building Svm for any class , as shown in the Step 4 of MimlSvm. In making predictions, the TCriterion [11] is used, which actually corresponds to the Step 5 of the MimlSvm algorithm. That is, the test example is labeled by all the class labels with positive Svm scores, except that when all the Svm scores are negative, the test example is labeled by the class label which is with the top (least negative) score.
4.3 Experiments
4.3.1 MultiLabel Evaluation Criteria
In traditional supervised learning where each object has only one class label, accuracy is often used as the performance evaluation criterion. Typically, accuracy is defined as the percentage of test examples that are correctly classified. When learning with complicated objects associated with multiple labels simultaneously, however, accuracy becomes less meaningful. For example, if approach missed one proper label while approach missed four proper labels for a test example having five labels, it is obvious that is better than , but the accuracy of and may be identical because both of them incorrectly classified the test example.
Five criteria are often used for evaluating the performance of learning with multilabel examples [56, 92]; they are hamming loss, oneerror, coverage, ranking loss and average precision. Using the same denotation as that in Sections 3 and 4, given a test set , these five criteria are defined as below. Here, returns a set of proper labels of ; returns a realvalue indicating the confidence for to be a proper label of ; returns the rank of derived from .

, where stands for the symmetric difference between two sets. The hamming loss evaluates how many times an objectlabel pair is misclassified, i.e., a proper label is missed or a wrong label is predicted. The performance is perfect when ; the smaller the value of , the better the performance of .

. The oneerror evaluates how many times the topranked label is not a proper label of the object. The performance is perfect when oneerror; the smaller the value of oneerror, the better the performance of .

. The coverage evaluates how far it is needed, on the average, to go down the list of labels in order to cover all the proper labels of the object. It is loosely related to precision at the level of perfect recall. The smaller the value of coverage, the better the performance of .

, where denotes the complementary set of in . The ranking loss evaluates the average fraction of label pairs that are misordered for the object. The performance is perfect when ; the smaller the value of , the better the performance of .

. The average precision evaluates the average fraction of proper labels ranked above a particular label . The performance is perfect when avgprec; the larger the value of avgprec, the better the performance of .
In addition to the above criteria, we design two new multilabel criteria, average recall and average F1, as below.

. The average recall evaluates the average fraction of proper labels that have been predicted. The performance is perfect when avgrecl; the larger the value of avgrecl, the better the performance of .

. The average F1 expresses a tradeoff between the average precision and the average recall. The performance is perfect when avgF1; the larger the value of avgF1, the better the performance of .
Note that since the above criteria measure the performance from different aspects, it is difficult for one algorithm to outperform another on every one of these criteria.
In the following we study the performance of MIML algorithms on two tasks involving complicated objects with multiple semantic meanings. We will show that for such tasks, MIML is a good choice, and good performance can be achieved even by using simple MIML algorithms such as MimlBoost and MimlSvm.
4.3.2 Scene Classification
The scene classification data set consists of 2,000 natural scene images belonging to the classes desert, mountains, sea, sunset and trees. Over 22 of the images belong to multiple classes simultaneously. Each image has already been represented as a bag of nine instances generated by the Sbn method [46], which uses a Gaussian filter to smooth the image and then subsamples the image to an matrix of color blobs where each blob is a set of pixels within the matrix. An instance corresponds to the combination of a single blob with its four neighboring blobs (up, down, left, right), which is described with 15 features. The first three features represent the mean R, G, B values of the central blob and the remaining twelve features express the differences in mean color values between the central blob and other four neighboring blobs respectively.\@footnotemark\@footnotetextThe data set is available at http://lamda.nju.edu.cn/data_MIMLimage.ashx.
We evaluate the performance of the MIML algorithms MimlBoost and MimlSvm. Note that MimlBoost and MimlSvm are merely proposed to illustrate the two general degeneration solutions to MIML problems shown in Fig. 3. We do not claim that they are the best algorithms that can be developed through the degeneration paths. There may exist other processes for transforming MIML examples into multiinstance singlelabel (MISL) examples or singleinstance multilabel (SIML) examples. Even by using the same degeneration process as that used in MimlBoost and MimlSvm, there are also many alternatives to realize the second step. For example, by using miSvm [3] to replace the MiBoosting used in MimlBoost and by using the twolayer neural network structure [81] to replace the MlSvm used in MimlSvm, we get MimlSvm and MimlNn respectively. Their performance is also evaluated in our experiments.
We compare the MIML algorithms with several stateoftheart algorithms for learning with multilabel examples, including AdtBoost.MH [22], RankSvm [27], MlSvm [11] and Mlnn [80]; these algorithms have been introduced briefly in Section 2. Note that these are singleinstance algorithms that regard each image as a 135dimensional feature vector, which is obtained by concatenating the nine instances in the direction from upperleft to rightbottom.
The parameter configurations of RankSvm, MlSvm and Mlnn are set by considering the strategies adopted in [27], [11] and [80] respectively. For RankSvm, polynomial kernel is used where polynomial degrees of 2 to 9 are considered as in [27] and chosen by holdout tests on training sets. For MlSvm, Gaussian kernel is used. For Mlnn, the number of nearest neighbors considered is set to 10.
The boosting rounds of AdtBoost.MH and MimlBoost are set to 25 and 50, respectively; The performance of the two algorithms at different boosting rounds is shown in Appendix B (Fig. B.1), it can be observed that at those rounds the performance of the algorithms have become stable. Gaussian kernel Libsvm [16] is used for the Step 3a of MimlBoost. The MimlSvm and MimlSvm are also realized with Gaussian kernels. The parameter of MimlSvm is set to be 20% of the number of training images; The performance of this algorithm with different values is shown in Appendix B (Fig. B.2), it can be observed that the setting of does not significantly affect the performance of MimlSvm. Note that in Appendix B (Figs. B.1 and B.2) we plot average precision, average recall and average F1 such that in all the figures, the lower the curve, the better the performance.
Here in the experiments, 1,500 images are used as training examples while the remaining 500 images are used for testing. Experiments are repeated for thirty runs by using random training/test partitions, and the average and standard deviation are summarized in Table 1,\@footnotemark\@footnotetextFor the shared implementation of AdtBoost.MH (http://www.grappa.univlille3.fr/ grappa/en_index.php3?info=software), ranking loss, average recall and average F1 are not available in the program’s outputs. where the best performance on each criterion has been highlighted in boldface.
Compared  Evaluation Criteria  

Algorithms  
MimlBoost  .193.007  .347.019  .984.049  .178.011  .779.012  .433.027  .556.023 
MimlSvm  189.009  .354.022  1.087.047  .201.011  .765.013  .556.020  .644.018 
MimlSvm  .195.008  .317.018  1.068.052  .197.011  .783.011  .587.019  .671.015 
MimlNn  .185.008  .351.026  1.057.054  .196.013  .771.015  .509.022  .613.020 


AdtBoost.MH 
.211.006  .436.019  1.223.050  N/A  .718.012  N/A  N/A 
RankSvm  .210.024  .395.075  1.161.154  .221.040  .746.044  .529.068  .620.059 
MlSvm  .232.004  .447.023  1.217.054  .233.012  .712.013  .073.010  .132.017 
Mlnn  .191.006  .370.017  1.085.048  .203.010  .759.011  .407.026  .529.023 
Pairwise tests with 95% significance level disclose that all the MIML algorithms are significantly better than AdtBoost.MH and MlSvm on all the seven evaluation criteria. This is impressive since as mentioned before, these evaluation criteria measure the learning performance from different aspects and one algorithm rarely outperforms another algorithm on all criteria. MimlSvm and MimlSvm are both significantly better than RankSvm on all the evaluation criteria, while MimlBoost and MimlNn are both significantly better than RankSvm on the first five criteria. MimlNn is significantly better than Mlnn on all the evaluation criteria. Both MimlBoost and MimlSvm are significantly better than Mlnn on all criteria except hamming loss. MimlSvm is significantly better than Mlnn on oneerror, average precision, average recall and average F1, while there are ties on the other criteria. Moreover, note that the best performance on all evaluation criteria are always attained by MIML algorithms. Overall, comparison on the scene classification task shows that the MIML algorithms can be significantly better than the nonMIML algorithms; this validates the powerfulness of the MIML framework.
4.3.3 Text Categorization
The Reuters21578 data set is used in this experiment. The seven most frequent categories are considered. After removing documents that do not have labels or main texts, and randomly removing some documents that have only one label, a data set containing 2,000 documents is obtained, where over 14.9 documents have multiple labels. Each document is represented as a bag of instances according to the method used in [3]. Briefly, the instances are obtained by splitting each document into passages using overlapping windows of maximal 50 words each. As a result, there are 2,000 bags and the number of instances in each bag varies from 2 to 26 (3.6 on average). The instances are represented based on term frequency. The words with high frequencies are considered, excluding “function words” that have been removed from the vocabulary using the Smart stoplist [55]. It has been found that based on document frequency, the dimensionality of the data set can be reduced to 110 without loss of effectiveness [73]. Thus, we use the top 2% frequent words, and therefore each instance is a 243dimensional feature vector.\@footnotemark\@footnotetextThe data set is available at http://lamda.nju.edu.cn/data_MIMLtext.ashx
The parameter configurations of RankSvm, MlSvm and Mlnn are set in the same way as in Section 4.3.2. The boosting rounds of AdtBoost.MH and MimlBoost are set to 25 and 50, respectively. Linear kernels are used. The parameter of MimlSvm is set to be 20% of the number of training images. The singleinstance algorithms regard each document as a 243dimensional feature vector which is obtained by aggregating all the instances in the same bag; this is equivalent to represent the document using a sole term frequency feature vector.
Here in the experiments, 1,500 documents are used as training examples while the remaining 500 documents are used for testing. Experiments are repeated for thirty runs by using random training/test partitions, and the average and standard deviation are summarized in Table 2, where the best performance on each criterion has been highlighted in boldface.
Compared  Evaluation Criteria  

Algorithms  
MimlBoost  .053.004  .094.014  .387.037  .035.005  .937.008  .792.010  .858.008 
MimlSvm  .033.003  .066.011  .313.035  .023.004  .956.006  .925.010  .940.008 
MimlSvm  .041.004  .055.009  .284.030  .020.003  .965.005  .921.012  .942.007 
MimlNn  .038.002  .080.010  .320.030  .025.003  .950.006  .834.011  .888.008 


AdtBoost.MH 
.055.005  .120.017  .409.047  N/A  .926.011  N/A  N/A 
RankSvm  .120.013  .196.126  .695.466  .085.077  .868.092  .411.059  .556.068 
MlSvm  .050.003  .081.011  .329.029  .026.003  .949.006  .777.016  .854.011 
Mlnn  .049.003  .126.012  .440.035  .045.004  .920.007  .821.021  .867.013 
Pairwise tests with 95% significance level disclose that, impressively, both MimlSvm and MimlSvm are significantly better than all the nonMIML algorithms. MimlNn is significantly better than AdtBoost.MH, RankSvm, and Mlnn on all the evaluation criteria; significantly better than MlSvm on hamming loss, average recall and average F1 while there are ties on the other criteria. MimlBoost is significantly better than AdtBoost.MH on all criteria except that there is a tie on hamming loss; significantly better than RankSvm on all criteria; significantly better than MlSvm on average recall and there is a tie on average F1; significantly better than Mlnn on oneerror, coverage, ranking loss and average precision. Moreover, note that the best performance on all evaluation criteria are always attained by MIML algorithms. Overall, comparison on the text categorization task shows that the MIML algorithms are better than the nonMIML algorithms; this validates the powerfulness of the MIML framework.
5 Solving MIML Problems by Regularization
The degeneration methods presented in Section 4 may lose information during the degeneration process, and thus a “direct” MIML algorithm is desirable. In this section we propose a regularization method for MIML. In contrast to MimlSvm and MimlSvm, this method is developed from the regularization framework directly and so we call it DMimlSvm. The basic assumption of DMimlSvm is that the labels associated to the same example have some relatedness, and the performance of classifying the bags depends on the loss between the labels and the predictions on the bags as well as on the constituent instances. Moreover, considering that for any class label the number of positive examples is smaller than that of negative examples, this method incorporates a mechanism to deal with class imbalance. We employ the constrained concaveconvex procedure (Cccp) which has wellstudied convergence properties [62] to solve the resultant nonconvex optimization problem. We also present a cutting plane algorithm that finds the solution efficiently.
5.1 The Loss Function
Given a set of MIML training examples , the goal of DMimlSvm is to learn a mapping where the proper label set for each bag corresponds to . Specifically, DMimlSvm chooses to instantiate with functions, i.e. , where is the number of labels in the label space . Here, the th function determines the belongingness of for , i.e. . In addition, each single instance in a bag can be viewed as a bag containing only one instance, such that is also a welldefined function. For convenience, and are simplified as and in the rest of this section.
To train the component functions in , DMimlSvm employs the following empirical loss function involving two terms (balanced by ):
(3) 
Here, the first term considers the loss between the groundtruth label set of each training bag , i.e. , to its predicted label set, i.e. . Let if holds (). Otherwise, . Furthermore, let denote the hinge loss function. Accordingly, the first loss term is defined as:
(4) 
The second term considers the loss between and the predictions of ’s constituent instances, i.e. , which reflects the relationships between the bag and its instances . Here, the common assumption in multiinstance learning is that the strength for to hold a label is equal to the maximum strength for its instances to hold the label, i.e. .\@footnotemark\@footnotetextNote that this assumption may be restrictive to some extent. There are many cases where the label of the bag does not rely on the instance with the maximum predictions, as discussed in Section 2. In addition, in classification only the sign of prediction is important [19], i.e. . However, in this paper the above common assumption is still adopted due to its popularity and simplicity. Accordingly, the second loss term is defined as:
(5) 
Here, can be defined in various ways and is set to be the loss in this paper, i.e. . By combining Eq. 4 and Eq. 5, the empirical loss function in Eq. 3 is then specified as:
(6)  
5.2 Representer Theorem for MIML
For simplicity, we assume that each function is a linear model, i.e., where is the feature map induced by a kernel function and denotes the standard inner product in the Reproducing Kernel Hilbert Space (RKHS) induced by the kernel . We recall that an instance can be regarded as a bag containing only one instance, so the kernel can be any kernel defined on a set of instances, such as the set kernel [32]. In the case of classification, objects (bags or instances) are classified according to the sign of .
DMimlSvm assumes that the labels associated with a bag should have some relatedness; otherwise they should not be associated with the bag simultaneously. To reflect this basic assumption, DMimlSvm regularizes the empirical loss function in Eq. 6 with an additional term :
(7) 
Here, is a regularization parameter balancing the model complexity and the empirical risk . Inspired by [28], we assume that the relatedness among the labels can be measured by the mean function ,
(8) 
The original idea in [28] is to minimize and meanwhile minimize , i.e. to set the regularizer as:
(9) 
According to Eq.8, the first term in the RHS of Eq. 9 can be rewritten as:
(10) 
Therefore, by substituting Eq. 10 into Eq. 9, the regularizer can be simplified as:
(11) 
Further note that and , by substituting Eq. 11 into Eq. 7, we have the regularization framework of DMimlSvm as follows:
(12) 
Here, is a parameter to trade off the discrepancy and commonness among the labels, that is, how similar or dissimilar the ’s are. Refer to Eq. 10, we have . Intuitively, when (or ) is large, minimization of Eq. 12 will force to tend to be zero and the discrepancy among the labels becomes more important; when (or ) is small, minimization of Eq. 12 will force to tend to be zero and the commonness among the labels becomes more important [28].
Given the above setup, we can prove the following representer theorem.