# Fast Multi-Instance Multi-Label Learning

###### Abstract

In many real-world tasks, particularly those involving data objects with complicated semantics such as images and texts, one object can be represented by multiple instances and simultaneously be associated with multiple labels. Such tasks can be formulated as multi-instance multi-label learning (MIML) problems, and have been extensively studied during the past few years. Existing MIML approaches have been found useful in many applications; however, most of them can only handle moderate-sized data. To efficiently handle large data sets, in this paper we propose the MIMLfast approach, which first constructs a low-dimensional subspace shared by all labels, and then trains label specific linear models to optimize approximated ranking loss via stochastic gradient descent. Although the MIML problem is complicated, MIMLfast is able to achieve excellent performance by exploiting label relations with shared space and discovering sub-concepts for complicated labels. Experiments show that the performance of MIMLfast is highly competitive to state-of-the-art techniques, whereas its time cost is much less; particularly, on a data set with 20K bags and 180K instances, MIMLfast is more than 100 times faster than existing MIML approaches. On a larger data set where none of existing approaches can return results in 24 hours, MIMLfast takes only 12 minutes. Moreover, our approach is able to identify the most representative instance for each label, and thus providing a chance to understand the relation between input patterns and output label semantics.

^{$\ast$}

^{$\ast$}footnotetext: Corresponding author. Email: zhouzh@nju.edu.cn

Fast Multi-Instance Multi-Label Learning

National Key Laboratory for Novel Software Technology

Nanjing University, Nanjing 210093, China

Key words: MIML, multi-instance multi-label learning, fast, key instance, sub-concepts

## 1 Introduction

In traditional supervised learning, one object is represented by a single instance and associated with only one label. However, in many real world applications, one object can be naturally decomposed into multiple instances, and has multiple class labels simultaneously. For example, in image classification problems, an image usually contains multiple objects, and can be divided into several segments, where each segment is represented with an instance, and corresponds to a semantic label ZZ07 (); in text categorization tasks, an article may belong to multiple categories, and can be represented by a bag of instances, one for a paragraph YZH09 (); in gene function prediction tasks, a gene usually has multiple labels since it is related to multiple functions, and can be represented with a set of images with different views LJKYZ (). Multi-instance multi-label learning (MIML) is a recent proposed framework for such complicated objects ZZHL12 ().

During the past years, many MIML algorithms were proposed ZZ07 (); ZHMWQW08 (); JWZ09 (); YZH09 (); ZW09 (); LO10 (); N10 (); Z10 (); BFR12 (); ZZHL12 (). They achieved decent performances and validated the superiority of MIML framework in different applications. However, along with the enhancing of expressive power, the hypothesis space of MIML expands dramatically, resulting in the high complexity and low efficiency of existing approaches. These approaches are usually time-consuming, and cannot handle large scale data, thus strongly limit the application of multi-instance multi-label learning.

In this paper, we propose a novel approach MIMLfast to learn on multi-instance multi-label data fast. Though simple linear models are employed for efficiency, MIMLfast provides an effective approximation of the original MIML problem. Specifically, to utilize the relations among multiple labels, we first learn a shared space for all the labels from the original features, and then train label specific linear models from the shared space. To identify the key instance to represent a bag for a specific label, we train the classification model on the instance level, and then select the instance with maximum prediction. To make the learning efficient, we employ stochastic gradient descent (SGD) to optimize an approximated ranking loss. At each step of SGD, MIMLfast randomly samples a triplet which consists of a bag, a relevant label of the bag and an irrelevant label, and optimizes the model to rank the relevant label before the irrelevant one if such an order is violated.

While most existing approaches focus on improving generalization, another important task of MIML learning is to understand the relation between input patterns and output label semantics LHJZ12 (). Our approach can naturally identify the most representative instance for each label. In addition, we propose to discover sub-concepts for complicated labels, which frequently occur in MIML tasks.

The rest of the paper is organized as follows. We propose the MIMLfast approach in Section 2, and then present the experiments in Section 3. Section 4 reviews some related work, followed by the conclusion in Section 5.

## 2 The MIMLfast Approach

We denote by the training data that consists of examples, where each bag has instances and contains the labels associated with , which is a subset of all possible labels .

We first discuss on how to build the classification model on the instance level, and then try to get the labels of bags from instance predictions. To handle a problem with multiple labels, the simplest way is to degenerate it into a series of single label problems by training one model for each label independently. However, such a degenerating approach may lose information since it treats the labels independently and ignores the relations among them. In our approach, we formulate the model as a combination of two components. The first component learns a linear mapping from the original feature space to a low dimensional space, which is shared by all the labels. Then the second component learns label specific models based on the shared space. The two components are optimized interactively to fit training examples from all labels. In such a way, examples from each label will contribute the optimization of the shared space, and labels with strong relations are expected to help each other. Formally, given an instance , we define the classification model on label as

where is a matrix which maps the original feature vectors to the shared space, and is the -dimensional weight vector for label . and are the dimensionalities of the feature space and the shared space, respectively.

Objects in multi-instance multi-label learning tasks usually have complicated semantic; and thus examples with diverse contents may be assigned the same label. For example, the content of an image labeled apple can be a mobile phone, a laptop or just a real apple. It is difficult to train a single model to classify images with such diverse contents into the same category. Instead, we propose to learn multiple models for a complicated label, one for a sub-concept, and automatically decide which sub-concept one example belongs to. The model of each sub-concept is much simpler and may be more easily trained to fit the data. We assume that there are sub-concepts for each label. For a given example with label , the sub-concept it belongs to is automatically determined by first examining the prediction values of the models, and then selecting the sub-concept with maximum prediction value. Now we can redefine the prediction of instance on label as:

(1) |

where corresponds to the -th sub-concept of label . Note that although we assume there are sub-concepts for each label, empty sub-concepts are allowed, i.e., examples of a simple label may be distributed in only a few or even one sub-concept.

We then look at how to get the predictions of bags from the instance level models. It is usually assumed that a bag is positive if and only if it contains at least one positive instance DLL97 (); BFR12 (). Under this assumption, the prediction of a bag on label can be defined as the maximum of predictions of all instances in this bag:

We call the instance with maximum prediction the key instance of on label .

With the above model, for an example and one of its relevant labels , we define as

(2) |

where denotes the set of irrelevant labels of , and is the indicator function which returns if the argument is true and otherwise. Essentially, counts how many irrelevant labels are ranked before label on the bag .

Based on , we further define the ranking error UBG09 () with respect to an example on label as

(3) |

It is obvious that the ranking error would be larger for lower being ranked. Finally, we have the ranking error on the whole dataset as

Based on Eq. 2, the ranking error can be spread into all irrelevant labels in as:

(4) |

Due to non-convexity and discontinuousness, it is rather difficult to optimize the above equation directly because such optimization often leads to NP-hard problems. We instead explore the following hinge loss, which has been shown as an optimal choice among all convex surrogate losses BDLSS12 (),

(5) |

where if ; otherwise, . The surrogate loss can be viewed as an upper bound of with the following lemma:

###### Lemma 1

.

###### Proof.

This lemma holds from . ∎

We then employ stochastic gradient descent (SGD) RM51 () to minimize the ranking error. At each iteration of SGD, we randomly sample a bag , one of its relevant labels , and one of its irrelevant labels to form a triplet , which will induce a loss:

(6) |

We then have the following lemma to disclose the relation between and .

###### Lemma 2

, where denotes the expectation on the uniform distribution over .

###### Proof.

This lemma follows from the fact that probability of randomly choosing in is . ∎

To minimize , it is required to calculate in advance, i.e., we have to compare with for each , whereas this could be time consuming when the number of possible labels is large. Therefore, we use an approximation to estimate in our implementation, inspired by Weston et al. WBU11 (). Specifically, at each SGD iteration, we randomly sample labels from the irrelevant label set one by one, until a violated label occurs. Here we call a violated label if it was ranked before , i.e., . Without loss of generality, we assume that the first violated label is found at the -th sampling step, and then, can be approximated by with the following lemma:

###### Lemma 3

We denote by a random event with representing the event that first violated label is at the -th sampling step. We have

###### Proof.

For convenience, we set and assume without loss of generality. It is easy to derive the probability

and we further have

where we use and . This completes the proof. ∎

We assume that the triplet sampled at the -th SGD iteration is , on label , the key instance is , and achieves the maximum prediction on the -th sub-concept, while on label , the instance achieves the maximum prediction on the -th sub-concept. Then we have the approximated ranking loss for the triplet:

Here we introduce for the convenience of presentation. So, if a violated label is sampled, we perform the gradient descent on the three parameters according to:

(7) | ||||

(8) | ||||

(9) |

where is the step size of SGD at the -th iteration. After the update of the parameters, , and each column of are normalized to have a L2 norm smaller than a constant .

The pseudo code of MIMLfast is presented in Algorithm 1. First, each column of and for all labels and all sub-concepts are initialized at random with mean 0 and standard deviation . Then at each iteration of SGD, a triplet is randomly sampled, and their corresponding key instance and sub-concepts are identified. After that, gradient descent is performed to update the three parameters: , and according to Eqs. 7 to 9. At last, the updated parameters are normalized such that their norms will be upper bounded by . This procedure is repeated until some stop criteria reached. In our experiments, we sample a small subset from the training data to form a validation set, and stop the training if the ranking loss does not decrease anymore on the validation set.

We then present some theoretical guarantees on the convergence rate of the optimization. Denoting by

the loss of -th SGD iteration with model parameters , , and

the optimal solution, we have:

###### Theorem 1

Suppose , , and . By choosing proper , and , it holds that

where .

###### Proof.

We present the main steps due to space limitation. Since the function is convex with respect to and , we have

From Eqs. (8) and (9), we have

where

This follows that

In a similar manner, we have

Summing over , and by setting and simple calculation, we have

Further, we have

by selecting proper initial values and simple calculation. This theorem follows as desired. ∎

In the test phase of the algorithm, for a bag , we can get the prediction value on each label, and consequently the rank of all labels. For single label classification problem, it is very easy to get the label of by selecting the one with largest prediction value. However, in multi-label learning, the bag may have more than one label; and thus one do not know how many labels should be selected as relevant ones from the ranked label list FHLB08 (). To solve this problem, we assign each bag a dummy label, denoted by , and train the model to rank the dummy label before all irrelevant labels while after the relevant ones. To implement this idea, we pay a special consideration on constructing the irrelevant labels set . Specifically, when and its label are sampled (in Line 6 of Algorithm 1), the algorithm will first examine whether is the dummy label. If , then consists of all the irrelevant labels; otherwise, contains both the dummy label and all the irrelevant labels. In such a way, the model will be trained to rank the dummy label between relevant labels and irrelevant ones. For a test bag, the labels with larger prediction value than that on the dummy label are selected as relevant labels.

Data | # ins. | # bag | # label | # label per bag |

Letter Frost | 565 | 144 | 26 | 3.6 |

Letter Carroll | 717 | 166 | 26 | 3.9 |

MSRC v2 | 1758 | 591 | 23 | 2.5 |

Reuters | 7119 | 2000 | 7 | 1.2 |

Bird Song | 10232 | 548 | 13 | 2.1 |

Scene | 18000 | 2000 | 5 | 1.2 |

Corel5K | 47,065 | 5,000 | 260 | 3.4 |

MSRA | 270,000 | 30,000 | 99 | 2.7 |

## 3 Experiments

### 3.1 Settings

We compare MIMLfast with six state-of-the-art MIML methods: DBA YZH09 (), a generative model for MIML learning; KISAR LHJZ12 (), a MIML algorithm tries to discover instance-label relation; MIMLBoost ZZ07 (), a boosting method decomposes MIML into multi-instance single label problems; MIMLkNN Z10 (), a MIML nearest neighbor algorithm; MIMLSVM ZZ07 (), a SVM style algorithm which decomposes MIML into single instance multi-label problems; and RankLossSIM BFR12 (), a MIML algorithm minimizes ranking loss for instance annotation.

MIMLfast | DBA | KISAR | MIMLBoost | MIMLkNN | MIMLSVM | RankL.SIM | |
---|---|---|---|---|---|---|---|

Letter Carroll | |||||||

h.l. | .134.012 | .180.010 | .150.008 | .153.008 | .170.017 | .154.007 | .132.006 |

o.e. | .119.050 | .248.036 | .058.096 | .645.062 | .312.043 | .554.043 | .167.050 |

co. | .380.029 | .909.023 | .870.018 | .730.039 | .460.030 | .905.020 | .389.037 |

r.l. | .130.013 | .622.033 | .873.043 | .477.035 | .194.019 | .710.029 | .134.017 |

a.p. | .715.032 | .324.029 | .181.027 | .263.020 | .611.023 | .350.022 | .708.026 |

Letter Frost |
|||||||

h.l. | .136.014 | .166.010 | .200.013 | .139.007 | .139.010 | .154.013 | .136.010 |

o.e. | .151.041 | .228.056 | .380.064 | .257.101 | .288.077 | .581.045 | .203.055 |

co. | .375.042 | .857.032 | .906.019 | .728.038 | .463.035 | .884.028 | .372.038 |

r.l. | .134.019 | .580.033 | .705.036 | .478.030 | .199.018 | .810.101 | .138.019 |

a.p. | .704.034 | .358.030 | .264.028 | .235.014 | .612.027 | .226.060 | .686.035 |

MSRC v2 |
|||||||

h.l. | .100.007 | .140.006 | .086.004 | N/A | .131.007 | .084.003 | .110.004 |

o.e. | .295.025 | .415.026 | .341.031 | N/A | .440.031 | .320.029 | .302.028 |

co. | .238.014 | .837.018 | .254.015 | N/A | .312.020 | .256.018 | .239.013 |

r.l. | .108.009 | .675.017 | .131.010 | N/A | .165.013 | .125.011 | .107.007 |

a.p. | .688.017 | .326.016 | .666.018 | N/A | .591.018 | .685.018 | .687.013 |

Reuters |
|||||||

h.l. | .028.004 | .043.004 | .032.003 | N/A | .034.004 | .042.004 | .037.003 |

o.e. | .044.008 | .077.011 | .057.010 | N/A | .065.011 | .100.015 | .055.007 |

co. | .035.004 | .089.010 | .036.004 | N/A | .043.004 | .050.006 | .036.004 |

r.l. | .014.004 | .062.008 | .016.003 | N/A | .023.004 | .031.005 | .016.003 |

a.p. | .972.005 | .922.008 | .966.006 | N/A | .958.006 | .939.009 | .967.005 |

Bird Song |
|||||||

h.l. | .073.009 | .116.005 | .098.011 | N/A | .081.007 | .073.005 | .087.008 |

o.e. | .055.017 | .101.020 | .159.039 | N/A | .122.029 | .111.025 | .064.046 |

co. | .150.013 | .292.015 | .186.018 | N/A | .175.015 | .173.013 | .133.011 |

r.l. | .036.007 | .132.010 | .067.012 | N/A | .059.010 | .054.006 | .027.008 |

a.p. | .921.014 | .786.013 | .847.026 | N/A | .878.017 | .888.011 | .930.025 |

Scene |
|||||||

h.l. | .188.009 | .269.009 | .194.005 | N/A | .196.007 | .200.008 | .204.007 |

o.e. | .351.023 | .386.025 | .351.020 | N/A | .370.018 | .380.021 | .392.019 |

co. | .207.012 | .334.011 | .204.008 | N/A | .222.009 | .225.010 | .237.010 |

r.l. | .189.014 | .348.012 | .185.010 | N/A | .207.011 | .212.011 | .222.010 |

a.p. | .770.015 | .600.013 | .772.012 | N/A | .757.011 | .750.012 | .738.011 |

We perform the experiments on 6 moderate-sized data sets and 2 large data sets. Among the moderate-sized data sets, Scene and Reuters are two benchmark data sets commonly used in existing MIML works. Scene ZZ07 () consists of 2000 images for scene classification, and is associated with 5 possible labels: desert, mountains, sea, sunset and trees. For each image, a bag of 9 instances is extracted via SBN MR98 (). Reuters is constructed based on the Reuters-21578 data set S02 () with the sliding window technique in ATH02 (). The other four moderate-sized data sets are collected by Fern et al. in their recent work BFR12 (): Letter Carroll and Letter Frost are constructed using the UCI Letter Recognition dataset FS91 (), where a bag is created for each word, and labels correspond to the letters. Bird Song consists of bird song recordings at the H. J. Andrews (HJA) Experimental Forest. Each bag is extracted from a 10-seconds audio recording while labels correspond to species of birds. MSRC v2 is a subset of the Microsoft Research Cambridge (MSRC) image dataset WCM05 (). Based on the ground-truth segmentation, histograms of gradients and colors are extracted to form an instance for each segment. The two large data sets are Corel5K and MSRA. Corel5K DBDF02 () contains 5000 segmented images and 260 class labels, and each image is represented by 9 instances on average. MSRA LWH09 () is a multimedia database collected by Microsoft Research Asia, the subset used in this work contains 30000 images with 99 possible labels, and each image is represented with a bag of 9 instances. The detailed characteristics of these data sets are summarized in Table 1.

For MSRA and Corel5K, since existing MIML approaches cannot handle large scale data, we examine the performances of compared approaches on a series of subsets with different number of training bags (which will be specified later). For each data set, of the data are randomly sampled for training, and the remaining examples are taken as test set. We repeat the random data partition for thirty times, and report the average results over the thirty repetitions.

For MIMLfast, the step size is in the form according to B10 (). The parameters are selected by 3-fold cross validation on the training data with regard to ranking loss. The candidate values for the parameters are as below: , , , and . In our experience, the algorithm is not very sensitive to and ; and the influence of will be studied in Section 3.5. For the compared approaches, parameters are determined in the same way if no value suggested in their literatures.

The performances of the compared approaches are evaluated with five commonly used MIML criteria: hamming loss, one error, coverage, ranking loss and average precision. For average precision, a larger value implies a better performance, while for the other four criteria, the smaller, the better. Note that coverage is normalized by the number of labels such that all criteria are in the interval . The definition of these criteria can be found in SS00 (); ZZHL12 ().

### 3.2 Performance Comparison

We first report the comparison results on the six moderate-sized data sets in Table 2. As shown in the table, our approach MIMLfast achieves the best performance in most cases. DBA tends to favor text data, and is outperformed by MIMLfast on all the data sets. KISAR achieves comparable results with MIMLfast on Scene while is less effective on the other data sets. MIMLBoost can handle only the two smallest data sets, and does not yield good performance. MIMLkNN and MIMLSVM work steady on all the data sets, but are not competitive when compared with MIMLfast. At last, RankLossSIM is comparable to MIMLfast on 4 of 6 data sets, and even achieves better coverage and ranking loss on the Bird Song data set. However, on the other two data sets with relative more bags, i.e., Reuters and Scene, it is significantly worse than our approach on all the five criteria.

MSRA and Corel5K contain 30000 and 5000 bags respectively, which are too large for most existing MIML approaches. We thus perform the comparison on subsets of them with different data sizes. We vary the number of bags from 1000 to 5000 for Corel5K, and 5000 to 30000 for MSRA, and plot the performance curves in Figures 1 and 2, respectively. MIMLBoost did not return results in 24 hours even for the smallest data size, and thus it is not included in the comparison. RankLossSIM is not presented on MSRA for the same reason. We also exclude DBA on MSRA because its performance is too bad. As observable in Figures 1 and 2, MIMLfast is apparently better than the others on these two large data sets. Particularly, when data size reaches 25K, other methods cannot work, but MIMLfast still works well.

### 3.3 Efficiency Comparison

It is crucial to study the efficiency of the compared MIML approaches, because our basic motivation is to develop a method that can work on large scale MIML data. All the experiments are performed on a machine with GHz CPUs and 32GB main memory. Again, we first show the time cost of each algorithm on the six moderate-sized data sets in Figure 3. Since the results on the two smallest data sets Letter Carroll and Letter Frost are similar, we take one of them as representative to save space. Obviously, our approach is the most efficient one on all the data sets. MIMLBoost is the most time-consuming one, followed by RankLossSIM and MIMLkNN.

The superiority of our approach is more distinguished on larger data sets. As shown in Figure 4, on Corel5K, MIMLBoost failed to get result in 24 hours even with the smallest subset, while RankLossSIM can handle only 1000 examples. The time costs of existing methods increase dramatically as the data size increases. In contrast, MIMLfast takes only 1 minute even for the largest size in Figure 4(a). In Figure 4(b), on the largest MSRA data, the superiority of MIMLfast is even more apparent. None of existing approaches can deal with more than 20K examples. In contrast, on data of 20,000 bags and 180,000 instances, MIMLfast is more than 100 times faster than the most efficient existing approach; when the data size becomes larger, none of existing approaches can return result in 24 hours, and MIMLfast takes only 12 minutes.

### 3.4 Key Instance Detection

In MIML, a set of labels are assigned to a group of instances, and thus it is interesting to understand the relation between input patterns and output label semantics. Inspired by LHJZ12 (), by assuming that each label is triggered by its most positive instance, our MIMLfast approach is able to identify the key instance for each label.

We first give an intuitive evaluation of the key instance detection of MIMLfast. On MSRA, following LHJZ12 (), we first partition each image into a set of patches with k-means clustering, and then extract an instance from each cluster. In Figure 5, we show two example images, and highlight the regions corresponding to the key instance detected by our approach for each label. Note that since the image regions are obtained by clustering, an instance may correspond to multiple regions in the same cluster rather than a single region. The results clearly show that MIMLfast can detect reasonable key instances for the labels.

We also evaluate the key instance detection accuracy quantitatively. On 4 of the 8 MIML data sets, i.e., Letter Carroll, Letter Frost, MSRC v2 and Bird Song, the instance labels are available, and thus providing a test bed for key instance detection. Among the existing MIML methods, RankLossSIM and KISAR are able to detect key instance for each label, and will be compared with our approach. For MIMLfast and RankLossSIM, the key instance for a specific label is identified by selecting the instance with maximum prediction value on that label, while for KISAR, key instance is the one closest to the prototype of the label as in LHJZ12 (). We examine the ground truth of the detected key instances and present the accuracies in Table 3. We can observe that KISAR is less accurate than the other two methods, probably because it does not build the model on the instance level, and detects key instance based on unsupervised prototypes. When compared with RankLossSIM, which is specially designed for instance annotation, our approach is more accurate on the two larger data sets, while comparable on Letter Carroll, and slightly worse on Letter Frost.

data | MIMLfast | KISAR | RankLossSIM |
---|---|---|---|

LetterCarroll | 0.670.03 | 0.410.03 | 0.670.03 |

LetterFrost | 0.670.03 | 0.470.04 | 0.700.03 |

MSRC v2 | 0.660.03 | 0.620.03 | 0.640.02 |

Bird Song | 0.580.04 | 0.310.03 | 0.420.02 |

### 3.5 Sub-Concept Discovery

To examine the effectiveness of sub-concept discovery, we run MIMLfast with varying number of sub-concepts on the two benchmark data sets: Scene for image classification and Reuters for text categorization. Table 4 presents the results with varying from 1 to 15 with step size of 5. For each value of , we run 10-fold cross validation and report the average results as well as standard deviations. Note that is selected by cross validation on the training data in Section 3.2. As shown in Table 4, compared with neglecting the sub-concepts (), the exploitation of sub-concepts is helpful (, 10 and 15 are all better than ). When the gets larger, the difference between results with different values is not very significant. This may owe to that if we set a value larger than what is really needed, some sub-concepts might capture no examples, and thus a overly-large will not make the performance degenerate too much, although it might hamper the efficiency.

We further examine the sub-concepts discovered by MIMLfast. We take the Scene data set as an illustration and show some example images of the top-four sub-concepts discovered for the label sea in Figure 6. It is interesting to see that these four sub-concepts are with reasonable but different perceptions: the first sub-concept corresponds to sea with beach and blue sky, the second sub-concept corresponds to big wave in the sea, etc.

1 | 5 | 10 | 15 | |

Scene | ||||

hamming loss | .191.011 | .186.009 | .182.014 | .181.011 |

one error | .366.038 | .354.026 | .338.030 | .344.031 |

coverage | .224.018 | .213.015 | .202.017 | .210.014 |

ranking loss | .209.020 | .196.017 | .184.018 | .192.016 |

average precision | .754.023 | .764.018 | .777.020 | .769.019 |

Reuters | ||||

hamming loss | .027.008 | .026.006 | .025.006 | .025.007 |

one error | .042.013 | .040.009 | .037.007 | .040.010 |

coverage | .036.007 | .035.006 | .034.006 | .035.007 |

ranking loss | .015.006 | .014.005 | .013.005 | .014.006 |

average precision | .972.010 | .974.007 | .976.006 | .974.008 |

MIMLfast | V1 | V2 | |

Scene | |||

hamming loss | .188.009 | .211.009 | .196.012 |

one error | .351.023 | .409.023 | .358.030 |

coverage | .207.012 | .239.011 | .208.014 |

ranking loss | .189.014 | .228.013 | .192.016 |

avg. precision | .770.015 | .730.014 | .767.018 |

Reuters | |||

hamming loss | .028.004 | .038.004 | .035.003 |

one error | .044.008 | .060.011 | .046.010 |

coverage | .035.004 | .038.005 | .035.004 |

ranking loss | .014.004 | .019.004 | .015.003 |

avg. precision | .972.005 | .963.007 | .971.006 |

### 3.6 Comparison with Variants

To further examine how MIMLfast works, we study two variants, V1 and V2. V1 gives up in Eq. 1 and directly learns a linear model for each label. It is constructed to examine whether learning the shared space is helpful. V2 simply selects the top labels as relevant ones, where is the average number of relevant labels on the training data. It is constructed to examine whether the dummy label provides a good separation of relevant and irrelevant labels.

Table 5 shows the results on the two benchmark data sets. V1 is significantly worse than MIMLfast on all criteria, implying that learning the shared space for all the labels is better than learning each label independently. On hamming loss, MIMLfast achieves significantly better performance than V2, while on the other four criteria, they achieve comparable performances, implying that the use of dummy label does not affect the rank of the labels but providing a reasonable separation of relevant and irrelevant labels.

## 4 Related Work

Many MIML approaches were proposed during the past few years. For example, MIMLSVM ZZ07 () degenerated the MIML problem into single-instance multi-label tasks to solve. MIMLBoost ZZ07 () degenerated MIML to multi-instance single-label learning. A generative model for MIML was proposed by Yang et al. YZH09 (). Nearest neighbor and neural network approaches for MIML were proposed in Z10 () and ZW09 (), respectively. Zha et al. ZHMWQW08 () proposed a hidden conditional random field model for MIML image annotation. Briggs et al. BFR12 () proposed to optimize ranking loss for MIML instance annotation. In LHJZ12 (), the authors tried to discover what patters trigger what labels in MIML learning by constructing a prototype for each label with clustering. Existing MIML approaches achieved success in many applications, most with moderate-sized data owing to the high computational load. To handle large-scale data, MIML approaches with high efficiency are demanded.

In WBU11 (), a similar technique was used to optimize WARP loss for image annotation; however, it dealt with single-instance single-label problem, which is quite different from our MIML problem. In ZZHL12 (), an approach of discovering sub-concepts for complicated concepts was proposed based on clustering. However, it was focused on single label learning, quite different from our MIML task. Moreover, MIMLfast exploits label information and discovers sub-concepts using supervised model rather than heuristic clustering.

## 5 Conclusion

MIML is a framework for learning with complicated objects, and has been proved to be effective in many applications. However, existing MIML approaches are usually too time-consuming to deal with large scale problems. In this paper, we propose the MIMLfast approach to learn with MIML examples fast. On one hand, efficiency is highly improved by optimizing the approximated ranking loss with SGD based on a two level linear model; on the other hand, effectiveness is achieved by exploiting label relations in a shared space and discovering sub-concepts for complicated labels. Moreover, our approach can naturally detect key instance for each label, and thus providing a chance to discover the relation between input patterns and output label semantics. In the future, we will try to optimize other loss functions rather than ranking loss. Also, larger scale problems will be studied.

## References

- [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Advances in neural information processing systems 15, pages 561–568. MIT Press, Cambridge, MA, 2002.
- [2] S. Ben-David, D. Loker, N. Srebro, and K. Sridharan. Minimizing the misclassification error rate using a surrogate convex loss. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, 2012.
- [3] Léon Bottou. Large-scale machine learning with stochastic gradient descent. Compstat, 2010.
- [4] F. Briggs, X.Z. Fern, and R. Raich. Rank-loss support instance machines for miml instance annotation. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 534–542, Beijing, China, 2012.
- [5] T.G. Dietterich, R.H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1):31–71, 1997.
- [6] P. Duygulu, K. Barnard, J.F.G. Freitas, and D.A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the 7th European Conference on Computer Vision, pages 97–112, Copenhagen, Denmark, 2002.
- [7] P.W. Frey and D.J. Slate. Letter recognition using holland-style adaptive classifiers. Machine Learning, 6(2):161–182, 1991.
- [8] J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, and K. Brinker. Multilabel classification via calibrated label ranking. Machine Learning, 73(2):133–153, 2008.
- [9] R. Jin, S. Wang, and Z.H. Zhou. Learning a distance metric from multi-instance multi-label data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 896–902, Miami, FL, 2009.
- [10] H. Li, M. Wang, and X.S. Hua. Msra-mm 2.0: A large-scale web multimedia dataset. In Proceedings of the IEEE International Conference on Data Mining Workshops, pages 164–169, 2009.
- [11] Yu-Feng Li, Ju-Hua Hu, Yuang Jiang, and Zhi-Hua Zhou. Towards discovering what patterns trigger what labels. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, pages 1012–1018, Toronto, Canada, 2012.
- [12] Y.X. Li, S. Ji, S. Kumar, J. Ye, and Z.H. Zhou. Drosophila gene expression pattern annotation through multi-instance multi-label learning. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1445–1450, Pasadena, CA, 2009.
- [13] J. Luo and F. Orabona. Learning from candidate labeling sets. In Advances in Neural Information Processing Systems 23. MIT Press, Cambridge, MA, 2010.
- [14] O. Maron and A.L. Ratan. Multiple-instance learning for natural scene classification. In Proceedings of the 15th International Conference on Machine Learning, pages 341–349, Madison, WI, 1998.
- [15] N. Nguyen. A new svm approach to multi-instance multi-label learning. In Proceedings of the 10th IEEE International Conference on Data Mining, pages 384–392, Sydney, Australia, 2010.
- [16] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
- [17] R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2-3):135–168, 2000.
- [18] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
- [19] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classification. In Proceedings of the 26th International Conference on Machine Learning, pages 1057–1064, Montreal, Canada, 2009.
- [20] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pages 2764–2770, Barcelona, Spain, 2011.
- [21] J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal visual dictionary. In 10th IEEE International Conference on Computer Vision, pages 1800–1807, Beijing, China, 2005.
- [22] S.H. Yang, H. Zha, and B.G. Hu. Dirichlet-bernoulli alignment: A generative model for multi-class multi-label multi-instance corpora. In Advances in Neural Information Processing Systems 22, pages 2143–2150. MIT Press, Cambridge, MA, 2009.
- [23] Z.J. Zha, X.S. Hua, T. Mei, J. Wang, G.J. Qi, and Z. Wang. Joint multi-label multi-instance learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, Anchorage, AK, 2008.
- [24] M.-L. Zhang. A k-nearest neighbor based multi-instance multi-label learning algorithm. In Proceedings of the 22nd IEEE International Conference on Tools with Artificial Intelligence, pages 207–212, Arras, France, 2010.
- [25] M.-L. Zhang and Z.-J. Wang. Mimlrbf: Rbf neural networks for multi-instance multi-label learning. Neurocomputing, 72(16):3951–3956, 2009.
- [26] Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene classification. In Advances in Neural Information Processing Systems 19, pages 1609–1616. MIT Press, Cambridge, MA, 2007.
- [27] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li. Multi-instance multi-label learning. Artificial Intelligence, 176(1):2291–2320, 2012.