Abstract
We formalize a new statistical machine learning paradigm, called infinitelabel learning, to annotate a data point with more than one relevant labels from a candidate set which pools both the finite labels observed at training and a potentially infinite number of previously unseen labels. The infinitelabel learning fundamentally expands the scope of conventional multilabel learning and better meets the practical requirements in various realworld applications, such as image tagging, adsquery association, and article categorization. However, how can we learn a labeling function that is capable of assigning to a data point the labels omitted from the training set? To answer the question, we seek some clues from the recent works on zeroshot learning, where the key is to represent a class/label by a vector of semantic codes, as opposed to treating them as atomic labels. We validate the infinitelabel learning by a PAC bound in theory and some empirical studies on both synthetic and real data.
InfiniteLabel Learning with Semantic Output Codes
Yang Zhang Rupam Acharyya Ji Liu Boqing Gong
Center for Research in Computer Vision University of Central Florida Orlando, FL 32816 yangzhang@knigths.ucf.edu bgong@crcv.ucf.edu Department of Computer Science University of Rochester Rochester, NY 14627 racharyy,jliu@cs.rochester.edu
1 Introduction
Recent years have witnessed a great surge of works on zeroshot learning [1, 2, 3, 4, 5, 6, 7, 8]. While conventional supervised learning methods (for classification problems) learn classifiers only for the classes that are supported by data examples in the training set, the zeroshot learning also strives to construct the classifiers of novel classes for which there is no training data at all. This is achieved by using class/label descriptors, often in the form of some vectors of numeric or symbolic values, in contrast to the atomic classes used in many other machine learning settings.
Zeroshot learning is appealing as it may bring machine learning closer to one of the remarkable human learning capabilities: generalizing learned concepts from old tasks to not only new data but also new (and yet similar) tasks [9, 10]. There are also many realworld applications for zeroshot learning. It facilitates coldstart [11] when the data of some classes are not available yet. In computer vision, the number of available images per object follows a longtail distribution, and zeroshot learning becomes key to handling the rare classes. We note that the success of zeroshot learning largely depends on how much knowledge the label descriptors encode for relating the novel classes with the seen ones at training, as well as how the knowledge is involved in the decision making process, e.g., for classification.
Despite the extensive existing works on zeroshot learning for multiclass classifications, it is rarely studied for multilabel classification [12, 13, 14] except by [15, 16]. We contend that it is actually indispensable for multilabel learning algorithms to handle previously unseen labels. There are always new topics arising from the news (and research, education, etc.) articles over time. Creative hashtags could become popular on social networks over a night. There are about 53M tags on Flickr and many of them are associated with none or very few images. Furthermore, it is often laborious and costly to collect clean training data for the multilabel learning problems especially when there are many labels [17, 14]. To this end, we contend that zeroshot multilabel learning, which we call infinitelabel learning to make it concise and interesting, is arguably a much more pressing need than the zeroshot multiclass classification.
To this end, we provide a formal treatment to the infinitelabel learning problem in this paper. To the best of our knowledge, the two existing works [15, 16] on this problem only study specific application scenarios. Zhang et al. analyze the rankability of word embeddings [18, 19] and then use them to assign both seen and unseen tags to images [16]. Nam et al. propose a regularization technique for the WSABIE algorithm [20] to account for the label hierarchies [15]. In sharp contrast, we formalize the infinitelabel learning as a general machine learning framework and show its feasibility both theoretically under some mild assumptions and empirically on synthetic and real data. Our results indicate that the infinitelabel learning is hard and yet not impossible to solve, given appropriate semantic codes of the labels.
The semantic codes of the labels can be derived from a large knowledge base in addition to the training set. Thanks to the studies in linguistics, words in speech recognition are represented by combinations of phonemes [21]. In computer vision, an object class is defined by a set of visual attributes [5, 22]. More recently, the distributed representations of English words [18, 19] have found their applications in a variety of tasks. In biology, the labels are naturally represented by some structural or vectorial codes (e.g., protein structures [23]).
We organize the remaining of the paper as follows. We discuss some related work and machine learning settings next. After that, Section 3 formally states the problem of infinitelabel learning, followed by some plausible modeling assumptions and algorithmic solutions. We then present a PAClearning bound and some empirical results for this new machine learning framework in Sections 4 and 5, respectively. Section 6 concludes the paper.
2 Related work
We note that the semantic codes of labels have actually been used in various machine learning settings even before the zeroshot learning becomes popular [1, 2]. Bonilla et al. use the vectorial representations of tasks to enrich the kernel function and parameterize a gating function [24]. Similarly, Bakker and Heskes input the task vectors to a neural network for task clustering and gating [25]. Wimalawarne et al. address multitask learning when the tasks are each indexed by a pair of indices [26]. A largemargin algorithm is introduced in [27] to transform the classification problems to multivariate regression by considering the semantic codes of the classes as the target. It is worth pointing out that this line of works, unlike zeroshot learning or the infinitelabel learning studied in this paper, lack the exploration of the learned models’ generalization capabilities to novel labels or tasks that are unseen in training.
Learning to rank [28, 29] and the coldstart problem in recommendation system [30, 31, 32] explicitly attempt to tackle unseen labels (e.g., queries and items, respectively) thanks to the label descriptions that are often vectorial. Zeroshot learning shares the same spirit [1, 2, 5, 3, 4, 7, 6, 8, 33, 34] and we particularly study the infinitelabel learning in this paper.
Additionally, the semantic codes of labels or tasks also seamlessly unify a variety of machine learning settings [35, 23]. The infinitelabel learning studied in this paper can be regarded as a special case of the pairwise learning; in particular, a special case of the setting D in [23]. Nonetheless, we think the infinitelabel learning is worth studying as a standalone machine learning framework given its huge application potentials. Besides, the PAClearning analysis presented in this paper sheds light on the learnability of the pairwise learning [23].
3 Infinitelabel learning
The goal of infinitelabel learning is to learn a decision function to annotate a data point with all relevant labels from a candidate set which pools both the finite labels observed at training and a potentially infinite number of previously unseen labels. In this section, we formally state the problem and then lay out a modeling assumption about the data generation process. We also provide some potential algorithmic solutions to the infinitelabel learning.
Multilabel learning  infinitelabel learning (ZSML)  

Data generation  
Training data  
Labeling function  ,  
Example solution 
3.1 Problem statement
Suppose we have a set of vectorial labels and a training sample , where the annotation indicates that the th label, which is described by , is relevant to the th data point , and otherwise. For convenience, we refer to the label vectors as labels in the rest of the paper. Unlike conventional multilabel learning where one learns a decision function to assign the seen labels to an input data point, the infinitelabel learning enables the decision function to additionally tag the data point with previously unseen labels , where .
Table 1 conceptually compares the infinitelabel learning with multilabel learning. The rows of training data and labeling functions summarize the above discussions, and the other rows are elaborated in the next two subsections, respectively.
3.2 The noni.i.d. training set
One of our objectives in this paper is to derive a PAClearning bound for the infinitelabel learning. To facilitate that, we carefully examine the data generation process in the infinitelabel learning in this subsection.
At the first glance, one might impose a distribution over the data , vectorial label , and the indicator about the relevance between the label and the data, and then assume the training sample ( and jointly) is drawn i.i.d. from . Noting that implies nothing but a binary classification over the augmented data , the existing PAClearning results could be used to bound the generalization error of the infinitelabel learning.
The above reasoning is actually problematic because its corresponding generalization risk cannot be estimated empirically in real scenarios and applications. To see this point more clearly, we write out the generalization risk and its straightforward empirical estimate below,
(1)  
(2) 
where is a hypothesis from a prefixed family and is the 01 loss. We can see that, under the i.i.d. assumption, the data and label must be sampled simultaneously, i.e., they are exclusively coupled.
However, the seen labels and data in the training set are often pairwise examined and are not exclusively coupled. For example, to build the NUSWIDE dataset [39], the annotators are asked to give the relevance between all the 81 tags and each image. Such practice instead suggests an alternative empirical risk,
Moreover, the decoupling of data example and label in also implies that the training set in infinitelabel learning (and multilabel learning) is virtually generated from the joint distribution in the following noni.i.d. manner.

Sample times from and obtain ,

Sample times from and obtain ,

Sample from for .
As a result, the training set is a noni.i.d. sample drawn from . The existing PAClearning bounds for binary classification cannot be directly applied here. Instead, we would like to bound the difference between the generalization risk and the noni.i.d. empirical risk . To overcome this disparity is accordingly another contribution of this paper presented in Section 4.
Remarks.
We note that the joint distribution is in general not sufficient to specify appeared in the above sampling procedure. To rectify this, we make an independent assumption detailed in Section 4.
3.3 Views of the problem and algorithmic solutions
Given the sampling procedure described above, to solve infinitelabel learning essentially is essentially to approximate the conditional distribution . One potential hypothesis set for takes the simple bilinear form . Following the discussion about zerodata learning [1], we may understand this decision function from at least two views. From the data point of view, it is a decision function defined over the augmented input data, i.e., . From the model point of view, it defines a hyperplane that separates the labels into two subsets no matter they are seen or unseen. One subset of the labels is relevant to the data instance and the other is not, thanks to that the hyperplane is indexed by the input data. To learn the model parameters , the maximum margin regression [27] may be readily applied.
We can also use a neural network to infer the hyperplanes in the label space, such that . Additionally, the kernel methods reviewed in [23] for pairwise learning are natural choices for solving the infinitelabel learning. In general, the kernel function gives rise to the following form of decision functions: , where takes as input a data point and outputs a scalar and is a kernel function over the labels.
4 A PAClearning analysis
In this section, we investigate the learnability of the infinitelabel learning under the PAClearning framework. Given the training set and the seen vectorial labels , the theorem below sheds light on the numbers of data points and labels that are necessary to achieve a particular level of error of the generalization to not only test data examples but also previously unseen labels.
Our result depends on two mild assumptions.
Assumption I: .
Recall the second step in the sampling procedure in Section 3.2. The conditional distribution is intractable in general from only the joint distribution . To rectify this, we introduce the first assumption here, that knowing the data does not alter the marginal distribution of the labels, i.e., . This is reasonable for the infinitelabel learning especially when the label descriptions come from a knowledge base that is distinct from the training set — e.g., the word embeddings used in [16] are learned without accessing the images, and the label hierarchies are given independent of the news articles in [15].
Due to Assumption I, we arrive at a simplified modeling about the infinitelabel learning problem. Table 1 draws the corresponding graphical model from which we can see that the distribution over the indicator variable is unique from the graphical model of multilabel learning. It is this change of modeling that enables the generalization to new labels feasible.
Assumption II:
The conditional distribution of the indicator variable is controlled by a random variable through where is a binary random variable and is the noiseless indication about the relevance of the label with the data .
This assumption partially grounds on that, in practice, the annotations of the training sample are often incomplete when the number of seen labels is large. Take the usertagged data in social networks for instance. Some labels for which are actually relevant to the data; the indicators are flipped to (from the groundtruth ) merely because the corresponding labels fail to capture the users’ attentions and are thus missed by the users.
Theorem 1.
For any , any , and under Assumptions I and II, the following holds with probability at least ,
(3) 
This generalization error bound is roughly if all terms are ignored. When and converge to infinity, the error vanishes. An immediate implication is the learnability of the infinitelabel learning: to obtain a certain accuracy on all labels — both seen and unseen , one does not have to observe all of them in the training phase.
Proof.
To prove the theorem, we essentially need to consider the following probability bound:
(4)  
(5) 
where the last inequality is due to the fact , and
The first equality follows Assumption II and the second is due to Assumption I as well as the independence between the binary random variable and the data .
Next we consider the bounds for the two terms (4) and (5), respectively. For the first term (4), we have the following,
(6) 
Denote by a function given and . We have . Moreover, are i.i.d. random variables in thanks to the sampling procedure described in Section 3.2. Define to be a hypothesis space
Using the new notations, we can rewrite (6) by
(7) 
where the inequality uses the growth function bound [40, Section 3.2], and is the number of maximally possible configurations (or values of ) given data points in over the hypothesis space . Since is in the form of , is bounded by with . Hence, we have
(4)  
(8) 
Next, we find the bound for term (5) which depends on Assumption I only.
(9) 
where the last inequality uses the union bound. Denote by for short. Then we have and are i.i.d. variables taking values from given . Define to be a hypothesis space
Then we can cast the probabilistic factor in (9) into
(10) 
where the last inequality uses the growth function bound again. We omit the remaining parts of the proof; the supplementary material includes the complete proof.
∎
5 Empirical studies
In this section, we continue to investigate the properties of infinitelabel learning. While the theoretical result in Section 4 justifies its learnability, there are many other questions of interest to explore for the practical use of the infinitelabel learning. We focus on the following two questions and provide some empirical insights using respectively synthetic data and real data.

After we learn a decision function from the training set, how many and what types of unseen labels can we confidently handle using this decision function?

What is the effect of varying the number of seen labels , given a fixed union of seen and unseen labels? Namely, given a labeling task, can we collect training data for only a subset of the labels and yet achieve good performance at the test phase on all the labels? We learn different decision functions by varying the number of seen labels and then check their labeling results of assigning all the labels .
We use the bilinear form for the decision function in the experiments, where our goal is not to achieve the stateoftheart performance on the datasets but rather to demonstrate the feasibility of solving the infinitelabel learning problems as well as revealing some trends and insights when the ratio between the numbers of seen and unseen labels changes.
5.1 Synthetic experiments
To answer the first question, we generate some synthetic data which allows us to conveniently control the number of labels for the experiments.
Data.
We randomly sample 500 training data points and 1000 testing data points from a fivecomponent Gaussian mixture model. We also sample 10 seen labels and additionally 2990 unseen labels from a Gaussian distribution. Note that only the seen labels are revealed during the training stage. As below specifies the distributions,
where the mixture weights are sampled from a Dirichlet distribution , and both the mean and for the variance are sampled from a standard normal distribution (using the randn function in MATLAB). Finally, we generate a “groundtruth” matrix from a standard normal distribution. The groundtruth label assignments are thus given by for both training and testing data and both seen and unseen labels. Following Assumption II, we randomly flip the sign of each with probability .
Figure 1(a) and (b) show the data points and labels we have sampled. The training data and the seen labels are in red color, while all the other (test) data points and labels are unseen during training. We choose the low dimensions for the data and vectorial labels so that we can visualize them and have a concrete understanding about the results to be produced.
Method %  81 out of 81 tags are seen  73 out of 81 tags are seen  65 out of 81 tags are seen  

MiAP  P  R  F1  MiAP  P  R  F1  MiAP  P  R  F1  
LabelEM [41]  47.4  26.2  44.7  33.1  41.8  23.4  39.8  29.4  38.4  21.4  36.4  26.9 
ConSE [42]  47.5  26.5  44.9  33.2  46.9  26  44.3  32.7  44.9  24.3  41.5  30.7 
ESZSL [7]  45.8  25.9  44.2  30.7  45.6  25.6  43.6  28.1  43.8  23.8  40.6  30.1 
BilinearRankNet  53.8  30.1  51.4  38  52.8  29.5  50.2  37.1  49.5  27.5  46.8  34.6 
Method %  57 out of 81 tags are seen  49 out of 81 tags are seen  41 out of 81 tags are seen  

MiAP  P  R  F1  MiAP  P  R  F1  MiAP  P  R  F1  
LabelEM [41]  32.1  16.5  28.2  20.8  30.0  15.7  26.8  19.5  32.4  18.1  30.8  22.8 
ConSE [42]  41.8  22.9  39.0  28.9  40.1  22.0  37.5  27.8  38.7  22.2  37.9  28.0 
ESZSL [7]  41.6  22.8  38.9  28.7  39.6  21.7  36.9  27.3  38.4  21.9  37.4  27.6 
BilinearRankNet  46.8  26.3  44.8  33.1  45.1  25.8  44.0  32.5  41.2  23.7  40.3  29.8 
Algorithm.
Given the training set of the 10 seen labels , we learn the labeling function by minimizing a hinge loss,
and then try to assign both seen and unseen labels to the test data points, using . It is interesting to note that a similar formulation and its dual have been studied by Szedmak et al. [27], which, however, are mainly used to reduce the complexity of multiclass classification. It it also worth pointing out that it is not a regression problem though its form shares some similarity with the (multivariate) support vector regression [43].
We incrementally challenge the learned infinitelabel model by gradually increasing the difficulties at the test phase. Namely, we rank all the labels according to their distances to the seen labels , where the distance between an unseen label and the seen ones is calculated by . We then evaluate the label assignment results for every 500 consecutive labels in the ranking list (as well as the 10 seen labels). Arguably, the last 500 labels, which are the furthest subset of unseen labels from the seen ones , impose the biggest challenge to the learned model.
Results.
Figure 1(c) shows the label assignment errors for different subsets of the test labels. We run 5 rounds of experiments each with different randomly sampled data, and report their individual results as well as the average. We borrow from multilabel classification [13] the Hamming loss as the evaluation metric. It is computed by , where is the groundtruth label assignment to the data point and is the predicted assignment. Note that this is inherently different from the classification accuracy/error used for evaluating multiclass classification.
We draw the following observations from Figure 1(c). First, infinitelabel learning is feasible since we have obtained decent results for up to 3000 labels with only 10 of them seen in training. Second, when the unseen labels are not far from the seen labels, the label assignment results are on par with the performance of assigning only the seen labels to the test data (cf. the Hamming losses over the first, second, and third 500 unseen labels). Third, labels that are far from the seen labels may cause larger confusions to the infinitelabel model learned from finite seen labels. Increasing the number of seen labels and/or data points during training can improve the model’s generalization capability to unseen labels, as suggested by Theorem 1 and revealed in the next experiment.
5.2 Image tagging
We experiment with image tagging to seek some clues to answer the second question raised at the beginning of this section. Suppose we are given a limited budget and have to build a tagging model as good as possible under this budget. Thanks to the infinitelabel learning, we may compile a training set using only a subset of the labels of interest as opposed to asking the annotators to examine all of them. Then, how many seen labels should we use in order to achieve about the same performance as using all of the labels for training? We give some empirical results to answer this question. Our experiment protocol largely follows the prior work [16].
Data.
We conduct the experiments using the NUSWIDE dataset [39]. It has 269,648 images in total and we were only able to retrieve 223,821 of them using the provided image URLs. Among them, 134,281 are training images and the rest are testing images according to the official split by NUSWIDE. We further randomly leave out 20% of the training images as the validation set for tuning hyperparameters. The images are each represented by the are the normalized activations of the last fully connected layer of VGGNet19 [44].
Each image in NUSWIDE has been manually annotated with the relevant tags out of 81 candidate tags in total. We obtain the tags’ vectorial representations using the pretrained GloVe word vectors [19]. While all the 81 tags are considered at the test stage, we randomly choose (100% out of the 81 tags), 73 (90% out of the 81 tags), 65 (80% out of 81), 57 (70% out of 81), 49 (60% out of 81), and 41 (50% out of 81) seen tags for training different labeling functions . If a training image has no relevant tags under some of the settings, we simply drop it out of the training set.
Learning and evaluation.
Image tagging is often evaluated based on the top few tags returned by a system, assuming that users do not care about the remaining ones. We report the results measured by four popular metrics: Mean image Average Precision (MiAP) [45] and the top3 precision, recall, and F1score. Accordingly, in order to impose the ranking property to our labeling function, we learn it using the RankNet loss [46],
where is the number of relevant tags to the th image. The hyperparameter is tuned by the MiAP results on the validation set.
Baselines.
We compare our results to those of three stateoftheart zeroshot learning methods: LabelEM [33] whose compatible function is the same as ours except that it learns the model parameters by structured SVM [41], ConSE [42] which estimates the representations of the unseen labels using the weighted combination of the seen ones’, and ESZSL [7] that enjoys a closed form solution thanks to the choice of the Frobenius norm.
The methods above were developed for multiclass classification, namely, to exclusively classify an image to one and only one of the classes. As a result, when we train them using the image tagging data, different tags may have duplicated images and result in conflict terms in the objective functions. We resolve this issue by removing such conflict terms in training. By doing this, we observe about 0.5%–2% absolute improvements for the baselines over blindly applying them to our infinitelabel learning problem.
Results.
Tables 2 and 3 present the results of our method and the competitive baselines evaluated by the MiAP, top3 precision, recall, and F1score. We also plot the MiAPs and F1scores in Figure 2 to visualize the differences between different methods and, more importantly, the changes over different numbers of seen labels. Recall that no matter how many labels are seen in the training set, the task remains the same at the test phase, i.e., to find all relevant labels from the 81 candidate tags for each image.
We can see that the performances of all the methods decrease as fewer seen tags are used for training. However, the performance drop is fairly gentle for our method—the MiAP drops by 1% (respectively, 3%) absolutely from using 100% seen labels for training to using 90% (respectively, from 90% to 80%). This confirms our conjecture that, with the semantic codes of the labels, we may save some human labor efforts for annotating the data without sacrificing the overall performance at the test phase.
Additionally, we find that our learned decision function outperforms all the competitive baseline methods by a large margin no matter under which experiment setting. This mainly attributes to that it is learned by a tailored loss function for the infinitelabel learning problem, while the others are not at all. To some extent, such results affirm the significance of studying the infinitelabel learning as a standalone problem because the existing methods for zeroshot classification are suboptimal for this new learning framework.
6 Conclusion
In this paper, we study a new machine learning framework, infinitelabel learning. It fundamentally expands the scope of multilabel learning in the sense that the learned decision function can assign both seen and unseen labels to a data point from potentially an infinite number of candidate labels. We provide a formal treatment to the infinitelabel learning, discuss its distinction from existing machine learning settings, lay out mild assumptions to a PAClearning bound for the new problem, and also empirically examine its feasibilities and properties.
There are many potential avenues for the future work. Our current PAC bound can be likely improved and the assumptions could be relaxed. Theoretical understanding about its performance under the MiAP evaluation is also necessary given that MiAP is prevalent in evaluating the multilabel learning results, especially for image tagging. One particularly interesting application of the infinitelabel learning is on the extreme multilabel classification problems [14]. We will explore them in the future work.
References
 [1] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. Zerodata learning of new tasks. In AAAI, volume 1, page 3, 2008.
 [2] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zeroshot learning with semantic output codes. In Advances in neural information processing systems, pages 1410–1418, 2009.
 [3] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visualsemantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129, 2013.
 [4] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zeroshot learning through crossmodal transfer. In Advances in neural information processing systems, pages 935–943, 2013.
 [5] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attributebased classification for zeroshot visual object categorization. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(3):453–465, 2014.
 [6] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zeroshot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4247–4255, 2015.
 [7] Bernardino RomeraParedes and PHS Torr. An embarrassingly simple approach to zeroshot learning. In Proceedings of The 32nd International Conference on Machine Learning, pages 2152–2161, 2015.
 [8] WeiLun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zeroshot learning for object recognition in the wild. In European Conference on Computer Vision, pages 52–68. Springer, 2016.
 [9] Yael Moses, Shimon Ullman, and Shimon Edelman. Generalization to novel images in upright and inverted faces. Perception, 25(4):443–461, 1996.
 [10] Peter JB Hancock, Vicki Bruce, and A Mike Burton. Recognition of unfamiliar faces. Trends in cognitive sciences, 4(9):330–337, 2000.
 [11] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. Methods and metrics for coldstart recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 253–260. ACM, 2002.
 [12] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multilabel data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2009.
 [13] MinLing Zhang and ZhiHua Zhou. A review on multilabel learning algorithms. Knowledge and Data Engineering, IEEE Transactions on, 26(8):1819–1837, 2014.
 [14] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multilabel classification. In Advances in Neural Information Processing Systems, pages 730–738, 2015.
 [15] Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes Fürnkranz. Predicting unseen labels using label hierarchies in largescale multilabel learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 102–118. Springer, 2015.
 [16] Yang Zhang, Boqing Gong, and Mubarak Shah. Fast zeroshot image tagging. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [17] Wei Bi and James Kwok. Efficient multilabel classification with many labels. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 405–413, 2013.
 [18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [19] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
 [20] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pages 2764–2770, 2011.
 [21] Alex Waibel. Modular construction of timedelay neural networks for speech recognition. Neural computation, 1(1):39–46, 1989.
 [22] Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE, 2009.
 [23] Michiel Stock, Tapio Pahikkala, Antti Airola, Bernard De Baets, and Willem Waegeman. Efficient pairwise learning using kernel ridge regression: an exact twostep method. arXiv preprint arXiv:1606.04275, 2016.
 [24] Edwin V Bonilla, Felix V Agakov, and Christopher KI Williams. Kernel multitask learning using taskspecific features. In AISTATS, pages 43–50, 2007.
 [25] Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4(May):83–99, 2003.
 [26] Kishan Wimalawarne, Masashi Sugiyama, and Ryota Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In Advances in neural information processing systems, pages 2825–2833, 2014.
 [27] Sandor Szedmak, John ShaweTaylor, and Emilio ParadoHernandez. Learning via linear operators: Maximum margin regression; multiclass and multiview learning at oneclass complexity. University of Southampton, Tech. Rep, 2006.
 [28] Michiel Stock, Thomas Fober, Eyke Hüllermeier, Serghei Glinca, Gerhard Klebe, Tapio Pahikkala, Antti Airola, Bernard De Baets, and Willem Waegeman. Identification of functionally related enzymes by learningtorank methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(6):1157–1169, 2014.
 [29] Tapio Pahikkala, Antti Airola, Michiel Stock, Bernard De Baets, and Willem Waegeman. Efficient regularized leastsquares algorithms for conditional ranking on relational data. Machine Learning, 93(23):321–356, 2013.
 [30] Ryan Prescott Adams, George E Dahl, and Iain Murray. Incorporating side information in probabilistic matrix factorization with gaussian processes. arXiv preprint arXiv:1003.4944, 2010.
 [31] Tinghui Zhou, Hanhuai Shan, Arindam Banerjee, and Guillermo Sapiro. Kernelized probabilistic matrix factorization: Exploiting graphs and side information. In Proceedings of the 2012 SIAM international Conference on Data mining, pages 403–414. SIAM, 2012.
 [32] Yi Fang and Luo Si. Matrix cofactorization for recommendation with rich side information and implicit feedback. In Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, pages 65–69. ACM, 2011.
 [33] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for finegrained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
 [34] Yanwei Fu, Yongxin Yang, Tim Hospedales, Tao Xiang, and Shaogang Gong. Transductive Multilabel Zeroshot Learning. pages 7.1–7.11. British Machine Vision Association, 2014.
 [35] Yongxin Yang and Timothy M Hospedales. A unified perspective on multidomain and multitask learning. arXiv preprint arXiv:1412.7489, 2014.
 [36] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multilabel prediction via compressed sensing. In NIPS, volume 22, pages 772–780, 2009.
 [37] Yi Zhang and Jeff G Schneider. Multilabel output codes using canonical correlation analysis. In International Conference on Artificial Intelligence and Statistics, pages 873–882, 2011.
 [38] Krishnakumar Balasubramanian and Guy Lebanon. The landmark selection method for multiple output prediction. In Proceedings of the 29th International Conference on Machine Learning (ICML12), pages 983–990, 2012.
 [39] TatSeng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nuswide: a realworld web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, page 48. ACM, 2009.
 [40] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
 [41] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(Sep):1453–1484, 2005.
 [42] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zeroshot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
 [43] Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex Smola, Vladimir Vapnik, et al. Support vector regression machines. Advances in neural information processing systems, 9:155–161, 1997.
 [44] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [45] Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees GM Snoek, and Alberto Del Bimbo. Socializing the semantic gap: A comparative survey on image tag assignment, refinement and retrieval. arXiv preprint arXiv:1503.08248, 2015.
 [46] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pages 89–96. ACM, 2005.