# Multi-Label Zero-Shot Learning with Transfer-Aware Label Embedding Projection

## Abstract

Zero-shot learning transfers knowledge from seen classes to novel unseen classes to reduce human labor of labelling data for building new classifiers. Much effort on zero-shot learning however has focused on the standard multi-class setting, the more challenging multi-label zero-shot problem has received limited attention. In this paper we propose a transfer-aware embedding projection approach to tackle multi-label zero-shot learning. The approach projects the label embedding vectors into a low-dimensional space to induce better inter-label relationships and explicitly facilitate information transfer from seen labels to unseen labels, while simultaneously learning a max-margin multi-label classifier with the projected label embeddings. Auxiliary information can be conveniently incorporated to guide the label embedding projection to further improve label relation structures for zero-shot knowledge transfer. We conduct experiments for zero-shot multi-label image classification. The results demonstrate the efficacy of the proposed approach.

## 1 Introduction

Despite the advances in the development of supervised learning techniques such as deep neural network models, the conventional supervised learning setting requires a large number of labelled instances for each single class to perform training, and hence induce substantial annotation costs. It is important to develop algorithms that enable the reduction of annotation cost for training classification models. Zero-shot learning (ZSL) which transfers knowledge from annotated seen classes to predict unseen classes that have no labeled data, hence has received a lot of attention [\citeauthoryearLampert et al.2009, \citeauthoryearAkata et al.2015, \citeauthoryearRomera-Paredes and Torr2015, \citeauthoryearZhang and Saligrama2015, \citeauthoryearChangpinyo et al.2017].

One primary source deployed in zero-shot learning for bridging the gap between seen and unseen classes is the attribute description of the class labels [\citeauthoryearLampert et al.2009, \citeauthoryearLampert et al.2014, \citeauthoryearRomera-Paredes and Torr2015, \citeauthoryearFu et al.2015]. The attributes are typically defined by domain experts who are familiar with the common and specific characteristics of different category concepts, and hence are able to carry transferable information across classes. Nevertheless human labor is still involved in defining the attribute-based class representations. This propels the research community to exploit more easily accessible free information sources from the Internet, including textual descriptions from Wikipedia articles [\citeauthoryearQiao et al.2016, \citeauthoryearAkata et al.2015], word embedding vectors trained from large text corpus using natural language processing (NLP) techniques [\citeauthoryearAkata et al.2015, \citeauthoryearFrome et al.2013, \citeauthoryearXian et al.2016, \citeauthoryearZhang and Saligrama2015, \citeauthoryearAl-Halah et al.2016], co-occurrence statistics of hit-counts from search engine [\citeauthoryearRohrbach et al.2010, \citeauthoryearMensink et al.2014], and WordNet hierarchy information of the labels [\citeauthoryearRohrbach et al.2010, \citeauthoryearRohrbach et al.2011, \citeauthoryearLi et al.2015b]. These works demonstrated impressive results on several standard zero-shot datasets. However, majority research effort has concentrated on multi-class zero-shot classifications, while the more challenging multi-label zero-shot learning problem has received very limited attention [\citeauthoryearMensink et al.2014, \citeauthoryearZhang et al.2016, \citeauthoryearLee et al.2017].

In this work we propose a novel transfer-aware label embedding projection method to tackle multi-label zero-shot learning, as shown in Figure 1. Label embeddings have been exploited in standard multi-label classification to capture label relationships. We exploit the word embeddings [\citeauthoryearPennington et al.2014] produced from large corpus with NLP techniques as the initial semantic label embedding vectors. These semantic embedding vectors have the nice property of catching general similarities between any pair of label phrases/words, but may not be optimal for multi-label classification and information transfer across classes. Hence we project the label embedding vectors into a low-dimensional semantic space in a transfer-aware manner to gain transferable label relationships by enforcing similarity between seen and unseen class labels and separability across unseen labels. We then simultaneously co-project the labeled seen class instances into the same semantic space under a max-margin multi-label classification framework to ensure the predictability of the embeddings. Moreover, we further incorporate auxiliary information to guide the label embedding projection for suitable inter-label relationships. To investigate the proposed approach, we conduct ZSL experiments on two standard multi-label image classification datasets, the PASCAL VOC2007 and VOC2012. The empirical results demonstrate the effectiveness of the proposed approach by comparing to a number of related ZSL methods.

## 2 Related Work

Multi-label Classification Multi-label classification is relevant in many application domains, where each data instance can be assigned into multiple classes. Many multi-label learning works developed in the literature have centered on exploiting the correlation/interdependency information between the multiple labels, including the max-margin learning methods with pairwise ranking loss [\citeauthoryearElisseeff et al.2001], weighted approximate pairwise ranking loss (WARP) [\citeauthoryearWeston et al.2011], and calibrated separation ranking loss (CSRL) [\citeauthoryearGuo and Schuurmans2011]. Moreover, incomplete labels are frequently encountered in many multi-label applications due to noise or crowd-sourcing, where only a subset of true labels are provided on some training instances. Multi-label learning methods with missing labels have largely depended on observed label correlations to overcome the label incompleteness of the training data [\citeauthoryearBucak et al.2011, \citeauthoryearYu et al.2014, \citeauthoryearYang et al.2016]. These methods however assumed that all the labels are at least observed on a subset of training data and they cannot handle the more challenging zero-shot learning setting where some labels are completely missing from the training instances.

Zero-shot Learning There have been a significant number of works in multi-class zero-shot image classification, including the ones that explore different transferring embedding strategies [\citeauthoryearRomera-Paredes and Torr2015, \citeauthoryearFrome et al.2013, \citeauthoryearNorouzi et al.2013, \citeauthoryearXian et al.2016] or different information sources [\citeauthoryearAkata et al.2015, \citeauthoryearMensink et al.2014]. Many methods represent labels in a semantic attribute space [\citeauthoryearLampert et al.2009] or word embedding space [\citeauthoryearMikolov et al.2013, \citeauthoryearPennington et al.2014]) to perform zero-shot learning by computing similarities between the instances and labels. [\citeauthoryearRomera-Paredes and Torr2015] proposed a simple approach to learn a projection matrix that maps image features into the attribute space, while [\citeauthoryearFrome et al.2013] used a CNN architecture followed by a transformation matrix to map images into the word embedding vector space. [\citeauthoryearNorouzi et al.2013] also took advantage of CNNs but they expressed image embeddings as convex combinations of seen class embeddings. [\citeauthoryearAkata et al.2015] considered learning a bilinear compatibility function for image features and output label embeddings. They evaluated attributes, word embedding vectors, as well as WordNet hierarchy and online text information, for producing label embeddings. In [\citeauthoryearXian et al.2016], the authors proposed to use tensors as nonlinear latent embedding functions. [\citeauthoryearLi et al.2015a] learned the projection matrix by minimizing max-margin loss in a semi-supervised way. [\citeauthoryearZhang and Saligrama2015] proposed to embed both image features and attribute signature of labels into a common semantic space which has the seen classes as bases. More recently, [\citeauthoryearChangpinyo et al.2017] proposed a method to generate visual examplars from semantic attributes, and then use them as optimized class prototypes for prediction on test instances. This work also projects both semantic and visual feature vectors into an intermediate space. Nonetheless, all theses methods are designed for multi-class zero-shot learning problems.

Despite the many works above on multi-class ZSL, to the best of our knowledge, there has not been much work on multi-label ZSL with the following exceptions. In [\citeauthoryearFu et al.2014], the authors proposed to address multi-label zero-shot learning by mapping images into the semantic word space. However in testing phase it needs to consider all possible combinations of the outputs, which is the power set of unseen tags/classes. This prevents it from being applied on larger datasets. The authors of [\citeauthoryearMensink et al.2014] proposed to express unseen class classifiers as weighted sums of seen class classifiers, while the weights are estimated from different kinds of co-occurrence statistics. This approach however treats the unseen class classifiers separately, without considering the correlations/dependencies among the classes. [\citeauthoryearZhang et al.2016] proposed a fast zero-shot image tagging algorithm, which learns the principal direction of each image to separate tags into positive and negative ones. Their approach however uses fixed pre-given label embeddings which may not be the best for capturing useful class correlations between seen and unseen classes towards information transfer. More recently, in [\citeauthoryearGaure et al.2017] the authors adopted a generative probabilistic framework to leverage the co-occurrence statistics of the seen labels for multi-label zero-shot prediction. This method however heavily depends on the auxiliary resource for gaining quality label co-occurrence statistics. [\citeauthoryearLee et al.2017] proposed to construct a knowledge graph based on WordNet hierarchy for modeling label relations, and then propagate confidence scores from the seen to unseen labels through the graph. Its performance largely relies on the quality of the knowledge graph. By contrast, our proposed approach can project existing label embeddings into a more suitable low-dimensional semantic space to automatically retrieve better label relations for knowledge transfer between seen and unseen classes, while flexibly exploiting auxiliary information for additional help.

## 3 Proposed Approach

### 3.1 Problem Definition and Notations

We consider multi-label zero-shot learning in the following setting. Assume we have a set of labeled training images , where denotes the -dim visual features extracted using CNNs for the images, and denotes the corresponding label indicator matrix across a set of seen classes, : “1” indicates the presence of the corresponding label (i.e., positive labels) and “0” indicates the absence of the corresponding label (i.e., negative labels). For multi-label classification, each row of can have multiple “1” values. Moreover, we also assume there are a set of unseen classes, such that , and the labels for the unseen classes are completely missing in our labeled training data. In addition, we assume the word embeddings of the seen classes and unseen classes are both given: , where are the seen class embeddings, are the unseen class embeddings, and their concatenation is for all the classes. We aim to learn a multi-label prediction model from the training data that allows us to perform multi-label classification on the unseen classes.

We use the following general notations in the presentation below. For any matrix, e.g., , we use to denote its -th row vector. We use to denote the Frobenius norm of a matrix and use tr() to denote the trace of a matrix. For , we use to denote its complement such that . We also reuse the notation to denote a set of indices of its non-zero values within proper contexts. We use to denote the Euclidean norm and denote the rectified operator as . We use 1 to denote a column vector of all 1s, assuming its size can be determined in the context, and use to denote an identity matrix. We use to denote a matrix with all 0s and use to denote a matrix with all 1s.

### 3.2 Max-margin Multi-label Learning with Semantic Embedding Projection

Instead of entirely relying on the pre-given label embeddings in obtained from word embeddings to facilitate cross-class information adaptation, we propose to co-project the input image visual features and the label embeddings into a more suitable common low-dimensional semantic space such that the similarity matching scores of each image with its positive labels in this semantic space will be higher than that with its negative labels. Specifically, we want to learn a projection function that maps an instance from the visual feature space into a semantic space ; assuming a linear projection we have , where is a projection matrix. Simultaneously, we learn another linear projection function such that , where is a projection matrix, which maps a class from the original word embedding space into the same semantic space . Then the similarity matching score between an instance and the -th class label can be computed as the inner product of their project representations in the common semantic space:

(1) |

To encode the assumption that the similarity score between an instance and any of its positive label should be higher than the similarity score between instance and any of its negative label , i.e., , we formulate the projection learning problem within a max-margin multi-label learning framework:

(2) |

where denotes a max-margin ranking loss and is a model regularization term. In this work we adopt a calibrated separation ranking loss:

(3) |

where can be considered as the matching score for an auxiliary class 0, which produces a separation threshold score on the -th instance such that the scores for positive labels should be higher than it and the scores for negative labels should be lower than it, i.e., , to minimize the loss.

We assume the project matrix has orthogonal columns to maintain a succinct label embedding projection. For the regularization term over , we consider a Frobenius norm regularizer, , where can be considered as an auxiliary column to , and is a trade-off weight parameter.

### 3.3 Transfer-Aware Label Embedding Projection

Employing the ranking loss to minimize classification error on seen classes can ensure the predictability of the projected label embedding. However for ZSL our goal is to predict labels from the unseen classes. This requires a label embedding representation that can encode suitable inter-class label relations to facilitate information transfer from seen to the unseen classes such that the similarity score can well reflect the relative prediction scores on an unseen class under the learned model parameters and . Our intuition is that classification or ranking on the target unseen class labels would be easier if they are well separated in the projected embedding space and knowledge transfer would be easier if unseen classes and seen classes have high similarities in the projected label embedding space. We hence propose to guide the label embedding projection learning by encoding this intuition through a transfer-aware regularization objective such that:

which can be equivalently expressed in a more compact form:

(4) |

where is a balance parameter for , and

(5) |

Here we use the inner product of a pair projected label embedding vectors as the similarity value for the corresponding pair of classes, and aim to maximize the similarities across seen and unseen classes and minimize the similarities between unseen classes. By incorporating this regularization objective into the framework in Eq.(2), we obtain the following Transfer-Aware max-margin Embedding Projection (TAEP) learning problem:

(6) | ||||

s.t. | ||||

The objective learns and by enforcing positive labels to rank higher than negative labels, while incorporating the regularization term to refine the label embedding structure in the semantic space. can help produce better inter-class relationship structure for cross-class knowledge transfer. The regularization form also has a nice property — it allows a closed-form solution for to be derived and hence simplifies the training procedure.

Note after learning the projection matrices and , it will be straightforward to rank all unseen labels for instance based on the prediction scores for all .

### 3.4 Integration of Auxiliary Information

In addition to explicit word embeddings, similarity information about the class labels can be derived from some external resources. We propose to leverage such auxiliary information to further improve label embedding projection.

In general, we can assume there is some auxiliary source in terms of a similarity matrix over the seen and unseen labels; i.e., defines the similarity between a label pair . Then , where , is the normalized Lapalacian matrix of . We use a manifold regularization term to enforce the projected label embeddings to be better aligned with the inter-class affinity :

(7) |

where is a balance parameter for . This regularization form has the following advantages. First, it can be conveniently integrated into the learning framework in Eq.(6) by simply updating the regularization function to:

(8) |

Second, it is convenient to exploit different auxiliary resources by simply replacing (or ) with the one computed from the specific resource. In this work we study two different auxiliary information resources, WordNet [\citeauthoryearMiller1995] hierarchy and web co-occurrence statistics.

WordNet: WordNet [\citeauthoryearMiller1995] is a large lexical database of English. Words are grouped into a hierarchical tree structure based on their semantic meanings. Since words are organized based on ontology, their semantic relationships can be reflected by their connection paths. We find the shortest path between any two words based on “is-a” taxonomy, and then define the similarity between two labels and as the reciprocal of the path length between the corresponding words, i.e., .

Co-occurrence statistics: Many researchers have exploited the usage of online data, for example Hit-Count, to compute similarity between labels [\citeauthoryearRohrbach et al.2010, \citeauthoryearMensink et al.2014]. The Hit-Count denotes how many times in total and appear together in the auxiliary source – for example, the number of records returned by a search engine. It is the co-occurrence statistics of and in the scale of the entire World Wide Web. Following previous works, we use the Flickr Image Hit Count to compute the dice-coefficient as similarity between two labels, i.e., .

### 3.5 Dual Formulation and Learning Algorithm

With the orthogonal constraint on and the appearance of in both the objective function and the linear inequality constraints, it is difficult to perform learning directly on Eq.(6). We hence deploy the standard Lagrangian dual formulation of the max-margin learning problem for fixed . This leads to the following equivalent dual formulation of Eq.(6):

(9) | ||||

s.t. | ||||

where the primal and can be recovered from the dual variables by and .

One nice property about the dual formulation in Eq.(9) is that it allows a convenient closed-form solution for . To solve this min-max optimization problem, we develop an iterative alternating optimization algorithm to perform training. We start from an infeasible initialization point by setting both and as zeros. Then in each iteration, we perform the following two steps, which will quickly move into the feasible region after one iteration.

#### Step 1:

Given the current fixed , the inner maximization over is a linear constrained convex quadratic programming. Though we can solve it directly using a quadratic solver, it subjects to a scalability problem– the Hessian matrix over will be very large whenever the data size or the label size is large. Hence we adopt a coordinate descent method to iteratively update each row of given other rows fixed, since the constraints over each row of can be separated. The maximization over the -th row can be equivalently written as the following simple quadratic minimization problem:

(10) | ||||

s.t. | ||||

where and . After obtaining the optimal solution , we can update with , where denote a one-hot vector with a single 1 in its -th entry and 0s in all other entries.

Step 2: After updating each row in , we fix the value and perform minimization over . By taking a negative sign from Eq.(9), we have the following maximization problem:

(11) |

which has a closed-form solution. Let . Then the solution for is the top-r eigenvectors of .

Methods | VOC2007 | VOC2012 | ||||||
---|---|---|---|---|---|---|---|---|

MiAP | micro-F1 | macro-F1 | Hamm. | MiAP | micro-F1 | macro-F1 | Hamm. | |

ConSE | 49.98 | 30.80 | 27.57 | 28.12 | 49.95 | 33.48 | 28.83 | 27.13 |

LatEm-M | 52.45 | 35.32 | 36.69 | 26.28 | 51.44 | 35.74 | 36.33 | 26.21 |

DMP | 53.52 | 36.70 | 40.44 | 25.72 | 52.92 | 35.73 | 41.04 | 26.12 |

Fast0Tag | 52.39 | 35.01 | 36.76 | 26.53 | 52.29 | 34.23 | 35.38 | 26.41 |

TAEP | 57.42 | 38.48 | 42.33 | 24.98 | 54.39 | 37.63 | 41.58 | 25.25 |

TAEP-C | 59.22 | 39.84 | 43.77 | 24.01 | 57.13 | 39.30 | 42.97 | 24.27 |

TAEP-H | 57.62 | 38.95 | 43.29 | 24.46 | 56.10 | 38.89 | 42.23 | 24.44 |

## 4 Experiments

To investigate the empirical performance of the proposed method, we conducted experiments on two standard multi-label image classification datasets to test its performance on multi-label zero-shot classification and generalized multi-label zero-shot classification.

### 4.1 Experimental Setting

#### Datasets

In our experiments we used two standard multi-label datasets: The PASCAL VOC2007 dataset and VOC2012 dataset. The PASCAL VOC2007 dataset contains 20 visual object classes. There are 9963 images in total, 5011 for training and 4952 for testing. The VOC2012 dataset contains 5717 and 5823 images from 20 classes for training and validation. We used the validation set for test evaluation.

Detailed settings For each image, we used VGG19 [\citeauthoryearSimonyan and Zisserman2014] pre-trained on ImageNet to extract the 4096-dim visual features. For the label embeddings, we used the 300-dim word embedding vectors pre-trained by GloVe [\citeauthoryearPennington et al.2014]. All image feature vectors and word embedding vectors are normalized. To determine the hyper-parameters, we further split the seen classes into two disjoint subsets with equal number of classes for training and validation. We train the model on the training set and choose hyper-parameters based on the test performance on the validation set. For the proposed model, we choose , and from and respectively. After parameter selection, the training and validation data are put back together to train the model for the final evaluation on unseen test data.

Evaluation metric We used four different multi-label evaluation metrics: MiAP, micro-F1, macro-F1 and Hamming loss. The Mean image Average Precision (MiAP) [\citeauthoryearLi et al.2016] measures how well are the labels ranked on a given image based on the prediction scores. The other three standard evaluation metrics for multi-label classification measure how well the predicted labels match with the ground truth labels on the test data.

### 4.2 Multi-label Zero-shot Learning Results

#### Comparison methods

We compared the proposed method with
four related multi-label ZSL methods, ConSE, LatEm-M, DMP and Fast0Tag,
which also adopted the visual-semantic projection strategy.
The first two methods are the multi-label adaptations of
two standard ZSL approaches,
the convex combination of semantic embedding (ConSE) [\citeauthoryearNorouzi et al.2013]
and the latent embedding (LatEm) method [\citeauthoryearXian et al.2016].
For LatEm, we adopted a multi-label ranking objective to replace the original one of LatEm
and denote this variant as Latent Embedding Multi-label method (LatEm-M).
The direct multi-label zero-shot prediction method (DMP) [\citeauthoryearFu et al.2014]
and the fast tagging method (Fast0Tag) [\citeauthoryearZhang et al.2016]
are specifically developed for mulit-label zero-shot learning.
For our proposed transfer-aware max-margin embedding projection (TAEP) method,
we also provide comparisons for two TAEP variants with
different types of auxiliary information:
TAEP-H uses WordNet Hierarchy as auxiliary information,
and TAEP-C uses Flickr Image Hit-Count as auxiliary information.

Zero-shot multi-label learning results. We divided the datasets into two subsets of equal number of classes, and then use them as seen and unseen classes
respectively. All methods use seen class instances in the training set to train their models
and make predictions on the unseen class instances in test set.
We selected the hyper-parameters for the comparison methods based on grid search.
With selected fixed parameters,
for each approach we repeated 5 runs and reported its mean performance in Table 1.
We can see the direct multi-label prediction method, DMP, outperforms both ConSE and LatEm-M
on the two datasets in terms of almost all measures.
This shows that the specialized multi-label ZSL method, DMP, does have advantage over extended multi-class ZSL methods.
Fast0Tag is a bit less effective than DMP, but
still consistently outperforms ConSE.
The proposed TAEP on the other hand consistently outperforms
all the four comparison methods across all measures
and with notable improvements on both datasets.
By integrating auxiliary information, the proposed TAEP-C and TAEP-H
further improve the performance of the proposed model TAEP,
while TAEP-C achieves the best results in terms of all measures.
These results verified the efficacy of the proposed model.
They also demonstrated the usefulness of auxiliary information and
validated the effective information integration mechanism of our proposed model.

Methods | VOC2007 | VOC2012 | ||||||
---|---|---|---|---|---|---|---|---|

MiAP | micro-F1 | macro-F1 | Hamm. | MiAP | micro-F1 | macro-F1 | Hamm. | |

ConSE | 64.10 | 42.11 | 32.29 | 12.78 | 62.85 | 41.17 | 35.72 | 13.04 |

LatEm-M | 66.46 | 43.11 | 32.37 | 12.56 | 63.06 | 39.95 | 32.35 | 13.31 |

DMP | 67.79 | 43.97 | 34.13 | 12.37 | 64.24 | 41.29 | 32.39 | 13.02 |

Fast0Tag | 67.34 | 43.54 | 33.31 | 12.49 | 64.63 | 41.28 | 32.46 | 12.97 |

TAEP | 68.16 | 43.61 | 35.29 | 12.01 | 64.67 | 40.60 | 34.07 | 12.75 |

TAEP-C | 69.87 | 44.75 | 35.62 | 11.98 | 65.33 | 42.10 | 36.74 | 12.53 |

TAEP-H | 69.74 | 44.55 | 35.56 | 12.00 | 65.10 | 41.39 | 35.95 | 12.94 |

Generalized multi-label zero-shot learning results. Although zero-shot learning has often been evaluated only on the unseen classes in the literature, it is natural to evaluate multi-label zero-shot learning on all the classes, which is referred to as generalized multi-label zero-shot learning. Hence we conducted experiments to test the generalized zero-shot classification performance of the comparison methods. Each method is still trained on the same seen classes , but the test set now contains all the seen and unseen labels, i.e., . The average comparison results on the two datasets are reported in Table 2. We can see that the two specialized multi-label zero-shot learning methods, DMP and Fast0Tag, outperform the adapted methods ConSE and LatEm-M in terms of most measures on both VOC2007 and VOC2012, while TAEP achieves competitive performances with them. By further incorporating the auxiliary information, the proposed methods, TAEP-C and TAEP-H, not only consistently outperform all the three comparison methods on both datasets in terms of all the evaluation metrics, they also consistently outperform the base model TAEP. TAEP-C again produced the best results in most cases. These results suggest our proposed model provides an effective framework on learning transfer-aware label embeddings for generalized multi-label zero-shot learning, and it also provides the effective mechanism on incorporating free auxiliary information.

### 4.3 Impact of Label Embedding Regularization

In this section we study the impact of label embedding projection regularization term , i.e., the transfer-aware part of the proposed model. For TAEP, we firstly set the parameters to the same values, , as those that generate Table 1, and then reduce by a factor of 10 each time to repeat the experiments. That is, we try ={}. Since is the weight for the regularization term , by doing this we are actually reducing the contribution of the embedding projection regularization term. The results in terms of MiAP are presented in Figure 2. Similarly, we also tested the impact of auxiliary information through the regularization term for TAEP-H and TAEP-C by reducing by factors of . From Figure 2 we can see that, as decreases, the performance of TAEP decreases on both datasets. This suggests that the label embedding projection regularization term is a necessary and useful component. By regularizing the label embeddings to induce better inter-label relationships, the cross-class information transfer can be facilitated in zero-shot learning. Similarly, we also observe that when decreases, the performance of TAEP-C and TAEP-H decreases as well on both datasets. This again verifies the usefulness of auxiliary information and the effectiveness of auxiliary integration mechanism of the proposed transfer-aware embedding projection method.

## 5 Conclusion

In this paper we proposed a transfer-aware label embedding approach for multi-label zero-shot image classification. This approach projects both images and labels into the same semantic space to rank the similarity scores of the images with positive and negative labels under a max-margin learning framework, while guiding the label embedding projection with a transfer-aware regularization objective to achieve a suitable inter-label relations for information adaptation. The regularization framework also allows convenient incorporations of auxiliary information. We conducted experiments to compare our approach with a few related ZSL methods on multi-label image classification tasks. The results demonstrated the efficacy of the proposed approach.

### References

- Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015.
- Z. Al-Halah, M. Tapaswi, and R. Stiefelhagen. Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. In CVPR, 2016.
- S. Bucak, J. Rong, and A. Jain. Multi-label learning with incomplete class assignments. In Proc. of CVPR, 2011.
- S. Changpinyo, W.-L. Chao, and F. Sha. Predicting visual exemplars of unseen classes for zero-shot learning. In ICCV, 2017.
- A. Elisseeff, J. Weston, et al. A kernel method for multi-labelled classification. In NIPS, 2001.
- A. Frome, G. S Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
- Y. Fu, Y. Yang, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-label zero-shot learning. In BMVC, 2014.
- Y. Fu, T. M Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. TPAMI, 37(11):2332–2345, 2015.
- A. Gaure, A. Gupta, V. Kumar Verma, and P. Rai. A probabilistic framework for zero-shot multi-label learning. In UAI, 2017.
- Y. Guo and D. Schuurmans. Adaptive large margin training for multilabel classification. In AAAI, 2011.
- C. H Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
- C. H Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. TPAMI, 36(3):453–465, 2014.
- C.-W. Lee, W. Fang, C.-K. Yeh, and Y.-C. Frank Wang. Multi-label zero-shot learning with structured knowledge graphs. arXiv preprint arXiv:1711.06526, 2017.
- X. Li, Y. Guo, and D. Schuurmans. Semi-supervised zero-shot classification with label representation learning. In ICCV, 2015.
- X. Li, S. Liao, W. Lan, X. Du, and G. Yang. Zero-shot image tagging by hierarchical semantic embedding. In SIGIR. ACM, 2015.
- X. Li, T. Uricchio, L. Ballan, M. Bertini, C. GM Snoek, and A. Del Bimbo. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR), 49(1):14, 2016.
- T. Mensink, E. Gavves, and C. GM Snoek. Costa: Co-occurrence statistics for zero-shot classification. In CVPR, 2014.
- T. Mikolov, I. Sutskever, K. Chen, G. S Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
- G. A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
- J. Pennington, R. Socher, and C. D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
- R. Qiao, L. Liu, C. Shen, and A. van den Hengel. Less is more: zero-shot learning from online textual documents with noise suppression. In CVPR, 2016.
- M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where–and why? semantic relatedness for knowledge transfer. In CVPR, 2010.
- M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR, 2011.
- B. Romera-Paredes and P. HS Torr. An embarrassingly simple approach to zero-shot learning. In ICML, 2015.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011.
- Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016.
- H. Yang, Joey T. Zhou, and J. Cai. Improving multi-label learning with missing labels by structured semantic correlations. In Proc. of ECCV, 2016.
- H. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In Proc. of ICML, 2014.
- Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015.
- Y. Zhang, B. Gong, and M. Shah. Fast zero-shot image tagging. In CVPR, 2016.