Class label autoencoder for zero-shot learning
Existing zero-shot learning (ZSL) methods usually learn a projection function between a feature space and a semantic embedding space(text or attribute space) in the training seen classes or testing unseen classes. However, the projection function cannot be used between the feature space and multi-semantic embedding spaces, which have the diversity characteristic for describing the different semantic information of the same class. To deal with this issue, we present a novel method to ZSL based on learning class label autoencoder (CLA). CLA can not only build a uniform framework for adapting to multi-semantic embedding spaces, but also construct the encoder-decoder mechanism for constraining the bidirectional projection between the feature space and the class label space. Moreover, CLA can jointly consider the relationship of feature classes and the relevance of the semantic classes for improving zero-shot classification. The CLA solution can provide both unseen class labels and the relation of the different classes representation(feature or semantic information) that can encode the intrinsic structure of classes. Extensive experiments demonstrate the CLA outperforms state-of-art methods on four benchmark datasets, which are AwA, CUB, Dogs and ImNet-2.
keywords:class label autoencoder, multi-semantic embedding space, zero-shot learning, transfer learning
The large-scalely visual recognition problem can be solved possible by the support of large-scale datasets (for example, ImageNet Russakovsky2015ImageNet ()) and the advances of deep learning methods Krizhevsky2012ImageNet () Sermanet2013OverFeat () Simonyan2014Very () Szegedy2015Going (). However, the visual recognition is still challenging ”in the wild” because of the rare samples object classes and fine-grained object categories. For example, we cannot always collect all classes, which include the recognized classes, for learning the related model. Moreover, we can hardly utilize the model learned based on coarse classes to cognise fine-grained classes in traditionally visual recognition methods. To deal with the visual recognition problem in these situations, the main idea of ZSL is to exploit the transfer model between the feature space and the semantic space in seen classes, form which labeled samples can be used, for classifying unseen classes, from which samples can not be collected. In other words, in ZSL, training and testing class sets are disjoint which can be handled by modeling the transfer relationship based on the interactive relevance between feature classes and semantic classes. For instance, ’horse’ belongs to unseen classes, while ’zebra’ is a seen class. These classes include the same (e.g. ’horse’ and ’zebra’ are both ’has tail’ attribute) or different semantic information (e.g. zebra has ’is striped’ attribute, but horse has ’is solid color’ attribute). In ZSL, the knowledge transfer model can be learned between the visual feature of ’zebra’ and the semantic information of ’has tail’ in training sets, and then, in classification, the visual feature of ’horse’ can be projected into the semantic or label space by the transfer model for recognizing ’horse’ in testing sets.
In ZSL, the transfer model can be learned between the visual feature and the semantic information in training seen class sets, so that it suffers from project domain shift problem Fu2015Transductive () Kodirov2017 () because of the disjoint between testing unseen class sets and training seen class sets. To alleviate the effect of the project domain shift problem, we face to two challengesChangpinyo2016 (). One is how to use the semantic information for modeling the knowledge transfer relationship, and other is how to construct the projection function between visual feature and multi-semantic information for the optimally discriminative characteristic on unseen classes.
To address the first challenge, we usually make a assumption, in which seen and unseen classes are correlative in the semantic embedding space (e.g. attribute farhadi2009describing () lampert2009learning () parikh2011relative () or text space Frome2013DeViSE () mikolov2013efficient () Socher2013Zero ()). In the semantic embedding space, all kinds of class names are embedded as vectors, which are class prototype Fu2014Transductive (). Some ZSL methods Fu2014Transductive () jayaraman2014zero () Lampert2014 () li2015semi () li2014attributes () norouzi2013zero () romera2015embarrassingly () try to find the relation of the different space (feature, semantic embedding and class label space) for modeling the knowledge transfer, and others attempt to transform the semantic embedding into new representation for constraining the transfer relationship between seen and unseen classes by Canonical Correlation Analysis (CCA)Fu2015Transductive () or Sparse Coding (SC) yu2017transductive () zhang2015zero ().
To handle the second challenge, attribute-based classification Lampert2014 () as the classical method can construct the probability model to predict the visual attributes of unseen classes. Recent methods tend to build the linear Frome2013DeViSE () 7298911 () Akata2016Label () Kodirov2017 (), nonlinear Socher2013Zero () 7780384 (), or hybrid norouzi2013zero () zhang2015zero () projection function among the different spaces (feature, semantic embedding, and class label space). Furthermore, two tendencies show the promising results. One is that the structure of semantic classes Changpinyo2016 () or structure propagation Lin2018structure () are considered for enhancing the transfer model based on the above projection function. The other is a autoencoder mechanism is utilized to constrain the bidirection projection relation between feature and semantic information for improving the compatibility of the transfer modelKodirov2017 (). However, this autoencoder mechanism neglects the model construction between the feature space and multi-semantic space. Moreover, it is difficult to approximate the intrinsic structure of the unseen classes because of the statically linear model learned on seen classes.
From above mentions, our motivation is inspired by the autoencoder mechanism Kodirov2017 (), structure fusion Lin2017275 () Lin20161 () Lin2014146 () 7268821 () 7301305 () Lin20131286 () and structure propagation Lin2017Dynamic () Lin2018structure () for jointly addressing two challenges. The different point of CLA try to model the projection function between the feature and the class label space by autoencoder mechanism for dealing with the transfer relationship among the feature and multi-semantic embedding spaces, while literature Kodirov2017 () only involves the model construction between the feature and the single semantic space, and literature Lin2018structure () can not consider the bidirectional constrain between the feature and the class label space with the consideration of multi-semantic information for building the transfer model. Therefore, we expect that CLA not only can construct a uniform framework for adapting to multi-semantic embedding spaces, but also reform the encoder-decoder mechanism for constraining the bidirectional projection between the feature space and the class label space. Figure 1 illustrates the idea of CLA conceptually.
Our main contributions have two points. One is a novel idea proposed to construct projection function between the feature and the class label space based on autoencoder mechanism for considering the multi-semantic information in ZSL. Other is a feasible method presented to improve unseen classes recognition by the evolution relationship mining of the different manifold structure(the feature distribution structure of seen classes,the feature distribution structure of unseen classes, the semantic distribution structure between seen and unseen classes). In the experiment, we evaluate CLA on four benchmark datasets for ZSL. The experimental results are promising for ameliorating the recognition performance of unseen object classes.
2 Related Works
In ZSL, the semantic information(For example, attributes lampert2009learning () come from the manual annotation Akata2013Label ()and text information Socher2013Zero () derive from language by machine learning Yu2013Designing () or data mining Elhoseiny2013Write ()) can describe the characteristic of each class corresponding to the class label. Recent methods can draw support from the semantic information to recognize the visual unseen class by intermediate attribute classifiers learning Rohrbach2010What () Rohrbach2011Evaluating () Lampert2014 (), seen class proportions combination zhang2015zero () 7781018 () norouzi2013zero () Changpinyo2016 (), and compatibility learning Akata2016Label () 7298911 () Frome2013DeViSE () Socher2013Zero () romera2015embarrassingly () 7780384 (). For directly classifying unseen classes, the current focus is how to learning the projection or compatibility function between the feature space and the class label space by the assistance of the semantic information to suppress the project domain shift problem. We attempt to enhance the projection or compatibility function learning by structure fusion of multi-semantic classes and autoencoder mechanism, which are the related methods of CLA.
To best of our knowledge, structure fusion is firstly proposed for multi-feature fusion of shape classification Lin20131286 () in our research. Structure is defined as the graph structure among features of data samples. We can capture the linear relation of multi-feature structure fusion base on manifold leaning method Lin20131286 () Lin2014146 () or probability model 7268821 () for improving the performance of image classification, shape analysis and human action recognition. Moreover, in the further works, we construct the non-linear relation of heterogeneous structures fusion for remarkably ameliorating image classification 7301305 () Lin20161 () and feature encoding Lin2017275 (). In recent works, we find the interesting things that are dynamic structure fusion to refining the relation of objects for semi-supervised multi-modality classification Lin2017Dynamic () and structure propagation to update the relevance of multi-semantic classes by the iteration computation for ZSL Lin2018structure (). From above works, it shows that the relationship between structures is very important for the discriminative learning of object classification. Therefore, we expect to deal with structure fusion of multi-semantic classes for ZSL by the more suitable way.
Autoencoder is a bidirectional mechanism for encoding and decoding in many works Baldi1989Hornik () Rifai2011Contractive () Xie2015Unsupervised () Chen2012Marginalized () Badrinarayanan2017SegNet ()Yan2015Attribute2Image () Reed2016Generative () Kodirov2017 (). In term of semantic projection, autoencoder methods are roughly divided into two categories which are non-semantic and semantic encoder-decoder methods. Non-semantic encoder-decoder methods usually learn the intrinsic structure of data for visualization/clustering Xie2015Unsupervised () or classification Chen2012Marginalized (), while semantic encoder-decoder methods generally share the latent embedding space between the encoder and the decoder by semantic regularization Yan2015Attribute2Image () Reed2016Generative () or learn end-to-end deep model for ZSL by reconstructing the loss between the convolutional and deconvolutional neural network Kodirov2017 ().
For considering the bidirectional constrains, we expect to build the project model between visual features and class labels for ZSL by autoencoder idea. Simultaneously, we want to use structure fusion idea to process multi-semantic information by the model for improve the performance of ZSL. To this end, we propose CLA for uniforming two ideas in ZSL.
3 Class Label Autoencoder
3.1 Linear autoencoder
linear autoencoder can construct a simple model, which only includes one hidden layer linked between the encoder and decoder. By this mechanism, the input data can be encoded by projecting into the hidden layer, and then can be decoded by reconstructing the original data space Kodirov2017 (). Therefore, linear autoencoder can attain the better coding quality with minimizing the error between the original and reconstructed data. We extend the autoencoder mechanism into class label space and expect to directly encode the visual feature to the class label with the semantic information. Given, an input data is a visual feature matrix ( feature vectors of dimensions), and can be projected into a -dimension class label space by a transformation matrix . The class label representation is . According to autoencoder mechanism, the class label representation can be mapped back to become by a transformation matrix . The tied weights can be considered to further simplify the autoencoder model by Ranzato2007Sparse (). We expect to minimize the reconstruct error between and . To this end, the following objective can be built.
We can equivalently reformulate (1) as unconstrained optimization problem as following.
here, is a tradeoff parameter for balancing the encoder and the decoder.
3.2 ZSL Model
In ZSL, visual features include two sets. One set can be represented as feature matrix ( feature vectors of dimensions) with class label matrix ( label vectors of dimensions) in seen classes, and another can be described as feature matrix ( feature vectors of dimensions) without class label matrix ( label vectors of dimensions) in unseen classes. Semantic feature set can be defined as , in which or respectively is feature matrix in seen or unseen classes. We expect to learn a transformation matrix in seen classes for transferring knowledge to a transformation matrix in unseen classes with consideration of multi-semantic information. Therefore, we want to find the relationship between and . In term of this relationship, the efficient information can be transferred from seen classes to unseen classes. To this end, we respectively define the following formula.
here, ( is the number of seen classes)is the similarity matrix of seen classes, and is the structure representation of seen classes. is a projection matrix for encoding seen classes. By formula (3), we can decompose into two parts, in which can describe the intrinsically discriminative characteristic of seen classes and can extract the common information in seen classes. In unseen classes, we define the similar formula as following.
here, ( is the number of unseen classes)is the similarity matrix of unseen classes, and is the structure representation of unseen classes. is a projection matrix for encoding unseen classes. By formula (4), we can decompose into two parts, in which can depict the intrinsically discriminative characteristic of unseen classes and can extract the common information in unseen classes. For describing the transfer relationship of the common information between seen and unseen classes, we define the following formula.
here, can not directly be obtained by autoencoder mechanism. Therefore, we compute by and ,which is the similarity matrix and transfers the common information between seen and unseen classes. In formula (3),(4) and (5), the similarity matrix can be computed as following.
here, when and respectively are visual class feature or semantic class representation in seen classes, is the element of ; while and respectively are visual class feature or semantic class representation in unseen classes, is the element of ; if and respectively are visual class feature or semantic class representation in seen or unseen classes, is the element of .
According to the above definitions, we can equivalently reformulate formula (2) in seen classes as following.
For optimizing formula (8), we can transform it as a well-know Sylvester equation (Appendix A shows the detail). Bartels-Stewart algorithm Bartels1972Solution () can be efficiently solve this problem by running Matlab function ”sylvester”. When we obtain , can be computed by formula (3) and (5). In term of formula (4), can be calculated by the following formula.
We can determinate a estimated label to a by the following formula.
3.3 Multi-semantic structure evolution fusion
In formula (8),(9) and (10), the similarity matrix (, or ) can be computed by formula (6). In ZSL, semantic information can usually describe all class prototype. Therefore, the similarity matrix often is the structure of the semantic information. If we can obtain multi-semantic information (for example, attributelampert2009learning () or word vectorSocher2013Zero ()), the structure of the semantic information can be modeled by the linear relationship of multi-similarity matrix in the following formula.
here, is the number of multi-semantic information. is the linear coefficient of the similarity matrix, which is (the structure representation of th semantic information in seen classes), (the structure representation of th semantic information in unseen classes) or (the structure representation of th semantic information between seen and unseen classes).
Except multi-semantic information, visual feature also can help to construct the structure representation of object classes(A kind of method is simple for describing visual class by averaging all visual features with the same label). However, we only know the label of unseen classes by formula (10). In other words, we expect to calculate the structure representation of visual feature class by the estimated label. Therefore,the above similarity matrix come from not only semantic information but also visual feature. We reformulate formula (11) as follows.
here,, and are respectively the structure representation of visual feature class in seen classes, unseen classes and between seen and unseen classes. is fix because of the label in seen classes, while and are dynamic with the estimated label in unseen classes. Therefore, we respectively use and for weight the different structure representation. We can reformulate formula (8) in term of formula (12) as follows.
For optimizing formula (13), we fix to optimize by transforming it as a well-know Sylvester equation (Appendix A shows the detail), and then fix to solve by linear programming. We can obtain the initial estimated label by formula (9)and (10). The initial estimated label could not be accurate, so a evolution process can be presented to refine the performance of the model by the iteration computation. Therefore, we can update and by the estimated label. For updating , we construct formula (14) to constrain the positive propagation of label.
here, The element of and forms the column of , while the element of and makes the column of . Formula (14) can be transformed as a linear programming for solving . The evolution process can be implemented in formula (12),(9),(10) and (14) by the iteration computation.
To describe the detail of CLA, we demonstrate the pseudo code of the proposed CLA algorithm in Algorithm 1, which includes three steps. The first step (line 1 and line 2) initializes the structure representation of unseen classes and the structure relationship. The second step (line 3 and line 4) computes the fusion structure for computing a projection matrix of seen classes and updating the structure relationship. The third step (from line 6 to line 11) is a evolution process for refine the classification performance of unseen classes by iteration computation. In addition, the evolution process can also fuse the recognition result of other ZSL method for further improving the classification performance of unseen classes.
We evaluate the proposed method CLA in four challenging datasets, which include Animals with Attributes (AwA)Lampert2014 (), CUB-200-2011 Birds (CUB)Wah2011The (), Stanford Dogs (Dogs)Deng2013Fine (), and ILSVRC2012/ILSVRC2010 (ImNet-2)Russakovsky2015ImageNet (). In ImNet-2, the same configuration as in Kodirov2017 () is the 1000 classes of ILSVRC2012 for seen classes and the 360 classes of ILSVRC2010 for unseen classes. These datasets can be categorized into fine-grained recognition (CUB and Dogs) or non-fine-grained recognition (AwA and ImNet-2) for ZSL. Tab.1 shows the statistics and the extracted features (the detail of image and semantic feature in section 4.2) for these datasets.
4.2 Image and semantic feature
Image and semantic feature description are necessary for modeling ZSL. Because deep feature can capture the discriminative characteristic of objects based on large scale database, we adopt image feature to be the outputs (1024 dimension feature vector) of the pre-trained GoogleNet7298911 () Szegedy2015Going (), which is end to end paradigm for processing whole image inputs. Semantic feature can be extracted by four methods in the different datasets. The first method obtains the feature vector from attributes (att)farhadi2009describing () by human annotation and judgment in AwA and CUB. The second method extracts word vectors (w2v) by predicting words of text document on a two-layer neural network Mikolov2013Distributed () in AwA, CUB, ImNet-2 and Dogs. The third method attains GloVe (glo) form co-occurrence statistics of words on a large unlabel text corpora Pennington2014Glove () in AwA, CUB, and Dogs. The forth method gets hierarchical embedding (hie) from vectorial class structure for describing the class hierarchical relationship (for example WordNet 7298911 ()Miller2002WordNet ()) in AwA, CUB, and Dogs.
4.3 Classification and validation protocols
Classification accuracy can be computed by averaging all test classes accuracy in each database. In the learned model, there are three parameters, which are (the tradeoff parameter and in formula (13), (the total iteration number in Algorithm 1), and (the tradeoff parameter and in formula (14). The training classes set is alternately segmented as learning set and validation set in according with the proportion between the training classes set and the test classes set. We obtain corresponding to the best result in by cross validation. In all experiments, is set to and is equal to .
4.4 Comparison with baseline approaches
In this section, because autoencoder mechanism and structure propagation are basic ideas for constructing CLA, we implement two existing methods with these ideas as baseline approaches, which are semantic autoencoder (SAE)Kodirov2017 () and structure propagation (SP) Lin2018structure (). SAE have two configurations, which encode from visual to semantic space(V to S) or from semantic to visual space (S to V). The details of experimental results are illustrated in Tab.2. In the differently semantic space, the classification performance of CLA is obviously superior to that of the baseline methods. In AWA, CUB, Dogs and ImNet-2, the performance of CLA respectively improves , , , and at least. Because the performance improvement is not significant in ImNet-2, we demonstrate the comparison of Top-n (n is a number of set, which includes 1,2,3,4 and 5.) accuracy between CLA and the baseline methods on unseen classes in Tab.3. CLA can still obtain the best performance in the contrast methods.
|SAE(V to S)||att||N/A||N/A|
|SAE(S to V)||att||N/A||N/A|
|Top-n||Semantic feature||SAE(V to S)||SAE(S to V)||SP||CLA|
4.5 Comparison with existing methods for multi-semantic fusion
In this section, multi-semantic fusion is implemented by ClA, SJE7298911 (), LatEm7780384 (), SynC Changpinyo2016 () and SPLin2018structure (). The details of experimental results are shown in Tab.4. w indicates that multi-semantic fusion includes att, w2v, glo and hie, while w/o expresses that multi-semantic fusion contains w2v, glo and hie. In the different datasets, the classification performance of CLA is better that of other methods. In AWA, CUB, and Dogs, the performance of CLA respectively increases ,, and at least.
4.6 Comparison with state-of-the-arts
In this section, we compare CLA and state of the arts methods, which include SJE7298911 (), LatEm7780384 (), SynC Changpinyo2016 (),SAEKodirov2017 (), SPLin2018structure (), DMaP Li2017Paths () and AR-CPR8016672 () in Tab.5. The classification performance of CLA outperforms other state of the arts methods except in CUB. When semantic representation is att, DMaP Li2017Paths () is better than CLA for ZSL. DMaP can focus on the manifold structure consistency between the semantic representation and the image feature, so it can better distinguish unseen classes in semantic representation att. The details of result analysis are explained in section 4.8.
|SAE(V to S)Kodirov2017 ()||att||N/A|
|SAE(S to V)Kodirov2017 ()||att||N/A|
4.7 Parameter analysis
The important impact on CLA involves two parameter, which are (the tradeoff parameter and in formula (13),and (the total iteration number in Algorithm 1). As aforementioned, we select from by cross-validation on seen classes. In this section, we examine the effect of on classification performance of CLA in the different selection. Figure 2 shows classification performance of CLA with the different on CUB. We can find that CLA is sensitive to , so we carefully select this parameter by cross-validation in the training set. In above experiments, we empirically set to 50. However, this parameter indicates the processing of structure evolution. Figure 3 shows classification performance of CLA with structure evolution on the different on CUB. We can observe that classification performance of CLA is improve with increasing, and then tend to be stable. Therefore, we can set to 50, because classification performance of CLA is mostly better and stable in this value. Another parameter change is not sensitive to the classification accuracy of CLA.
4.8 Experimental results analysis
In above experiments, eight approaches are involved to consider the manifold structure from different aspects for building the bridge between visual feature and semantic space. SJE7298911 () can construct the output space of the structure by balancing the different output embedding with the confidence contribution. LatEm7780384 () can capture the structured model for making the overall piecewise linear function, and then can obtain the latent space of the flexible model for fitting the unseen class. SynCChangpinyo2016 () can take into account the manifold structure in semantic space for getting optimal discriminative performance in the model space. SAEKodirov2017 () can mine the latent manifold structure for classifying unseen classes by taking the encoder-decoder paradigm with the bidirectional constrains. SPLin2018structure ()can optimize the relationship of the manifold structure in semantic and image space, and improve the positive structure propagation by iteration computation for ZSL. DMaPLi2017Paths () can build the manifold structure consistency between semantic representation and image feature by using dual visual-semantic mapping paths. AR-CPR8016672 () can learn a deep regression model to matching the manifold structure consistency between images and class prototypes by rectifying class representation. The proposed CLA can jointly consider the structure evolution and the bidirectional constrains between feature and class label space to recognize unseen classes. From aforementioned experiments, we can have the following observations.
The performance of CLA outperforms baseline approaches, which are SAE and SP. SAE try to find the mapping relationship with the bidirectional constrains, which are from semantic to image and from image to sematic, while SP attempt to complement the project relevance by structure positive propagation, which implements based on the relationship between unseen and seen classes. The proposed CLA can jointly consider the impact of the the bidirectional constrains, which are from class label to image and from image to class label, and the structure evolution, which involves not only the relevance between unseen and seen classes but also the relationship among unseen classes or seen classes.
The performance improvement CLA is different than baseline approaches in the various datasets. The significant advance can be shown in AWA,CUB, and Dogs, while the slight advance can be found in ImNet-2. This situation of the main reason is the diversity of a large-scale dataset to cause the bigger divergence of the intra-class than a small-scale dataset. Therefore, the proposed CLA shows the obvious advantages than other methods in Top-n accuracy experiment.
In multi-semantic fusion, the performance of CLA is superior to other methods, which are SJE, LatEm, SynC, and SP. The proposed multi-semantic fusion method can have the better performance than the single-semantic method in term of multi-semantic complement each other. In Dogs dataset, the performance of CLA is slightly better than that of SP, while the performance of CLA significantly outperforms that of other methods. This situation of main reason is that the bidirectional constrains mechanism have the less effect on classification accuracy than structure evolution in multi-semantic fusion methods.
In multi-semantic fusion or single-semantic method, structure evolution can have the more improvement the classification accuracy for ZSL than the bidirectional constrains mechanism in supervised semantic space (att). This effect relationship is irregular in unsupervised semantic space (w2v,glo, and hie). However, their joint effect essentially improve the classification accuracy for ZSL.
The performance of CLA is superior to that of DMaP except in att semantic space of CUB, because the structure matching of DMaP is a key point for classifying fine-grained category with supervised semantic space (att). CLA integrates the structure evolution and the bidirectional constrains mechanism, while DMaP focuses on the various manifold structure consistency. Therefore, the performance of CLA has approximated to that of DMaP in this situation, even greatly outperforms that of DMaP with unsupervised semantic space (w2v).
The performance of CLA and AR-CPR is same in att semantic space of CUB. AR-CPR tempt to train a deep network and rectify class prototype for enhancing classification accuracy of unseen classes, while CLA try to learn the projection function by integrating the structure evolution and the bidirectional constrains mechanism. Both methods obtain the best result based on the different aspects. It shows that class structure distribution and it’s constrains are very important to bridge the gap between visual feature and semantic representation.
The most computational load involved in CLA is for solving equation (14)and (13). Specifically,the complexity of equation (14) is ( is semantic space number, and is the number of bits in the input Karmarkar1984A ()) in the worst case. The complexity of equation (13) is . Because and iteration times are often much less than feature dimension, the computational complexity of CLA is .
We have proposed class label autoencoder (CLA) method to address multi-semantic fusion in ZSL. CLA can not only adapt multi-semantic structure distribution to a uniform ZSL framework, but also constrain the bidirectional mapping between the feature space and the class label space by the encoder-decoder paradigm. Furthermore, CLA can fuse the relationship of feature classes, the relevance of the semantic classes, and the interaction between feature and semantic classes for improving zero-shot classification. At last, the optimization of the CLA can obtain both unseen class labels and the different classes representation(feature or semantic information) of the relation that can encode the intrinsic structure of classes by iteration evolution way. For evaluating the proposed CLA, we implement the comparison experiments about baseline methods, multi-semantic fusion methods,and state of the art methods on AwA, CUB, Dogs and ImNet-2. Experiment results demonstrate the CLA is effective in ZSL.
The authors would like to thank the anonymous reviewers for their insightful comments that help improve the quality of this paper. Especially, The authors thank to Dr. Yongqin Xian from MPI for Informatics, who provided the data source to us. This work was supported by NSFC (Program No.61771386), Natural Science Basic Research Plan in Shaanxi Province of China (Program No.2017JZ020).
- (1) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
- (2) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems(NIPS), 2012, pp. 1097–1105.
- (3) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. Lecun, Overfeat: Integrated recognition, localization and detection using convolutional networks, arXiv preprint arXiv:1312.6229.
- (4) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
- (5) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2015, pp. 1–9.
- (6) Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, Transductive multi-view zero-shot learning., IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (11) (2015) 2332–2345.
- (7) E. Kodirov, T. Xiang, S. Gong, Semantic autoencoder for zero-shot learning, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017, pp. 4447–4456.
- (8) S. Changpinyo, W. L. Chao, B. Gong, F. Sha, Synthesized classifiers for zero-shot learning, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016, pp. 5327–5336.
- (9) A. Farhadi, I. Endres, D. Hoiem, D. Forsyth, Describing objects by their attributes, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1778–1785.
- (10) C. H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 951–958.
- (11) D. Parikh, K. Grauman, Relative attributes, in: IEEE International Conference on Computer Vision(ICCV), 2011, pp. 503–510.
- (12) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: a deep visual-semantic embedding model, in: Advances in Neural Information Processing Systems(NIPS), 2013, pp. 2121–2129.
- (13) T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781.
- (14) R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. D. Manning, A. Y. Ng, Zero-shot learning through cross-modal transfer, in: Advances in Neural Information Processing Systems(NIPS), 2013, pp. 935–943.
- (15) Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, S. Gong, Transductive multi-view embedding for zero-shot recognition and annotation, in: European Conference on Computer Vision(ECCV), 2014, pp. 584–599.
- (16) D. Jayaraman, K. Grauman, Zero-shot recognition with unreliable attributes, in: Advances in Neural Information Processing Systems(NIPS), 2014, pp. 3464–3472.
- (17) C. H. Lampert, H. Nickisch, S. Harmeling, Attribute-based classification for zero-shot visual object categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3) (2014) 453–465.
- (18) X. Li, Y. Guo, D. Schuurmans, Semi-supervised zero-shot classification with label representation learning, in: IEEE International Conference on Computer Vision(ICCV), 2015, pp. 4211–4219.
- (19) Z. Li, E. Gavves, T. Mensink, C. G. Snoek, Attributes make sense on segmented objects, in: European Conference on Computer Vision(ECCV), 2014, pp. 350–365.
- (20) M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, J. Dean, Zero-shot learning by convex combination of semantic embeddings, arXiv preprint arXiv:1312.5650.
- (21) B. Romera-Paredes, P. H. Torr, An embarrassingly simple approach to zero-shot learning., in: International Conference on Machine Learning(ICML), 2015, pp. 2152–2161.
- (22) Y. Yu, Z. Ji, X. Li, J. Guo, Z. Zhang, H. Ling, F. Wu, Transductive zero-shot learning with a self-training dictionary approach, arXiv preprint arXiv:1703.08893.
- (23) Z. Zhang, V. Saligrama, Zero-shot learning via semantic similarity embedding, in: IEEE International Conference on Computer Vision(ICCV), 2015, pp. 4166–4174.
- (24) Z. Akata, S. Reed, D. Walter, H. Lee, B. Schiele, Evaluation of output embeddings for fine-grained image classification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2927–2936.
- (25) Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, Label-embedding for image classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (7) (2016) 1425–1438.
- (26) Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, B. Schiele, Latent embeddings for zero-shot classification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 69–77.
- (27) G. Lin, Y. Chen, F. Zhao, Structure propagation for zero-shot learning, arXiv preprint arXiv:1711.09513.
- (28) G. Lin, C. Fan, H. Zhu, Y. Miu, X. Kang, Visual feature coding based on heterogeneous structure fusion for image classification, Information Fusion 36 (2017) 275 – 283.
- (29) G. Lin, G. Fan, X. Kang, E. Zhang, L. Yu, Heterogeneous feature structure fusion for classification, Pattern Recognition 53 (2016) 1 – 11.
- (30) G. Lin, H. Zhu, X. Kang, C. Fan, E. Zhang, Feature structure fusion and its application, Information Fusion 20 (2014) 146 – 154.
- (31) G. Lin, H. Zhu, X. Kang, Y. Miu, E. Zhang, Feature structure fusion modelling for classification, IET Image Processing 9 (10) (2015) 883–888.
- (32) G. Lin, G. Fan, L. Yu, X. Kang, E. Zhang, Heterogeneous structure fusion for target recognition in infrared imagery, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2015, pp. 118–125.
- (33) G. Lin, H. Zhu, X. Kang, C. Fan, E. Zhang, Multi-feature structure fusion of contours for unsupervised shape classification, Pattern Recognition Letters 34 (11) (2013) 1286 – 1290.
- (34) G. Lin, K. Liao, B. Sun, Y. Chen, F. Zhao, Dynamic graph fusion label propagation for semi-supervised multi-modality classification, Pattern Recognition 68 (2017) 14–23.
- (35) Z. Akata, F. Perronnin, Z. Harchaoui, C. Schmid, Label-embedding for attribute-based classification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 819–826.
- (36) F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, S. F. Chang, Designing category-level attributes for discriminative visual recognition, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2013, pp. 771–778.
- (37) M. Elhoseiny, B. Saleh, A. Elgammal, Write a classifier: Zero-shot learning using purely textual descriptions, in: IEEE International Conference on Computer Vision(ICCV), 2013, pp. 2584–2591.
- (38) M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, B. Schiele, What helps where and why? semantic relatedness for knowledge transfer, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2010, pp. 910–917.
- (39) M. Rohrbach, M. Stark, B. Schiele, Evaluating knowledge transfer and zero-shot learning in a large-scale setting, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2011, pp. 1641–1648.
- (40) Z. Zhang, V. Saligrama, Zero-shot learning via joint latent similarity embedding, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 6034–6042.
- (41) P. Baldi, K. Hornik, Hornik, k.: Neural networks and principal component analysis: Learning from examples without local minima. neural networks 2(1), 53-58, Neural Networks 2 (1) (1989) 53–58.
- (42) S. Rifai, P. Vincent, X. Muller, X. Glorot, Y. Bengio, Contractive auto-encoders: Explicit invariance during feature extraction, in: International Conference on Machine Learning (ICML), 2011, pp. 833–840.
- (43) J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, arXiv preprint arXiv:1511.06335.
- (44) M. Chen, Z. Xu, K. Weinberger, F. Sha, Marginalized denoising autoencoders for domain adaptation, arXiv preprint arXiv:1206.4683.
- (45) V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for scene segmentation., IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99) (2017) 1–1. doi:10.1109/TPAMI.2016.2644615.
- (46) X. Yan, J. Yang, K. Sohn, H. Lee, Attribute2image: Conditional image generation from visual attributes, arXiv preprint arXiv:1512.00570.
- (47) S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis, arXiv preprint arXiv:1605.05396.
- (48) M. A. Ranzato, Y. L. Boureau, Y. Lecun, Sparse feature learning for deep belief networks, in: Advances in Neural Information Processing Systems(NIPS), 2007, pp. 1185–1192.
- (49) Bartels, H. R., Stewart, W. G., Solution of the matrix equation ax + xb = c [f4], Communications of the Acm 15 (9) (1972) 820–826.
- (50) C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd birds200-2011 dataset, California Institute of Technology.
- (51) J. Deng, J. Krause, L. Fei-Fei, Fine-grained crowdsourcing for fine-grained recognition, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2013, pp. 580–587.
- (52) T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems(NIPS), 2013, pp. 3111–3119.
- (53) J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Conference on Empirical Methods in Natural Language Processing(EMNLP), 2014, pp. 1532–1543.
- (54) G. A. Miller, Wordnet: A lexical database for the english language, Contemporary Review 241 (1) (2002) 206–208.
- (55) Y. Li, D. Wang, H. Hu, Y. Lin, Y. Zhuang, Zero-shot recognition using dual visual-semantic mapping paths, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017, pp. 5207–5215.
- (56) C. Luo, Z. Li, K. Huang, J. Feng, M. Wang, Zero-shot learning via attribute regression and class prototype rectification, IEEE Transactions on Image Processing 27 (2) (2018) 637–648.
- (57) N. Karmarkar, A new polynomial-time algorithm for linear programming, Combinatorica 4 (4) (1984) 373–395.
Then, we can take a derivative of formula (15) and set it zero for obtaining the following formula.
If , , , and , we can reformulate formula (16) to the following formula.
Here, formula (17) is a well-know Sylvester equation.