Rethinking Knowledge Graph Propagation for Zero-Shot Learning
The potential of graph convolutional neural networks for the task of zero-shot learning has been demonstrated recently. These models are highly sample efficient as related concepts in the graph structure share statistical strength allowing generalization to new classes when faced with a lack of data. However, knowledge from distant nodes can get diluted when propagating through intermediate nodes, because current approaches to zero-shot learning use graph propagation schemes that perform Laplacian smoothing at each layer. We show that extensive smoothing does not help the task of regressing classifier weights in zero-shot learning. In order to still incorporate information from distant nodes and utilize the graph structure, we propose an Attentive Dense Graph Propagation Module (ADGPM). ADGPM allows us to exploit the hierarchical graph structure of the knowledge graph through additional connections. These connections are added based on a node’s relationship to its ancestors and descendants and an attention scheme is further used to weigh their contribution depending on the distance to the node. Finally, we illustrate that finetuning of the feature representation after training the ADGPM leads to considerable improvements. Our method achieves competitive results, outperforming previous zero-shot learning approaches.
Rethinking Knowledge Graph Propagation for Zero-Shot Learning
noticebox[b] indicates equal contribution. indicates the corresponding author.\end@float
With the ever-growing supply of image data, from an ever-expanding number of classes, there is an increasing need to use prior knowledge to classify images from unseen classes into correct categories based on semantic relationships between seen and unseen classes. This task is called zero-shot image classification. To obtain satisfactory performance on this task, it is crucial to model precise class relationships based on prior class knowledge. Previously prior knowledge has been incorporated in form of semantic descriptions of classes, such as attributes akata2015evaluation (); romera2015embarrassingly (); long2017zero () or word embeddings socher2013zero (); frome2013devise (); li2017zero (), or by using semantic relations such as knowledge graphs palatucci2009zero (); rohrbach2011evaluating (); salakhutdinov2011learning (). Approaches that use knowledge graphs are less-explored and generally are based on the assumption that unknown classes can exploit similarity to known classes. Recently the benefit of hybrid approaches that combine knowledge graph and semantic class descriptions has been illustrated wang2018zero ().
The current state-of-the-art approach wang2018zero () processes knowledge graphs by making use of recent developments in applying neural network techniques to non-euclidean spaces, such as graph and manifold spaces bronstein2017geometric (). They employ a -layer graph convolutional neural network (GCN) kipf2016semi () and phrase the problem as weight regression, where the GCN is trained to regress classifier weights for each class. GCNs balance model complexity and expressiveness with a simple scalable model relying on the idea of message passing, i.e. nodes pass knowledge to their neighbors. However, these models were originally designed for classification tasks, albeit semi-supervised, an arguably simpler task than regression. In recent work, it has been shown that GCNs perform a form of Laplacian smoothing, where feature representations will become more similar as depth increases leading to easier classification li2018deeper (). However, in the regression setting, the aim is to exchange information between nodes in the graph and extensive smoothing is not desired as it dilutes information and does not allow for accurate regression. For instance, in a connected graph all features in a GCN with layers will converge to the same representation as under some conditions, hence washing out all information li2018deeper ().
We, therefore, argue that this approach is not ideal for the task of zero-shot learning and that the number of layers in the graph should be small in order to avoid smoothing. We illustrate this phenomenon in practice, by proposing a Graph Propagation Module (GPM) that only consists of a single layer and that consistently outperforms previously reported results. As in wang2018zero () we use the GPM in a model-of-models framework by training it to predict a set of logistic regression classifier for each class on top of a set of extracted features produced by a CNN.
Choosing a small number of layers, however, has the effect that knowledge will not propagate well through the graph as a -layer GCNs only considers neighbors that are two hops away in the graph. This means that only immediate neighbors influence a given node. Thus, we propose a dense connectivity scheme, where nodes are connected directly to descendants/ancestors in order to include distant information. These connections allow us to propagate information without many smoothing operations. However, this leads to the problem that all descendants/ancestors are weighed equally when computing the regression weight vector for a given class. However, intuitively, nodes closer to a given node should have higher importance. To remedy this, we extend this framework by adding an attention scheme that considers the distance between nodes in order to weigh the contribution of different nodes. Figure 1 illustrates the difference in the way knowledge is propagated in this Attentive Dense Graph Propagation Module (ADGPM) compared to the GPM.
We further consider the problem of domain-shift in zero-shot learning, which is the problem that current methods often struggle to perform well on both seen and unseen classes chao2016empirical (). Seen and unseen classes can be considered overlapping domains with a set of shared appearances, however, they also exhibit domain differences kodirov2017semantic (). To remedy this and allow the feature extraction stage of the pre-trained CNN to adjust to the newly learned classifiers we propose a two-phase training scheme. In the first step, the ADGPM is trained to predict the last layer CNN weights. In the second phase, we finetune the CNN to predict the ADGPM output in order to allow the CNN feature representation to adjust to the predicted weights, incorporating the implicit constraint of the knowledge graph.
We present an analyzes of our intuitions for zero-shot learning and illustrate how these intuitions can be combined to design a ADGPM that outperforms previous zero-shot learning results. To summarize, the key benefits of the proposed ADGPM module is that it explicitly exploits the hierarchical structure of the knowledge graph in order to perform zero-shot learning by more efficiently propagating knowledge through the proposed dense connectivity structure. Specifically, we perform experiments on various splits of the 21K ImageNet dataset as well as the AWA2 dataset. On the 21K classes of ImageNet, we obtain relative improvements of more than over the previously reported best results, and our proposed ADGPM improves on the GPM by . In a realistic setting zero-shot models should generalize well to the unseen classes but still perform well on seen classes norouzi2013zero (). We therefore also evaluate the ability of the models to perform well on the seen classes and illustrate that the performance on the seen classes benefits from our finetuning approach. The source code for the experiments performed in this paper is provided in the supplementary materials.
2 Related Work
Graph convolutional networks are a class of graph neural networks, based on local graph operators bruna2013spectral (); defferrard2016convolutional (); kipf2016semi (). Their advantage is that their graph structure allows the sharing of statistical strength between classes making these methods highly sample efficient. After being introduced in bruna2013spectral (), they were extended with an efficient filtering approach based on recurrent Chebyshev polynomials, reducing their computational complexity to the equivalent of the commonly used CNNs in image processing operating on regular grids defferrard2016convolutional (). kipf2016semi () further proposed simplifications to improve scalability and robustness and applied their approach to semi-supervised learning on graphs. Their approach is termed graph convolutional network (GCN).
Zero-shot learning has in recent years been considered from various set of viewpoints such as manifold alignment deutsch2017zero (); li2017zero (), linear auto-encoder kodirov2017semantic (), and low-rank embedded dictionary learning approaches ding2017low (), using semantic relationships based on attributes misra2017red (); socher2013zero (); frome2013devise () and relations in knowledge graphs wang2018zero (); mensink2012metric (); rohrbach2011evaluating (); palatucci2009zero (). One of the early works larochelle2008zero () proposed a method based on the idea of a model-of-model class approach, where a model is trained to predict models based on their description. Each class is modeled as a function of its description. This idea has recently been used in another work in wang2018zero (), the work most similar to our own, where a graph convolutional neural network is trained to predict logistic regression classifiers on top of pre-trained CNN features. wang2018zero () proposed to use GCNs kipf2016semi () to predict a set of logistic regression classifiers, one for each class, on top of pre-trained CNN features in order to predict unseen classes. Their approach has yielded impressive performance on a set of zero-shot learning tasks and can, to the author’s knowledge be considered to be the current state-of-the-art.
Here we first formalize the problem of zero-shot learning, provide information on a simple one layer model GPM and then extend this model to the full ADGPM. Our zero-shot learning framework to address this task is illustrated in Figure 2. Inspired by wang2018zero () we train a model, in our case the ADGPM, to predict the last layer CNN weights for each class/concept.
3.1 Zero-shot learning
Zero-shot classification aims to predict the class labels of a set of test data points to a set of classes . However, unlike in common supervised classification, the test data set points have to be assigned to previously unseen classes, given a dimensional semantic representation vector per class and a set of training data points , where denotes the -th training image and the corresponding class label. Here denotes the set of all classes and and the test and training classes, respectively. Note that training and test classes are disjoint for the zero-shot learning task. In this work, we perform zero-shot classification by using the word embedding of the class labels and the knowledge graph to predict classifiers for each unknown class in form of last layer CNN weights.
3.2 Graph Propagation Module
Given a graph with nodes and with input features per node, is used to denote the feature matrix. Here each node represents a distinct concept/class in the classification task. In this work, each concept is represented by a word vector of the class name. The connections between the classes in the knowledge graph are encoded in form of a symmetric adjacency matrix , which also includes self-loops. We employ a simple propagation rule to perform convolutions on the graph
where represents the activations in the layer and denotes the trainable weight matrix for layer . For the first layer, . denotes a nonlinear activation function, in our case ReLU. is a degree matrix , which normalizes rows in to ensure that the scale of the feature representations is not modified by . Similar to previous work done on graph convolutional neural networks, this propagation rule can be interpreted as a spectral convolution kipf2016semi (). Guided by experimental evidence, we only employ a single hidden layer model to avoid the smoothing effect and also do not employ the symmetric normalization that was used in wang2018zero ().
The model is trained to predict the classifier weights for the seen classes by optimizing the loss
where denotes the prediction of GPM for the known classes and therefore corresponds to the rows of , which correspond to the training classes. denotes the number of training classes and denotes the dimensionality of the weight vectors. The ground truth weights are obtained by extracting the last layer weights of a pre-trained CNN and denoted as .
During testing, the features of new images are extracted from the CNN and the GPM predicted classifiers are used to classify the features.
3.3 Attentive Dense Graph Propagation Module
Our ADGPM for zero-shot learning aims to utilize the hierarchical graph structure for the zero-shot learning task and avoids the dilution of knowledge by intermediate nodes. This is achieved using a dense graph connectivity scheme consisting of two phases, namely the descendant propagation phase and the ancestor propagation phase. This two-phase approach further enables the model to learn separate relations between a node and its parents and a node and its children. Unlike in the GPM, we do not use the knowledge graph relations directly as an adjacency graph to include information from neighbors further away. We do therefore not suffer from the problem of knowledge being washed out due to averaging over the graph. Instead, we introduce two separate connectivity patterns, one where nodes are connected to all their ancestors and one where nodes are connected to all descendants. We utilize two adjacency matrices that denotes the connections between nodes and their ancestors and adjacency matrix that denotes the connections between nodes and their descendants. Note, . Unlike in previous approaches, this connectivity pattern allows nodes direct access to knowledge in their extended neighborhood as opposed to knowledge that has been modified by intermediate nodes. Note that both these adjacency matrices include self-loops. The connection pattern is illustrated in Figure 1. The same propagation rule as in Equation 1 is applied consecutively for the two connectivity patterns leading to the overall ADGPM propagation rule
Attention weights In order to allow ADGPM to weigh the contribution of various neighbors in the dense graph, we propose an attention scheme that weighs a given nodes neighbors based on the graph distance from the node. Note, the distance is computed on the knowledge graph and not the dense graph. We use and to denote the attention weights for the ancestor and the descendent propagation phase, respectively. and correspond to weights for nodes that are hops away from the given node. correspond to self-loops and correspond to the weights for nodes further than hops away. We normalize the weights using a softmax function . Similarly, . The weighted propagation rule in Equation 3 becomes
where is used to denote the adjacency matrix that only contains the -hop edges.
Training of the proposed model is done in two stages, where the first stage trains the ADGPM to predict the last layer weights of a pre-trained CNN using equation 2. Note, , in this case, contains the rows of , which correspond to the training classes. In order to allow the feature representation of the CNN to adapt to the new class classifiers, we train the CNN in a second stage where the last layer weights are fixed to the predicted weights of the training classes in the ADGPM and only the feature representation is updated. This can be viewed as utilizing the ADGPM as a constrained for the CNN finetune, as we indirectly incorporate the graph information in order to constrain the CNN output space.
3.5 Training details
We use a ResNet-50 He2015 () model that has been pre-trained on the ImageNet 2012 dataset. Following wang2018zero (), we use the GloVe text model pennington2014glove (), which has been trained on the Wikipedia dataset, as the feature representation of our concepts in the graph. The ADGPM model consists of two layers as illustrated in Equation 3 with feature dimensions of and the final output dimension corresponds to the number of weights in the last layer of the ResNet-50 architecture, for weights and bias. Following the observation of wang2018zero (), we perform L2-Normalization on the outputs as it regularizes the outputs into similar ranges. Similarly, we also normalize the ground truth weights produced by the CNN. We further make use of Dropout srivastava2014dropout () with a dropout rate of in each layer. The model is trained for epochs with a learning rate of and weight decay of using Adam kingma2014adam (). We make use of leaky ReLUs with a negative slope of . The number of attention values per phase was set to as additional weights had diminishing returns. The proposed ADGPM model is implemented in PyTorch paszke2017automatic () and training and testing is performed on a GTX 1080Ti GPU. Finetuning is done for 20 epochs using SGD with a learning rate of 0.0001 and momentum of 0.9.
We performed a comparative evaluation of the proposed GPM and ADGPM against previous state-of-the-art on two common zero-shot learning datasets, namely ImageNet and AWA2.
ImageNet deng2009imagenet () is the largest commonly used dataset for zero-shot learning. In our work, we follow the train/test split suggested by frome2013devise (), who proposed to use the 21K ImageNet dataset for zero-shot evaluation. They define three tasks in increasing difficulty, denoted as "2-hops", "3-hops" and "All". Hops refer to the distance that classes are away from the ImageNet 2012 1K classes in the ImageNet hierarchy and thus is a measure of how far unseen classes are away from seen classes. "2-hops" contains all the classes two hops from the seen classes and consists of 1,589 classes, while "3-hops" contains 7,860 classes. "All" contains all 20,841 classes. None of the classes are contained in the ImageNet 2012 dataset, which was used to pre-train the ResNet-50 model. Mirroring experiment setup in frome2013devise (); norouzi2013zero (); wang2018zero () we further evaluate the performance when training categories are included as potential labels. Note that since the only difference is the number of classes during testing, the model does not have to be retrained. We denote the splits as "2-hops+1K", "3-hops+1K", "All+1K".
AWA2 xian2017zero () is a replacement for the original AWA dataset and represents more traditional zero-shot learning datasets, where most approaches rely on class-attribute information. It consists of 50 animal classes, with a total of 37,322 images and an average of 746 per class. The dataset further consists of 85-attribute features per class. We report results on the proposed split in xian2017zero () to ensure that there is no overlap between the test classes and the ImageNet 2012 dataset. In the proposed split, 40 classes are used for training and 10 for testing.
AWA2 test classes are contained in the 21K ImageNet classes and several of the training classes (24 out of 40) that are in the proposed split overlap with the ImageNet 2012 dataset. We, therefore, use a unified approach for both datasets. Note, we are not using attribute information but rely on the ImageNet graph obtained from WordNet and word embedding as semantic class information.
4.1 Baseline approaches
We compare our ADGPM to the following baselines: Devise frome2013devise () linearly maps visual information in form of features extracted by a convolutional neural network to the semantic word-embedding space. The transformation is learned using a hinge ranking loss. Classification is performed by assigning the visual features to the class of the nearest word-embedding. ConSE norouzi2013zero () projects image features into a semantic word embedding space as a convex combination of the closest seen classes semantic embedding weighted by the probabilities that the image belongs to the seen classes. The probabilities are predicted using a pre-trained convolutional classifier. Similar to Devise, ConSE assigns images to the nearest classes in the embedding space. EXEM changpinyo2017predicting () creates visual class exemplars by averaging the PCA projections of images belonging to the same seen class. A kernel-based regressor is then learned to map a semantic embedding vector to the class exemplar. For zero-shot learning visual exemplars can be predicted for the unseen classes using the learned regressor and images can be assigned using nearest neighbor classification. SYNC changpinyo2016synthesized () aligns a semantic space (e.g., the word-embedding space) with a visual model space, adds a set of phantom object classes in order to connect seen and unseen classes, and derives new embeddings as convex combination of these phantom classes. GCNZ wang2018zero () represents the current state of the art and is the approach most related to our proposed ADGPM. A GCN is trained to predict last layer weights of a convolutional neural network.
4.2 Comparison to state-of-the-art methods: ImageNet
|Test set||Model||Hit@k (%)|
|2-hops||ConSE changpinyo2016synthesized ()||8.3||12.9||21.8||30.9||41.7|
|EXEM changpinyo2017predicting ()||12.5||19.5||32.3||43.7||55.2|
|SYNC changpinyo2016synthesized ()||10.5||17.7||28.6||40.1||52.0|
|GCNZ wang2018zero ()||19.8||33.3||53.2||65.4||74.6|
|3-hops||ConSE changpinyo2016synthesized ()||2.6||4.1||7.3||11.1||16.4|
|SYNC changpinyo2016synthesized ()||2.9||4.9||9.2||14.2||20.9|
|EXEM changpinyo2017predicting ()||3.6||5.9||10.7||16.1||23.1|
|GCNZ wang2018zero ()||4.1||7.5||14.2||20.2||27.7|
|All||ConSE changpinyo2016synthesized ()||1.3||2.1||3.8||5.8||8.7|
|SYNC changpinyo2016synthesized ()||1.4||2.4||4.5||7.1||10.9|
|EXEM changpinyo2017predicting ()||1.8||2.9||5.3||8.2||12.2|
|GCNZ wang2018zero ()||1.8||3.3||6.3||9.1||12.7|
|Test set||Model||Hit@k (%)|
|2-hops+1K||DeViSE frome2013devise ()||0.8||2.7||7.9||14.2||22.7|
|ConSE norouzi2013zero ()||0.3||6.2||17.0||24.9||33.5|
|ConSE wang2018zero ()||0.1||11.2||24.3||29.1||32.7|
|GCNZ wang2018zero ()||9.7||20.4||42.6||57.0||68.2|
|3-hops+1K||DeViSE frome2013devise ()||0.5||1.4||3.4||5.9||9.7|
|ConSE norouzi2013zero ()||0.2||2.2||5.9||9.7||14.3|
|ConSE wang2018zero ()||0.2||3.2||7.3||10.0||12.2|
|GCNZ wang2018zero ()||2.2||5.1||11.9||18.0||25.6|
|All+1K||DeViSE frome2013devise ()||0.3||0.8||1.9||3.2||5.3|
|ConSE norouzi2013zero ()||0.2||1.2||3.0||5.0||7.5|
|ConSE wang2018zero ()||0.1||1.5||3.5||4.9||6.2|
|GCNZ wang2018zero ()||1.0||2.3||5.3||8.1||11.7|
Quantitative results for the comparison on the ImageNet datasets are shown in Table 2. We use DGPM to denote the accuracy that is achieved after training the ADGPM model without attention and before finetuning the CNN. DGPM(f) and ADGPM(f) are used to denote the results for DGPM and ADGPM after finetuning, respectively. We further report the accuracy achieved by the one layer GPM model and a finetuned version of it GPM(f). Compared to previous results such as ConSE changpinyo2016synthesized (), EXEM changpinyo2017predicting (), and GCNZ wang2018zero () our proposed methods outperform the previous results with a considerable margin, achieving, for instance, more than 50% relative improvement for Top-1 accuracy on the 21K ImageNet All dataset. We observe that our methods especially outperform the baseline models on the "All" task, illustrating the potential of our methods to more efficiently propagate knowledge. Furthermore, we perform an ablation experiment for the 2-hops dataset where we analyze the effect of adding attention and finetuning to the model and observe that finetuning the DGPM model and introducing attention consistently leads to improvements in model performance. Results further illustrate that the proposed ADGPM(f), which learns both feature extraction and the zero-shot classifiers, achieves better performance than all variants, demonstrating that our dense attention scheme can make features more general and incorporates more graph information.
Qualitative results of the finetuned ADGPM and GPM are shown in Figure 3. Example images from unseen test classes are displayed and we compare the results of our proposed ADGPM and GPM to results produced by a pre-trained ResNet. Note, ResNet can only predict training classes while the others predict classes not seen in training. For comparison we also provide results for our re-implementation of GCNZ. We observe that GPM and ADGPM generally provide coherent top-5 results. All methods struggle to predict the opener and tend to predict some type of plane instead, however, ADGPM does include opener in the top-5 results. We further observe that the prediction task on this dataset for zero-shot learning is difficult as it contains classes of fine granularity, such as many different types of squirrels, planes and furniture. Additional examples are provided in the supplementary material.
Study of the generalized setting. Results are shown in Table 2. This includes the training labels as potential labels during classification of the zero-shot examples. For the baselines, we include two implementations of ConSE, one that uses AlexNet as a backbone norouzi2013zero () and one that uses ResNet-50 wang2018zero (). Compared to Table 2, we observe that the accuracy is considerably lower, but GPM, DGPM and ADGPM still outperform the previous state-of-the-art approach GCNZ on most metrics. For the 2-hop dataset DGPM and GCNZ perform similar on the Top-1 accuracy, but DGPM performs better on the remaining Top-k measures. Results for the finetuned methods show that similar to the non-generalized setting, considerable improvements can be obtained compared to the baseline. Note, that the difference in performance between GPM(f) and ADGPM(f) for the generalized setting is less than in the non-generalized setting.
Study of domain shift issue. Performance on seen classes also needs to be considered as there is a tradeoff. In a realistic setting, zero-shot learning models should perform well on both seen and unseen classes. To put the zero-shot performance into perspective, we perform experiments where we analyze how the model’s performance on the original seen classes is affected by domain shift as additional unseen classes (all 2-hop classes) are introduced. Table 4 shows the results when the model is tested on the validation dataset from ImageNet 2012. We compare the performance to our re-implementation of the GCNZ model with ResNet-50 backbone and also the performance from the original ResNet-50 model, which is trained only on the seen classes. It can be observed that all our methods outperform GCNZ on Hit@1 and Hit@2 accuracy. We also observe that finetuning the feature extraction part of the CNN improves the performance on the train classes considerably, decreasing the gap to the performance of the pre-trained ResNet-50.
Scalability. To obtain good scalability it is important that the adjacency matrix is a sparse matrix so that the complexity of computing is linearly proportional to the number of edges present in . Our approach utilizes the structure of knowledge graphs, where entities only have few ancestors and descendents, to ensure this. The adjacency matrix for the ImageNet hierarchy used in our experiments, for instance, has a density of , while our dense connections only increase the density of the adjacency matrix to .
4.3 Comparison to state-of-the-art methods: AWA2
The results are presented in Table 4. We observe that GPM, DGPM and ADGPM outperform the baseline approaches. Note that our models differ considerably from the baselines as it does not make use of the attributes provided in the dataset. To illustrate the merits of our approach, we re-implement wang2018zero (), as it represents the method which is closest related to our approach and also makes use of word embeddings and a knowledge graph. We observe that our methods also outperforms wang2018zero (), however, the improvement is lower than on the ImageNet dataset, which we believe is due to the arguably simpler task with the number of classes being considerably lower. We observe an additional increase in performance by making use of finetuning. Note, we observe that the simple GPM outperforms the proposed ADGPM. However, similar to the ImageNet experiments, we observe a large improvement by adding the simple attention scheme to DGPM, and therefore believe that a more granular attention scheme can allot ADGPM to outperform GPM. The large improvement that is achieved by adding attention to DGPM illustrates this. The results illustrate that the proposed GPM and ADGPM can universally address the problem of zero-shot learning as long as test classes are represented in the knowledge graph.
|ConSE norouzi2013zero ()||44.5|
|Devise frome2013devise ()||59.7|
|SYNC changpinyo2016synthesized ()||46.6|
In contrast to previous approaches using graph convolutional neural networks for zero-shot learning, our proposed GPM and ADGPM tailor the graph approach to the problem by exploiting the hierarchical structure of the knowledge graph. This allowed us to address the problem of knowledge being diluted as it is passed along the graph. Experiments illustrate the ability of the proposed methods, outperforming previous state-of-the-art methods for zero-shot learning. In future work, we aim to investigate the potential of more advanced attention mechanisms to further improve the performance of ADGPM compared to GPM. The inclusion of additional semantic information for settings where these are available for a subset of nodes is another future direction.
-  Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
-  M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
-  J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
-  S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5327–5336, 2016.
-  S. Changpinyo, W.-L. Chao, and F. Sha. Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3476–3485, 2017.
-  W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In European Conference on Computer Vision, pages 52–68. Springer, 2016.
-  M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
-  S. Deutsch, S. Kolouri, K. Kim, Y. Owechko, and S. Soatto. Zero shot learning via multi-scale manifold regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7112–7119, 2017.
-  Z. Ding, M. Shao, and Y. Fu. Low-rank embedded ensemble semantic dictionary for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2050–2058, 2017.
-  A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129, 2013.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2015.
-  T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference for Learning Representation, 2017.
-  E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3174–3183, 2017.
-  H. Larochelle, D. Erhan, and Y. Bengio. Zero-data learning of new tasks. In Proceedings of the 23rd national conference on Artificial intelligence-Volume 2, pages 646–651. AAAI Press, 2008.
-  Q. Li, Z. Han, and X.-M. Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the 33rd national conference on Artificial intelligence. AAAI Press, 2018.
-  Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zero-shot recognition using dual visual-semantic mapping paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Y. Long, L. Liu, L. Shao, F. Shen, G. Ding, and J. Han. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In Proceedings of the European Conference on Computer Vision, pages 488–501. Springer, 2012.
-  I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1792–1801, 2017.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. International Conference for Learning Representation, 2014.
-  M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In Advances in Neural Information Processing Systems, pages 1410–1418, 2009.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems Workshop, 2017.
-  J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical methods in Natural Language Processing, pages 1532–1543, 2014.
-  M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1641–1648. IEEE, 2011.
-  B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pages 2152–2161, 2015.
-  R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass object detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1481–1488. IEEE, 2011.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, pages 935–943, 2013.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  X. Wang, Y. Ye, and A. Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In arXiv preprint arXiv:1803.08035, 2018.
-  Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4582–4591, 2017.