Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective
Zero-shot learning (ZSL) aims to recognize instances of unseen classes solely based on the semantic descriptions of the classes. Existing algorithms usually formulate it as a semantic-visual correspondence problem, by learning mappings from one feature space to the other. Despite being reasonable, previous approaches essentially discard the highly precious discriminative power of visual features in an implicit way, and thus produce undesirable results. We instead reformulate ZSL as a conditioned visual classification problem, i.e., classifying visual features based on the classifiers learned from the semantic descriptions. With this reformulation, we develop algorithms targeting various ZSL settings: For the conventional setting, we propose to train a deep neural network that directly generates visual feature classifiers from the semantic attributes with an episode-based training scheme; For the generalized setting, we concatenate the learned highly discriminative classifiers for seen classes and the generated classifiers for unseen classes to classify visual features of all classes; For the transductive setting, we exploit unlabeled data to effectively calibrate the classifier generator using a novel learning-without-forgetting self-training mechanism and guide the process by a robust generalized cross-entropy loss. Extensive experiments show that our proposed algorithms significantly outperform state-of-the-art methods by large margins on most benchmark datasets in all the ZSL settings.
Deep learning methods have achieved revolutionary successes on many tasks in computer vision owing to the availability of abundant labeled training data [45, 44, 17, 20, 19, 21]. However, labeling large-scale training data for each task is both labor-intensive and unscalable. Inspired by human’s remarkable abilities to recognize instances of unseen classes solely based on class descriptions without seeing any visual example of such classes, researchers have extensively studied an image classification setting similar to the human learning called zero-shot learning (ZSL) [41, 31, 22, 33], in which labeled training images of seen classes and semantic descriptions of both seen classes and unseen classes are given and the task is to classify test images into seen and unseen classes.
Existing approaches usually formulate ZSL as a visual-semantic correspondence problem and learn the visual-semantic relationship from seen classes and apply it to unseen classes, considering that the seen and unseen classes are related in the semantic space [1, 43, 13]. These methods usually project either visual features or semantic features from one space to the other, or alternatively project both types of features to an intermediate embedding space. In the shared embedding space, the associations between the two types of features are utilized to guide the learning of the projection functions.
However, these methods fail to recognize the tremendous efforts in obtaining these discriminative visual features over a large number of classes through training powerful deep neural network classifiers with a huge amount of computational and data resources, and thus essentially discard the highly precious discriminative power of visual features in an implicit way. In details, on one hand, the visual features used in most ZSL methods are extracted by some powerful deep neural networks (e.g., ResNet101) trained on large-scale datasets (e.g., ImageNet) . These visual features are already highly discriminative; reprojecting them to any space shall impair the discriminability, especially to a lower dimensional space, because the dimension reduction usually significantly shrinks data variance. It is surprising that the majority of existing ZSL approaches try to transform the visual feature vectors in various ways [22, 33, 13]. On the other hand, by nature of classification problems, the competition information among different classes are crucial for classification performance. But many ZSL approaches ignore the class separation information during training due to focusing on learning the associations between visual and semantic features, and fail to realize that ZSL is essentially a classification problem .
Inspired by the above observations, we propose to solve ZSL in a novel conditional visual feature classification framework. In the proposed framework, we effectively generate visual feature classifiers from the semantic attributes, and thus intrinsically preserve the visual feature discriminability while exploiting the competing information among different classes. Within the novel framework, we propose various novel strategies to address different ZSL problems.
For the conventional ZSL problem where only unseen classes are involved for evaluations, we propose to train a deep neural network that generates visual feature classifiers directly from the semantic attributes. We train the network with a Cosine similarity based cross-entropy loss, which mitigates the impact of variances of features from two different domains when calculating their correlations. Borrowing ideas from meta-learning, we train our model in an episode-based way by composing numerous “fake” new ZSL tasks, so that its generalizability to “real” new ZSL tasks during test is enhanced. For the generalized setting in which seen classes are included for ZSL evaluations, we concatenate the classifiers for seen classes and unseen classes to classify visual features for all classes. Since the classifiers for seen classes are trained with labeled samples, they are highly discriminative to discern whether an incoming image belongs to the seen classes or not. This desirable property prevents our method from significant performance drops when much more classes are involved for evaluations. For the transductive setting in which images of unseen classes are available during training , we take advantage of these unlabeled data to calibrate our classifier generator using the pseudo labels generated by itself. To limit the harm of incorrect pseudo labels and avoid the model being over-adapted to new classes, we propose to use the generalized cross-entropy loss to guide the model calibration process under an effective learning-without-forgetting training scheme.
In summary, our contributions are as follows:
We reformulate ZSL as a conditional visual classification problem, by which we can essentially benefit from high discriminability of visual features and inter-class competing information among training classes to solve ZSL problem in various settings.
We propose various effective techniques to address different ZSL problems uniformly within the proposed framework.
Experiments show that our algorithms significantly outperform state-of-the-art methods by large margins on most benchmark datasets in all the ZSL settings.
2 Related Work
Zero-Shot Learning (ZSL) aims to recognize unseen classes based on their semantic associations with seen classes. The semantic associations could be within the human-annotated attributes [34, 26, 2], word vectors [10, 43, 4], text descriptions [16, 6], etc. In practice, ZSL is performed by firstly learning an embedding space where semantic vectors and visual features are interacted. Then, within the learned embedding space, the best match among semantic vectors of unseen classes is selected for the visual features of any given image of the unseen classes.
According to the embedding space used, existing methods can be generally categorized into the following three groups. Some approaches select semantic space as embedding space and project visual features to semantic space [15, 10]. Projecting visual features into a often much lower-dimensional semantic space shall shrink the variance of the projected data points and thus aggravate the hubness problem, i.e., some candidates will be biased to be the best matches to many of the queries. Alternatively, some methods project both visual and semantic features into a common intermediate space [1, 35, 47]. However, due to the lack of training samples from unseen classes, these methods are prone to classifying test samples into seen classes . The third category of methods choose the visual space as the embedding space and learn a mapping from the semantic space to visual space . Benefiting from the abundant data diversity in visual space, these methods can mitigate the hubness problem to some extent.
Recently, a new branch of methods come out and approach ZSL in virtue of data augmentation, either by variational auto-encoder (VAE)  or Generative Adversarial Network (GAN) [5, 42, 8, 48, 50]. These methods learn from visual and semantic features of seen classes and produce generators that can generate synthesized visual features based on class semantic descriptions. Then, synthesized visual features are used to train a standard classifier for object recognition.
ZSL may turn easier when unlabelled test samples are available during training, i.e., the so-called transductive ZSL. This is because unlabelled test samples can be utilized to help reach clearer decision boundaries for both seen and unseen classes. In fact, it is more like a semi-supervised learning problem. Propagated Semantic Transfer (PST)  conducts label propagation from seen classes to unseen classes through exploiting the class manifold structure. Unsupervised Domain Adaption (UDA)  formulates the problem as a cross-domain data association problem and solves it by regularized sparse coding. Quasi-Fully Supervised Learning (QFSL)  aims to strength the mapping from visual space to semantic space by explicitly requiring the visual features being mapped to the categories (seen and unseen) they belong.
Unlike the above methods, we approach ZSL from the perspective of conditioned visual feature classification. Perhaps most similar to our algorithms are [16, 38], which approach ZSL also by generating classifiers. However,  projects visual features to a lower dimensional space, harming discriminability of the visual features.  uses graph convolutional network to model the semantic relationships and output classifiers. However, it requires categorical relationship as additional inputs. We instead generate classifiers directly from attributes by a deep neural network and train the model with a novel cosine similarity based cross-entropy loss. Besides, neither of the two methods uses episode-based training to enhance model adaptability to novel classes. Moreover, they are only feasible for the conventional ZSL setting, while our method is flexible for various ZSL settings.
Zero-shot learning (ZSL) is to recognize objects of unseen classes given only semantic descriptions of the classes. Formally, suppose we have three sets of data , where and are training and test sets, respectively. and are the images, while and the corresponding labels. There is no overlap between training classes and test classes, i.e., . The goal of ZSL is to learn transferable information from that can be used to classify unseen classes from , with the help of semantic descriptions for both seen () and unseen () classes. can be human-annotated class attributes  or articles describing the classes .
We solve ZSL in a conditional visual feature classification framework. Specifically, we predict the possibility of an image x belonging to class given the semantic description of the class, where in the standard setting, while in the generalized setting. When is available during training, we call the problem transductive ZSL. For convenience, sometimes we call the setting inductive ZSL where is unavailable.
3.1 Zero-Shot Learning
By approaching ZSL in virtue of visual classification conditioned on attributes, we need to generate visual feature classifiers from the attributes. We achieve this by learning a deep neural network which takes a semantic feature vector of a class as input and outputs the classifier weight vector for the class. Since the model is going to generate classifiers for novel classes when tested, we adopt the episode-based training mechanism, an effective and popular technique in meta-learning [37, 9, 18], to mimic this scenario during training.
The key to episode-based training is to sample in each mini-batch a “fake” new task that matches the scenario where the model is tested. This process is called an episode. The goal is to expose the model with numerous “fake” new tasks during training, such that it can generalize better for real new tasks when tested. To construct a ZSL episode, we randomly sample from and a ZSL task where contains samples for classes, samples per classes. Note for each sample , we dismiss its global (dataset-wise) label and replace it with a local (minibatch-wise) label (i.e., ), while still maintaining the class separation (samples of the same global label still with the same local label). This is to cut off the connections across tasks induced by the shared global label pool, so that each mini-batch is treated as a new task. is the associated attribute vectors.
For each task , generates a classifier for the sampled classes as
With the classifier , we can calculate classification scores of visual features from . Rather than using the extensively used dot product, we use cosine similarity.
|Algorithm 1. Proposed ZSL approach|
|Input: Training set and attributes .|
|Output: Classifier weight generation network|
|while not done do|
|1. Randomly sample from and a ZSL task|
|2. Calculate loss according to Eq. (3)|
|3. Update through back-propagation.|
Cosine similarity based classification score function. Traditional multi-layer neural networks use dot product between the output vector of previous layer and the incoming weight vector as the input to activation function. [23, 11] recently showed that replacing the dot product with cosine similarity can bound and reduce the variance of the neurons and thus result in models of better generalization. Considering that we are trying to calculate the correlation between data from two dramatically different domains, especially for the attribute domain in which the features are discontinuous and have high variances. Using cosine similarity shall mitigate the harmful effect of the high variances and bring us desirable Softmax activations. With this consideration, we define our classification score function as
where is a learnable scalar controlling the peakiness of the probability distribution generated by the Softmax operator. is the classifier weight vector for class .
With this definition, the loss of a typical ZSL task is defined as follows,
where is a hyper-parameter weighting the -norm regularization of the learnable parameters of neural network .
Algorithm 1 outlines our training procedures.
3.2 Generalized Zero-Shot Learning
With the learned classifier generator , given attributes of unseen classes in the test stage, we generate the corresponding classifier weights and use them to classify visual features of unseen classes according to Eq. (2).
When both seen and unseen classes are involved for evaluations, i.e., the generalized ZSL setting, we combine the classifiers for both seen and unseen classes to classify images from all classes. Specifically, with and , we can get classifiers and for unseen and seen classes, respectively. We use their concatenation as the classifier for all classes.
It is worth noting that since has already been trained with labeled samples, the resulting should be very discriminative to discern whether an incoming image belongs to the seen classes or not. As will be shown later in the experiments, this desirable property prevents our method from significant recognition accuracy drops when much more classes are involved for evaluations.
3.3 Transductive Zero-Shot Learning
Thanks to the conditional visual classification formulation of ZSL, the above inductive approach can be readily adapted to the transductive ZSL setting. We can utilize test data during training to calibrate our classifier generator and output classifiers of better decision boundaries for both seen and unseen classes. We achieve this in virtue of self-training. Specifically, we alternate between generating pseudo labels for images of unseen classes using the classifier generator and updating it using the generated pseudo labels. With this idea, two key problems need to be solved. The first is how to prevent the generator from over-adapting to unseen classes such that the knowledge previously learned from seen classes is lost, resulting in unsatisfactory performance for seen classes. The second is how to avoid the generator being impaired by the incorrect pseudo labels. We propose a novel self-training based transductive ZSL algorithm to avoid both problems. Figure 1 illustrates our algorithm.
To generate pseudo labels for test images , we first generate classifier weights for unseen classes as
With , we calculate classification score S of according to Eq. . Pseudo labels of can be obtained from S. There inevitably exist noises among . We propose to mitigate their impact by a novel classification score peakiness based filtering strategy.
Let be the classification score of according to all the classes. Let and be the highest and second highest score among . The pseudo label assigned to should be . However, we regard this assignment as a “confident” one unless is peaky enough:
where is a threshold controlling the peakiness. This constraint prevents ambiguous label assignment from being exploited for classifier generator calibration.
After obtaining the confident set , as well as the the corresponding attributes , we can use them to adjust . However, finetuning with only shall cause strong bias towards unseen classes such that the knowledge previously acquired about seen classes will be forgotten after a few iterations. What is worse, the incorrect pseudo labels among may damage when they are of a high portion. We propose a novel learning-without-forgetting training scheme to avoid this.
Along with sampling a ZSL task from (, ) to calibrate to unseen classes, we sample another ZSL task from (, ) to keep the memory of to seen classes and dilute the impact of noisy labels from . Further, while updating , we update as well classifier to adjust the decision boundaries of seen classes towards unseen ones.
|Algorithm 2. Proposed approach for transductive ZSL|
|Input: Training set , attribute set|
|, and test images , parameters and|
|Output: Class label of , weight generator ,|
|classifier weight for seen classes.|
|1. Obtain with and using Algorithm 1.|
|2. Obtain .|
|3. Calculate classifier weights for unseen classes|
|4. Generate pseudo labels for according to Eq. (2).|
|5. Select confident test set and|
|based on Eq. (5).|
|6. Sample ZSL tasks from (, ), and|
|from ( ).|
|7. Calculate loss according to Eq. (7).|
|8. Update and through back-propagation.|
Moreover, we introduce the very recently proposed generalized cross-entropy loss  to handle task and limit the impact of incorrect pseudo labels to the classifier weight generator:
where is the possibility of belonging to class , which is calculated according to Eq. (2). (0,1] is a hyper-parameter of which a higher value is preferred when the noise level is high. It can be shown that Eq. (6) turns to Eq. (3) when infinitely approaches 0. On the other hand, it turns to the Mean Absolute Error (MAE) loss when . Cross-entropy loss is powerful for classification tasks but noise-sensitive, while MAE loss performs worse for conventional classification task but is robust to noisy labels. Tuning between 0 and 1 fits different noise levels.
By handling with generalized cross-entropy loss and with conventional cross-entropy loss, our loss function for the transductive ZSL is as follows:
where is defined in Eq (3). Algorithm 2 outlines the training procedures.
4.1 Datasets and Evaluation Settings
We employ the most widely-used zero-shot learning datasets for performance evaluation, namely, CUB  AwA1 , AwA2 , SUN  and aPY . The statistics of the datasets are shown in Table 1. We follow the GBU setting proposed in  and evaluate both the conventional ZSL setting and the generalized ZSL (GZSL) setting. In the conventional ZSL, test samples are restricted to the unseen classes, while in GZSL, they may come from either seen classes or unseen classes. For both settings, we use top-1 (T1) Mean Class Accuracy (MCA) as the evaluation metric in our experiments. For GZSL, we evaluate the MCA for both seen () and unseen classes (), and also calculate their harmonic mean .
4.2 Implementation details
Following , we use ResNet101  trained on ImageNet for feature extraction, which results in a 2048-dimension vector for each input image. The classifier generation model consists of two pairs of FC+ReLU layers, i.e., FC-ReLU-FC-ReLU, which maps semantic vectors to visual classifier weights. The dimension of the intermediate hidden layer is 1600 for all the five datasets. We train with Adam optimizer and a learning rate for all datasets by 1,000,000 randomly sample ZSL tasks. Each task consists of 32 randomly sampled classes, 4 samples for each class, i.e., and , except aPY where we set and because there are in total only 20 classes for training. The hyper-parameter is chosen as , , , and for AwA1, AwA2, CUB, SUN and aPY, respectively.
For transductive ZSL, the experimental setting is the same as that in the corresponding inductive case for each dataset. For all the datasets, we update the pseudo labels of unseen classes every 10,000 iterations and execute 50 updates, i.e., and . We apply and for all the datasets. We develop our algorithms based on PyTorch.
4.3 Ablation Studies
By formulating ZSL as a visual classification problem conditioned on the attributes, we can naturally benefit from the high discriminability of visual features. Meanwhile, to combat with the significant variance of visual and attribute features, we propose to replace the widely-used dot product with cosine similarity to calculate the classification score. Moreover, we introduce the episode-based training scheme to enhance the adaptability of our model to new tasks. We conduct ablation study to evaluate the effectiveness of our ingenious designs.
Preserving visual feature discriminability. To study the importance of preserving visual discriminability, we implement two baseline methods: one we project visual features to attribute space and the other we project visual features to an intermediate space (of half dimension as the visual space). All the other settings are the same as our method.
Table 2 shows that the performance degrades significantly by projecting visual features to either the semantic space or the intermediate space, no matter using dot product or Cosine similarity based classification score functions. As analyzed before, image feature embeddings for ZSL are usually generated offline by some powerful feature extraction networks such that high discriminatibility has already been secured. Reprojecting them to either the attribute or the intermediate space shall inevitably impair the discriminability. What is worse, the attribute space or the intermediate space are often of lower dimension than the visual embedding space. The visual variance, which is crucial to ensure discriminability, shall be shrunk once the feature embeddings are reprojected to the lower-dimensional spaces. Due to the damage of the discriminability of visual features, the hubness problem becomes even more intense, leading to much worse results.
|V I A||\CheckmarkBold||\CheckmarkBold|
|Episode based training||\CheckmarkBold|
Cosine similarity based classification score function. We compare dot product and cosine similarity based loss functions within all the three classification spaces. Table 2 shows that the classification space seems a more dominant factor: neither of the two score functions works well if the classification space is not appropriate. When the visual embedding space is selected for classification, the proposed cosine similarity based score function results in much better performance than that based on dot product. We speculate the reason is that values of class attribute are not continuous such that there are large variance among the attribute vectors of different classes. Consequently, classifier weights derived from them also possess large variance, which might cause high variances of inputs to the Softmax activation function . Unlike dot product, our cosine similarity based score function normalizes the classifier weights before calculating its dot product with visual embeddings. This normalization procedure can bound and reduce the variance of the classifier weights, contributing to better performance.
|Relat. Net ||-||-||-||-||55.6||38.1||61.1||47.0||68.2||31.4||91.3||46.7||64.2||30.0||93.4||45.3||-||-||-||-|
Episode-based training mechanism The proposed episode-based training mechanism is to train our classifier weight generator in the way it works during test. From Table 2, we can observe that there are about 3 performance gains for both the ZSL setting and GZSL setting when this unique training mechanism is adopted. This is within our expectation because after exposing our weight generator with numerous (fake) new ZSL tasks during training, it acquires the knowledge how to deal with real new ZSL tasks during test. So, better performance is more likely to be guaranteed.
4.4 Comparative Results
Zero-shot learning. Table 3 shows the comparative results of the proposed method and the state-of-the-art ones for the inductive ZSL problem. For conventional ZSL, our method reaches the best for three out of the five datasets. Remarkably, for the AwA2 dataset, our method beats the second best by about 4.
Generalized zero-shot learning. More interesting observations can be made for the GZSL setting where classification is performed over both seen and unseen classes. With more classes involved, the classification accuracy of unseen classes drops for all methods. However, our method exhibits much more robustness than the other ones and drops moderately on these datasets. Remarkably, our method sometimes secures accuracy that is even by about 100 (aPY) higher than the second best. We analyze this striking improvements are brought by our consideration of inter-class separation during training so that the resultant classifiers for seen classes possess favorable class separation property after training and shall be highly discriminative to discern whether an incoming image belongs to the classes they were trained for.
Contrary to the striking advantages for recognizing unseen classes, our method seems kind of “forgetful” and is overcome by many methods for recognizing seen classes. This is because during training, we constantly sample new ZSL tasks to train the weight generator to acquire the knowledge of handling new ZSL tasks. Unlike existing methods, which process the whole dataset altogether or are specially designed to keep the training memory, our method intentionally forgets the global class structure of training set. Therefore, with the increase of the capability of handle new ZSL tasks, it inevitably sacrifices some competence of classifying seen classes. Despite of this, our method surpasses the other ones by large margins for three out of the five datasets for the harmonic mean (H), while being very close to the feature synthesized based method, f-CLSWGAN, which generates additional data for training.
Transductive zero-shot learning. When test data are available during training, better performance is often expected as we can utilize them to mitigate the classification bias towards seen classes. Table 4 verifies this and our transductive algorithm significantly outperforms the inductive counterpart. This substantiates the effectiveness of our novel learning-without-forgetting self-training technique. Further, with generalized cross-entropy loss for unseen classes, Ours-trans (GXE) consistently performs better than Ours-trans (XE) which uses conventional cross-entropy loss. This shows the effectiveness of using the generalized cross-entropy loss for avoiding the negative impact of incorrect pseudo labels. Comparatively speaking, similar as we have observed in the inductive setting, our method significantly outperforms existing ones, especially for unseen classes in GZSL.
4.5 Further Analyses
Analyzing self-training process. In the transductive ZSL setting, we propose to calibrate weight generator towards unseen classes using test data in a novel self-training fashion. We alternate between generating pseudo labels for unseen images using and updating using the pseudo labels of high confidence. By this self-training strategy, the bias of towards seen classes can be progressively eliminated, with boost for unseen class recognition as the consequence.
To analyze how this self-training process works, we plot in Figure 2 the changes of training loss, classification accuracy, number of confident unseen samples (used for updating the model) and the portion of the correctly labeled ones among them. We can see that as the training round increases, the training loss keeps decreasing and the collection of confident samples is consistently enlarged. At the same time, the accuracy of pseudo label assignment is also promoted. This means with the increase of training round, the unlabeled images used for training are boosted in terms of both quantity and quality, which in return further improves the classifier generator.
Number of classes per episode. Table 5 shows that ZSL accuracy changes little w.r.t. sampled classes in each mini-batch, which contradicts the observations in , where episode-based training is used for few-shot learning. We speculate the reason is that sampling more classes per mini-batch in  helps boost discriminability of the feature extraction model, as it is required to extract distinct features for more classes in each mini-batch. This does not apply to us as we use pretrained features. Sampling more classes in each mini-batch to train the classifier generator can be approximated by sampling multiple mini-batches.
Embedding visualization. Recall that we calculate the possibility of an image x of belonging to class given class attribute by calculating the Cosine similarity of x and the classifier weight generating from (Eq. (2)). As Cosine similarity of two vectors is equivalent to their dot product after being normalized, we can view as the prototype of class . By this interpretation, the possibility of x of belonging to class can be measured by the distance of the normalized feature and the normalized classifier weight vector . Thus, we can visualize normalized classifier weight vectors and normalized visual feture vectors to qualitatively evaluate the discriminability of the classifiers.
We plot the t-SNE visualizations  of the classifier weights and their overlappings with the visual features of unseen classes in Figure 3. We can see that our class prototypes are more spatially dispersed than that of DEM  which does not consider the inter-class separation information for generating class prototypes. Besides, we can observe that by projecting visual features to attribute space, the corresponding class prototypes are extremely clustered. This substantiates the merits of formulating ZSL as a conditional visual classification problem, by which we can naturally benefit from the high discrimination of the visual features and the inter-class separation information to get discriminative classifiers for both seen and unseen classes. Moreover, we can also see that the distribution of the class prototypes in the transductive setting is even more dispersed than that for the inductive setting. This evidences the effectiveness of our transductive ZSL algorithm in exploiting unlabeled test data for enhancing the discriminability of the classifiers for both seen and unseen classes.
By overlapping the class prototypes with visual features of unseen classes, we can observe that visual features of unseen classes lie closely with their corresponding class prototypes, whiling being far away from those of seen classes. In contrast, this favorable distribution cannot be observed in the plots of DEM and the algorithm which projects visual features to the attribute space. This further substantiates the superiority of our method.
In this paper, we reformuate ZSL as a visual feature classification problem conditioned on the attributes. Under this reformulation, we develop algorithms for various ZSL settings. For the conventional setting, we propose to learn a deep neural network to generate visual feature classifiers directly from the attributes, and guide the process with a cosine similarity based cross-entropy loss and an episode-based training scheme. For the generalized setting, we propose to concatenate classifiers for both seen and unseen classes to recognize objects from all classes. For the transductive setting, we develop a novel learning-without-forgetting self-training mechanism to calibrate the classifier genereator towards unseen classes while maintaining good performance for seen classes. Experiments on widely used datasets verify the effectiveness of the proposed methods and demonstrate that the proposed methods obtain remarkable advantages over the state-of-the-art methods, especially for unseen classes in the generalized ZSL setting.
6 Further Analysis on the GZSL Performance
From the experiments in the main article, we can observe that the proposed method reaches significant performance gains over the existing ones for the generalized ZSL setting. Here, we give more explanations for our impressive performance.
As explained in the main text, our great advantages in GZSL is owing to our novel problem formulation of ZSL as a conditional visual classification problem. Due to this formulation, during the test stage, we generate the classifiers for both seen and unseen classes from the corresponding attributes, and combine (by concatenating the classifier weight matrices) them to classify images from all classes. Figure 4 illustrates the process. Since our classifier generation model is trained with seen classes, during test, the classifiers for seen classes generated from the corresponding attributes should be highly discriminative to discern whether or not an incoming image belongs to the classes observed during training. Thus, the involvement of seen classes during GZSL test impacts much less on our method than on the existing ones, leading to our much better recognition results.
Another thing to be further noted is that in the GZSL experimental setting, our performance for seen classes is less competitive and is often inferior to the state-of-the-art. To figure out how this happens, we plot in Figure 5 the changes of performance on the AwA1 dataset with respect to the training iteration. We can observe that in the conventional ZSL setting, the accuracy (ZSL-T1) first keeps increasing and then remains stable, along with the decrease of the training loss. For the generalized ZSL setting, the accuracy for unseen classes (GZSL-Unseen) and seen classes (GZSL-Seen) has quite different changing trajectories: GZSL-Seen reaches the peak in the very beginning, drops thereafter, and remains stable later, while GZSL-Unseen first keeps increasing and then remains stable. The dropping rate of GZSL-Seen is much slower than the increasing rate of GZSL-Unseen, which makes their harmonic mean GZSL-H change similarly as GZSL-Unseen.
The plot indicates that our classifier generator acquires quickly the knowledge of classifying the observed classes and reaches the peak performance for seen class recognition. It is then tuned to be apt for categorizing unseen classes as exposed with various randomly sampled new ZSL tasks. As a side effect of the drift towards recognizing unseen classes, the classification boundaries for seen classes turn vaguer, but still remain a fair discriminative level. The enhancement in the competence of recognizing unseen classes, in combined with a fair maintenance of the capability of recognizing seen classes, leads to our distinguished performance in GZSL.
7 Classification Result Visualizations
To facilitate analysis, we visualize the classification results of our method in the conventional ZSL setting. Figure 6 shows the visualizations on the AWA1. In the figure, according to the classification score, we show the top image returns of a class given the semantic description of the class.
According to the top images, we can see that our method reasonably captures discriminative visual properties of each unseen class based solely on its semantic embedding. We can also see that the misclassified images are with appearance so similar to that of predicted class that even humans cannot easily distinguish between the two. For example, the “bat” images in the first row of Figure 6 look so similar to that of the “rat” images. Without carefully observation, human can sometimes make mistakes in differentiating them, even that we have seen various images about the two classes before. Considering that the attributes of the two classes are very similar and our model has never “seen” any images of the two classes, it is reasonable to make the mistakes.
This research is supported in part by the NSF IIS award 1651902, U.S. Army Research Office Award W911NF-17-1-0367, and NEC labs America.
-  Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015.
-  Yashas Annadani and Soma Biswas. Preserving semantic relations for zero-shot learning. In CVPR, 2018.
-  Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classifiers for zero-shot learning. In CVPR, 2016.
-  Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. Predicting visual exemplars of unseen classes for zero-shot learning. In ICCV, 2017.
-  Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding network. In CVPR, 2018.
-  Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. Link the head to the “beak”: Zero shot learning from noisy text description at part precision. In CVPR, 2017.
-  Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In CVPR, 2009.
-  Rafael Felix, BG Vijay Kumar, Ian Reid, and Gustavo Carneiro. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV, 2018.
-  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
-  Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.
-  Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In CVPR, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Unsupervised domain adaptation for zero-shot learning. In ICCV, 2015.
-  Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In CVPR, 2017.
-  Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
-  Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In CVPR, 2015.
-  Kai Li, Zhengming Ding, Kunpeng Li, Yulun Zhang, and Yun Fu. Support neighbor loss for person re-identification. In ACM MM, 2018.
-  Kai Li, Martin Renqiang Min, Bing Bai, Yun Fu, and Hans Peter Graf. On novel object recognition: A unified framework for discriminability and adaptability. In CIKM, 2019.
-  Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. Guided attention inference network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
-  Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Attention bridging network for knowledge transfer. In ICCV, 2019.
-  Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Visual semantic reasoning for image-text matching. In ICCV, 2019.
-  Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. Discriminative learning of latent features for zero-shot recognition. In CVPR, 2018.
-  Chunjie Luo, Jianfeng Zhan, Lei Wang, and Qiang Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. arXiv preprint arXiv:1702.05870, 2017.
-  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal, and Hema A Murthy. A generative model for zero shot learning using conditional variational autoencoders. In CVPR Workshops, 2018.
-  Pedro Morgado and Nuno Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR, 2017.
-  Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.
-  Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012.
-  Marcus Rohrbach, Sandra Ebert, and Bernt Schiele. Transfer learning in a transductive setting. In NIPS, 2013.
-  Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In ICML, 2015.
-  Eli Schwartz, Leonid Karlinsky, Joseph Shtok, Sivan Harary, Mattias Marder, Rogerio Feris, Abhishek Kumar, Raja Giryes, and Alex M Bronstein. Delta-encoder: an effective sample synthesis method for few-shot object recognition. arXiv preprint arXiv:1806.04734, 2018.
-  Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
-  Jie Song, Chengchao Shen, Jie Lei, An-Xiang Zeng, Kairi Ou, Dacheng Tao, and Mingli Song. Selective zero-shot classification with augmented attributes. In ECCV, 2018.
-  Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. Transductive unbiased embedding for zero-shot learning. In CVPR, 2018.
-  Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
-  Vinay Kumar Verma and Piyush Rai. A simple exponential family framework for zero-shot learning. In ECML-PKDD, 2017.
-  Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016.
-  Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR, 2018.
-  Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. 2010.
-  Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016.
-  Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In CVPR, 2018.
-  Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In CVPR, 2017.
-  Yulun Zhang, Chen Fang, Yilin Wang, Zhaowen Wang, Zhe Lin, Yun Fu, and Jimei Yang. Multimodal style transfer via graph cuts. In ICCV, 2019.
-  Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. In ICLR, 2019.
-  Zhilu Zhang and Mert R Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. arXiv preprint arXiv:1805.07836, 2018.
-  Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015.
-  Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018.
-  Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018.
-  Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In ICCV, 2019.