Convolutional Prototype Learning for Zero-Shot Recognition

Convolutional Prototype Learning for Zero-Shot Recognition

Abstract

Zero-shot learning (ZSL) has received increasing attention in recent years especially in areas of fine-grained object recognition, retrieval, and image captioning. The key to ZSL is to transfer knowledge from the seen to the unseen classes via auxiliary class attribute vectors. However, the popularly learned projection functions in previous works cannot generalize well since they assume the distribution consistency between seen and unseen domains at sample-level. Besides, the provided non-visual and unique class attributes can significantly degrade the recognition performance in semantic space. In this paper, we propose a simple yet effective convolutional prototype learning (CPL) framework for zero-shot recognition. By assuming distribution consistency at task-level, our CPL is capable of transferring knowledge smoothly to recognize unseen samples. Furthermore, inside each task, discriminative visual prototypes are learned via a distance based training mechanism. Consequently, we can perform recognition in visual space, instead of semantic space. An extensive group of experiments are then carefully designed and presented, demonstrating that CPL obtains more favorable effectiveness, over currently available alternatives under various settings.

\cvmfinalcopy

1 Introduction

Figure 1: The visual images and class prototypes provided for several classes in benchmark dataset AWA2 [40].

In recent years, deep convolutional neural networks have achieved significant pro-gress in object classification. However, most existing recognition models need to collect and annotate a large amount of target class samples for model training. It is obvious these operations are expensive and cumbersome. In addition, the number of target classes is very large, and meanwhile novel categories appear dynamically in nature daily. Moreover, for fine-grained object recognition (e.g., birds), it is hard to collect enough image samples for each category (e.g., skimmer) due to the rarity of target class. Thus, zero-shot learning (ZSL) [27, 46, 4, 28, 41, 9, 34, 18] is proposed to address above issues.

ZSL aims to recognize objects which may not have seen instances in the training phase. Due to the lower requirement of labeled samples and various application scenarios, it recently has received a lot of focus and achieved significant advances in computer vision. Because of the lack of labeled samples in the unseen class domain, ZSL requires the auxiliary information to learn the transferable knowledge from the seen to unseen class domain. To this end, existing methods usually offer the semantic description extracted from text (e.g., attribute vector [19, 22, 24] or word vector [1, 10, 35]) for each category to relate the seen and unseen class domains as shown in Figure 1. This mimics the human ability to recognize the novel objects in the world. For example, given the description that “a wolf looks like a dog, but has a drooping tail”, we can recognize a wolf even without having seen one, as long as we have seen a “dog”. It is obvious zero-shot learning is feasible by imitating such a learning mechanism.

Figure 2: Several instances with various colors from the unseen class “horse” in benchmark dataset AWA2 [40].

The key of ZSL is how effectively learn the visual-semantic projection function. Once the visual-semantic projection function is learned from the source object domain, we can transfer the learned knowledge to target object domain. At test phase, the test sample is first projected into the embedding space, and then the recognition is conducted by computing the similarity between the test sample and unseen class attributes. To this end, various ZSL methods are proposed to learn an effective projection function between visual and semantic spaces recently. However, due to there exists a significant distribution gap between the seen and unseen samples in most real scenarios, the generalization ability of such methods is seriously limited. For example, both “zerba” and “pig” have the attribute “tail”, but the visual appearance of their tails is often greatly different. To solve this problem, a lot of ZSL methods propose transductive learning [11, 14, 29, 47] which introduces visual samples of unseen classes in the training phase. Obviously, transductive learning can mitigate the domain shift, however it is hard to apply into many real scenarios due to the great challenge to obtain the corresponding samples for all target classes. In this paper, we aim at developing an inductive ZSL method that can reduce the domain distribution mismatch between the seen and the unseen class data, only by training set.

Figure 3: The predictability of each binary attribute, measured with classification accuracy by pre-trained Resnet101 [15], offered by [45].

In addition, both handly defined attribute vectors and automatically extracted word vectors in ZSL are obtained independently from visual samples, and uniquely for each class. This results in the class descriptors are usually inaccurate and less diversified. Due to this reason, the projection learned in previous works may not effectively address the intra-class variation problem. For instance, there are many colors for a “horse” as shown in Figure 2. Besides, there exists some non-visual attributes in the provided class descriptors, such as “timid”, “solitary”, and “active” in benchmark dataset AWA2 [40]. Obviously, these attributes are hard to be predicted only based on visual information. Referring to [45], it even may appear a level of random guess as shown in Figure 3. Thus, the learned knowledge cannot effectively transfer to the unseen class domain, although we can achieve significant performance in the seen class domain thanks to supervised learning. Specially, due to different shooting angles, the visual instance of an object generally cannot possess all the attributes provided by class descriptor. As a result, a common projection from visual to semantic space usual is inaccurate since the lack of some attributes (e.g., paws and tail are not captured).

Motivated by these two key observations above, in this work, we focus on developing a novel inductive ZSL model. Different from previous works, we assume distribution consistency between the seen and unseen class samples at task-level, instead of sample-level, for better transferring knowledge. Furthermore, considering the non-visual components and uniqueness of provided class descriptors, we choose to learn discriminative visual prototypes, thus performing zero-shot recognition in visual space, instead of semantic space.

We emphasize our contributions in four aspects:

  • A simple yet effective convolutional prototype learning (CPL) framework is proposed for zero-shot recognition task.

  • Our CPL is able to transfer knowledge smoothly with an assumption of distribution consistency at task-level, thus recognizing unseen samples more accurately.

  • To avoid the problems of recognition in semantic space, discriminative visual prototypes are learned via a distance based training mechanism in our CPL.

  • The improvements over currently available ZSL alternatives are especially significant under various ZSL settings.

2 Related work

According to whether information about the test data is involved during model learning, existing ZSL models consist of inductive [2, 4, 21, 33] and transductive settings [11, 12, 20, 29]. Specifically, this transduction in ZSL can be embodied in two progressive degrees: transductive for specific unseen classes [26] and transductive for specific test samples [47]. Although transductive settings can alleviate the domain shift caused by the different distributions of the training and the test samples, it is not realistic to obtain all test samples. Thus, we take an inductive setting in this work.

From the view of constructing the visual-semantic interactions, existing inductive ZSL methods fall into four categories. The first group learns a projection function from the visual to the semantic space with a linear [3, 24, 25] or a non-linear model [7, 28, 35, 43]. The test unseen data are then classified by matching the visual representations in the class semantic embedding space with the unseen class prototypes. To capture more distribution information from visual space, recent work focuses on generating pseudo examples for unseen classes with seen class examples [13], web data [29], generative adversarial networks (GAN) [41, 48], etc. Then supervised learning methods can be employed to perform recognition task. The second group chooses the reverse projection direction [2, 38, 44], to alleviate the hubness problem caused by nearest neighbour search in a high dimensional space [32]. The test unseen data are then classified by resorting to the most similar pseudo visual examples in the visual space. The third group is a combination of the first two groups by taking the encoder-decoder paradigm but with the visual feature or class prototype reconstruction constraint [2, 21]. It has been verified that the projection function learned in this way is able to generalize better to the unseen classes. The last group mainly learns an intermediate space, where both the visual and the semantic spaces are projected to [5, 6, 17, 26].

To avoid the problems caused by non-visual components and uniqueness of the provided class attributes in ZSL, we choose to perform recognition in visual space. Thus, our CPL can be considered as one of the second group.

3 Methodology

In this section, we first set up the zero-shot recognition (Section 3.1), then develop a convolutional prototype learning (CPL) framework for this task (Section 3.2), and finally discuss how to perform recognition on test data (Section 3.3).

3.1 Problem definition

Suppose we have an unseen class set , where there are not any labeled samples for these different classes. But each class is provided with an attribute vector correspondingly, and then we denote as all unseen class attribute vectors. Given an unlabeled sample set , the target of zero-shot learning is to infer the correct class label from the classes for each sample in . It is impossible to learn an effective classifier only from these data. To tackle this problem, an additional training set that covers seen classes, is usually adopted to learn transferable knowledge to help the classification on . and represent the sets of training samples and corresponding labels, respectively. Let denote the set of these seen classes, and is corresponding attribute vectors. It is worth noting that, generally, . More importantly, under the standard ZSL setting, while under the generalized ZSL setting.

Let denote a convolutional neural network based embedding module, which can learn feature representation for any input image . We use as a classifier moduler to assign a label for a test image within different classes, according to the unseen class attribute set . Denote the true class label of as . Note that these two modules can be integrated into a unified network and trained from scratch in an end-to-end manner. The cost function of is denoted as for simplicity.

3.2 CPL: algorithm

Generic zero-shot learning models usually make a distribution consistency assumption between the training and test sets (i.e., independent and identically distributed assumption), thus guaranteeing the model trained on the training set can generalize to the test set. However, because of the existence of unseen classes, the sample distribution of training set is relatively different from that of test set. Then, the generalization performance on the test set cannot be well guaranteed. As an illustration, from the left side of Figure 4, we can observe that there may exist a large distribution gap between training and test sets in the sample space because these two sets are class-disjoint, making it weakly transferable. We cannot even estimate the gap due to a complete lack of training samples in the unseen class set . Fortunately, we can assume the above distribution consistency in a task space instead of the sample space (see the right side of Figure 4). The task space is composed of a lot of similar tasks, e.g., zero-shot tasks. From the perspective of task-level distribution, we can construct a large number of zero-shot tasks within the training set , by simulating the target zero-shot task. Consequently, the sample distribution gap can be mitigated.

Figure 4: Converting the distribution consistency assumption in the sample level into the task level. Each geometric shape indicates one sample and each color represents one class.

According to the above task-level distribution consistency assumption, we propose the following episode-based convolutional prototype learning framework. Specifically, let be a set of zero-shot tasks randomly sampled from the training set . Then our episode-based convolutional prototype learning framework (CPL) can be formulated as follows:

(1)

where denotes the parameter set of our convolutional prototype learning model. The core idea here is to simulate the target zero-shot task and conduct a lot of similar zero-shot tasks with training set. In this way, we can build a task-based space to mend the gap among different sample distributions. It is worth noting that the classes covered by each task should be as disjoint as possible. This can facilitate better transferable knowledge for new tasks. At each training step, we generate discriminative visual prototypes inside each sampled task according to the current model.

Figure 5: The illustration of our proposed CPL framework for zero-shot recognition.

More specifically, for discriminative recognition, the minimization of is encouraged to achieve two goals of i) minimizing the classification error of via visual prototypes (CEP); ii) minimizing the encoding cost of via visual prototypes (PEC). For this end, as shown in Figure 5, we consider a decomposition of the objective function in Eq. (1) into two functions, corresponding to the two aforementioned objectives, as

(2)

where balances the effects of two functions.

The detailed definition about and is presented in Section 3.2.1 and Section 3.2.2, respectively.

Classification error via prototypes (CEP)

Suppose there exist classes and support samples per-class in the -th sampled task . Then the sets of class attributes and classes are denoted as and , respectively. In our CPL model, we need to learn visual prototypes, denoted as , to represent such classes in visual space. Once we obtain , the probability of sample in belonging to the -th prototype has the following relationship with the similarity between and :

(3)

where . That is, we learn visual prototypes from class attributes via a non-linear function as illustrated in Figure 5. Actually, this is inspired by multi-view learning [42]. Then, the above probability can be further normalized as

(4)

where is the temperature to mitigate overfitting [16], and is a common option. In particular, “softens” the softmax (raises the output entropy) with . As , , which leads to maximum uncertainty. As , the probability collapses to a point mass (i.e., ). Since does not change the maximum of the softmax function, the class prediction remains unchanged if is applied after convergence. Plugging the probability in Eq. (4) into cross-entropy loss over the training sample yields

(5)

where if and 0 otherwise.

Consequently, we can guarantee the classification accuracy of all the samples effectively by learning discriminative prototypes.

Prototype encoding cost (PEC)

Furthermore, to make prototype learning generalize more smoothly on new tasks, we should improve the representativeness of learned prototypes in visual space. For this end, we additionally minimizing the prototype encoding cost as follows

(6)

where is the prototype of the -th class, corresponding to the label of sample .

3.3 CPL: recognition

During the test phase, we have such a zero-shot task with the unseen class set . To recognize the label of any one sample in , we first need to obtain visual prototypes of all the classes covered in via the function learned during the training phase. And then, we compare the sample with all prototypes and classify it to the nearest prototype. As a result, the label of sample , denoted , is predicted as follows

(7)

where .

4 Experimental results and analysis

In this section, we first detail our experimental protocol, and then present the experimental results by comparing our CPL with the state of the art methods for zero-shot recognition on four benchmark datasets under various settings.

Number of Classes Number of Images
Dataset # attributes Total # training # validation # test Total # training # test
SUN 102 717 580 65 72 14340 10320 2580+1440
AWA2 85 50 27 13 10 37322 23527 5882+7913
CUB 312 200 100 50 50 11788 7057 1764+2967
aPY 64 32 15 5 12 15339 5932 1483+7924
Table 1: Statistics for four datasets. Note that test images include the images in both seen and unseen class domains.

4.1 Evaluation setup and metrics

Datasets. Among the most widely used datasets for ZSL, we select four attribute datasets. Two of them are coarse-grained, one small (aPascal & Yahoo (aPY) [8]) and one medium-scale (Animals with Attributes (AWA2) [40]). Another two datasets (SUN Attribute (SUN) [31] and CUB-200-2011 Birds (CUB) [37]) are both fine-grained and medium-scale. Details of all dataset statistics are in Table 1.

Protocols. We adopt the novel rigorous protocol1 proposed in [40], insuring that none of the unseen classes appear in ILSVRC 2012 1K, since ILSVRC 2012 1K is used to pre-train the Resnet model. Otherwise the zero-shot rule would be violated. In particular, this protocol involves two settings: Standard ZSL and Generalized ZSL. The latter emerges recently under which the test set contains data samples from both seen and unseen classes. This setting is thus clearly more reflective of real-world application scenarios. By contrast, the test set in standard ZSL only contains data samples from the unseen classes.

Evaluation metrics. During test phase, we are interested in having high performance on both densely and sparsely populated classes. Thus, we use the unified evaluation protocol proposed in [40], where the average accuracy is computed independently for each class. Specifically, under the standard ZSL setting, we measure average per-class top-1 accuracy by

While under the generalized ZSL setting, we compute the harmonic mean () of and to favor high accuracy on both seen and unseen classes:

where and are the accuracy of recognizing the test samples from the seen and unseen classes respectively, and

Implementation details. By tuning on validation set, we set to be the same with the number of unseen classes, and . The parameter is selected from and is selected from . For each dataset, our CPL is trained for 40 epochs with weight decay on two fine-grained datasets and on two coarse-grained datasets. Meanwhile, the learning rate is initialized with Adam by selecting from . Specially, i) we adopt Resnet101 as embedding module that is pre-trained on ILSVRC 2012 1k without fine-tuning. The input to Resnet101 is a color image that is first normalized with means and standard deviations with respect to each channel. ii) We utilize a MLP network as attribute embedding module. The size of hidden layer (as in Fig. 5) is set to 1200 for CUB and 1024 for another three datasets, and the output layer is set to the same size (2048) as image embedding module for all the datasets. In addition, we add weight decay ( regularisation), and ReLU non-linearity for both hidden and output layers.

Compared methods. We choose to compare with a wide range of competitive and representative inductive ZSL approaches, especially those that have achieved the state-of-the-art results recently. In particular, such compared approaches involve both shallow and deep models.

Method SUN AWA2 CUB aPY

DAP [23]
39.9 46.1 40.0 33.8
IAP [23] 19.4 35.9 24.0 36.6
CONSE [30] 38.8 44.5 34.3 26.9
CMT [35] 39.9 37.9 34.6 28.0
SSE [46] 51.5 61.0 43.9 34.0
LATEM [39] 55.3 55.8 49.3 35.2
DEVISE [10] 56.5 59.7 52.0 39.8
SJE [1] 53.7 61.9 53.9 32.9
SYNC [4] 56.3 46.6 55.6 23.9
SAE [21] 40.3 54.1 33.3 8.3
DEM [44] 61.9 67.1 51.7 35.0
PSR [2] 61.4 63.8 56.0 38.4
DLFZRL [36] 59.3 63.7 57.8 44.5
CPL 62.2 72.7 56.4 45.3
Table 2: Comparative results of standard zero-shot learning on four datasets.
SUN AWA2 CUB aPY
Method H H H H
DAP [23] 4.2 25.1 7.2 0.0 84.7 0.0 1.7 67.9 3.3 4.8 78.3 9.0
IAP [23] 1.0 37.8 1.8 0.9 87.6 1.8 0.2 72.8 0.4 5.7 65.6 10.4
CONSE [30] 6.8 39.9 11.6 0.5 90.6 1.0 1.6 72.2 3.1 0.0 91.2 0.0
CMT [35] 8.1 21.8 11.8 0.5 90.0 1.0 7.2 49.8 12.6 1.4 85.2 2.8
SSE [46] 2.1 36.4 4.0 8.1 82.5 14.8 8.5 46.9 14.4 0.2 78.9 0.4
LATEM [39] 14.7 28.8 19.5 11.5 77.3 20.0 15.2 57.3 24.0 0.1 73.0 0.2
DEVISE [10] 16.9 27.4 20.9 17.1 74.7 27.8 23.8 53.0 32.8 4.9 76.9 9.2
SJE [1] 14.7 30.5 19.8 8.0 73.9 14.4 23.5 59.2 33.6 3.7 55.7 6.9
SYNC [4] 7.9 43.3 13.4 10.0 90.5 18.0 11.5 70.9 19.8 7.4 66.3 13.3
SAE [21] 8.8 18.0 11.8 1.1 82.2 2.2 7.8 54.0 13.6 0.4 80.9 0.9
DEM [44] 20.5 34.3 25.6 30.5 86.4 45.1 19.6 57.9 29.2 11.1 75.1 19.4
PSR [2] 20.8 37.2 26.7 20.7 73.8 32.3 24.6 54.3 33.9 13.5 51.4 21.4
DLFZRL [36] - - 24.6 - - 45.1 - - 37.1 - - 31.0
CPL 21.9 32.4 26.1 51.0 83.1 63.2 28.0 58.6 37.9 19.6 73.2 30.9
Table 3: Comparative results of generalized zero-shot learning on four datasets.

4.2 Standard ZSL

We firstly compare our CPL method with several state-of-the-art ZSL approaches under the standard setting. The comparative results on four datasets are shown in Table 2. It can be observed that: i) Our model consistently performs best on all the datasets except CUB, validating that our episode-based convolutional prototype learning framework is indeed effective for zero-shot recognition tasks. ii) For the three datasets (e.g., SUN, AWA2, and aPY), the improvements obtained by our model over the strongest competitor range from 0.3% to 5.6%. This actually creates new baselines in the area of ZSL, given that most of the compared approaches take far more complicated image generation networks and some of them even combine two or more feature/semantic spaces. iii) In particular, for the two coarse-grained datasets (aPY and AWA2), our CPL achieves 0.5% and 5.6% significant improvements over the strongest competitors, showing its great advantage in coarse-grained object recognition problems. iv) Although DLFZRL [36] performs better than our CPL with 1.4% on CUB, it still can be observed that our CPL takes more advantage on most cases.

We further discuss the reasons why our CPL outperforms existing methods on zero-shot recognition tasks. First, compared with DEM [44] that conducts a non-linear neural network to model the relationship between the visual and the semantic space, our model also possesses such a non-linear property. Besides, our framework improves the discriminability and representativeness of learned prototypes by minimizing the classification error of seen class samples, and encoding cost of prototypes, at task-level respectively. Thus, recognition performances can be improved dramatically. Second, to achieve better performances on novel classes, PSR [2] proposes to preserve semantic relations in the semantic space when learning embedding function from the semantic to the visual space. However, it is hard to select an appropriate threshold to define the semantically similar and dissimilar classes in our daily life. By contrast, our framework needs not to predefine these relations. Third, a recent work DLFZRL [36] aims to learn discriminative and generalizable representations from image features and then improves the performance of existing methods. In particular, they utilized DEVISE [10] as the embedding function and obtained very competitive results on four benchmark datasets. Unlike DLFZRL [36], our CPL has different motivations, and then provides the learned representations with a novel concept (i.e., visual prototypes).

4.3 Generalized ZSL

In real applications, whether a sample is from a seen or unseen class is unknown in advance. Hence, generalized ZSL is a more practical and challenging task compared with standard ZSL. Here, we further evaluate the proposed model under the generalized ZSL setting. The compared approaches are consistent with those in standard ZSL. Table 3 reports the comparative results, much lower than those in standard ZSL. This is not surprising since the seen classes are included in the search space which act as distractors for the samples that come from unseen classes. Additionally, it can be observed that generally our method improves the overall performance (i.e., harmonic mean ) over the strongest competitor by an obvious margin (0.8% 18.1%). Such promising performance boost mainly comes from the improvement of mean class accuracy on the unseen classes, meanwhile without much performance degradation on the seen classes. Especially, our method obtains the best performance for recognizing unseen samples on four benchmarks, the improvements range from 1.1% to 20.5%. These compelling results also verify that our method can significantly alleviate the strong bias towards seen classes. This is mainly due to the fact that: i) Our CPL significantly improves the representability and discriminability of learned prototypes by means of the corresponding objective functions. ii) Unlike most of existing inductive approaches (e.g., DEM [44]) that assumes distribution consistency between seen and unseen classes at sample-level, our CPL creatively makes an assumption at task-level). Consequently, our CPL can learn more transferable knowledge that is able to generalize well to the unseen class domain.

4.4 Ablation study

Effectiveness of distribution consistency at task-level. First, we evaluate the assumption of distribution consistency at task-level. In this experiment, CPL denotes the assumption of distribution consistency at task-level which randomly choices classes and support samples per-class for each zero-shot task, and here we set to be the same with the number of unseen classes and . Let CPL-S denote the assumption of distribution consistency at sample-level that randomly selects a batch with the sample number for each iteration. Then, we conduct corresponding experiments on four datasets and report the comparison results in Figure 6. It can be concluded that: i) Our CPL performs better than CPL-S especially on two coarse-grained datasets. Concretely, on AWA2 dataset, CPL obtains a 2.5% improvement compared with CPL-S. Meanwhile, our method also achieves a 2.6% improvement on aPY dataset. Additionally, CPL and CPL-S have the basically same performance on two fine-grained datasets. That is to say, our CPL based on the assumption at task-level can generalize well to novel classes, and effectively mitigate the significant distribution gap at sample-level. ii) Significant differences between classes are more beneficial for CPL to learn transferable knowledge.

Figure 6: The comparison results of task-level and sample-level.
Value of 3 6 9 12 15 20
CPL 42.3 42.6 42.9 45.3 42.1 42.7
Table 4: Parameter influence of on ZSL task using aPY.

Effectiveness of . Second, we discuss the influence of in our CPL framework. Under standard ZSL setting, we set different on aPY dataset, and is same for each experiment. This means that we build the corresponding sets of zero-shot tasks for different experiments and each zero-shot task consists of seen classes. We report the results on Table 4, it can be seen that that is the same with the number of unseen classes outperforms than other setup with a 2.6% improvement. Such comparative results further verify that our assumption of distribution consistency at task-level is more suitable for ZSL problem. In addition, the learned knowledge can be well transferred from the seen to the unseen classes.

Method SUN AWA2 CUB aPY
59.7 69.9 53.6 41.8
60.1 70.0 48.3 39.2
CPL 62.2 72.7 56.4 45.3
Table 5: Effectiveness of CEP and PEC functions.

Effectiveness of CEP and PEC functions. Finally, We perform the ablation experiments of loss functions under standard ZSL setting. We define two baseline settings to prove the importance of each objective function. For baseline , we set to evaluate the contribution of the distance based cross entropy loss. This results in the improvement of representativeness of learned prototypes in visual space. For baseline , only loss is used to learn the embedding model which helps us to better understand the importance of objective . This effectively improves the classification accuracy of all the samples by learning discriminative prototypes. The comparative results is shown in Table 5, it can be seen that: i) The improvements obtained by introducing the loss range from 2.5% to 3.5%. This indicates that the objective , which aims to improve the discriminability of learned prototypes, is beneficial for solving zero-shot recognition tasks. ii) The improvements range from 2.1% to 8.1% when including loss. This proves that improving the representativeness of learned prototypes can make our model better generalize to novel classes. iii) By improving both the discriminability and representativeness of learned prototypes, we achieve remarkable performance on zero-shot recognition tasks.

5 Conclusions and future work

In this paper, we propose a convolutional prototype learning (CPL) framework that is able to perform zero-shot recognition in visual space, and thus, avoid a series of problems caused by the provided class attributes. Meanwhile, the generalization ability of our CPL is significantly improved, by assuming distribution consistency between seen and unseen domains at task-level, instead of the popularly used sample-level. We have carried out extensive experiments about ZSL on four benchmarks, and the results demonstrate the obvious superiority of the proposed CPL to the state-of-the-art ZSL approaches. It is also worth noting that the number of prototypes is fixed in our CPL. In essence, learning one prototype for a class is generally insufficient to recognize one class and differentiate two classes. Thus, our ongoing research work includes learning prototypes adaptively with the data distribution.

Footnotes

  1. http://www.mpi-inf.mpg.de/zsl-benchmark

References

  1. Z. Akata, S. Reed, D. Walter, H. Lee and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936. Cited by: §1, Table 2, Table 3.
  2. Y. Annadani and S. Biswas (2018) Preserving semantic relations for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7603–7612. Cited by: §2, §2, §4.2, Table 2, Table 3.
  3. A. Bansal, K. Sikka, G. Sharma, R. Chellappa and A. Divakaran (2018) Zero-shot object detection. In European Conference Computer Vision, pp. 397–414. Cited by: §2.
  4. S. Changpinyo, W. Chao, B. Gong and F. Sha (2016) Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336. Cited by: §1, §2, Table 2, Table 3.
  5. S. Changpinyo, W. Chao, B. Gong and F. Sha (2018) Classifier and exemplar synthesis for zero-shot learning. arXiv preprint arXiv:1812.06423. Cited by: §2.
  6. S. Changpinyo, W. Chao and F. Sha (2017) Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3476–3485. Cited by: §2.
  7. L. Chen, H. Zhang, J. Xiao, W. Liu and S. Chang (2018) Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1043–1052. Cited by: §2.
  8. A. Farhadi, I. Endres, D. Hoiem and D. Forsyth (2009) Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1778–1785. Cited by: §4.1.
  9. R. Felix, V. B. Kumar, I. Reid and G. Carneiro (2018) Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision, pp. 21–37. Cited by: §1.
  10. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §1, §4.2, Table 2, Table 3.
  11. Y. Fu, T. M. Hospedales, T. Xiang and S. Gong (2015) Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37 (11), pp. 2332–2345. Cited by: §1, §2.
  12. Z. Fu, T. Xiang, E. Kodirov and S. Gong (2018) Zero-shot learning on semantic class prototype graph. IEEE transactions on pattern analysis and machine intelligence 40 (8), pp. 2009–2022. Cited by: §2.
  13. Y. Guo, G. Ding, J. Han and Y. Gao (2017) Zero-shot learning with transferred samples. IEEE Transactions on Image Processing 26 (7), pp. 3277–3290. Cited by: §2.
  14. Y. Guo, G. Ding, X. Jin and J. Wang (2016) Transductive zero-shot recognition via shared model space learning. In Thirtieth AAAI Conference on Artificial Intelligence, Vol. 3, pp. 8. Cited by: §1.
  15. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Figure 3.
  16. G. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.2.1.
  17. Y. Hubert Tsai, L. Huang and R. Salakhutdinov (2017) Learning robust visual-semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3571–3580. Cited by: §2.
  18. M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang and E. P. Xing (2019) Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11487–11496. Cited by: §1.
  19. P. Kankuekul, A. Kawewong, S. Tangruamsub and O. Hasegawa (2012) Online incremental attribute-based zero-shot learning. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3657–3664. Cited by: §1.
  20. E. Kodirov, T. Xiang, Z. Fu and S. Gong (2015) Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2452–2460. Cited by: §2.
  21. E. Kodirov, T. Xiang and S. Gong (2017) Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183. Cited by: §2, §2, Table 2, Table 3.
  22. C. H. Lampert, H. Nickisch and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958. Cited by: §1.
  23. C. H. Lampert, H. Nickisch and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. Cited by: Table 2, Table 3.
  24. C. Lampert, H. Nickisch and S. Harmeling (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (3), pp. 453–465. Cited by: §1, §2.
  25. Y. Li, J. Zhang, J. Zhang and K. Huang (2018) Discriminative learning of latent features for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7463–7471. Cited by: §2.
  26. S. Liu, M. Long, J. Wang and M. I. Jordan (2018) Generalized zero-shot learning with deep calibration network. In Advances in Neural Information Processing Systems, pp. 2006–2016. Cited by: §2, §2.
  27. T. Mensink, E. Gavves and C. G. Snoek (2014) Costa: co-occurrence statistics for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2441–2448. Cited by: §1.
  28. P. Morgado and N. Vasconcelos (2017) Semantically consistent regularization for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6060–6069. Cited by: §1, §2.
  29. L. Niu, A. Veeraraghavan and A. Sabharwal (2018) Webly supervised learning meets zero-shot learning: a hybrid approach for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7171–7180. Cited by: §1, §2, §2.
  30. M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado and J. Dean (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. Cited by: Table 2, Table 3.
  31. G. Patterson and J. Hays (2012) Sun attribute database: discovering, annotating, and recognizing scene attributes. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2751–2758. Cited by: §4.1.
  32. M. Radovanović, A. Nanopoulos and M. Ivanović (2010) Hubs in space: popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11 (Sep), pp. 2487–2531. Cited by: §2.
  33. B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pp. 2152–2161. Cited by: §2.
  34. E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247–8255. Cited by: §1.
  35. R. Socher, M. Ganjoo, C. D. Manning and A. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Cited by: §1, §2, Table 2, Table 3.
  36. B. Tong, C. Wang, M. Klinkigt, Y. Kobayashi and Y. Nonaka (2019-06) Hierarchical disentanglement of discriminative latent features for zero-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, §4.2, Table 2, Table 3.
  37. C. Wah, S. Branson, P. Welinder, P. Perona and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
  38. X. Wang, Y. Ye and A. Gupta (2018) Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866. Cited by: §2.
  39. Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein and B. Schiele (2016-06) Latent embeddings for zero-shot classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 2, Table 3.
  40. Y. Xian, C. H. Lampert, B. Schiele and Z. Akata (2018) Zero-shot learning a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence. Cited by: Figure 1, Figure 2, §1, §4.1, §4.1, §4.1.
  41. Y. Xian, T. Lorenz, B. Schiele and Z. Akata (2018) Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5542–5551. Cited by: §1, §2.
  42. C. Xu, D. Tao and C. Xu (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634. Cited by: §3.2.1.
  43. Y. Yu, Z. Ji, Y. Fu, J. Guo, Y. Pang and Z. (. Zhang (2018) Stacked semantic-guided attention model for fine-grained zero-shot learning. In Advances in Neural Information Processing Systems, pp. 5998–6007. Cited by: §2.
  44. L. Zhang, T. Xiang and S. Gong (2017) Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030. Cited by: §2, §4.2, §4.3, Table 2, Table 3.
  45. X. Zhang, S. Gui, Z. Zhu, Y. Zhao and J. Liu (2019) Hierarchical prototype learning for zero-shot recognition. External Links: arXiv:1910.11671 Cited by: Figure 3, §1.
  46. Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE international conference on computer vision, pp. 4166–4174. Cited by: §1, Table 2, Table 3.
  47. A. Zhao, M. Ding, J. Guan, Z. Lu, T. Xiang and J. Wen (2018) Domain-invariant projection learning for zero-shot recognition. In Advances in Neural Information Processing Systems, pp. 1025–1036. Cited by: §1, §2.
  48. Y. Zhu, M. Elhoseiny, B. Liu, X. Peng and A. Elgammal (2018) A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407703
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description