Zero-Shot Learning via Latent Space Encoding

Zero-Shot Learning via Latent Space Encoding

Yunlong Yu,  Zhong Ji,  Jichang Guo, and Zhongfei (Mark) Zhang
Abstract

Zero-Shot Learning (ZSL) is typically achieved by resorting to a class semantic embedding space to transfer the knowledge from the seen classes to unseen ones. Capturing the common semantic characteristics between the visual modality and the class semantic modality (e.g., attributes or word vector) is a key to the success of ZSL. In this paper, we present a novel approach called Latent Space Encoding (LSE) for ZSL based on an encoder-decoder framework, which learns a highly effective latent space to well reconstruct both the visual space and the semantic embedding space. For each modality, the encoder-decoder framework jointly maximizes the recoverability of the original space from the latent space and the predictability of the latent space from the original space, thus making the latent space feature-aware. To relate the visual and class semantic modalities together, their features referring to the same concept are enforced to share the same latent codings. In this way, the semantic relations of different modalities are generalized with the latent representations. We also show that the proposed encoder-decoder framework is easily extended to more modalities. Extensive experimental results on four benchmark datasets (AwA, CUB, aPY, and ImageNet) clearly demonstrate the superiority of the proposed approach on several ZSL tasks, including traditional ZSL, generalized ZSL, and zero-shot retrieval (ZSR).

I Introduction

Although the success of Convolutional Neural Network (CNN) [5, 6, 7] greatly enhances the performance of object classification, many existing models are based on supervised learning and require labour-intensive work to collect a large number of annotated instances for each involved class. Besides, the models have to be retrained again if new classes are added to the classification system, which brings a huge computational burden. These issues severely limit the scalability of the conventional classification models.

It is thus impressive to introduce models for Zero-Shot Learning (ZSL) [1, 2, 3, 4, 8, 9, 11], which enables a classification system to classify instances from unseen categories in which no data are available for training. It is typically achieved by transferring the knowledge from abundantly labeled seen classes to no labeled unseen classes via a class semantic embedding space, where the names of both the seen and unseen classes are embedded as vectors called class prototypes. Such a space can be human-defined attribute space [12, 13, 14] spanned by the pre-defined attribute ontology, or word vector space spanned by a large text corpus based on an unsupervised language processing technology [15, 16]. To this end, the semantic relationships between both the seen and unseen classes can be directly measured with the distances of the class prototypes in the class semantic embedding space.

In general, the performances of ZSL rely on the following three aspects: i) the representations of the visual instances; ii) the semantic representations of both the seen and unseen classes; and iii) the interactions between the visual instances and the class prototypes. On the one hand, the representations of visual instances are obtained with the off-the-shelf Convolutional Neural Networks (CNN), such as VGG [5], GoogleNet [6], and ResNet [7]. On the other hand, the class semantic embeddings are as important as the visual representations. The existing class prototypes are also collected in advance. In this way, the instance visual representations and the class semantic representations are obtained dependently. With the availability of the visual features and semantic class prototypes, the existing ZSL approaches mainly focus on learning a generalized interactional model to connect the visual space and the class semantic embedding space with the labeled seen classes only. ZSL is then achieved by resorting to the semantic distances between the unseen instances and the unseen class prototypes with the learned interactional model.

The approaches of constructing the interactions between the visual space and class semantic embedding space can be divided into two categories: i) the label-embedding approaches, and ii) the visual instance generative approaches. Specifically, the label-embedding approaches [13, 21] focuses on learning a general function to project the visual representations to the class semantic space with the labeled seen instances. The testing unseen instances are then classified by matching the representations of the visual instances in the class semantic embedding space with the unseen class prototypes. On the other hand, the visual instance generative approaches [17, 22] learn an inversely projective function to generate the pseudo visual instances with the class semantic representations. In this way, the testing visual instances are classified by resorting to the most similar pseudo visual instances of the unseen classes in the visual space. Experimental results show that the generative approaches perform better than label-embedding approaches since the latter is prone to suffering from hubness issue.

The existing approaches mostly require to learn an explicit projective function to relate different modalities. However, since the optimal projective function between two different spaces can be complicated and even indescribable, assuming that an explicit encoding function may not well model it. Besides, each modality has the distinctive characteristics despite of the common semantic information shared across different modalities, making the explicit encoding be easily spoiled. In this work, we assume that the intrinsic class semantic patterns hide in various modalities, and each modality can also be well reconstructed by the intrinsic class semantic patterns.

To mathematically formulate this assumption, we learn the model in an implicit way. Instead of learning a projective function to connect different modalities directly, we encode each input modality into a latent space to learn the intrinsic class semantic patterns and enforce different relative modalities to share the same latent space. In this way, the relations between different modalities are connected with the same latent space. Specifically, we formulate each modality as an encoder-decoder framework. In each modality, the features are encoded as vectors embedded in a latent space which can be well reconstructed to the original features. It should be noted that the encoder and decoder in our framework are symmetric so that they can be modeled by the same set of parameters. More specifically, the input matrix is implicitly decomposed into a code matrix and an encoder matrix. This encoding process assumes that the vectors of the code matrix are unrelated. Compared with the explicit encoding, the implicit encoding can reduce the risk of using an inappropriate predefined encoding function and thus it is likely to learn a better encoding result. In order to learn an efficient latent space, the encoder-decoder framework jointly maximizes the recoverability of the original space from the latent space and the predictability of the latent space from the original space, thus making the latent space feature-aware. The features from different modalities describing the same instance are forced to share the same latent representation. Such a constraint makes the features from different modalities being effectively connected via the common latent space. Besides, the proposed model can be easily integrated with more different modalities into the framework that makes use of complementary information to improve the performance even when further needed.

In summary, our main contributions can be summarized into three folds:

  1. We introduce an encoder-decoder framework to exploit the intrinsic co-occurrence semantic patterns of different modalities. In this way, a better latent space for mitigating the distribution divergence across seen and unseen classes can be recovered.

  2. The symmetric constraint of the encoder-decoder framework ensures that the features are easily recovered via the other modalities, which captures the transferability and discriminability of the proposed approach.

  3. We also demonstrate that the proposed framework is suitable for multi-modality issue via exploring both the common and complementary information among different modalities. The experimental results show that finding an appropriate weight for each modality can yield an improved performance compared with that of any single modality.

We conduct extensive experiments on four datasets, i.e., Animal with Attribute (AwA) [13], Caltech UCSD Birds (CUB) [30] and aPY [12] attribute datasets, and ImageNet. The experimental results on traditional ZSL, generalized ZSL, and zero-shot retrieval demonstrate that the proposed approach can not only transfer the source information to the target domain well but also preserve the discriminability between the seen classes and unseen ones.

Ii Related Work

Our work is related to several zero-shot learning scenarios, including traditional zero-shot learning, generalized zero-shot learning, and zero-shot retrieval. We review the difference and connection with respect to the related work separately.

Ii-a Traditional Zero-Shot Learning (TZSL).

Inspired by the human-being’s inferential ability that can recognize unseen categories according to the experiential knowledge about the seen categories and the descriptions of the unseen categories, TZSL is first attempted in [29], which introduces a model to generalize the unseen classes or tasks via their corresponding class descriptions. Motivated by this transferring mechanism, [13] represents each class with its corresponding class-level attributes and introduces two probabilistic models for TZSL. Considering that the collection of class-level attributes is a time-consuming work, [21] and [24] incorporate the natural language techniques into ZSL and use a high dimensional word vector to represent the name of each class. Likewise, [8] also represents the ontological relationships between different classes using the WordNet taxonomy. Once obtained the class semantic representations, the subsequently TZSL approaches mainly focus on learning the interactions between the visual modality and the class semantic modality. It is a cross-modality problem since the visual features and the class semantic features are located in different modalities. The existing approaches can be divided into three categorizes according to the direction of the mapping function between the visual space and the class embedding space. First, the simplest way is to learn a model to project the visual features to the class embedding space via Linear Regression [10] or Neural Network [21]. However, such a directional mapping easily suffers from the hubness issue [31], that is, the tendency of some unseen class prototypes (“hubs”) appearing in the top neighbours of many test instances. To address this issue, Shigeto et al. [17] propose to learn a reverse directional mapping function to project the class semantic embedding vectors to the visual space. Inspired by the cross-modality learning, reconstructing the interactions by learning a common latent space for both the visual and the class semantic embedding space is mostly focused on. By constructing a bilinear mapping, DeViSE [24], SJE [8], ESZSL [18], and JEDM [32] learn a translator function to measure the linking strength between the image visual features and the class semantic vectors.

Ii-B Generalized Zero-Shot Learning (GZSL).

TZSL assumes that the testing instances are only classified into the candidate unseen classes. This scenario is unrealistic since the instances from seen classes have better chance to be tested in the real world. GZSL [11, 27] is a more open setting that classifies the testing instances into both the seen and unseen classes. Compared with the TZSL, GZSL is a more challenging task. It requires not only to transfer the information from the source domain to the target domain but also to distinguish the seen classes and unseen ones. This is a dilemma since the effective transferability of the unseen classes relies on more related seen classes and however, more related seen classes dismiss the discriminability between the seen classes and unseen ones. In other words, most testing instances tend be classified into the affinal seen classes rather than their groundtruth unseen ones. Although the existing TZSL approaches can be applied to GZSL directly, their classification performances are poor. Recently, a few approaches try to address this issue. For example, [11] proposes a simple approach to balance two conflicting forces: recognizing data from seen classes versus those from unseen ones. In order to improve the discriminant ability between the seen classes and unseen ones, [27] proposes a maximum margin framework for semantic manifold based recognition to ensure that the instances are projected closer to their corresponding class prototypes than to others (both the seen and unsee classes).

Ii-C Zero-Shot Retrieval (ZSR)

Given a testing instance, zero-shot classification is to classify it into its most relevant candidate class. In contrast, the task of ZSR is an inverse-process that retrieves some images related to the specified attribute descriptions of unseen classes. In this way, ZSR can be seen as a special case of cross modality retrieval. The performance of ZSR relies on two aspects: i) the consistency between visual intra-class representations and ii) the effective semantic alignments between the different modalities. Many existing ZSL approaches [23, 34, 35, 36] are extended to retrieval tasks. However, just like TZSL and GZSL, most existing approaches only retrieve the instances from the unseen set, and the generalized ZSR is still an open issue.

To this end, the common practical and inevitable challenge of ZSL (TZSL, GZSL, and ZSR) consists in preserving the semantic consistence between different modalities. Capturing the common semantic characteristics between different modalities is thus a key to the success of ZSL.

Fig. 1: Illustration of the latent space encoding with images as visual modality and the text documents as class semantic modality. The latent space indicates the co-occurrence links between different modalities. Our method is based on an encoder-decoder framework for each modality to find an effective latent intrinsic semantic pattern of different modalities.

Iii Proposed Approach

We propose a new encoder-decoder framework, called Latent Space Encoding (LSE), for finding the intrinsic co-occurrence semantic patterns of different modalities, as shown in Fig.1. In this framework, each modality consists of an encoding and a decoding process. Particularly, given as input a vector from one modality, the encoder implicitly decomposes it into an encoding matrix and a code vector, which acts as the latent semantic feature. In the decoding process, both the decoding matrix and the code vector are used to reconstruct the original features. It should be noted that the decoding matrix and the encoding matrix in this framework are enforced to be symmetric so that they can be represented by the same parameters. Such a setting makes the original vector highly recoverable via the decoding matrix and the code vector. To relate different modalities together, a pair of features from different modalities are forced to share the same latent representation when they are denoted as the same concept. In this way, different modalities are related via the shared latent space. An efficient optimization approach is introduced to learn the latent space, together formed the encoding and decoding process. In this section, we first describe the proposed LSE approach, and then apply it to address ZSL.

Iii-a Problem formulation

Suppose that we have modalities, each of which consists of instances from different classes. Let denote all training instances, where is the -th instance of the -th modality, and is the label of the corresponding instance. The pair and share the same class label , which means that they represent the same concept.

Iii-B Latent Space Encoding

Given a feature matrix consisting of vectors of the -th modality, LSE learns a list of semantic patterns to represent the original vector based on the encoder-decoder framework. Specifically, in the encoding process, it decomposes the input matrix as the product of a code matrix consisting of code vectors and a linear encoding matrix , i.e., . In the decoding process, the code matrix is decomposed into the product of the decoding matrix and the original input matrix , i.e., . This ensures that the learned code vectors are directly derived from the original input features. Generally, the effectiveness of the encoder-decoder framework depends upon both the representability of the latent space and the recoverability of the original input space.

To improve both the recoverability of the original input space and the representability of the latent space, the difference between the input matrix and the recovered one using the latent code matrix and the encoding matrix should be minimized. Meanwhile, the difference between the latent code matrix and the recovered one using the decoding matrix and the original input matrix , is also expected to be minimized. Denoting the formula as , we thus have:

(1)

where is the Frobenius norm of a matrix, is a parameter for balancing two items. The first item decomposes the original input matrix into an encoding matrix and a latent code matrix , which ensures that the latent features well capture the original visual features. The second item customizes the latent code matrix to be derived from the original input matrix via a decoding matrix. It should be noted that the encoding matrix and the decoding matrix are symmetric so that they can be represented by the same parameters. Such a design makes not only the original vector highly recoverable but also Eq. (1) to have a closed optimization solution introduced below.

Given , the optimal to minimize can be obtained as the following closed-form expression by setting its derivative with respect to to 0,

(2)

To avoid redundant information in the latent space and enable the latent code vectors to encode the original input feature more compactly, we assume that the dimensional axes of are uncorrelated and thus orthonormal, as shown in Eq. (3).

(3)

where is the identity matrix. Consequently, we obtain the optimal with a closed-form expression:

(4)

To this end, the encoding matrix can be derived from the latent code matrix and the original input matrix. Thus, the task is to find an efficient code matrix. Substitute the optimal to Eq. (1) leading to:

(5)

where denotes the trace of a matrix. With being a constant, minimizing Eq. (5) is equivalent to maximize , which can be seen as an expression of the recoverability of the visual space and the representability of the latent space, i.e., . Consequently, we can derive the following formula.

(6)

By replacing with , Eq. (6) is rewritten as:

(7)

where .

In summary, the latent code matrix is learned in an implicit manner by balancing the predictability and the recoverability of the latent space, making the latent space feature-aware.

Iii-C Capture the intrinsic co-patterns across modalities

Let be the input original matrix of the -th modality. The above proposed model can encode the original matrix as a matrix embedded in a latent space. If the pair and from different modalities represent the same concept, their latent representations are strongly correlated. Here we set the correlated modalities to share the same latent code representations. To this end, the final objective function of the proposed LSE can be obtained as follows.

(8)

where . Using the Lagrange multipliers approach, each column of the optimal is obtained to satisfy the following condition:

(9)

It can be seen that the optimization for can be transformed to an eigenvalue problem. The normalized eigenvectors of corresponding to the top largest eigenvalues form the optimal code matrix . is the dimensionality of the latent space. Consequently, the principal co-patterns of different modalities are revealed via the common latent matrix .

The computational complexity of is , where is the dimensionality of the input feature matrix , and is the number of the input training instances. Since the dimensionality of the latent space is much smaller than , the eigenvalue problem of Eq. (9) can be solved efficiently with iterative methods like Arnoldi iteration [33], of which the optimal computational complexity is about . In this way, the overall computational complexity of the proposed approach is .

Input: : the feature matrix of the -th modality,
   , where denotes the visual feature matrix and denotes the class semantic feature matrix;
   : the balancing parameter;
   : the dimensionality of the latent space;
   : the testing instance;
   : the semantic feature matrix of unseen classes.
Output: The predicted class labels of unseen data.
Training:
   1: 
   2: 
   3:  = eigenvector() {eigenvectors corresponding to the top largest eigenvalues}
   4: The encoding matrix:
       .
   5: The decoding matrix: .
Testing:
   6: Obtaining the latent representations of all the unseen classes with semantic features :
     ;
   7: Obtaining the visual representations of all the unseen classes with latent representations:
     ;
   8: Obtaining the label of testing instance with:
     , where .
Algorithm 1 The implementation of LSE for TZSL

Iii-D Apply LSE to ZSL

Given a set of training instances from seen classes , where and are respectively the visual feature and label vector of the -th instance. Zero-shot learning is to learn a classifier for a label set that is disjoint from , i.e., . In order to transfer the information from seen classes to unseen ones, each class and is associated with a semantic vector , e.g., attributes or word embedding.

ZSL can be seen as a special case of the proposed LSE approach, where the visual space is the first modality while the class semantic embedding space is the second modality. The visual modality and the class semantic modality are connected by learning a shared representation with Eq. (8). Once obtaining the optimal code matrix , the encoding and decoding matrices are easily derived with Eq. (4). In the testing stage, the unseen instances are classified by computing the similarity between the visual features and the unseen class semantic embedding vectors with the learned encoding and decoding matrices. In order to alleviate the influence of the hubness issue mentioned in [31], we perform ZSL in the visual space by encoding the class semantic vectors into the visual space:

(10)

where returns the label of the test instance , is the vector in the visual space projected by the -th unseen class embedding vector , is the compatibility score between the test instance and the class embedding vector. An illustration of the implementation of LSE for TZSL is shown in Table 1.

For some cases, there are more than one type of class semantic embedding spaces available, each capturing an aspect of the structure of the class semantics. To explore the complementary information of different modalities, we can learn a better code matrix by combining them together. By learning the latent code matrix with Eq. (8), the encoder and decoder for each modality can be derived with Eq. (4) correspondingly. Consequently, we model the final prediction as

(11)

where is the vector in the visual space projected by the -th unseen class embedding vector of modality , is the weight parameter for modality . In our experiments, we perform a grid search over on the unseen classes.

Iv Experiments

In this section, we design extensive experiments to evaluate our proposed LSE approach. Firstly, we introduce the experimental setups, including the datasets, features, and evaluation metrics used in the experiments. Secondly, we provide TZSL, GZSL, and ZSR results on four benchmark datasets, respectively. Finally, some analysis of the proposed approach are further discussed.

Iv-a Experimental Setup

Datasets. Three benchmark attribute datasets and a large-scale image dataset are used for our evaluations. (a) Animal with Attributes (AwA) dataset consists of 30,475 images from 50 animal classes. Each class is associated with 85 class-level attributes. We follow the same seen/unseen split as that in [13] for experiments. (b) Caltech-UCSD Birds 2011 (CUB) [30] is a fine-grained dataset with 200 different bird classes, which consists of 11,788 images. Each class is annotated with 312 attributes. To facilitate direct comparison, we follow the split suggestion in [8], of which 150 classes are used for training and the rest 50 classes for testing. (c) aPascal-aYahoo [12] is a combined dataset of aPascal and aYahoo, which contains 2,644 images from 32 classes. Each image is annotated with 64 binary attributes. To represent each class with an attribute vector, we average the attributes of the images in each class. In the experiments, the aPascal is used as the seen data, and aYahoo is used as the unseen data. (d) For ImageNet, we follow the same seen/unseen split as that in [27], where 1,000 classes from ILSVRC2012 are used for training, while 360 non-overlapped classes from ILSVRC2010 are used for testing. The details of these four datesets are listed in TABLE I.

Dataset SS Training Testing
Attr. Wordvec. Images Classes Images Classes
AwA 85 100 24,295 40 6,180 10
CUB 312 400 8,855 150 2,933 50
aPY 64 - 12,695 20 2,644 12
ImageNet - 1,000 200,000 1,000 54,000 360
TABLE I: The statistics of the four datasets used in the experiments.

Semantic embedding space. For AwA and CUB datasets, both the attribute space and word vector space are used as the semantic embedding space. For an easy comparison with the existing approaches, we train a word2vector model [15] on a corpus of 4.6M Wikipedia documents to obtain the 100-dimensional vector for each AwA class name and 400-dimensional vector for each CUB class name. For aPY dataset, only the attribute space is used since few approaches are evaluated with word vector on it. For ImageNet dataset, 1,000-dimensional word vector is used to represent each class name.

Visual feature space. In order to better compare with the existing approaches, we use the deep features extracted from popular CNN architecture. For a fair comparison, two types of deep features: 4,096-dim VGG [5] features and 1,024-dim GoogleNet [6] features are used for the three benchmark attribute datasets. Those features are available from [23] and [37], respectively. For the ImageNet dataset, we use the 1024-dim GoogleNet features provided in [28].

Evaluation metric. Following the traditional supervised classification, many ZSL approaches [23, 28, 36] are evaluated with Per-image accuracy (PI), which focuses on classifying if the predicted label is the correct class label for each test instance. However, this criterion may encourage biased prediction on densely populated classes. Thus, the Per-class accuracy (PC) [13, 32, 37] is commonly used for ZSL. In our experiment, PC is adopted to evaluate ZSL and GZSL performances. For ZSR, mean average precision (mAP) [23, 35, 36] is used to measure the performance.

Implementation details. Our LSE approach has two parameters to investigate, the balance parameter and the dimensionality of latent space . As in [34], their values are set by class-wise cross-validation using the training data. It should be noted that the dimensionality of the latent space is always smaller than that of the input space. All the experiments are conducted on a computer which has 4-core 3.3GHz CPUs with 24GB RAM.

Iv-B TZSL results

Iv-B1 TZSL results with attribute

Method F AwA CUB aPY
DAP [13] G 60.1 36.7 35.5
RRZSL [17] G 66.4 45.4 38.8
ESZSL [18]§ G 76.8 49.1 47.3
SAE [28]§ G 81.4 46.2 41.3
SSE [23] V 76.3 30.4 46.2
JLSE [34] V 80.5 41.8 50.4
MLZSL [35] V 77.3 43.3 53.2
MFMR [36] V/G 79.8/76.6 47.7/46.2 48.2/46.4
SynC [37] V/G 78.6/73.4 50.3/54.4 48.9/44.2
LSE V/G 81.9/81.6 55.4/53.2 47.6/53.9
TABLE II: Comparison to the existing TZSL approaches in terms of classification accuracy (%) on three datasets with attributes. Two types of deep features (VGG and GoogleNet) are used. ‘V’ and ‘G’ are short for VGG and GoogleNet features, respectively. ‘§’  indicates the methods with which the classification performances are obtained by ourselves. For each dataset, the best one with VGG features is marked with underline and the best one with GoogleNet features is marked in bold.

In order to evaluate the effectiveness of the proposed approach, nine state-of-the-art ZSL approaches are selected for comparison: 1) DAP [13], RRZSL [17], ESZSL [18], and SAE [28] are compared using the GoogleNet features; 2) SSE [23], JLSE [34], and MLZSL [35] are compared using the VGG features; 3) MFMR [36] and SynC [37] are compared using both the GoogleNet and the VGG features. The performance results of the selected approaches are all from the original papers except for ESZSL [18] and SAE [28]. ESZSL [18] and SAE [28] are fine tuned by ourselves using the codes released by the authors; the hyperparameters of both models are selected from {0.01,0.1,1,10,100}.

TABLE II presents the classification accuracy of our approach and the nine competitive baselines with attributes. Generally, our approach achieves the state-of-the-art performance on three benchmark datasets. Specifically, it outperforms all the competitors on AwA dataset, which has a 1.4% and 0.2% improvements over the closest VGG-based competitor (i.e., JLSE) and the closest GoogleNet-based competitor (i.e., SAE), respectively. In terms of the CUB dataset, the relative accuracy gain of LSE over SynC [37], i.e., the second best approach, is 5.3% with the VGG features. For aPY dataset, LSE also beats all the competitors with a large margin using the GoogleNet features.

In addition, compared with the approaches that mostly focus on learning an explicit mapping function [17, 18, 23], the proposed approach achieves obvious improvements on three datasets, showing that the effectiveness and the superiority of the our implicit encoding and the balance between the encoding and decoding processes.

Iv-B2 ZSL results on ImageNet dataset

Five state-of-the-art competitors are selected for ImageNet dataset. Among them, DeViSE [24] is an end-to-end deep embedding framework to connect the visual features and the word vectors via a common compatible matrix. By applying the embedding representations of the visual features with DeViSE, AMP [25] constructs a class prototype graph to measure the similarity between the visual embedding representations and the class prototypes. ConSE [26] learns an n-way probabilistic classifier for the seen classes and infers the unseen classifiers via the semantic relationships between the seen and unseen classes. SS-Voc [27] and SAE [28] are two embedding approaches, in which SS-Voc [27] improves classification performance by utilizing vocabulary over unsupervised items to train the model and SAE [28] adds a reconstructed term to encourage learning a more generalized model. The comparison results are demonstrated in TABLE III.

For a fair comparison with the alternatives, we use Top@1 and Top@5 classification accuracies to evaluate the approaches. From the comparison results, we can find that the proposed LSE obtains the superior performance on ImageNet dataset. Specifically, it outperforms the closest competitor over 1.8% with top@5. This is impressive since the amount of testing instances is large.

Method Top@1 Top@5
DeViSE [24] 5.2 12.8
AMP [25] 6.1 13.1
ConSE [26] 7.8 15.5
SS-Voc [27] 9.5 16.8
SAE [28]§ 12.1 25.6
LSE 12.4 27.4
TABLE III: The classification performance (%) of different TZSL approaches on ImageNet dataset.

Iv-B3 ZSL results with multimodal features

Method F SS AwA CUB Average
SJE [8] G A 66.7 50.1 58.4
W 51.2 28.4 39.8
A+W 73.5 51.0 62.3
LatEm [38] G A 72.5 45.6 59.1
W 52.3 33.1 42.7
A+W 76.1 47.4 61.8
RKT [40] V A 76.0 39.6 57.8
W 76.4 25.6 51.0
A+W 82.4 46.2 64.3
BiDiLEL [39] V A 78.3 48.6 63.5
W 57.0 33.6 45.3
A+W 77.8 51.3 64.6
LSE V A 81.9 55.6 68.8
W 74.9 35.2 55.1
A+W 83.2 56.3 69.8
LSE G A 81.6 53.2 67.4
W 77.3 34.7 56.0
A+W 84.5 54.3 69.4
TABLE IV: Comparison results (in %) of the existing TZSL approaches with multi modalities on AwA and CUB datasets. Two types of semantic embedding space are used. ‘A’ and ‘W’ are short for attributes and word vector, respectively. ‘V’ and ‘G’ are short for VGG and GoogleNet features, respectively. For each dataset, the best one with VGG features is marked with underline and the best one with GoogleNet features is marked in bold.
Method AwA CUB aPY ImageNet
U-U S-S U-T S-T U-U S-S U-T S-T U-U S-S U-T S-T U-U S-S U-T S-T
ESZSL [18] 76.8 87.9 26.2 87.6 49.1 66.4 15.1 65.2 47.3 75.6 24.6 70.3 25.7 94.9 9.1 94.8
SynC [11] 73.4 81.0 0.4 81.0 54.4 73.0 13.2 72.0 44.2 72.9 18.4 68.9 23.4 93.7 7.2 93.6
JEDM [32] 77.4 84.2 31.9 82.6 48.4 51.4 12.3 48.9 49.6 76.4 35.5 71.6 25.9 93.2 8.8 93.0
SAE [28] 81.4 84.7 35.5 85.2 46.2 57.1 27.4 56.7 41.3 71.7 29.7 66.6 25.6 92.9 8.4 92.2
MFMR [36] 76.6 81.2 33.2 79.7 46.2 48.6 12.5 42.8 46.4 65.3 31.3 54.3 21.6 89.2 6.9 88.9
LSE 82.0 88.2 42.4 87.9 53.2 64.1 33.6 62.1 53.9 75.7 51.2 74.2 27.4 93.6 12.4 93.3
TABLE V: Performance (%) comparison with the state-of-the-art approaches on GZSL. The best performance is marked in bold under different scenarios.

One major limitation of many existing ZSL approaches is that they mostly focus on two modalities, e.g., the visual and the attribute modalities. However, the semantic information hides in different modalities in the real world. Thus, it is typically expected to develop the capability with more than two modalities. A main advantage of LSE is that it can fuse the multimodal features into the framework. To this end, we evaluate our approach with multimodal features on AwA and CUB datasets, respectively. Four related multimodal confusion approaches are selected for comparison. SJE [8] and LatEm [38] are two GoogleNet-based approaches, and RKT [40] and BiDiLEL [39] are two VGG-based approaches.

TABLE IV summarizes our comparison results with the competing approaches on AwA and CUB datasets. From the table, we have the following observations: 1) Our LSE approach achieves the best performance on both datasets with different modalities except for the result using VGG as visual features and word vectors as semantic embedding features. Specifically, the proposed LSE outperforms the second best method BiDiLEL [39] with 5.3%, 9.8%, and 5.2% with VGG visual features using attribute, wordvec, and attribute+wordvec as semantic embedding features, respectively. Besides, with GoogleNet visual feature, the proposed LSE has 8.3%, 13.3%, and 7.6% gains over the second best one, i.e., LatEm [38] with attribute, wordvec, and attribute+wordvec, respectively. 2) On both datasets, the results of different ZSL approaches with attributes are better than those with word vectors, indicating that the attribute information contains more transferring semantics than the existing word vector representations. 3) The classification results of all ZSL approaches by exploiting both the attribute and word vector are much better than those with a single one, which demonstrates that both the attributes and word vectors retrieve not only the common information but also the complementary features. 4) In contrast to those of the AwA dataset, the results on CUB with word vectors are obviously inferior to those with attributes. This may be due in part to the fact that the CUB is a fine-grained dataset of which differences between inter-classes are small, making the word vectors to contain less discriminative information.

Iv-C Results of GZSL

Fig. 2: The confusion matrixes of SynC [11] and our LSE on AwA dataset under U-T scenario, where the columns are the classes that the testing instances truly belong to and the rows are the testing instances be classified into.

We also evaluate our approach under GZSL setting on the four datasets. Four scenarios U-U, S-S, U-T, and S-T are evaluated. U-U actually is TZSL, which means that the testing instances are assumed to be classified into the candidate unseen classes. S-S is the traditional supervised classification. In the experiments, 80% instances from seen classes are randomly selected to train the model and the remaining 20% instances are used to test. U-T is the scenario where the candidate classes of the testing instances from unseen classes are total classes, including both the seen and unseen classes, while S-T is to classify the testing instances from seen classes into both the seen and unseen classes. For these four scenarios, high performances of U-T and S-T are encouraged since they have more practical significance. TABLE V compares our model with five competitors on the four datasets. For AwA, CUB, and aPY datasets, per-class accuracy is used to evaluate the performance, while for the ImageNet dataset, top@5 classification accuracy is used. All the performances of the competitors are obtained via fine-tuning the models using the codes released by the authors. The hyperparameters of the competitors are selected from {0.01,0.1,1,10,100}.

From the results shown in TABLE V, we observe that 1) the performance differences between S-S and S-T are small, which means that most testing instances from seen classes are classified into the seen classes. However, the performance differences between U-U and U-T are very large, which indicates that many testing unseen instances are wrongly classified into the seen classes. 2) LSE performs satisfactorily under U-T scenario and beats the other competitors by a large margin. Specifically, it has 6.9%, 6.2%, 15.7%, and 3.3% improvements over the second best methods on AwA, CUB, aPY, and ImageNet datasets, respectively.

In order to show a clearer comparison, we further visualize the classification results in terms of the confusion matrixes of SynC [11] and LSE on AwA dataset under U-T scenario. As illustrated in Figure 2, SynC wrongly classifies most testing instances into the corresponding affinal seen classes. For example, the instances from “chimpanzee” class, as an unseen class, are mostly classified into its affinal class “gorilla”, which is a seen class. However, for LSE, although many testing instances are also classified into the seen classes, most of them are classified into the correct classes. This indicates that the proposed approach not only enables transferring the information from the seen classes to unseen ones, but also has the ability to distinguish the differences between seen classes and unseen ones. From the perspective of transferability, more affinal seen classes ensure the information to be easily transferred to the unseen classes. However, more affinal seen classes also alleviate the discriminability of the model since most unseen instances tend to be classified into the seen classes. The comparison results in the TABLE V and Figure 2 illustrate that LSE finds a better tradeoff than the competing approaches.

Method AwA CUB aPY Average
SSE-INT [23] 46.3 4.7 15.4 22.1
JSLE [34] 66.5 23.9 32.7 41.0
SynC [37] 65.4 34.3 30.4 43.4
MLZS [35] 68.1 25.3 36.9 43.4
MFMR [36] 70.8 30.6 45.6 49.0
LSE 73.2 44.8 42.3 53.4
TABLE VI: Zero-shot retrieval mAP (%) comparison on three benchmark datasets. The results of the selected approaches are cited from the original papers.

Iv-D Zero-Shot Retrieval performances

Given an unseen class prototype as a query, the task of ZSR is to retrieve its related instances from unseen candidate set. In the experiment, the VGG visual features are used to obtain the retrieval performance. Five state-of-the-art VGG-based approaches are selected for comparison. Since no comparative ZSR approaches are evaluated in literature on ImageNet dataset, we conduct experiments on the rest three attribute datasets. TABLE VI presents the ZSR results in terms of mAP. From the results, we can find that LSE performs the best on AwA and CUB datasets. Specifically, LSE obtains 2.4% and 10.5% mAP score gains over the best counterparts on AwA and CUB datasets. Furthermore, the proposed LSE significantly and consistently outperforms the closest competitor, i.e., MFMR [36], by 4.4% on average. The superior ZSR performances of LSE indicate that the strong visual similarity between the same corresponding classes of the different modalities and the effective semantic alignment across different modalities with our LSE approach.

Iv-E Parameter Sensitivity Analysis

Fig. 3: The influences of different latent dimensionalities on AwA and CUB datasets; “Attr” and “Wordvec” are short for attributes and word vector, respectively.
Fig. 4: The influences of different on AwA and CUB datasets; “Attr” and “Wordvec” are short for attributes and word vector, respectively.

To evaluate the effects of the parameters in our method on unseen data, we take the TZSL classification accuracy on AwA and CUB datasets under different settings with respect to different parameter values. Specifically, there are two parameters and in the training stage. is the balance parameter, which is selected from . is the dimensionality of the latent space, which is smaller than that of the input space. In the experiments, we vary one parameter at each time while fixing the other to its optimal value.

The two sub-figures in Fig.3 illustrate the influences of different latent dimensionalities on AwA and CUB datasets. We observe that the curves vary on different datasets. This is reasonable since the dimensionalities of the original input semantic embedding spaces vary across different datasets. The curves in Fig.3 (a) show that the performances initially increase and achieve their peaks and then decline with the further increase of the latent dimensionality. We report the best performances on their peaks. However, the performances are more robust to the latent dimensionality on CUB dataset. As illustrated in Fig.3 (b), the curves tend to be flat when the latent dimensionality is larger than 60. The curves of these two datasets are drawn by setting balancing parameter as 0.1.

Fig.4 reports the classification performances using various on AwA and CUB datasets. The two sub-figures of Fig.4 show that, when equals 0, the performances are worse than those with equalling 0.1. This indicates that the encoding constraint (see Eq. (1)) boosts the classification performances and improves the generalized transfer ability of the framework on unseen data. When equals 0, the proposed approach turns to finding a latent space for both of different input modalities with matrix factorization, which has the same idea with MFMR [36]. As shown in Fig.4 (a), when is larger than 0.1, the curves generally decrease with the increase of on AwA dataset with different features, which indicates that the decoding process plays a more important role than the encoding process in the framework. The curves of CUB dataset in Fig.4 (b) have a similar trend to those in Fig.4 (a) but are more robust to the various values of . Compared to AwA dataset, the CUB dataset is a fine-grained dataset in which the differences of inter-classes are small. Thus, CUB is insensitive to . In the experiments, we set as 0.1 for AwA and 0.2 for CUB datasets under different settings.

Iv-F Computational cost

Method Training Testing
ESZSL [18] 0.62 0.04
MFMC [36] 66.95 1.01
SynC [37] 9.86 4.22
SAE [28] 1.19 0.34
LSE 57.76 0.42
fast-LSE 0.60 0.42
TABLE VII: The computational cost (in second) of different linear ZSL approaches on AwA dataset.

In this section, we compare the computational cost of LSE with those of the other four linear-based ZSL approaches, ESZSL [18], MFMC [36], SynC [37], and SAE [28]. As illustrated in TABLE VII, we observe that LSE is a bit inefficient to the counterparts in the training stage. This is because the computational cost of LSE mainly comes from the eigenvalue decomposition in Eq. (9), which depends on the number of training instances. As stated in [20] that the transferability of a model depends on the representation of the training classes rather than the amount of the training instances. In this way, we apply LSE with the fast strategy proposed in [20] by representing each training class as its visual pattern by averaging the visual features of the images in each class and call it as fast-LSE. Experimental results show that the fast strategy improves the computational efficiency dramatically.

V Conclusion and future work

In this paper, we have proposed a novel latent space embedding approach for addressing ZSL. It learns the optimal intrinsic semantic information of different modalities via implicitly decomposing the input features based on an encoder-decoder framework. The proposed framework can also be extended to address the multimodal issues. Experimental results on TZSL, GZSL, and ZSR demonstrate that the proposed approach not only transfers the information from the seen domain to the unseen domain efficiently but also distinguishes the seen classes and unseen ones well.

In the future, we shall extend the framework into an end-to-end deep approach on larger-size datasets, and improve the compactness of the latent space by preserving the structure of the learned code vectors on the other multimodal applications.

References

  • [1] Y. Fu, M. Hospedales, Timothy, T. Xiang, and S. Gong, “Learning multimodal latent attributes,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 303-316, 2014.
  • [2] Y. Fu, M. Hospedales, Timothy, T. Xiang, and S. Gong, “Transductive multi-view zero-shot learning,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 37, no. 11, pp. 2332-2345, 2015.
  • [3] Y. Guo, G. Ding, J. Han, and Y. Gao, “Zero-shot Learning with Transferred Samples,” IEEE Trans. on Image Process., vol. 26, no. 7, pp. 3277-3290, 2017.
  • [4] L., Yang, L. Liu, L. Shao, et al., “From Zero-shot Learning to Conventional Supervised Classification Unseen Visual Data Synthesis,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Honolulu, Hawaii, USA, July, 2017.
  • [5] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Int. Conf. Learn. Rep., Banff, Canada, April 2014.
  • [6] C. Szegedy, W. Liu, Y. Jia, et al., “Going deeper with convolutions,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Boston, USA, June 2015, pp. 1-9,
  • [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Las Vegas, USA, June 2016, pp. 770-778.
  • [8] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of output embeddings for fine-grained image classification.” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., 2015, pp. 2927-2936.
  • [9] X. Li, Y. Guo, and D. Schuurmans, “Semi-supervised zero-shot classification with label representation learning.” in Proc. IEEE Int. Conf. on Comput. Vis., Santiago, Chile, Dec. 2015, pp.  4211-4219.
  • [10] A. Lazaridou, E. Bruni, and M. Baroni, “Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world,” Proc. ACL., 2014, pp. 1403-1414.
  • [11] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha, “An empirical study and analysis of generalized zero-shot l earning for object recognition in the wild,” in Eur. Conf. on Comput. Vis., Amsterdam, Netherlands, Oct. 2016, pp. 52-68.
  • [12] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Miami, USA, June 2009, pp. 1778-1785.
  • [13] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 453-465, 2014.
  • [14] S. J. Hwang, F. Sha, and K. Grauman, “Sharing features between objects and their attributes, in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., 2011, pp. 1761-1768.
  • [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in Neural Inf. Process. Syst., Nevada, US, Dec. 2013, pp. 3111-3119.
  • [16] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. Conf. Empi. Meth. Natural Lan. Proc., Doha, Qatar, Oct. 2014, pp. 1532-1543.
  • [17] Y. Shigeto, I. Suzuki, and K. Hara, “Ridge Regression, Hubness, and Zero-Shot Learning,” in Eur. Conf. Mach. Learn., Porto, Portugal, Sep. 2015, pp. 135-151.
  • [18] B. Romera-Paredes and P. H. S Torr, “An embarrassingly simple approach to zero-shot learning,” in Proc. Int. Conf. Mach. Learn., Lille, France, July 2015, pp. 2152-2161.
  • [19] Z. Ji, Y. Yu, Y. Pang, J.  Guo, and Z. Zhang, “Manifold regularized cross-modal embedding for zero-shot learning,” Inf. Sci., pp. 48-58, 2017.
  • [20] Y. Yu, Z. Ji, J. Guo, and Y. Pang, “Transductive zero-shot learning with adaptive structural embedding,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1-12, 2017.
  • [21] R. Socher, M. Ganjoo, C. D. Manning, et al., “Zero-shot learning through cross-modal transfer,” Advances in Neural Inf. Process. Syst., Nevada, US, Dec. 2013, pp. 935-943.
  • [22] L. Jiang, J. Li, Z. Yan, and C. Zhang, “Zero-Shot Learning by Generating Pseudo Feature Representations,” arXiv:1703.06389, 2017.
  • [23] Z. Zhang and V. Saligrama, “Zero-shot learning via semantic similarity embedding,” in Proc. IEEE Int. Conf. on Comput. Vis., Santiago, Chile, Dec. 2015, pp. 4166-4174.
  • [24] A. Frome, G. S. Corrado, J. Shlens, et al., “DeViSE: A deep visual-semantic embedding model,” Advances in Neural Inf. Process Syst., Nevada, US, Dec. 2013, pp. 2121-2129.
  • [25] Z. Fu, T. Xiang, and E. Kodirov, “Zero-shot object recognition by semantic manifold distance,” in Proc. Comput. Visi. Pattern Recognit., Boston, USA, June 2015, pp. 2635-2644.
  • [26] M. Norouzi, T. Mikolov, S. Bengio, et al., “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” Int. Conf. on Learn. Repr., Banff, Canada, April 2014, pp. 1-9.
  • [27] Y. Fu and L. Sigal, “Semi-supervised Vocabulary-informed Learning,” in Proc. Comput. Visi. Pattern Recognit., Las Vegas, USA, June 2016, pp. 5337-5346.
  • [28] E. Kodirov, T. Xiang, and S. Gong, “Semantic Autoencoder for Zero-Shot Learning,” in Proc. Comput. Visi. Pattern Recognit., 2017.
  • [29] H. Larochelle, D. Erhan, and Y. Bengio, “Zero-data Learning of New Tasks,” in AAAI Conf. Art. Intell., Chicago, Illinois, USA, July, 2008, pp. 646-651.
  • [30] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset”, Technical rep., 2011.
  • [31] B. Marco, L. Angeliki, and D. Georgiana, “Hubness and pollution: Delving into cross-space mapping for zero-shot learning,” Proc. ACL., 2015, pp. 270-280.
  • [32] Y. Yu, Z. Ji, X. Li, J. Guo, et al., “Transductive zero-shot learning with a self-training dictionary approach,” arXiv:1703.08893, 2017.
  • [33] R. B. Lehoucq and D. C. Sorensen, “Deflation techniques for an implicitly restarted Arnoldi iteration,” J. Matrix Analysis and Applications, no. 17, vol. 4, 1996, pp. 789-821.
  • [34] Z. Zhang and V. Saligrama, “Zero-shot learning via joint latent similarity embedding,” in Proc. Comput. Visi. Pattern Recognit., Las Vegas, USA, June 2016, pp. 6034-6042.
  • [35] M. Bucher, S. Herbin, and F. Jurie, “Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication,” in Eur. Conf. Comput. Visi., Amsterdam, Netherlands, Oct. 2016, pp. 730-746.
  • [36] X. Xu, F. Shen, Y. Yang, et al, “Matrix tri-factorization with manifold regularizations for zero-shot learning,” in Proc. IEEE Conf. on Comput. Vis. Pattern Recognit., Honolulu, Hawaii, USA, July, 2017.
  • [37] S. Changpinyo, W. L. Chao, and B. Gong, “Synthesized Classifiers for Zero-Shot Learning,” in Proc. Comput. Visi. Pattern Recognit., Las Vegas, USA, June 2016, pp. 5327-5336.
  • [38] Y. Q. Xian, Z. Akata, and G. Sharma, “Latent embeddings for zero-shot classification,” in Proc. Comput. Visi. Pattern Recognit., Las Vegas, USA, June 2016, pp. 69-77.
  • [39] Q. Wang and K. Chen, “Zero-Shot Visual Recognition via Bidirectional Latent Embedding,” in Int. J. Comput. Visi., vol. 124, no. 3, pp. 356-383, 2017.
  • [40] D. Wang, Y. Li, Y. Lin, and Y. Zhuang, “Relational knowledge transfer for zero-shot learning,” in AAAI Conf. Art. Intell., Phoenix, Arizona, USA, Feb. 2016, pp. 1-7.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
7895
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description