Discriminative Learning of Latent Features for Zero-Shot Recognition
Zero-shot learning (ZSL) aims to recognize unseen image categories by learning an embedding space between image and semantic representations. For years, among existing works, it has been the center task to learn the proper mapping matrices aligning the visual and semantic space, whilst the importance to learn discriminative representations for ZSL is ignored. In this work, we retrospect existing methods and demonstrate the necessity to learn discriminative representations for both visual and semantic instances of ZSL. We propose an end-to-end network that is capable of 1) automatically discovering discriminative regions by a zoom network; and 2) learning discriminative semantic representations in an augmented space introduced for both user-defined and latent attributes. Our proposed method is tested extensively on two challenging ZSL datasets, and the experiment results show that the proposed method significantly outperforms state-of-the-art methods.
In recent years, zero-shot learning (ZSL) has gained its popularity in object recognition task [1, 9, 10, 12, 14, 15, 17, 30]. Unlike traditional object recognition methods that seek to predict the presence of an object instance by assigning its image label as one of the categories seen in the training set, zero-shot learning aims to recognize an object instance from a new category never seen before. Therefore, in the ZSL task, the seen categories in the training set and the unseen categories in the test set are disjoint. Typically, the descriptors of categories (\eguser-defined attribute annotations [1, 15], the text descriptions of the categories , the word vectors of the class names [7, 19], \etc) are provided for both seen and unseen classes; some of those descriptors are shared between categories. Those descriptors are often called side information or semantic representations. In this work, we focus on learning for ZSL with attributes.
As shown in Figure 1, a general assumption under the typical ZSL methods is that there exists a shared embedding space, in which a mapping function, , is defined to measure the compatibility between the image features and the semantic representations for both seen and unseen classes. is the visual-semantic mapping matrix to be learned. Existing approaches of ZSL mainly focus on introducing linear or non-linear modelling methods, utilizing various optimization objectives and designing different specific regularization terms to learn the visual-semantic mapping, more specially, to learn for ZSL.
To date, the learning of the mapping matrix , though important to ZSL, is mainly driven by minimizing the alignment loss between the visual and semantic space. However, the final goal of ZSL is to classify unseen categories. Therefore, the visual features and semantic representations , should arguably be discriminative to recognize different objects. Unfortunately, this issue has been thus far neglected in ZSL and almost all the methods follow the same paradigm: 1) extracting image features by hand-crafting or using pre-trained CNN models; and 2) utilizing the human-designed attributes as the semantic representations. There are some pitfalls existed in this paradigm.
Firstly, the image features either crafted manually or from a pre-trained CNN model may be not representative enough for zero-short recognition task. Though the features from a pre-trained CNN model are learned, yet restricted to a fixed set of images (\eg, ImageNet ), which is not optimal for a particular ZSL task.
Secondly, the user-defined attributes are semantically descriptive, but they are not exhaustive, thus limiting its discriminativeness in classification. There may exist discriminative visual clues not reflected by the pre-defined attributes in ZSL datasets, \eg, the huge mouths of hippos. On the other hand, as shown in Figure 1, the annotated attributes, such as big, strong and ground, are shared in many object categories. This is desired for knowledge transfer between categories, especially from seen to unseen categories. However, if two categories (\egcheetah and tiger) share too many (user-defined) attributes, they will be hardly distinguishable in the space of attribute vectors.
Thirdly, low-level feature extraction and embedding space construction in existing ZSL approaches are treated separately, and usually carried out in isolation. Therefore, few existing work ever considers those two components in a unified framework.
To address those pitfalls, we propose an end-to-end model capable of learning latent discriminative features (LDF) for ZSL in both visual and semantic space. Specifically, our contributions are:
1) A cascaded zooming mechanism to learn features from object-centric regions. Our model can automatically identify the most discriminative region in an image and then zoom it into a larger scale for learning in a cascaded network structure. In this way, our model can concentrate on learning features from a region with object as a focus.
2) A framework to jointly learn the latent attributes and the user-defined attributes. We formulate the learning of latent attributes as a category-ranking problem to ensure the learned attributes are discriminative. Meanwhile, the discriminative region mining and the latent attributes modelling are jointly learned in our model and assist each other to gain further improvement.
3) An end-to-end network structure for ZSL. The obtained image features can be regulated to be more compatible with the semantic space, which contains both the user-defined attributes and latent discriminative attributes.
2 Related Work
Early works of zero-shot learning (ZSL) follow an intuitive way to object recognition that first trains different attribute classifiers and then recognizes an image by comparing its predicted attributes with descriptions of unseen classes [6, 15]. Among these works, Direct Attribute Prediction (DAP) model  predicts the posterior of each attribute, and then the class posteriors for an image are calculated by maximizing a posterior. Whilst in Indirect Attribute Prediction (IAP)  model, the attribute posteriors are computed from the class posterior of seen classes. In these methods, each attribute classifier is trained individually and the relationship between attributes for a class is not considered.
To address this issue, most of recent ZSL works are embedding-based methods, which seek to build a common embedding space for images and their semantic features. The DeViSE model  and the ALE model  are based on a bilinear embedding model, where a linear transformation matrix is learned with a hinge ranking loss. The ESZSL model  adds a Frobenius norm regularizer into the embedding space construction. The SJE model  combines several compatibility functions linearly to form a joint embedding space. The LatEM model  improves SJE with more nonlinearity by incorporating latent variables. Recently, the SCoRe model  adds a semantically consistent regularization to make the learned transformation matrix perform better on test images. The MFMR model  learns the projection matrix by decomposing the visual feature matrix. The majority of ZSL methods thus far extract image features from whole image with fixed pre-trained CNN models. In contrast, image features in our model are learned to be more representative with the mining of latent discriminative regions and the end-to-end training style.
In typical embedding space construction approach, only the space of user-defined attributes is used to embed the seen and unseen classes. Different from this, the JSLA model [20, 21] and the LAD model  propose to model latent attributes for ZSL, which are similar to our work. JSLA learns latent discriminative attributes by minimizing the intra class distance between the attributes; while in LAD the discriminativeness of latent attributes is indirectly achieved by training seen class classifiers over the latent attributes. Different from them, our model proposes to directly regulate both inter-class and intra-class distances between latent attributes to achieve the discriminativeness. What’s more, JSLA and LAD still utilize the fixed pre-extracted image features, which are less representative than ours.
Another branch of ZSL approaches are based on hybrid models, which aim to use the combination of seen classes to classify unseen images. The ConSE model  convexly combines the classification probabilities of seen classes to classify unseen objects. The SynC model  introduces synthetic classifiers of unseen classes by linearly combining the classifiers of seen classes. In our method, when the learned latent attributes are utilized for ZSL prediction, the latent attribute prototype for an unseen class is obtained by combining the prototypes of seen classes. To this end, our prediction model is among the family of hybrid models; and beyond that our model also learns embeddings for both user-defined attributes and latent attributes in one network.
3 Task Definition
In the zero-shot learning task, the training set, \ie, the seen classes, is defined as , where is the -th image of the seen class and is its corresponding class label. The test set, \ie, the unseen classes, is defined as , where denotes the -th unseen image and is the label of it. The seen and unseen classes are disjoint, \ie, . Additionally, the user-defined attributes for both seen and unseen classes can be denoted as and , where and indicate the attribute vectors for the -th seen class and the -th unseen class, respectively. At the test stage, given a test image and the attribute annotations of test classes , the goal of ZSL is to predict the corresponding category for .
4 Our Method
The framework of the proposed method is illustrated in Figure 2. Note that the architecture in principle contains multiple scales and for clarity, we illustrate the network with two scales as an example. In each scale, the network consists of three different components, 1) the image feature network (FNet) to extract image representations, 2) the zoom network (ZNet) to locate the most discriminative region and then zoom it to larger scale and 3) the embedding network (ENet) to build the embedding space where the visual and semantic information are associated. For the first scale, the input of the FNet is the image of its original size and the ZNet is responsible for producing the zoomed region. Then for the second scale, the zoomed region is fed into the FNet to obtain more discriminative image features.
4.1 The Image Feature Network (FNet)
Different from existing works [5, 18, 31], we would like to learn image features together with embedding for zero-shot learning. Therefore, our framework starts with a compartment of convolutional nets responsible for learning image features, which is termed as FNet. The choice of the architecture of FNet is flexible; and two possible variants are considered in our approach, \ie, the VGG19 and the GoogLeNet. For VGG19, the FNet starts from conv1 to fc7; for GoogLeNet, it starts from conv1 to pool5. Given an image or a zoomed region , the image representation is denoted as:
where indicates the overall parameters of the FNet, and denotes a set of operations of the FNet. Different from traditional ZSL approaches, the parameters of FNet are jointly trained with other parts in our framework; thus the obtained features are regulated well with the embedding component. We show that this leads to an performance improvement.
4.2 The Zoom Network (ZNet)
The final goal of zero-shot learning is to classify different object categories. There exist studies showing that learning from object regions could benefit object categorization at image level [8, 32]. Inspired by these studies, we hypothesize that there may exist some discriminative regions in an image which benefit the zero-shot learning. Such a region could contain only object instance or object parts . On the other hand, for ZSL, a candidate region will also need to reflect the user-defined attributes, some of which describe the background, such as swim, tree and mountains. Therefore, a target region is expected to contain some background to enhance the attributes embedding. We name this type of regions as object-centric region. To identify them, we introduce the zoom network (ZNet) that adopts an incrementally zoom-in approach to let the network automatically search a proper discriminative region from coarse to fine. The proper in ZSL task means that the target region is discriminative for classification and meanwhile matched with the annotated attributes.
Specifically, our ZNet takes the output of the last convolutional layer in the FNet (\eg, conv5_4 in VGG19) as the input. For computational efficiency, the candidate region is assumed as a square and its location can be represented with three parameters:
where indicate the x-axis and y-axis coordinates for the center of the searched square, respectively, and represents the length of the square. The denotes the output of the last convolutional layer of the FNet. The ZNet is a two-stacked fully-connected layers (1024-3) followed by the sigmoid activation function and denotes the parameters of the ZNet.
After obtaining the location of the square, the searched region can be obtained by directly cropping from the original image. However, it is not convenient to optimize the non-continuous cropping operation in backward-propagation. Inspired by , the sigmoid function is utilized to first produce a two-dim continuous mask . Formally,
where and is set to 10 in all experiments.
Then the cropped region can be obtained by implementing element-wise multiplication between the original image and the continuous mask :
Finally, to obtain better representation for finer localized cropped region, we further use the bilinear interpolation to adaptively zoom the cropped region to the same size with the original image. The zoomed region is then fed into a copy of the FNet in the next scale to extract more discriminative representation.
4.3 The Embedding Network (ENet)
The Baseline Embedding Model
The embedding network (ENet) aims to learn an embedding space where the visual and semantic information are associated. In this section, we first introduce a baseline embedding model, where the semantic representations, , is defined with the user-defined attributes . In this model, the mapping function to be learned is therefore defined as: .
The attribute space is adopted as the embedding space and the compatibility score is defined by the inner product:
where is the -dim image representation obtained by the FNet and is the -dim annotated attribute vector of category . is the weight to learn in a fully connected layer, which can be considered as a linear project matrix that maps to the attribute space .
The compatibility score measures the similarity between an image and the attribute annotations of classes. It is similar to the classification score in traditional object recognition task. Thus, to learn the matrix , a standard softmax loss can be used:
The Augmented Embedding Model
The baseline embedding model, adopted by most of existing ZSL methods, has achieved promising performance. However, it is based on user-defined attributes, which is of limited size, and usually not discriminative. To address this issue, we introduce an augmented attribute space, where an image is projected into both user-defined attributes (UA) and latent discriminative attributes (LA).
Specifically, our embedding network (ENet) learns a matrix mapping the image features to a -dim augmented space, and the embedded image features are computed as follows:
The goal is to associate the embedded image features with both the UA and the LA. For simplicity, we equally divide into two -dim parts:
Then we let the first -dim embedded feature correspond to the UA and the second -dim component being associated with the LA. Based on this assumption, for , similar to the baseline model, the softmax loss is utilized to train the ZSL model. Formally,
For the second embedded feature , the goal is to make the learned features be discriminative for object recognition. We propose to utilize the triplet loss  to learn the latent discriminative attributes with regulating the inter/intra class distances between latent attributes features:
where are images from the same class and is from a different class. is the squared Euclidean distance between and . is the margin of the triplet loss and is set to 1.0 for all experiments.
It is noted that and are associated with different loss functions. can be learned to be discriminative by specifically exploiting the category information in (10).
For each scale, the network is trained with both the softmax loss and the triplet loss. For a two-scale network (\ie, and ), the whole LDF model is trained by the following loss function:
The final objective function for a multi-scale network could be constructed similarly by aggregating all the loss functions of all of scales.
4.4 ZSL Prediction
In the proposed LDF model, the test images can be projected into both user-defined attributes (UA) and latent attributes (LA) as in (7). Thus, ZSL prediction can be performed in both the UA space and the LA space.
Prediction with UA. Given a test image , it can be projected to the UA representation . To predict its class label, the compatibility scores can be used to select the most matched unseen categories:
Prediction with LA. The test image can also be projected to the LA representation, . To perform ZSL in the LA space, the LA prototypes for unseen classes are required.
Firstly, the LA prototypes for seen classes are computed. Concretely, all samples from the seen class are projected to their LA features and the mean of features are utilized as the LA prototype of class , \ie, .
Then, for an unseen class , we compute the relationship between class and all the seen classes in the UA space. This relationship can be obtained by solving the following ridge regression problem:
By applying the same relationship to the LA space, the prototype for unseen class can be obtained:
Finally, the classification result of a test image with LA representation can be achieved as following:
Combining multiple spaces. We can consider both the UA and LA spaces and utilize the concated UA-LA feature to perform ZSL prediction. Formally,
Combining multiple scales. For a two-scale LDF model (\ie, and ). The UA and LA features are obtained in each scale, and the obtained multi-scale features can be combined to gain further improvement.
For multi-scale UA features, \ie, , we first concatenate the two features , and then train a new project matrix to obtain the combined UA feature, \ie, . For multi-scale LA features, \ie, , the combined feature can be obtained by directly concatenating the normalized two features, \ie, . Finally, the ZSL prediction can be performed using (17) with the combined UA feature and the combined LA feature .
The proposed LDF model is evaluated on two representative ZSL benchmarks: Animals with Attributes (AwA)  and Caltech-UCSD Birds 200-2011 (CUB) . AwA includes 30,475 images from 50 common animals categories. The 85 class-level attributes (continuous) and the standard 40/10 zero-shot split are adopted in our experiments. The dataset of CUB is a fine-grained bird dataset with 200 different birds and 11,788 images. Following SynC , we use a split of 150/50 for zero-shot learning and utilize 312-dim attribute vectors at class level.
5.2 Implementation Details
The FNets are initialized using two different CNN models pre-trained on ImageNet, \ie, GoogLeNet  and VGG19  respectively, to learn, . For AwA, only one zoom operation is performed and the LDF model contains two scales, as the objects in AwA images are usually large and centered
Training strategy: We first adopt the strategy used in  to initial the ZNet. Then the other components in the LDF model are learned. The detailed process is as follows:
Step 1: The FNet in each scale is initialized with the same GoogLeNet (or VGG19) pre-trained on ImageNet. Notice that in the subsequent steps of training, the parameters in each scale are not shared.
Step 2: In each scale, the initialized FNet is utilized to search a discriminative square, which is then used to pre-train the ZNet. The size of the searched square is assumed to be the half size of the original image (\ie, ). Then we slide over the last convolutional layer in the FNet and select the region with the highest activations. Finally, the coordinates of the searched region () are utilized to train the zoom net with L2 loss.
Step 3: We keep the parameters of the ZNet fixed and train both the FNet and the ENet.
Step 4: Finally, the parameters of the whole LDF model are fine-tuned in an end-to-end approach.
To verify the effectiveness of the different components in our LDF model, four baselines are designed to compare with the proposed LDF model.
SS-BE-Fixed (Single Scale & Baseline Embedding Model & Fixed Image Representations). In this baseline, the ZNet is removed, and only the full-size images are utilized to extract image features. Moreover, the FNet is fixed during the training. For semantic representations, only the user-defined attributes are considered (Section 4.3.1).
SS-BE-Learned (Single Scale & Baseline Embedding Model & Learned Image Representations). Compared with the SS-BE-Fixed baseline, the only difference is that the FNet can be learned in this baseline.
SS-AE-Learned (Single Scale & Augmented Embedding Model & Learned Image Representations). Compared with the SS-BE-Learned baseline, this baseline aims to build the augmented embedding space (Section 4.3.2) with considering both UA and LA.
MS-BE-Learned (Multi Scale & Baseline Embedding Model & Learned Image Representations). Compared with the SS-BE-Learned baseline, the only difference is the ZNet is added into this model (Section 4.2).
|DAP ||57.2 (60.5)||44.5 (39.1)|
|ESZSL ||75.3 (59.6)||- (44.0)|
|SJE ||- (66.7)||- (50.1)|
|LatEM ||- (71.9)||- (45.5)|
|SynC ||- (72.9)||- (54.5)|
|JLSE ||80.46 (-)||42.11 (-)|
|MFMR ||79.8 (76.6)||47.7 (46.2)|
|Low-Rank ||82.8 (76.6)||45.2 (56.2)|
|SCoRe ||82.8 (78.3)||59.5 (58.4)|
|LAD ||82.48 (-)||56.63 (-)|
|JSLA ||82.9 (-)||57.1 (-)|
|SS-BE-Fixed (Ours)||75.20 (73.70)||50.51 (50.31)|
|SS-BE-Learned (Ours)||79.35 (75.19)||59.32 (58.26)|
|SS-AE-Learned (Ours)||81.36 (77.77)||65.99 (66.96)|
|MS-BE-Learned (Ours)||81.80 (78.31)||64.85 (64.39)|
|LDF (Ours)||83.40 (79.13)||67.12 (70.37)|
5.4 Experimental Results
The multi-way classification accuracy (MCA) is used for evaluating the ZSL models. The comparison results using two different CNN models are shown in Table 1.
Effect of feature learning. From Table 1, we first notice that, without any specially designed regularization terms, the SS-BE-Learned baseline has already achieved comparable performance with state-of-the-arts and marginally surpass the SS-BE-Fixed baseline. Most of existing ZSL methods use the fixed image feature and only focus on learning visual-semantic mapping with various human-designed regularization terms. We show that feature learning neglected in image feature extraction process is also important to ZSL, which should be paid more attentions. By simply fine-tuning the FNet in an end-to-end framework, SS-BE-Learned can make the image features associate with the semantic information of attributes for different ZSL tasks and obtain better performance.
Effect of ZNet. The MS-BE-Learned baseline aims to use the ZNet to automatically discover discriminative regions from full-size images and leverage the coarse-to-fine representations to obtain better performance. We can see that the performance of MS-BE-Learned baseline outperforms both the SE-BE-Learned baseline and most of the state-of-the-art methods (Table 1, 81.80% on AwA, 64.85% on CUB).
We further analyze the performance of each scale in MS-BE-Learned model, and show the results in Table 2. It can be seen that, the performance of the first scale, \ie, MS-BE-Learned (Scale 1), is comparable with the single scale baseline, SS-BE-Learned. With more discriminative image features utilized, the performance of the second and the third scale improves continuously.
Effect of the latent attribute modelling. The SS-AE-Learned baseline aims to build an augmented embedding space. It is more reasonable to associate image features with both user-defined and latent attributes in our augmented space. It can be observed from Table 1 that the SS-AE-Learned model outperforms SE-BE-Learned baseline for both AwA (81.36%) and CUB (66.96%) datasets.
|SS-BE-Learned||79.35 (75.19)||59.32 (58.26)|
|MS-BE-Learned (Scale 1)||79.20 (75.68)||59.88 (58.87)|
|MS-BE-Learned (Scale 2)||79.87 (77.02)||61.04 (61.81)|
|MS-BE-Learned (Scale 3)||- (-)||62.04 (62.72)|
|MS-BE-Learned (All Scale)||81.80 (78.31)||64.85 (64.39)|
MS-BE-Learned (Scale X) denotes the ZSL results using the image features of scale X only.
|SS-BE-Learned||79.35 (75.19)||59.32 (58.26)|
|SS-AE-Learned (UA)||80.97 (77.24)||62.17 (59.40)|
|SS-AE-Learned (LA)||78.76 (75.75)||63.08 (66.11)|
|SS-AE-Learned (UA & LA)||81.36 (77.77)||65.99 (66.96)|
SS-AE-Learned (UA/LA) denotes the results predicted with the UA features only or the LA features only.
We believe that, in the augmented attribute space, the learning of LA will help the learning of UA. Further experiments are conducted to verify this. The results are shown in Table 3. For SS-AE-Learned baseline, we only utilize the obtained UA representation to perform ZSL prediction as in (13), denoted as SS-AE-Learned (UA). We can see that, when using UA features only, the performance of SS-AE-Learned (UA) is higher than the SS-BE-Learned. (\eg, 80.97% \vs79.35%). It proves that better UA representations are obtained in the augmented attribute space.
Comparisons with state-of-the-art methods. Compared with previous methods in Table 1, the LDF model improves the state-of-the-art performance on both datasets. In general, the proposed model based on VGG19 performs better on AwA, while the GoogLeNet-based model shows superiority on CUB. On AwA, our LDF achieves 83.40%, which is slightly higher than JLSA  (82.81%). For more challenging CUB dataset that 50 bird species need to be classified, our model obtains more obvious improvement. On CUB, the LDF model reaches 70.37%, with an impressive gain over the state-of-the-art SCoRe (from 58.4% to 70.37%).
Furthermore, the components of the latent discriminative regions mining (the ZNet) and the latent discriminative attribute modelling (the ENet) are jointly learned in the proposed LDF model. We believe the two components could assist each other in the joint learning framework. To verity this assumption, a further analysis of the LDF model is performed, and the results are shown in Table 4. It can be seen that, when using the combined UA features only to perform ZSL prediction, \ie, LDF (UA), the performance of LDF is higher than the MS-BE-Learned baseline. When using the combined LA features only, the performance of the LDF (LA) also exceeds the SS-AE-Learned (LA). It confirms the advantages of the jointly learning approach.
|SS-AE-Learned (LA)||78.76 (75.75)||63.08 (66.11)|
|LDF (LA)||79.35 (76.84)||66.47 (69.94)|
|MS-BE-Learned (UA)||81.80 (78.31)||64.85 (64.39)|
|LDF (UA)||82.47 (78.77)||65.94 (65.78)|
|LDF (LA & UA)||83.40 (79.13)||67.12 (70.37)|
LDF (LA/UA) denotes the ZSL results predicted with the combined LA features only or the combined UA features only.
Discriminativeness of LA. The LA features are learned to be discriminative by exploiting the category information as in (10), and we believe the learned LA space is more discriminative than the UA space. To illustrate this, we show some examples on AwA in Figure 6. The test images are projected to their UA features and LA features with (11). Then for a UA element or a LA element, the images which have largest and smallest activations of the component are shown. It can be observed that, for LA features, the images with large activations belong to one same category and the images with small activations are of the other category. In contrast, the user-defined attributes are usually shared in multiple categories. It confirms the apparent discriminative property of the learned latent attributes.
Additionally, to quantitatively compare the learned LA space with the UA space, we calculate cosine similarities between unseen classes with both the LA and UA prototypes, and the results are shown in Figure 3. The LA prototypes are obtained by directly averaging the LA features, \ie, , for each unseen class, and the UA prototypes are the class-level attribute annotations, \ie, . It can be seen that, compared with the UA prototypes, the cosine similarities between different LA prototypes are obviously smaller for most categories, except for the pig and the hippopotamus. Compared with attributes annotated by experts, our LA prototypes are learned from the images only. Thus, the categories with similar appearances, \eg, pig \vshippopotamus, get closer in the LA space.
It is noted that when we perform ZSL prediction with LA features, a LA representation (prototype) of a test category is needed, but absent in the dataset. Thus, the LA prototypes for unseen classes have to be computed with (15) leveraging the relationship . However, is computed in the UA space and it cannot exactly reflect the true relationship between LA prototypes. This bias finally degrades the ZSL performance when LA prototypes are utilized for prediction with (16). This bias explains why, in Table 3, the performance of SS-AE-Learned (LA) is lower than SS-AE-Learned (UA) on AwA, although the learned LA space is actually more discriminative than the UA space.
Visualizations of discriminative regions. In Figure 5, we show the discovered regions with the LDF model. The left three columns show the examples selected from AwA. We can see that, for images with a single instance, the LDF model progressively searches for finer regions until it finds the main object; for images with multiple instances, the model tends to find a large square including the multiple objects. Another interesting discovery on AwA is that, for some specific categories, \eg, whale, the identified regions will include obvious more background elements than others. The reason is that the searched regions of the humpback whale are required to be matched with their user-defined attributes, some of which, such as swims, water and ocean, highly relate to the background waters in the images.
The examples in right three columns are sampled from CUB. It is aware that the CUB dataset provides bounding box annotations, however, our model could automatically discover object-centric regions without such annotations, which shows another advantage of our framework. It is noted that, the network in  performs fine-grained object recognition, a different task from us; and it could discover some object parts. In contrast, in our ZSL model, the searched regions should be associated to the user-defined attributes, which, for example, correspond to the whole body of the birds from bills to tails. Thus, it is expected that the model will focus on regions containing the whole object rather than its parts; and our analysis confirms this.
In this paper, an end-to-end model is proposed to learn the latent discriminative features for ZSL in both visual and semantic space. For visual space, we introduce the zoom net to automatically search for discriminative regions. For semantic space, we propose an augmented attribute space with both the user-defined attributes and the latent attributes. The latent attributes are learned to be discriminative with category information. Finally, the two components could assist each other in the end-to-end joint learning framework.
This work is funded by the National Key Research and Development Program of China (Grant 2016YFB1001004 and Grant 2016YFB1001005), the National Natural Science Foundation of China (Grant 61673375, Grant 61721004 and Grant 61403383) and the Projects of Chinese Academy of Sciences (Grant QYZDB-SSW-JSC006 and Grant 173211KYSB20160008).
Appendix A How to Identify the Discriminative Region from an Image?
To search the discriminative region from an image in zero-shot learning (ZSL), two weakly supervised learning approaches can be considered: 1) directly regressing the locations of the identified region (\eg, the proposed zoom scheme in our LDF model); 2) extracting multiple region proposals (\eg, EdgeBox ) for the image and then selecting the most discriminative one. In this paper, we didn’t utilize the latter region proposal method based on the following considerations. First, the goal of the region proposal algorithm  is to identify “objects”. However, as shown in Figure 5 and claimed in Section 4.2, in ZSL, the identified region may contain context elements to match its user-defined attributes. Such region is not exactly equal to the “object” region and hard to be captured by EdgeBox. Second, processing multiple proposals (typically 2,000) for each image is quite inefficient, and selecting the proper region from 2,000 ones is also difficult in weakly supervised settings. We have conducted an experiment to test the region proposal approach for ZSL.
Specially, we first extract 2,000 EdgeBox proposals for each image. Then we replace the pool5 layer in SS-BE-baseline (VGG19) with the RoI Pooling layer proposed in Fast RCNN . The images with their region proposals are imported into the model, and the model could output the compatibility score for each region. Following the standard multiple instance learning (MIL) setting, the region with highest compatibility score is selected to compute the loss function as in (6). The network finally obtains 72.67% on AwA dataset. This result is even lower than SS-BE-Learned (Table 1, 78.35%), which directly extract image features from full-size images. Moreover, the runtime is 78 times longer than our zoom scheme.
Appendix B The Bilinear Interpolation Operation
In Section 4.2, to obtain better representation for finer localized cropped region , the bilinear interpolation is utilized to adaptively zoom the cropped region to the same size with the original image. Concretely, for a point of the zoomed region, its value can be computed by linearly combining the values of nearest four points in the cropped region. Formally,
where is the upsampling factor, \ie, . and is the integral and fractional part, respectively.
Appendix C Experiments with Three Scales on AwA
As we have mentioned in Section 5.2, for AwA dataset, only one zoom operation is performed and the two-scale model is adopted. We claim the reason is that the objects in AwA images are usually large and centered. To verify this, in this section, we analyze the performance of three-scale MS-BE-Learned baseline on AwA. The experiment is conducted with GoogLeNet and all the experimental settings are the same as we described in Section 5.2. The performance of each single scale is shown in Table 5.
Additionally, the parameter in (2) represents the length of the cropped regions. In scale 1 and scale 2, we respectively count the values for all unseen images and show the mean value of the in Table 5. It can be seen that when the three-scale model is adopted on AwA, the performance of the second scale is higher than the first scale (77.12% \vs75.47%). However, the performance of the third scale does not show the further improvement (77.05% \vs77.12%). When we inspect the mean values in the second scale, it can be found that the scale size of the cropped region is nearly 1 (0.98), that is, the zoom net in the second scale actually does not perform any cropping operation and directly send the original image to the third scale. As we have claimed, the objects in AwA images are large and centered. Through one time zoom operation, the network can capture the main object and the third scale is actually useless in the model.
|ZSL performance on AwA||mean value of|
|MS-BE-Learned (Scale 1)||75.47||0.87|
|MS-BE-Learned (Scale 2)||77.12||0.98|
|MS-BE-Learned (Scale 3)||77.05||-|
|ZSL performance on AwA||The dimension of LA|
Appendix D The Effect of the Dimension of Latent Attribute
As we mentioned in Section 4.3.2, the dimension of the latent attributes (LA) is set to , \ie, the same with the user-defined attributes (UA). In this section, we explore the effectiveness of the latent attributes’ dimension and conduct experiments on AwA dataset with GoogLeNet. Specially, we train the SS-AE-Learned baseline with different dimensions of LA (\ie, , and ), and perform ZSL prediction with the latent attributes only. The results are shown in Table 6. It can be seen that with the larger dimension of LA, the ZSL performance improves. But the improvement is slight and the performance in general is robust to the dimension of LA.
Appendix E The Discriminativeness of the Learned Latent Attributes
In this section, we show more visualized examples to illustrate the discriminative property of latent attributes. For a latent attribute element, the images which have largest and smallest activations over this element are shown in Figure 6. Meanwhile, the examples selected with the learned UA features are shown in Figure 7 for comparison. From Figure 7, it can be seen that the user-defined attributes are shared in many objects. Another discovery is that the prediction results of user-defined attributes will be affected by mid-level cues, \eg, colors. For example, for UDA5 element, the chimpanzee, whale and pig objects are falsely predicted as orange due to the existing orange backgrounds. For UDA64 element, the persian cat and pig images are falsely predicted as arctic. It is possible that the two animals share white appearances.
Appendix F Generalized Zero-Shot Learning Results
In conventional zero-shot learning (cZSL), ZSL methods are trained on seen classes and evaluated on unseen ones. The basic assumption in cZSL is that test instances always come from the unseen classes (denoted as ), which is actually unrealistic in real-world applications. Motivated by this, recent ZSL works [4, 30] aim to measure the zero-shot performance in the generalized zero-shot learning (gZSL) setting. In gZSL, the test images are assumed to come from all target classes including both seen and unseen categories.
Similar to , 20% of the images from seen classes are extracted and then merged with the images from unseen classes to form the new test set. We denoted the joint label space of seen and unseen classes as and evaluate the proposed LDF model in terms of accuracy on and , which are denoted as and , respectively. indicates the accuracies of classifying test images from unseen classes into the joint label space while indicates the accuracies of recognizing seen objects into the joint label space. Moreover, similar to , the harmonic mean is computed to measure the ZSL methods with considering both the accuracy of seen classes and the accuracy of unseen classes. Formally,
The experiments are performed on both AwA and CUB datasets. The GoogLeNet model is utilized and the results are shown in Table 7. It can be seen that on both datasets, the proposed LDF model significantly outperforms previous methods on all the three metrics, which confirms the advantage of our method under the gZSL setting.
- In supplementary materials, we will show that if we use three scales on AwA, the third scale is actually useless for object recognition.
- Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(7):1425–1438, 2016.
- Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, pages 2927–2936, 2015.
- S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, pages 5327–5336, 2016.
- W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, pages 52–68, 2016.
- Z. Ding, M. Shao, and Y. Fu. Low-rank embedded ensemble semantic dictionary for zero-shot learning. In CVPR, pages 2050–2058, 2017.
- A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, pages 1778–1785, 2009.
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121–2129, 2013.
- J. Fu, H. Zheng, and T. Mei. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, pages 4438–4446, 2017.
- Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong. Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV, pages 584–599, 2014.
- Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(11):2332–2345, 2015.
- R. Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
- C. Huang, C. C. Loy, and X. Tang. Local similarity-aware deep feature embedding. In NIPS, pages 1262–1270, 2016.
- H. Jiang, R. Wang, S. Shan, Y. Yang, and X. Chen. Learning discriminative latent attributes for zero-shot classification. In ICCV, pages 4223–4232, 2017.
- N. Karessli, Z. Akata, A. Bulling, and B. Schiele. Gaze embeddings for zero-shot image classification. In CVPR, pages 4525–4534.
- C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, pages 951–958, 2009.
- C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(3):453–465, 2014.
- Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zero-shot recognition using dual visual-semantic mapping paths. In CVPR, pages 3279–3287, 2017.
- P. Morgado and N. Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR, pages 6060–6069, 2017.
- M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
- P. Peng, Y. Tian, T. Xiang, Y. Wang, and T. Huang. Joint learning of semantic and latent attributes. In ECCV, pages 336–353, 2016.
- P. Peng, Y. Tian, T. Xiang, Y. Wang, M. Pontil, and T. Huang. Joint semantic and latent attribute modelling for cross-class transfer learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
- S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In CVPR, pages 49–58, 2016.
- B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, pages 2152–2161, 2015.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
- C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
- K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research (JMLR), 10(Feb):207–244, 2009.
- Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, pages 69–77, 2016.
- Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600, 2017.
- X. Xu, F. Shen, Y. Yang, D. Zhang, H. T. Shen, and J. Song. Matrix tri-factorization with manifold regularizations for zero-shot learning. In CVPR, pages 3798–3807, 2017.
- J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. International journal of computer vision (IJCV), 73(2):213–238, 2007.
- Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In CVPR, pages 6034–6042, 2016.
- C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405, 2014.