Task-Aware Deep Sampling for Feature Generation

Task-Aware Deep Sampling for Feature Generation

Xin Wang, Fisher Yu, Trevor Darrell, Joseph E. Gonzalez
UC Berkeley
{xinw, fy, trevor, jegonzal}@eecs.berkeley.edu
Abstract

The human ability to imagine the variety of appearances of novel objects based on past experience is crucial for quickly learning novel visual concepts based on few examples. Endowing machines with a similar ability to generate feature distributions for new visual concepts is key to sample-efficient model generalization. In this work, we propose a novel generator architecture suitable for feature generation in the zero-shot setting. We introduce task-aware deep sampling (TDS) which injects task-aware noise layer-by-layer in the generator, in contrast to existing shallow sampling (SS) schemes where random noise is only sampled once at the input layer of the generator. We propose a sample-efficient learning model composed of a TDS generator, a discriminator and a classifier (e.g., a soft-max classifier). We find that our model achieves state-of-the-art results on the compositional zero-shot learning benchmarks as well as improving upon the established benchmarks in conventional zero-shot learning with a faster convergence rate.

 

Task-Aware Deep Sampling for Feature Generation


  Xin Wang, Fisher Yu, Trevor Darrell, Joseph E. Gonzalez UC Berkeley {xinw, fy, trevor, jegonzal}@eecs.berkeley.edu

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

Machine vision systems have achieved great success in a range of applications [11, 15, 16] over the past years. However, they often rely on large-scale training data which is costly and sometimes even impossible to annotate. Furthermore, even when there is substantial data, skew in the labels can result in limited data for some classes [6, 20]. Enabling machine vision systems to synthesize features for novel visual concepts (e.g., categories, attribute-object pairs) alleviates issues of data scarcity and imbalance and can help classifiers generalize in low-shot regimes [14, 37, 39, 7, 9].

Generative adversarial networks (GANs) [12] have been used to generate image features of novel concepts and/or to augment imbalanced training data [37, 39, 7]. These generators have the potential to generate useful image features and improve model generalization given a few exemplar images (few-shot), or using only task descriptions (additional information to describe the image categories, e.g., attributes or attribute-object pairs; each category is associated with one task description) with no images (zero-shot). However, existing work focuses on loss designs [7, 9] and adopts generators which sample the random noise shallowly, only once at the input layer of the generator conditioned on the task (Figure 1 left). To capture the target data distribution, the generator, which is usually composed of multiple fully connected (FC) or convolutional layers, is required to transform the initial random distribution to the target data distribution. Given that the target data distribution is often multi-modal and high dimensional, the generators usually need complex designs (e.g., self-attention [44]) and are hard to converge [4, 13].

In this work, we rethink the data sampling procedure of conditional feature generators and propose a novel task-aware deep sampling (TDS) generator design (Figure 1 right). TDS is built on top of two main innovations: deep sampling and multi-step task conditioning. We sample a new set of random noise conditioned on the input task description (i.e., task-aware noise) at each level and inject the task-aware noise to the generator layer-by-layer, enabling deep sampling with multi-step task conditioning, in contrast to shallow sampling where random noise is only sampled once at the input layer (Figure 1 left); this allows for local data sampling at each level and leads to faster convergence as we show in the later section. This iterative deep sampling procedure is motivated by the classic Markov Chain Monte Carlo (MCMC) sampling [3], which is widely used to create samples from multi-modal high dimensional data distributions. In addition, task-aware noise helps the generator create samples from a task-adaptive distribution rather than from a distribution unconditioned on the task (i.e., unconditioned deep sampling (UDS) in Figure 1 middle); this provides direct injection of task information at each level, which we find improves the model performance in practice.

We develop a zero-shot learning model composed of our TDS generator, a discriminator and a classifier (e.g., a simple soft-max classifier or a meta-learning approach) as illustrated in Figure 2. Our key innovation is the TDS generator design; we keep the same discriminator designs as the existing work [39, 31, 41]. Furthermore, the proposed TDS generator is general, as it can be easily incorporated into other zero-shot feature generation networks; e.g., we show performance improvements using TDS with the CLSWGAN [39] model.

To evaluate our method, we conduct extensive experimental studies on the compositional zero-shot classification task of object-attribute categories on three benchmark datasets (MITStates [17], UT-Zappos [42] and StanfordVRD [21]) and conventional zero-shot learning on the Caltech-UCSD-Birds 200-2011(CUB) [35] and Oxford Flowers (FLO) [27] datasets. Our results show that TDS achieves state-of-the-art performance on compositional zero-shot learning and improves performance of existing methods on established zero-shot benchmarks. In addition, we compare the TDS generator with several alternative generator designs to analyze the deep sampling and multi-step task conditioning.

(a) SS     (b) UDS (c) TDS

Figure 1: Different sampling schemes for the generators. Left: shallow sampling (SS). The random noise is sampled only once at the input layer of the generator with single-step task conditioning. Middle: unconditioned deep sampling (UDS). Unconditioned (task-independent) noise is injected to each layer of the generators; this provides deep sampling and single-step task conditioning. The blue box denotes a MLP sub-network to transform the random noise. Right: task-aware deep sampling (TDS). Noise conditioned on the input task description (i.e., task-aware noise) is injected layer-by-layer to the generator; this allows for both deep sampling and multi-step task conditioning. denotes a MLP sub-network to transform the task description.

2 Related work

Our key innovation is the generator design, which is related to conditional GANs [23, 31, 41], an important family of GANs. Mirza and Osindero [23] first proposed conditional generative adversarial nets to generate MNIST digits conditioned on class labels. Recent work [31, 45, 39, 41] adopts text and attributes as the task information for image or feature generation. Generators in this work adopt shallow sampling, where random noise is sampled once at the input layer of the generator. In contrast, we adopt task-aware deep sampling, where task-aware noise is injected layer-by-layer to the generator. In the contemporary work StyleGAN [18], Karras et al. propose a style-based generator architecture using a constant input to the generator and injecting layer-wise unconditioned noise to the generator similar to UDS (Figure 1 middle), showing improvement in image generation.

Our zero-shot learning model is related to data hallucination [37, 14, 9] and feature generation [39, 5, 7] work in zero-shot learning (ZSL) and few-shot learning (FSL). GAN-based feature generation networks are adopted to generate synthetic features for novel concepts and/or to augment imbalanced training data, which improves model generalization. Our model is similar to the existing work which is also composed of a generator, a discriminator and a classifier [39, 7, 9]; we focus on the generation design while the existing work emphasizes on the loss function designs.

We mainly evaluate our model on the compositional zero-shot learning (CZSL) of attribute-object categories, a special case of zero-shot learning recently proposed by Misra et al. [24]. In CZSL, each visual concept (category) is represented by a attribute-object pair (e.g., red elephant, modern city, etc.) or a subject-predicate-object (SPO) triplet (e.g., person ride horse, dog on grass, etc. SPO triplet is used in the StanfordVRD dataset). Similar to ZSL, images of only a subset of compositions are labeled during training and the model is learned to recognize unseen object-attribute compositions during inference. The compositionality and the contextual nature of the task makes it interesting for compositional reasoning and sample efficient learning. Prior approaches either focus on the task-aware embedding learning of the image features [36] or metric-learning of the object-attribute pairs and the image features [24, 25, 26]. In contrast, we use TDS to generate synthetic features of the unseen object-attribute pairs and train a classifier on the synthetic data, which is complementary to the existing approaches.

3 Feature generation via task-aware deep sampling

Learning good feature embeddings for images often requires substantial amount of training data and annotations; however, large-scale datasets are often expensive or even impossible to collect and annotate in practice. In this work, we aim to examine designs of feature generators for generating synthetic features conditioned on the task descriptions. The generators can be further used to train the classifiers in the limited data setting. We introduce a novel task-aware deep sampling (TDS) approach to construct the feature generator, including two key designs: deep sampling and multi-step task conditioning. We will describe the TDS design in more details in Section 3.1. The TDS feature generator, together with a discriminator and a classifier, compose of our zero-shot learning model (Figure 2). In Section 3.2, we introduce the overall model structure and the training objectives.

3.1 Task-aware deep sampling

We first describe the setup. We are given a dataset , where are the image features extracted from the pre-trained CNN models (e.g., ResNet [16]), are the class labels and is the task description of the category . In the zero-shot learning setting, two disjoint sets of class labels (i.e., seen classes and unseen classes) together with the associated task descriptions are available for training and testing and only the image features of the seen classes are available for training. The feature generator generates image features by sampling random noise conditioned on the task descriptions, that is, . We omit the subscript below for simplicity.

In the generator design, there are two main factors: noise sampling and task conditioning. Many prior works [39, 31, 41] use both the random noise and the task description as the inputs to the generator , as illustrated in Figure 1 (left). We refer this procedure as shallow sampling with single-step task conditioning, because the noise is sampled unconditionally only once at the input layer of the generator and the task condition is injected by concatenation in one single step. In contrast, we propose task-aware deep sampling (TDS) with deep sampling and multi-step task conditioning.

Deep sampling. Our first innovation is to replace shallow sampling with deep sampling where multiple sets of random noise are sampled and injected to the generator layer-by-layer. Instead of sampling the random noise at the input layer, we only use the task description as the input to the generator and at each layer, an additional set of random noise is sampled and injected to the generator as illustrated in Figure 1 (middle). Intuitively, the mapping from the space of the task descriptions to the image feature space works as a “short-cut” and at each layer of the generator, a new set of random noise is sampled and injected to the generator, which progressively creates samples from the local distribution at each level along the “shot-cut”. One can also draw an analogy to the classic Markov Chain Monte Carlo (MCMC) method [3], which adopts an iterative sampling procedure and is widely used for sampling multi-modal and high dimensional distributions.

Multi-step task conditioning. Another key aspect of our proposed generator structure is multi-step task conditioning. We inject the task information multiple times, not just once. In unconditioned deep sampling (UDS) shown in Figure 1 (middle), the sampled noise , transformed by a multilayer network to adjust the scale and variance, is added to each feature layer of the generator. The injected noise is independent of the task description (referred as unconditioned noise). As the distribution transformation proceeds at each level of the generator, the initial task description gets vague and the generator may create samples that are irrelevant to the given task. Also, the variance of feature distributions of different tasks may be different. Therefore, we propose multiple-step task conditioning by injecting task-aware noise at each level to the generator, namely task-aware deep sampling (TDS), as illustrated in Figure 1 (right). To construct the task-aware noise, we first transform the task description with a multilayer subnetwork for normalization and rescaling and then multiply the transformed task description with the unconditioned noise . In the experiments, we find that task-aware noise is clustered based on the tasks while the unconditioned noise follows the same distribution across different tasks.

In summary, the task-aware deep sampling (TDS) generator includes deep sampling and multi-step task conditioning. In the next section we will discuss how the generator can be utilized in a zero-shot learning model pipeline and generate synthetic features for unseen classes.

Figure 2: Our overall model is composed of a TDS generator, a discriminator and a classifier. The TDS generator is used to generate synthetic features of the novel categories to augment the imbalanced training data. The classifier, which can be a simple soft-max classifier or a more advanced meta-learning method, is trained on the augmented datasets to make predictions over both seen and unseen categories.

3.2 Overall model structure and objectives

Our overall model pipeline is composed of the TDS generator , a discriminator and a classifier as illustrated in Figure 2. The discriminator is used during training to distinguish whether the input feature is real or fake while the classifier is used for multi-class classification, mapping the image features into the unseen categories (both seen and unseen categories in generalized ZSL).

To train the model, we extend the improved WGAN-GP [13] by integrating the task descriptions to both the generator and the discriminator. The extended WGAN loss can be defined as

(1)

where is a subnetwork (a FC network or an identity function) to transform the task description, and with . The first two terms approximate the Wasserstein distance and the third term is the gradient penalty which enforces the gradient of to have unit norm along the linear interpolation of the real and the generated feature pair. We use in our experiments following  [13, 39].

The WGAN loss does not guarantee the generated features are well-suited for training the classifier. Similar to [39, 7], we introduce an additional classification loss defined as

(2)

where and is the class label of . is the conditional probability predicted by the classifier parameterized by . In contrast to [39, 7] which using a pre-trained classifier, our classifier is jointly trained with the generator with the synthetic features generated on the fly.

We also introduce a regression loss term to regularize the generated features of the seen classes to be close to the real image feature samples. We use a mean squared loss defined as , where is a sample image feature from the class . Intuitively, this regularization term helps the synthetic features to model the cluster center of the seen features softly and better capture the statistics of the target distribution. The overall objective is

(3)

and we adopt and if not specified. All three parts of the model, the generator, discriminator and the classifier, are trained in an end-to-end manner.

4 Experiments

In this section, we present the experimental evaluation of TDS on compositional zero-shot learning and conventional zero-shot learning. The appendix provides an extension of our model to few-shot learning. Our approach achieves the state-of-the-art results on the compositional zero-shot classification of attribute-object pairs on three benchmark datasets, which we describe below in Section 4.1. In the same sub-section, we also present an ablation study of the generator designs. In Section 4.2, we utilize the proposed TDS generator in the existing zero-shot learning framework CLSWGAN [39], observing improved classification accuracy and faster convergence.

4.1 Compositional zero-shot learning

Task setup. Compositional zero-shot learning (CZSL) aims to learn a classifier to recognize unseen visual concepts (i.e. categories). The visual concepts are represented by attribute-object pairs (e.g., red apple, modern city) or subject-predicate-object (SPO) triplets (e.g., people walk dog, phone on table). The compositional zero-shot learning datasets (e.g., MITStates [17], UT-Zappos [42]) provide an vocabulary of attributes and objects and is the task description to represent the visual concept/category. Image features of only a subset of attribute-object composition are available during training and the model is evaluated on the image features of unseen compositions during testing. CZSL can be viewed as a special case of the conventional zero-shot learning, but more challenging given the composable and contextual nature noted by [24].

Datasets. We conduct experiments on three datasets: MITStates [17], UT-Zappos [42] and StanfordVRD [21]. For MITStates, each image is associated with an attribute-object pair (e.g., modern city, sunny valley, etc.) as the label. The model is trained on 34K images with 1,292 labeled seen pairs and tested on 34K images with 700 unseen pairs. The UT-Zappos dataset is a fine-grained datasets where each image is associated with a material attribute and shoe type pair (e.g. leather slippers, cotton sandals, etc.). There are 16 different attribute classes and 12 object classes. Following [25], images of 83 pairs are used for training and a disjoint set of images of 33 pairs are used for testing. For StanfordVRD, the visual concept is represented with an SPO (subject, predicate, object) triplet such as person wears jeans, elephant on grass, etc. The dataset has 7,701 SPO triplets and 1,029 of them are seen only in the test set. Similar as [24], we use the ground-truth bounding boxes and treat the problem as classification into SPO tuples rather than detection.

Experimental details. In the experiments, we extract the image features with ResNet-18 and ResNet-101 [16] pretrained on ImageNet following [24, 25, 36] and also include the more recent DLA-34 and DLA-102 [43] for benchmarking. We report the top-1 accuracy of the unseen compositions following [24, 36, 26]. We use Glove [28] to convert the attributes and objects into 300-dimensional word embeddings. The word embeddings of attributes and objects are transformed by two 2-layer FC networks and with the hidden unit size of 1024 and is the concatenation of and used as input to both the generator and the discriminator. We use a soft-max classifier for classification. The discriminator is a 3-layer FC networks with the hidden size of 1024. For the generator, we use a 4-layer FC network where the hidden unit size of the first three layers is 2048 and the size of the last layer matches the dimension of the target feature dimension. is a single layer FC network with no bias and the hidden size matches the corresponding feature layer size of the generator. is a 2-layer FC network where the hidden unit size is 1024 in the first layer and matches the corresponding feature layer size of the generator in the second layer. We adopt the Adam [19] optimizer with an initial learning rate of ( for the embedding network ). We decrease the learning rate by 10 at epoch 30 and train the network for 40 epochs in total and report the accuracy of the last epoch. The batch size is 128.

ResNet-18 ResNet-101 DLA-34 DLA-102 Dataset Model Top-1 Acc. (%) Top-1 Acc. (%) Top-1 Acc. (%) Top-1 Acc. (%) MIT-States RedWine [24] 12.0 17.4 14.6 17.0 AttOperator [25] 14.2 15.7 14.4 15.8 TAFE-Net [36] 15.1 17.2 16.1 17.0 GenModel [26] 17.8 20.0 - - TDS 20.9 22.8 21.8 23.1 UT-Zappos RedWine [24] 40.3 43.2 36.8 37.6 AttOperator [25] 46.2 50.6 39.8 47.5 GenModel [26] 48.3 51.9 - - TDS 49.0 51.7 47.4 52.6 StanfordVRD RedWine [24] 8.3 10.1 9.5 9.8 AttOperator [25] 8.0 11.5 7.9 10.8 TAFE-Net [36] 10.4 12.3 10.4 12.7 TDS 12.7 13.1 11.8 13.5

Table 1: Top-1 accuracy of unseen compositions in compositional zero-shot learning on MIT-States (700 unseen pairs), UT-Zappos (33 unseen pairs) and StanfordVRD (1029 unseen triplets). TDS achieves state-of-the-art results on all three datasets with four different feature extractors.
Figure 3: Shallow sampling with multi-step conditioning. SS-MTC (left) and SS-MTC (right) adopt a simple addition and an affine transformation for task information injection at each layer of the generator. MIT-States UT-Zappos StanfordVRD Model Top-1 Acc. (%) Top-1 Acc. (%) Top-1 Acc. (%) SS 12.4 40.0 8.3 SS-MTC 18.3 43.4 9.3 SS-MTC 19.2 44.3 10.1 UDS 14.8 41.4 8.7 TDS 20.9 49.0 12.7 Table 2: Top-1 Accuracy of unseen compositions. SS-MTC and SS-MTC utilizing multi-step conditioning have better performance than SS. UDS with deep sampling achieves higher accuracies than SS. Overall, TDS achieves better performance than all the alternatives.

Quantitative results. We present the top-1 accuracy of the unseen attribute-object pairs (or SPO triplets) in Table 1 following [24, 25, 26, 36]. Redwine [24] and AttOperator [25] are metric-learning based approaches which compare the similarity of the image embeddings and the task embeddings111Numbers of Redwine and AttOperator with different feature extractors obtained by running the official implementation provided in AttOperator at https://github.com/Tushar-N/attributes-as-operators. GenModel [26] adopts a reconstruction loss in the objective function when learning the feature representation and TAFE-Net [36] learns a task-aware feature embeddings for a shared classifier. Our classifier is directly trained on the synthetic image features of the unseen compositions and at testing time, only the real image features of the unseen compositions are fed into the classifier, not combined with the task descriptions as the existing approaches do. As we can see from Table 1, our TDS model achieves state-of-the-art results on all three datasets with four different image feature extractors. In particular, we observe over 2% improvements over the prior-art GenModel on MIT-States and TAFE-Net on StanfordVRD.

Deep sampling and mulit-step task conditioning. We analyze the two innovations: deep sampling and multi-step task conditioning with two additional variants: shallow sampling with multi-step task conditioning (SS-MTC) illustrated in Figure 3. SS-MTC adds task information every layer to the generator and SS-MTC adopts an affine transformation of the features conditioned on the task at each level inspired by FiLM [29] and TAFE-Net [36]. In Table 2, we present the top-1 accuracy of the unseen compositions on the three datasets using ResNet-18 as the feature extractor. We observe that both SS-MTC and SS-MTC have better performance than the vanilla shallow sampling (SS) with single step task conditioning and SS-MTC has stronger performance than SS-MTC given the more complex transformation. For deep sampling, we find the unconditioned deep sampling (UDS) is better than SS, where both of them are using single step task conditioning. In all cases, the proposed TDS, which utilizes both deep sampling and multi-step conditioning, achieves the best results among all the considered variants.

Visualization of task-aware noise. TDS is different from UDS mainly because of the injection of task-aware noise, which allows for sampling from a task-adaptive distribution. In Figure 4, we visualize the noise injected to the last layer of the generator in UDS and TDS with t-SNE [22] of the 33 unseen compositions on UT-Zappos. We can observe from the figure that the task-aware noise is clustered based on the task while the unconditioned noise is mixed in one cluster.

Loss ablation. We introduce a regression loss term in our objective function. As shown in Table 3, the accuracy drops significantly if the is removed. We conjecture that this regularization helps the synthetic features to capture the statistics (e.g., centroid) of the target data distribution quickly so useful features are generated at an earlier stage during training to help train the classifier.

Figure 4: T-SNE visualization of the unconditioned noise used in UDS (left) and task-aware noise injected in the last layer of TDS (right) of 33 unseen attribute-object compositions on UT-Zappos. The task-aware noise is clustered based on the task while the unconditioned noise is mixed in one cluster.

MIT-States UT-Zappos StanfordVRD Model Top-1 Acc. (%) Top-1 Acc. (%) Top-1 Acc. (%) TDS without 0.33 20.2 0.17 TDS with 20.9 49.0 12.7

Table 3: Loss ablation. The prediction accuracy drops significantly when removing from the objective.

4.2 Zero-shot learning

In this section, we evaluate our model in both conventional zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) on two benchmark datasets: Oxford Flowers (FLO) [27] and Caltec-UCSD-Birds 200-2011(CUB) [35]. Feature generation based methods have already been used in these settings, e.g., CLSWGAN [39], which has a similar model as ours (a generator, a discriminator and a classifier). We investigate the generalizability of our TDS generator by replacing the existing generator in CLSWGAN with TDS while keeping the discriminator, the classifier and the loss function the same as CLWGAN. We find simply replacing the generator in CLSWGAN improves the prediction accuracy in both ZSL and GZSL and obtains faster convergence rate than the original CLSWGAN.

Datasets. CUB contains 11,788 images from 200 different types of birds annotated with 312 attributes. FLO has 8,189 images from 102 different types of flowers and following  [39], we use the fine-grained visual descriptions collected by Reed et al. [30] and extract 1024-dimensional semantic embeddings as the task descriptions. Following [40, 38, 1], the image features are extracted by the pretrained ResNet-101 [16].

Experimental details. Our implementation is based on the official code222https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/zero-shot-learning/feature-generating-networks-for-zero-shot-learning/ of CLSWGAN. For fair comparison, we also keep the same hidden unit size in the generator as CLSWGAN and set the embedding network as the identity function to keep consistency. We use the same hyper-parameters as CLSWGAN for training except that we simplify the training schedule by reducing the learning rate by 10 every 30 epochs and train the network for 60 epochs for all experiments where CLSWGAN has different training epochs designed for different datasets in ZSL and GZSL settings.

Quantitative results. We consider both the ZSL and GZSL settings in our evaluation. In ZSL, we report the average per class top-1 accuracy. In GZSL, we use the average per class top-1 accuracy of both unseen and seen classes , and the harmonic mean . In Table 4, we list the performance of major baselines (DEVISE [8], SJE [2], LATEM [38], ESZSL [32], ALE [1] and SP-AEN [5]) on FLO and CUB for reference and use CLSWGAN for main comparison. For both CLSWGAN and our method, we repeat 50 runs and report the mean accuracy and the standard deviation. As we observe from the table, our method is able to improve the prediction performance of CLSWGAN on the two datasets in both ZSL and GZSL with minimal hyper parameter tuning.

Faster convergence. We also observe that our model converges faster than CLSWGAN. In Figure 5, we plot the test accuracy of FLO and CUB under the ZSL setting at every training epoch. Our approach converges at around 30 epochs on FLO and 15 epochs on CUB while CLSWGAN converges at around 40 epochs on both FLO and CUB. We conjecture this is because of deep sampling which allows for local data sampling at each layer of the generator along a fixed mapping from task descriptions to target image features instead of a global transformation from the initial random distribution to the target data distribution, and thus is easier to optimize.

ZSL GZSL Model FLO CUB FLO CUB top-1 acc top-1 acc u s h u s h DEVISE [8] 45.9 52.0 9.9 44.2 16.2 23.8 53.0 32.8 SJE [2] 53.4 53.9 13.9 47.6 21.5 23.5 59.2 33.6 LATEM [38] 40.4 49.3 6.6 47.6 11.5 15.2 57.3 24.0 ESZSL [32] 51.0 53.9 11.4 56.8 19.0 12.6 63.8 21.0 ALE [1] 48.5 54.9 13.3 61.6 21.9 23.7 62.8 34.4 SP-AEN [5] - 55.4 - - - 34.7 70.6 46.6 CLSWGAN* 62.7 1.37 56.0 1.37 54.9 1.41 76.4 2.0 63.8 0.94 44.3 1.33 56.3 2.0 49.6 0.38 TDS 66.9 0.95 56.7 0.61 57.33 1.73 79.5 2.32 66.6 0.89 44.6 1.22 57.0 1.58 50.0 0.34

Table 4: Zero-shot learning benchmark results. We list performance of major baselines and conduct a control study with CLSWAN by replacing its generator with TDS while keeping the rest the same. We repeat both CLSWGAN and TDS 50 times and report the mean accuracy and the standard deviation. We find TDS achieves higher accuracy under both ZSL and generalized ZSL (GZSL) on both the FLO and CUB datasets.
Figure 5: Top-1 test accuracy vs. training epochs. TDS achieves a faster convergence rate on both FLO and CUB. In particular, TDS converges at 15 epochs while CLSWGAN converges at 40 epochs on CUB.

5 Conclusion and future work

In this paper, we proposed a task-aware deep sampling (TDS) method to construct conditional feature generators, which produce synthetic features of novel visual concepts that can be used to train the classifiers in the zero-shot learning setting. There are two key designs in TDS, deep sampling and multi-step task conditioning, in contrast to the widely adopted shallow sampling with single-step task conditioning. TDS achieves state-of-the-art results on the compositional zero-shot learning task and improves established benchmark in conventional zero-shot learning. Extensive ablation study indicates that the proposed TDS method can not only improve the prediction accuracy but lead to faster convergence. In addition, an extension of our model can be applied in few-shot learning and we provided some initial results in the appendix. We believe TDS also has potential in other applications, e.g., image synthesis, and we leave these studies for future work.

Acknowledgments

We thank Zuxuan Wu for the helpful discussion and Anna Rohrbach and Seth Dong Huk Park for the insightful comments. This work was supported by Berkeley AI Research, RISE Lab and Berkeley DeepDrive. In addition to NSF CISE Expeditions Award CCF-1730628, this research is supported by gifts from Alibaba, Amazon Web Services, Ant Financial, Arm, CapitalOne, Ericsson, Facebook, Google, Huawei, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware.

References

  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2016.
  • [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
  • [3] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1-2):5–43, 2003.
  • [4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
  • [5] L. Chen, H. Zhang, J. Xiao, W. Liu, and S.-F. Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1043–1052, 2018.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [7] R. Felix, B. V. Kumar, I. Reid, and G. Carneiro. Multi-modal cycle-consistent generalized zero-shot learning. In European Conference on Computer Vision, pages 21–37. Springer, 2018.
  • [8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
  • [9] H. Gao, Z. Shou, A. Zareian, H. Zhang, and S.-F. Chang. Low-shot learning via covariance-preserving adversarial augmentation networks. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • [10] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018.
  • [11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [14] B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision, pages 3018–3027, 2017.
  • [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] P. Isola, J. J. Lim, and E. H. Adelson. Discovering states and transformations in image collections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1383–1391, 2015.
  • [18] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
  • [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [21] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer, 2016.
  • [22] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [23] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [24] I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1792–1801, 2017.
  • [25] T. Nagarajan and K. Grauman. Attributes as operators: factorizing unseen attribute-object compositions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018.
  • [26] Z. Nan, Y. Liu, N. Zheng, and S.-C. Zhu. Recognizing unseen attribute-object pair with generative model. In AAAI 2019, 2019.
  • [27] M.-E. Nilsback and A. Zisserman. A visual vocabulary for flower classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1447–1454, 2006.
  • [28] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [29] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [30] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–58, 2016.
  • [31] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060–1069, 2016.
  • [32] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pages 2152–2161, 2015.
  • [33] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [34] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
  • [35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [36] X. Wang, F. Yu, R. Wang, T. Darrell, and J. E. Gonzalez. Tafe-net: Task-aware feature embeddings for low shot learning. arXiv preprint arXiv:1904.05967, 2019.
  • [37] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7278–7286, 2018.
  • [38] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 69–77, 2016.
  • [39] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018.
  • [40] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4582–4591, 2017.
  • [41] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
  • [42] A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In Proceedings of the IEEE International Conference on Computer Vision, pages 5570–5579, 2017.
  • [43] F. Yu, D. Wang, E. Shelhamer, and T. Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018.
  • [44] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
  • [45] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.

Appendix

A.1 Few-shot learning

We can extend the experiments to few-shot learning. We follow the few-shot ImageNet benchmark proposed in Hariharan et al. [14]. Our model can also be plugged in the existing framework used by Hariharan et al. [14] and the generated features trained with the simple logistic regression model is able to directly benefit more advanced few-shot learning methods such as Matching Network (MN) [34].

Experimental setup. The 1K ImageNet classes are randomly divided into 389 base categories and 611 novel categories. Following Hariharan et al. [14], the dataset has been further divided into two disjoint sets of base classes ( 193 classes and 196 classes) and novel classes ( 300 classes and 311 classes). We also adopt the meta training and meta testing schemes in our experiments. The model including the feature generator TDS is trained on the joint set of and and validated on and during meta training. The final evaluation is conducted on and . During meta-testing, we feed the few examples from the novel classes to TDS and augment the novel classes to have the same training examples as the base classes. The classifier (logistic regression model of 1000 classes333This is different from the logistic regression model used in [37] which uses smaller label space. We follow the original model provided by Hariharan and Girshick at https://github.com/facebookresearch/low-shot-shrink-hallucinate for easy comparison.) is then trained on the augmented dataset. The image features are extracted with pre-trained ResNet 18 model with SGM loss [14]. We use the top-5 accuracy of the novel classes, base classes and all classes as the evaluation metrics. We consider extremely low shot setting where (the number of examples per novel class) is set to 1, 2 and 5. At each setting, we repeat 5 runs and report the average accuracy as  [14, 37]. We keep the training hyper-parameters the same as Hariharan et al. [14].

Baselines. We consider the simple logistic regression model as our meta classifier as Hariharan et al. [14] and also the widely used Matching Network (MN) [34]. For experiments with MN, we directly augment the novel classes with trained TDS with logistic regression classifier during meta testing. During meta training, the MN model is trained the same as the original model on and without the feature generator. The goal is to evaluate whether the generated synthetic features are transferrble across different meta learning backbones. For comparison with other meta-learning backbones (e.g., PN [33], PMN [37], FewShotWithoutForgetting [10]) and the recent work [9] which focuses on GAN loss function study not the generator design, we leave it for further study given the time limit.

Quantitative results. We provide the evaluation results in Table 5. As we can see from the table, our model is able to improve the original logistic regression model by roughly 15 points and the hallucination approach in Hariharan et al. [14] by 5 points for one-shot accuracy of the novel classes. Moreover, directly augmenting the generated synthetic features during meta testing can improve the original MN by 2 points. This indicates the generated features have the potential to generalized across different meta learning backbones.

Model Novel Top-5 Acc (%) Base Top-5 Acc (%) All Top-5 Acc (%) n=1 2 5 n=1 2 5 n=1 2 5 LogReg [14] 23.14 42.37 61.68 91.00 89.32 86.67 49.37 60.52 71.34 LogReg + Gen [14] 32.80 46.37 61.70 88.43 87.12 86.62 54.31 62.12 71.33 LogReg + TDS 37.87 51.15 62.78 88.02 86.68 85.48 57.25 64.89 71.55 MN [14] 41.21 50.75 60.08 80.93 81.28 83.13 56.56 62.55 68.99 MN + TDS 43.20 51.01 58.75 80.54 81.18 83.53 57.64 62.68 68.33

Table 5: Few-shot Learning

A.2 Compositional zero-shot learning

In this section, we provide additional evaluation on the choice of which controls the weight of . We use grid search to determine the value of and we plot the top-1 accuracy of unseen compositions of MITStates, UT-Zappos and StanfordVRD with shown in Figure 6 (the value of is plotted in basis in the plot). We find the prediction accuracy increases when we increase and the performance starts to saturate or drop after . We, therefore, use in our experiments.

Figure 6: Top-1 accuracy of unseen compositions with different values of . The value of is plotted in basis in the plot. The prediction accuracy increases when we increase and the performance starts to saturate or drop after . Therefore, we choose in our experiments.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
375417
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description