Meta-Learning across Meta-Tasks for Few-Shot Learning

Meta-Learning across Meta-Tasks for Few-Shot Learning

Abstract

Existing meta-learning based few-shot learning (FSL) methods typically adopt an episodic training strategy whereby each episode contains a meta-task. Across episodes, these tasks are sampled randomly and their relationships are ignored. In this paper, we argue that the inter-meta-task relationships should be exploited to learn models that are more generalizable to unseen classes with few-shots. Specifically, we consider the relationships between two types of meta-tasks and propose different strategies to exploit them. (1) Two meta-tasks with disjoint sets of classes: these are interesting because their relationship is reminiscent of that between the source seen classes and target unseen classes, featured with domain gap caused by class differences. A novel meta-training strategy named meta-domain adaptation (MDA) is proposed to make the meta-learned model more robust to the domain gap. (2) Two meta-tasks with identical sets of classes: these are interesting because they can be used to learn models that are robust against poorly sampled few-shots. To that end, a novel meta-knowledge distillation (MKD) strategy is formulated. Extensive experiments demonstrate that both MDA and MKD significantly boost the performance of a variety of existing FSL methods and thus achieve new state-of-the-art on three benchmarks.

\printAffiliationsAndNotice

1 Introduction

Most object recognition models (especially those recent ones based on deep neural networks) require hundreds of training labelled samples from each object class. However, collecting and annotating large quantities of training samples is often infeasible or even impossible for certain classes in real-life scenarios Yang et al. (2012); Antonie et al. (2001). One approach to addressing this challenge is few-shot learning (FSL) Li et al. (2003, 2006); Santoro et al. (2016b); Vinyals et al. (2016); Ravi and Larochelle (2017); Finn et al. (2017), which aims to recognize a set of unseen classes with only few training samples by learning from a set of seen classes each containing ample samples.

Recently the FSL research has been dominated by meta-learning based methods Finn et al. (2017); Snell et al. (2017); Sung et al. (2018); Ren et al. (2018); Chen et al. (2019a); Allen et al. (2019); Lee et al. (2019); Jamal and Qi (2019). These methods typically adopt an episodic training strategy. In each episode, a meta-task is constructed by sampling seen classes with few () shots as a support set and a separate query set of the same classes. Each meta-task is designed to simulate the -way -shot unseen class classification task. Across episodes, the meta-tasks are sampled randomly and independently. Considering that for each meta-task a feature extractor and a classifier are learned, though the former is normally shared across tasks, the latter is learned whilst ignoring any relationships among the tasks. However, since these tasks are sampled from the same pool of seen classes, they are inevitably related. In this paper, we propose to exploit the relationships between different tasks so that a model learned from seen classes can generalize better to unseen classes with only few training samples. In particular, we focus on exploring two types of meta-task relationships and designing different learning strategies accordingly.

(a) Conventional meta-training strategy
(b) Our proposed meta-training strategies
Figure 1: (a) Conventional meta-training strategy: a pair of episodes/meta-tasks are assumed to be independent even if they have two disjoint sets of classes or have exactly the same set of classes. (b) Our proposed meta-training strategies (i.e. MDA and MKD) followed by the conventional meta-test strategy: for each meta-training iteration, the red episode has a disjoint set of classes w.r.t. the two blue episodes, while the two blue episodes have the same set of classes (but with totally different samples).

The first type is the one between two meta-tasks that have completely different sets of classes (see episodes and in Fig. 1(a)). This relationship is interesting because it is reminiscent of that between unseen and seen classes. Considering different tasks with different classes as domains, a key attribute of this relationship is the domain gap caused by class differences. Since a FSL model learned on seen classes needs to be adapted rapidly to unseen classes, this domain gap issue must be addressed as in zero-shot learning Zhao et al. (2018). Joint learning over two such meta-tasks and introducing domain adaptation (DA) learning objectives Cortes et al. (2019); Zhang et al. (2019b); Rahman et al. (2020) across meta-tasks thus enable the FSL model to meta-learn how to be robust against the domain gap between unseen and seen classes. To this end, we introduce a DA loss between these two meta-tasks and name the resultant training strategy as meta-domain adaptation (MDA).

The second type of meta-task relationship is the one between two meta-tasks consisting of the same set of classes (see episodes and in Fig. 1(a)). We aim to take advantage of this relationship to address a specific challenge in few-shot learning, that is, how to learn a classifier with poorly sampled few training samples. Since each class is represented by only a handful of () samples, it is crucial for the model to be able to cope with outlying samples. In particular, with few samples per class, most existing FSL methods resort to very simple classifiers (e.g. the nearest neighbor classifier with each class represented as the sample mean adopted in prototypical networks Snell et al. (2017)) which are sensitive to the sampling of training data. Given two meta-tasks of the same set of classes, it is now possible to enforce that the two classifiers learned with different support sets behave consistently. In other words, they should be insensitive to the random sampling of the data in the support sets. Inspired by the original knowledge distillation Hinton et al. (2015), a novel meta-knowledge distillation (MKD) strategy is thus formulated in this work.

By adopting both the MDA and MKD strategies for episodic training, a novel meta-training method is presented in Fig. 1(b), which can be applied to any existing meta-learning model. Specifically, we sample three meta-tasks in each training iteration, among which two contain the same set of seen classes (represented as two blue episodes and ) and the third (represented as the red episode ) has a disjoint set of classes from the two blue episodes. With the three tasks, MKD is performed on to by enforcing classifier prediction consistency via knowledge distillation Hinton et al. (2015) and MDA is done between and / via minimizing the domain adaptation loss Zhang et al. (2019b). Once learned, we test the FSL model in the conventional way of meta-test as is shown in Fig. 1(b).

Our contributions are: (1) For the first time, we propose to exploit the relationships across different meta-tasks explicitly for meta-learning. (2) We consider two types of relationships across FSL meta-tasks/episodes and propose two corresponding training strategies (i.e. MDA and MKD) to address two key challenges faced by FSL: seen-unseen domain gap caused by class differences, and poorly sampled few-shots. (3) Our proposed strategies are generally applicable for all meta-learning based FSL methods (i.e. methods adopting episodic training) and clearly boost their performances (see details in Sec. 4). (4) Extensive experiments show that existing models learned with our training strategies achieve new state-of-the-art performance.

2 Related Work

2.1 Few-Shot Learning

In recent years, most few-shot learning (FSL) approaches Vinyals et al. (2016); Ravi and Larochelle (2017); Finn et al. (2017); Snell et al. (2017); Sung et al. (2018); Mishra et al. (2018); Oreshkin et al. (2018); Qiao et al. (2018); Ye et al. (2018); Lee et al. (2019); Rusu et al. (2019); Allen et al. (2019) are based on meta-learning with an episodic training strategy. These methods can be categorized into three groups: metric-based, model-based, and optimization-based approaches. (1) Metric-based methods Vinyals et al. (2016); Snell et al. (2017); Sung et al. (2018); Allen et al. (2019) try to learn a suitable metric for nearest neighbor search based classification. Instead of embedding all samples into a shared task-independent metric space, Qiao et al. (2019) further learn an episodic-wise adaptive metric for classification. (2) Model-based methods Santoro et al. (2016a); Munkhdalai and Yu (2017) fine-tune their models trained on the seen classes and then quickly adapt them to the unseen classes. (3) Optimization-based methods Ravi and Larochelle (2017); Finn et al. (2017); Li et al. (2017); Lee et al. (2019) exploit novel optimization algorithms instead of the gradient descent algorithm, again for quick adaptation from seen to unseen classes. Regardless which groups existing FSL methods belong to, they all ignore the relationships between the meta-tasks randomly sampled in different episodes. There is only one exception – meta-transfer learning Sun et al. (2019) randomly samples a batch of independent episodes, records the class with the lowest accuracy in each meta-task/episode, and re-samples ‘hard’ tasks from the set of recorded classes. Instead of hard task mining for meta-learning, we deliberately construct meta-task pairs with either completely same or different classes, in order to meta-learn a model that is robust against both the domain gap caused by class differences and poorly sampled training data caused by only having few-shots per class.

2.2 Domain Adaptation

Domain adaptation (DA) Pan et al. (2010); Cortes et al. (2019); Rahman et al. (2020) aims to reduce the domain gap between the source and target domains. Under the popular unsupervised DA setting Gong et al. (2012); Ganin and Lempitsky (2015), a large amount of labelled source data along with abundant unlabelled target data are provided for training. A number of recent DA works Tzeng et al. (2017); Pinheiro (2018); Long et al. (2018); Sohn et al. (2019); Zou et al. (2019); Zhang et al. (2019b); Chen et al. (2019b) are based on adversarial learning, which aligns the source and target distributions by reducing the domain gap in a minimax game. For this classic DA setting, the source and target domains are assumed to share the same set of classes. In our work, however, we aim to minimize the domain gap caused by disjoint sets of classes rather than that caused by different underlying data distributions, and face the biggest challenge that there are only few training samples.

Note that recently cross-domain FSL Dong and Xing (2018); Tseng et al. (2020) has started to draw attention, where the unseen classes in FSL are also from another problem domain (e.g., photo to sketch). Our current work is clearly different from this new FSL setting in that we strictly follow the conventional FSL setting but exploit the relationships between meta-tasks with disjoint sets of classes.

2.3 Knowledge Distillation

Knowledge distillation (KD) Hinton et al. (2015) has become topical recently and several works have focused on KD with meta-learning Flennerhag et al. (2019); Jang et al. (2019). Concretely, Flennerhag et al. (2019) proposes a framework to transfer knowledge across learning processes, and Jang et al. (2019) proposes a novel transfer learning approach based on meta-learning to automatically learn what to transfer from the source network to the target network. Moreover, in meta-learning based FSL, Robust-dist Dvornik et al. (2019) learns an ensemble of networks and distills the ensemble into a single network to remove the overhead at test time. KD is also employed in our meta-knowledge distillation (MKD) strategy. However, the objective is not to train a smaller target network more effectively, but to alleviate the effects of badly sampled meta-tasks by distilling knowledge from a better sampled one.

3 Methodology

3.1 Problem Definition

Let denote a set of seen classes and denote a set of unseen classes, where . We are then given a large sample set from , a few-shot sample set from , and a test set from , where . Concretely, , where denotes the -th image, is the class label of , and denotes the number of images in . Similarly, the -shot (i.e. each unseen class has labelled images) sample set , where . The goal of FSL is to predict the labels of test images in by training a model with and .

3.2 Meta-Learning for FSL

Meta-learning based FSL methods Vinyals et al. (2016); Finn et al. (2017); Snell et al. (2017); Sung et al. (2018); Lee et al. (2019) typically evaluate their models over unseen class classification meta-tasks (or episodes) sampled from . To form an -way -shot -query episode , a subset of unseen classes are first randomly sampled from , where . A support set and a query set () are then generated by sampling support images and query images from each class in , respectively. An effective way to exploit the large sample set is to mimic the few-shot meta-test setting via episodic training.

In this meta-learning framework, a typical FSL approach designs a few-shot classification loss for measuring the gap between the predicted labels and the ground-truth labels of the query set over each episode :

(1)

where is the cross-entropy loss, is the ground-truth of , and can be any FSL model with a set of parameters as long as it adopts episodic training. The FSL model can be further represented as , where denotes the feature extractor with its output feature dimension of , and denotes the scoring function constructed from within episode . For conciseness, we replace with . The FSL model is then trained over the meta-training set by minimizing the loss function and is tested over the meta-test set.

3.3 Meta-Learning across Meta-Tasks (MLMT)

(a) Meta-domain adaptation (MDA)
(b) Meta-knowledge distillation (MKD)
Figure 2: Schematic of our proposed meta-domain adaptation (MDA) and meta-knowledge distillation (MKD) strategies for meta-learning across meta-tasks (MLMT).

Existing meta-learning approaches described above take either one episode or a batch of episodes per training iteration and minimize loss functions defined within each episode independently, ignoring the underlying relations across different meta-tasks. In contrast, in our meta-learning across meta-tasks (MLMT) method, each pair of meta-tasks are constructed to have either identical or completely different sets of classes. Different training strategies are then devised to exploit these two types of relationships (see Fig. 2).

Meta-Domain Adaptation (MDA)

We sample an -way -shot episode/task from as the source episode and an -way -shot episode from as the target episode, where , , and . Note that since the two episodes are sampled from disjoint sets of classes, their number of ways or shots can also be different.

Let denote the scoring function constructed from within source episode , which is decided by the meta-learning FSL model . We first introduce an auxiliary scoring function sharing the same hypothesis space with . Since is used to score each sample in on the classes of , is designed as a metric-learning network that computes the similarity scores of query-prototype pairs. We set to be a multi-layer perceptron (MLP) module (see its detailed architecture in Sec. 4.1) stacked after the absolute difference between a query sample and a source class prototype (i.e. the mean representation of support samples from this source class). Since adversarial learning is widely used for domain adaptation, our MDA problem is formulated as:

(2)
(3)

where is the trade-off coefficient between the few-shot classification loss and the DA loss . Many existing DA losses could be employed here (see Table LABEL:tab:mda). In this work, we only consider the margin disparity discrepancy (MDD) Zhang et al. (2019b). We then have:

(4)
(5)

where is a hyper-parameter, and and are the two margin disparities of the source and target episodes, respectively. We train to maximize the discrepancy between two episodes in Eq. (3) and train , to minimize the maximum MDD in Eq. (2). In this minimax manner, the domain gap between two episodes caused by their disjoint sets of classes should be reduced. We find that introducing MDA into episodic training indeed helps to improve the generalization ability during meta-test (see Fig. 4). Note that our MDA designed for FSL can cope with the class difference by inducing an metric-learning based auxiliary classifier, while this issue cannot be addressed by the original MDD (since it assumes that the source and target domains have the same set of classes).

Furthermore, we adopt the softmax function for classification. Concretely, for , is defined as:

(6)

Therefore, in Eqs. (4) -(5) is the cross-entropy loss:

(7)
(8)

Similarly, in Eq. (5) is a modified cross-entropy loss:

(9)

which was introduced in Goodfellow et al. (2014) to ease the burden of vanishing or exploding gradients.

Note that in Eq. (9), although does not belong to any class in , the similarity scores after softmax and () can be considered to come from distributions in an -dimensional space. That is also the reason why we use the binary cross-entropy loss in both Eq. (8) and Eq. (9). Moreover, since is decided by the meta-learning based FSL method and it may contain no learnable parameters (e.g. prototypical networks Snell et al. (2017) use the negative Euclidean distance as the score), we cut off the gradients over in Eq. (5) and directly train the feature extractor to minimize this discrepancy loss through a gradient reversal layer (GRL) Ganin and Lempitsky (2015). The schematic of our MDA strategy is shown in Fig. 2(a).

Meta-Knowledge Distillation (MKD)

As is shown in Fig. 2(b), we consider another type of relationship between two meta-tasks which are sampled from exactly the same set of classes but with different samples. Specifically, we are given two -way -shot -query episodes and (both from a subset ), where and . Our MKD strategy between these two episodes aims to transfer knowledge from the strong classifier to the weak one which is weak because its shots are more negatively impacted by outlying samples.

Let and be the scoring functions of the classifiers within the two episodes, respectively. We first define an indicator function as:

(10)

To determine which classifier (scoring function) is better, we compute the few-shot classification accuracies of the two classifiers on the merged queries from both episodes. Concretely, for ), we have:

(11)
(12)

where denotes the ground-truth label of , , and (). The classifier with higher accuracy is thus considered to be the better one. Without loss of generality, we assume that is better (i.e. ) and call the main episode. The optimization problem for MKD is then stated as:

(13)

where denotes a hyper-parameter, and are respectively the few-shot classification losses defined over and , and is the knowledge distillation loss that is defined with a temperature as in Hinton et al. (2015):

(14)

When the softmax function (, ) is used for classification, we define (in Eq. (14)) as:

(15)

3.4 MLMT-Based FSL Algorithm

For implementation simplicity, in each training iteration, we randomly sample one -way -shot -query source episode/meta-task and two -way -shot -query target episodes/meta-tasks and . More specifically, the source episode is limited to have a disjoint set of classes w.r.t. either target episode (i.e. ), while the two target episodes are limited to have exactly the same set of classes but with different samples (i.e. ).

In each training iteration, we first determine the main target episode to compute the MKD loss over the two target episodes. We then compute the MDA loss between the source episode and the main target episode. The total loss for MLMT is finally given by:

(16)

where denotes the main target episode, denotes the other target episode, and . Note that minimizing is actually equal to maximizing . However, with the gradient reversal layer (GRL) between and , we are still training to minimize .

In practical implementation, when computing the MKD loss , we can even exploit the queries from to further improve the generalization ability of MKD. Although samples in do not belong to any class in or , the two classifiers’ outputs are still aligned by minimizing the MKD loss, enforcing that they behave consistently even on the ‘unseen’ class data (i.e. , unseen by them). A model learned with this MKD strategy is thus more robust against the class-difference caused domain gap (i.e. seen classes to unseen ones) during meta-test, in addition to our MDA strategy. Given , we reformulate Eqs. (14)-(15) as follows:

(17)
(18)
0:  Any meta-learning based FSL method    The seen class sample set    Parameters , , ,
0:  The learned
1:  for all iteration = 1, …, MaxIteration do
2:     Randomly sample one -way -shot source episode (i.e. meta-task) and two -way -shot target episodes (i.e. meta-tasks) and from , satisfying that ;
3:     Compute with Eq. (4), and obtain , in the same way;
4:     Construct based on the two target episodes;
5:     Compute and with Eq. (11) and Eq. (12), respectively;
6:     if  then
7:         ; ;
8:     else
9:         ; ;
10:     end if
11:     Compute with Eq. (5), and obtain the MDA loss ;
12:     Construct based on the three episodes;
13:     Compute the MKD loss with Eq. (17);
14:     Compute the total loss with Eq. (16);
15:     Compute the gradients ;
16:     Update , using stochastic gradient descent;
17:  end for
18:  return .
Algorithm 1 MLMT-Based FSL

By combining MDA and MKD for episodic training, our MLMT-based FSL algorithm is summarized in Algorithm 1. Once learned, with the optimal FSL method found by our algorithm, we randomly sample 2,000 -way -shot meta-test episodes from and average the top-1 test accuracies over these episodes as the final FSL results.

4 Experiments

 

miniImageNet tieredImageNet CUB
Method Backbone 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

 

MatchingNet Vinyals et al. (2016) Conv-4 - - - -
Meta-LSTM Ravi and Larochelle (2017) Conv-4 - - - -
MAML Finn et al. (2017) Conv-4
ProtoNets Snell et al. (2017) Conv-4
RelationNet Sung et al. (2018) Conv-4
IMP Allen et al. (2019) Conv-4 - - - -
SNAIL Mishra et al. (2018) ResNet-12 - - - -
TADAM Oreshkin et al. (2018) ResNet-12 - - - -
MTL Sun et al. (2019) ResNet-12 - - - -
VariationalFSL Zhang et al. (2019a) ResNet-12 - - - -
TapNet Yoon et al. (2019) ResNet-12 - -
MetaOptNet Lee et al. (2019) ResNet-12 - -
CAN Hou et al. (2019) ResNet-12 - -
PPA Qiao et al. (2018) WRN - - - -
LEO Rusu et al. (2019) WRN
Robust-dist++ Dvornik et al. (2019) WRN - - - -
wDAE Gidaris and Komodakis (2019) WRN - -
CC+rot Gidaris et al. (2019) WRN - -
Mangla et al. (2019) WRN - -
MetaOptNet Lee et al. (2019) WRN
IMP Allen et al. (2019) WRN
FEAT Ye et al. (2018) WRN
MetaOptNet+MLMT (ours) WRN
IMP+MLMT (ours) WRN
FEAT+MLMT (ours) WRN

 

Table 1: Comparative results of conventional FSL on the three benchmark datasets. The average 5-way few-shot classification accuracies (%, top-1) along with 95% confidence intervals are reported on the test split of each dataset.

4.1 Datasets and Settings

Datasets. Three widely-used benchmark datasets are selected: (1) miniImageNet: This dataset is proposed in Vinyals et al. (2016), which contains 100 classes from ILSVRC-12 Russakovsky et al. (2015). Each class has 600 images. We split the dataset into 64 training classes, 16 validation classes and 20 test classes as in Ravi and Larochelle (2017). (2) tieredImageNet: This dataset Ren et al. (2018) is a larger subset of ILSVRC-12, which contains 608 classes and 779,165 images totally. We split it into 351 training classes, 97 validation classes and 160 test classes as in Ren et al. (2018). (3) CUB-200-2011 Birds (CUB): CUB Wah et al. (2011) has 200 bird classes and 11,788 images in total. We split it into 100 training classes, 50 validation classes and 50 test classes as in Chen et al. (2019a). All images of the three datasets are resized to .

Evaluation Protocols. We make performance evaluation under the 5-way 5-shot/1-shot settings. Each episode has 5 classes randomly sampled from the test split, which contains 5 shots (or 1 shot) and 15 queries per class. We thus have as in previous works. We report average 5-way classification accuracy (%, top-1) over 2,000 test episodes as well as 95% confidence interval.

Implementation Details. Our algorithm is implemented in PyTorch. WideResNet-28-10 (WRN) Zagoruyko and Komodakis (2016) is adopted as the feature extractor as in Oreshkin et al. (2018); Qiao et al. (2018); Ye et al. (2018); Rusu et al. (2019), and the output feature dimension is 640. We pre-train WRN to accelerate the entire training process. The auxiliary scoring function used for our MDA strategy is formed by 4 fully-connected (FC) layers: {FC layer (640, 1024), batch normalization, ReLU, dropout(0.5)}, {FC layer (1024, 1024), ReLU, dropout(0.5)}, {FC layer (1024, 64), ReLU}, {FC layer (64, 1)}. The stochastic gradient descent (SGD) optimizer is employed with the initial learning rate of 1e-3 and the Nesterov momentum of 0.9. The learning rate is adjusted by half every 10 epochs. According to the validation performance of our algorithm, we uniformly set , , , and . The code and data will be released soon.

4.2 Main Results

Note that we can employ any FSL model as the baseline in Algorithm 1. In this work, without loss of generality, we apply our meta-training strategies (i.e. MDA and MKD) to three state-of-the-art FSL models: MetaOptNet Lee et al. (2019), IMP Allen et al. (2019), and FEAT Ye et al. (2018). After adopting our strategies, each model is thus named with the suffix ‘+MLMT’ (e.g. MetaOptNet+MLMT). As described in Algorithm 1, we need to sample one -way -shot source episode and two -way -shot target episodes in each training iteration, which can be regarded as one -way -shot episode in total. For fair comparison, we thus re-implement MetaOptNet Lee et al. (2019), IMP Allen et al. (2019), and FEAT Ye et al. (2018) by employing WRN as the backbone and sampling one -way -shot episode in each training iteration.

The comparative results on the three datasets are shown in Table LABEL:tab:main. Models using the same backbones are placed together. ‘Conv-4’ denotes the simple feature extractor with only 4 convolutional blocks, which is widely used in previous works. We can make the following observations: (1) Models using WRN as the backbone generally outperform those adopting other feature extractors, showing that the stronger feature extractor always leads to better results. (2) Models trained with our MDA and MKD strategies (i.e. MLMT) achieve new state-of-the-art on all three datasets. Importantly, the improvements over their original versions without using our strategies range from 1.4% to 4.8%. This clearly validates the effectiveness of our proposed meta-training strategies for meta-learning based FSL. (3) The improvements obtained by our MLMT under the 1-shot setting are generally larger than those under the 5-shot setting. One plausible explanation is that: less support samples result in more unstable models (more prone to poorly data sampling when only one shot is sampled), and our meta-training strategies (particularly MKD) can alleviate such negative effects and thus achieve better performance.

4.3 Further Evaluation

Figure 3: Ablative results for our full MLMT strategy (including both MDA and MKD) on the test split of miniImageNet. The error bars indicate the 95% confidence intervals.

 

Method 1-shot 5-shot

 

FEAT
FEAT+MDA (CDAN)
FEAT+MDA (AFN)
FEAT+MDA (ours)

 

Table 2: Comparison among different implementations of MDA on the test split of miniImageNet.

 

Method EQ 1-shot 5-shot

 

FEAT -
FEAT+MKD (symKL)
FEAT+MKD (symKL)
FEAT+MKD (KD)
FEAT+MKD (KD)

 

Table 3: Comparison among different implementations of MKD on the test split of miniImageNet.

Ablative Results. To demonstrate the contributions of each meta-training strategy, we conduct experiments by introducing more strategies into FEAT Ye et al. (2018) on miniImageNet under the 5-way settings. The ablative results in Fig. 3 show that: (1) Adding MDA or MKD alone to the original FEAT model clearly yields performance improvements (see FEAT+MDA vs. FEAT or FEAT+MKD vs. FEAT). It is also observed that MKD outperforms MDA. (2) The combination of MDA and MKD (i.e. MLMT) achieves further improvements (see FEAT+MLMT vs. FEAT+MDA or FEAT+MLMT vs. FEAT+MKD), suggesting that our two strategies are complementary to each other.

Moreover, we make comparison among different implementations of MDA and MKD in Table LABEL:tab:mda and Table LABEL:tab:mkd, respectively. Firstly, for our MDA strategy, we adopt CDAN Long et al. (2018) and AFN Xu et al. (2019) as alternative MDA implementations (in place of MDD in Eq. (5)). The obtained results in Table LABEL:tab:mda show that the MDD loss is the best for MDA. In our ongoing research, we will exploit new DA losses for MDA. Secondly, for our MKD strategy, we compare our asymmetric knowledge distillation loss (denoted as ‘KD’) in Eq. (18) to the symmetric Kullback–Leibler (KL) divergence loss (denoted as ‘symKL’):

(19)

where ( are two unnormalized -dimensional scoring vectors). Note that we use query images from the source episode as external queries (denoted as ‘EQ’) when applying MKD over the two target episodes and in Algorithm 1. Therefore, we also conduct experiments to study the effect of EQ. It can be seen from Table LABEL:tab:mkd that: (1) The asymmetric KD loss leads to better results than the symKL loss. (2) The external queries indeed can improve the performance of both KD and symKL, validating our explanation above Eq. (17).

(a) FEAT+MDA
(b) FEAT
(c) FEAT+MKD
(d) FEAT
Figure 4: Visualization of the generalization ability of our two meta-training strategies (i.e. MDA and MKD) on the test split of miniImageNet under the 5-way 5-shot setting. We check the test performance of the learned models at each training epoch.

Visualization Results. We further provide the visualization of the generalization ability of our two meta-training strategies (i.e. MDA and MKD) during meta-test in Fig. 4. (1) Visualization of MDA: We randomly sample 1,000 episode pairs from the test split of miniImageNet, where the two 5-way 5-shot episodes in each pair have disjoint sets of classes. We compute the average 5-way classification accuracy over all 2,000 episodes and the average MDD over all 1,000 episode pairs at each training epoch. Note that we compute MDD using the original definition in Zhang et al. (2019b) with our trained . We present the visualization results of FEAT+MDA and FEAT in Fig. 4(a) and Fig. 4(b), respectively. We can observe that FEAT+MDA has higher accuracies and lower MDD values (i.e. smaller domain gap between two episodes) than FEAT. This provides direct evidence that our MDA strategy can boost the generalization ability of the learned model during meta-test. (2) Visualization of MKD: We randomly sample 1,000 episode pairs, where the two 5-way 5-shot episodes in each pair have the same set of classes. Similarly, we compute the average accuracy over all 2,000 episodes and the average in Eq. (14) over all 1,000 episode pairs at each training epoch. The visualization results in Fig. 4(c) and Fig. 4(d) show that FEAT+MKD has higher accuracies and lower values (i.e. better performance consistency between two episodes) than FEAT. This provides further evidence that our MKD has a better generalization ability during meta-test.

5 Conclusions

We have investigated the meta-learning based FSL problem. For the first time, we have exploited two types of relationships across meta-tasks in the meta-learning framework and modeled them explicitly as two meta-training strategies. Extensive experiments show that our proposed strategies can boost existing episodic-training based FSL methods and achieve new state-of-the-art on three benchmarks. We hope that our current work can inspire more studies on the relationship across different meta-tasks in a meta-learning framework, even beyond the FSL problem.

APPENDIX

In this document, we provide more support results to show the effectiveness of our algorithm. Firstly, we show more ablative results on tieredImageNet and CUB. Secondly, we give more visualization results of the generalization ability of our two meta-training strategies (i.e. MDA and MKD) during meta-validation. Finally, we show several examples of the data distribution of meta-tasks.

Appendix A Ablative Results

Similar to the ablation study on miniImageNet, we conduct experiments by introducing more strategies into FEAT on tieredImageNet and CUB under the 5-way settings, respectively. The ablative results in Fig. 5 show that: (1) On both tieredImageNet and CUB, adding MDA or MKD alone to the original FEAT model clearly yields performance improvements (see FEAT+MDA vs. FEAT or FEAT+MKD vs. FEAT). It is also observed that MKD slightly outperforms MDA on both datasets. (2) The combination of MDA and MKD (i.e. MLMT) achieves further improvements (see FEAT+MLMT vs. FEAT+MDA or FEAT+MLMT vs. FEAT+MKD), suggesting that our two strategies are complementary to each other.

Appendix B Visualization Results

We provide more visualization results of the generalization ability of our two meta-training strategies (i.e. MDA and MKD) during meta-validation in Fig. 6.

Visualization of MDA. We randomly sample 1,000 episode pairs from the validation split of miniImageNet, where the two 5-way 5-shot episodes in each pair have disjoint sets of classes. We compute the average 5-way classification accuracy over all 2,000 episodes and the average MDD over all 1,000 episode pairs at each training epoch. We present the visualization results of FEAT+MDA and FEAT in Fig. 6(a) and Fig. 6(b), respectively. We can observe that FEAT+MDA has higher accuracies and lower MDD values (i.e. smaller domain gap between two episodes) than FEAT. This provides direct evidence that our MDA strategy can boost the generalization ability of the learned model during meta-validation.

Visualization of MKD. We randomly sample 1,000 episode pairs, where the two 5-way 5-shot episodes in each pair have the same set of classes. Similarly, we compute the average accuracy over all 2,000 episodes and the average in Eq. (14) over all 1,000 episode pairs at each training epoch. The visualization results in Fig. 6(c) and Fig. 6(d) show that FEAT+MKD has higher accuracies and lower values (i.e. better performance consistency between two episodes) than FEAT. This provides further evidence that our MKD has a better generalization ability during meta-validation. Moreover, the results of FEAT+MKD after convergence have smaller variance than FEAT, which also validates that our MKD can help improve the model stability.

(a) tieredImageNet
(b) CUB
Figure 5: Ablative results for our full MLMT strategy (including both MDA and MKD) on the test split of tieredImageNet and CUB, respectively. The error bars indicate the 95% confidence intervals.
(a) FEAT+MDA
(b) FEAT
(c) FEAT+MKD
(d) FEAT
Figure 6: Visualization of the generalization ability of our two meta-training strategies (i.e. MDA and MKD) on the validation split of miniImageNet under the 5-way 5-shot setting. We check the validation performance of the learned models at each training epoch.

Appendix C Qualitative Results

We further give qualitative results to show the effectiveness of our proposed MLMT. Concretely, we sample five meta-tasks in the test split of miniImageNet under the 5-way 5-shot setting and obtain the feature vectors of all images using CNNs trained with FEAT+MLMT and FEAT, respectively. We then apply t-SNE van der Maaten and Hinton (2008) to project these feature vectors into a 2-dimensional space in Fig. 7. In each small figure, samples with the same color belong to the same class. And two figures in each column represent the same meta-task. Similarly, we show the qualitative results in the test split of miniImageNet under the 5-way 1-shot setting in Fig. 8. We can observe that feature vectors obtained by FEAT+MLMT are more discriminative than FEAT, validating that our MLMT can help improve the generalization ability during meta-test.

(a) FEAT+MLMT
(b) FEAT
Figure 7: Examples of meta-tasks in the test split of miniImageNet under the 5-way 5-shot setting.
(a) FEAT+MLMT
(b) FEAT
Figure 8: Examples of meta-tasks in the test split of miniImageNet under the 5-way 1-shot setting.

References

  1. Infinite mixture prototypes for few-shot learning. In ICML, pp. 232–241. Cited by: §1, §2.1, §4.2, Table 1.
  2. Application of data mining techniques for medical image classification. In International Conference on Multimedia Data Mining, pp. 94–101. Cited by: §1.
  3. A closer look at few-shot classification. In ICLR, Cited by: §1, §4.1.
  4. Transferability vs. discriminability: batch spectral penalization for adversarial domain adaptation. In ICML, pp. 1081–1090. Cited by: §2.2.
  5. Adaptation based on generalized discrepancy. Journal of Machine Learning Research (JMLR) 20 (1), pp. 1–30. Cited by: §1, §2.2.
  6. Domain adaption in one-shot learning. In ECML-PKDD, pp. 573–588. Cited by: §2.2.
  7. Diversity with cooperation: ensemble methods for few-shot classification. In ICCV, pp. 3723–3731. Cited by: §2.3, Table 1.
  8. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §1, §1, §2.1, §3.2, Table 1.
  9. Transferring knowledge across learning processes. In ICLR, Cited by: §2.3.
  10. Unsupervised domain adaptation by backpropagation. In ICML, pp. 1180–1189. Cited by: §2.2, §3.3.1.
  11. Boosting few-shot visual learning with self-supervision. In ICCV, pp. 8059–8068. Cited by: Table 1.
  12. Generating classification weights with GNN denoising autoencoders for few-shot learning. In CVPR, pp. 21–30. Cited by: Table 1.
  13. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, pp. 2066–2073. Cited by: §2.2.
  14. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §3.3.1.
  15. Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §1, §1, §2.3, §3.3.2.
  16. Cross attention network for few-shot classification. In Advances in Neural Information Processing Systems, pp. 4005–4016. Cited by: Table 1.
  17. Task agnostic meta-learning for few-shot learning. In CVPR, pp. 11719–11727. Cited by: §1.
  18. Learning what and where to transfer. In ICML, pp. 3030–3039. Cited by: §2.3.
  19. Meta-learning with differentiable convex optimization. In CVPR, pp. 10657–10665. Cited by: §1, §2.1, §3.2, §4.2, Table 1.
  20. A bayesian approach to unsupervised one-shot learning of object categories. In ICCV, pp. 1134–1141. Cited by: §1.
  21. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28 (4), pp. 594–611. Cited by: §1.
  22. Meta-sgd: learning to learn quickly for few shot learning. CoRR abs/1707.09835. Cited by: §2.1.
  23. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp. 1647–1657. Cited by: §2.2, §4.3.
  24. Charting the right manifold: manifold mixup for few-shot learning. CoRR abs/1907.12087. Cited by: Table 1.
  25. A simple neural attentive meta-learner. In ICLR, Cited by: §2.1, Table 1.
  26. Meta networks. In ICML, pp. 2554–2563. Cited by: §2.1.
  27. TADAM: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: §2.1, §4.1, Table 1.
  28. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22 (2), pp. 199–210. Cited by: §2.2.
  29. Unsupervised domain adaptation with similarity learning. In CVPR, pp. 8004–8013. Cited by: §2.2.
  30. Transductive episodic-wise adaptive metric for few-shot learning. In ICCV, pp. 3603–3612. Cited by: §2.1.
  31. Few-shot image recognition by predicting parameters from activations. In CVPR, pp. 7229–7238. Cited by: §2.1, §4.1, Table 1.
  32. On minimum discrepancy estimation for deep domain adaptation. In Domain Adaptation for Visual Understanding, pp. 81–94. Cited by: §1, §2.2.
  33. Optimization as a model for few-shot learning. In ICLR, Cited by: §1, §2.1, §4.1, Table 1.
  34. Meta-learning for semi-supervised few-shot classification. In ICLR, Cited by: §1, §4.1.
  35. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §4.1.
  36. Meta-learning with latent embedding optimization. In ICLR, Cited by: §2.1, §4.1, Table 1.
  37. One-shot learning with memory-augmented neural networks. CoRR abs/1605.06065. Cited by: §2.1.
  38. Meta-learning with memory-augmented neural networks. In ICML, pp. 1842–1850. Cited by: §1.
  39. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4080–4090. Cited by: §1, §1, §2.1, §3.2, §3.3.1, Table 1.
  40. Unsupervised domain adaptation for distance metric learning. In ICLR, Cited by: §2.2.
  41. Meta-transfer learning for few-shot learning. In CVPR, pp. 403–412. Cited by: §2.1, Table 1.
  42. Learning to compare: relation network for few-shot learning. In CVPR, pp. 1199–1208. Cited by: §1, §2.1, §3.2, Table 1.
  43. Cross-domain few-shot classification via learned feature-wise transformation. In ICLR, Cited by: §2.2.
  44. Adversarial discriminative domain adaptation. In CVPR, pp. 2962–2971. Cited by: §2.2.
  45. Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: Appendix C.
  46. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §1, §2.1, §3.2, §4.1, Table 1.
  47. The caltech-ucsd birds-200-2011 dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §4.1.
  48. Larger norm more transferable: an adaptive feature norm approach for unsupervised domain adaptation. In ICCV, pp. 1426–1435. Cited by: §4.3.
  49. Unsupervised template learning for fine-grained object recognition. In Advances in Neural Information Processing Systems, pp. 3122–3130. Cited by: §1.
  50. Learning embedding adaptation for few-shot learning. CoRR abs/1812.03664. Cited by: §2.1, §4.1, §4.2, §4.3, Table 1.
  51. TapNet: neural network augmented with task-adaptive projection for few-shot learning. In ICML, pp. 7115–7123. Cited by: Table 1.
  52. Wide residual networks. In BMVC, Cited by: §4.1.
  53. Variational few-shot learning. In ICCV, pp. 1685–1694. Cited by: Table 1.
  54. Bridging theory and algorithm for domain adaptation. In ICML, pp. 7404–7413. Cited by: §1, §1, §2.2, §3.3.1, §4.3.
  55. Domain-invariant projection learning for zero-shot recognition. In Advances in Neural Information Processing Systems, pp. 1019–1030. Cited by: §1.
  56. Consensus adversarial domain adaptation. In AAAI, pp. 5997–6004. Cited by: §2.2.