Incremental FewShot Learning with Attention Attractor Networks
Abstract
Machine learning classifiers are often trained to recognize a set of predefined classes. However, in many applications, it is often desirable to have the flexibility of learning additional concepts, with limited data and without retraining on the full training set. This paper addresses this problem, incremental fewshot learning, where a regular classification network has already been trained to recognize a set of base classes, and several extra novel classes are being considered, each with only a few labeled examples. After learning the novel classes, the model is then evaluated on the overall classification performance on both base and novel classes. To this end, we propose a metalearning model, the Attention Attractor Network, which regularizes the learning of novel classes. In each episode, we train a set of new weights to recognize novel classes until they converge, and we show that the technique of recurrent backpropagation can backpropagate through the optimization process and facilitate the learning of these parameters. We demonstrate that the learned attractor network can help recognize novel classes while remembering old classes without the need to review the original training set, outperforming various baselines.
1 Introduction
The availability of large scale datasets with detailed annotation, such as ImageNet imagenet, played a significant role in the recent success of deep learning. The need for such a large dataset is however a limitation, since its collection requires intensive human labor. This is also strikingly different from human learning, where new concepts can be learned from very few examples. One line of work that attempts to bridge this gap is fewshot learning koch2015siamese; matching; proto, where a model learns to output a classifier given only a few labeled examples of the unseen classes. While this is a promising line of work, its practical usability is a concern, because fewshot models only focus on learning novel classes, ignoring the fact that many common classes are readily available in large datasets.
An approach that aims to enjoy the best of both worlds, the ability to learn from large datasets for common classes with the flexibility of fewshot learning for others, is incremental fewshot learning lwof. This combines incremental learning where we want to add new classes without catastrophic forgetting mccloskey1989catastrophic, with fewshot learning when the new classes, unlike the base classes, only have a small amount of examples. One use case to illustrate the problem is a visual aid system. Most objects of interest are common to all users, e.g., cars, pedestrian signals; however, users would also like to augment the system with additional personalized items or important landmarks in their area. Such a system needs to be able to learn new classes from few examples, without harming the performance on the original classes and typically without access to the dataset used to train the original classes.
In this work we present a novel method for incremental fewshot learning where during metalearning we optimize a regularizer that reduces catastrophic forgetting from the incremental fewshot learning. Our proposed regularizer is inspired by attractor networks localist and can be thought of as a memory of the base classes, adapted to the new classes. We also show how this regularizer can be optimized, using recurrent backpropagation rbp; rbp2; rbp3 to backpropagate through the fewshot optimization stage. Finally, we show empirically that our proposed method can produce stateoftheart results in incremental fewshot learning on miniImageNet matching and tieredImageNet fewshotssl tasks.
2 Related Work
Recently, there has been a surge in interest in fewshot learning koch2015siamese; matching; proto; lake2011oneshot, where a model for novel classes is learned with only a few labeled examples. One family of approaches for fewshot learning, including Deep Siamese Networks koch2015siamese, Matching Networks matching and Prototypical Networks proto, follows the line of metric learning. In particular, these approaches use deep neural networks to learn a function that maps the input space to the embedding space where examples belonging to the same category are close and those belonging to different categories are far apart. Recently, garcia2017few proposes a graph neural networks based method which captures the information propagation from the labeled support set to the query set. fewshotssl extends Prototypical Networks to leverage unlabeled examples while doing fewshot learning. Despite their simplicity, these methods are very effective and often competitive with the stateoftheart.
Another class of approaches aims to learn models which can adapt to the episodic tasks. In particular, metalstm treats the long shortterm memory (LSTM) as a meta learner such that it can learn to predict the parameter update of a base learner, e.g., a convolutional neural network (CNN). MAML maml instead learns the hyperparameters or the initial parameters of the base learner by backpropagating through the gradient descent steps. santoro2016one uses a read/write augmented memory, and mishra2017meta combines soft attention with temporal convolutions which enables retrieval of information from past episodes.
Methods described above belong to the general class of metalearning models. First proposed in Schmidhuber1987evolutionary; naik1992meta; Thrun1998, metalearning is a machine learning paradigm where the metalearner tries to improve the base learner using the learning experiences from multiple tasks. Metalearning methods typically learn the update policy yet lack an overall learning objective in the fewshot episodes. Furthermore, they could potentially suffer from shorthorizon bias shorthorizon, if at test time the model is trained for longer steps. To address this problem, diffsolver proposes to use fast convergent models like logistic regression (LR), which can be backpropagated via a closed form update rule. Compared to diffsolver, our proposed method using recurrent backpropagation rbp; rbp2; rbp3 is more general as it does not require a closedform update, and the inner loop solver can employ any existing continuous optimizers.
Our work is also related to incremental learning, a setting where information is arriving continuously while prior knowledge needs to be transferred. A key challenge is catastrophic forgetting mccloskey1989catastrophic; mcclelland1995there, i.e., the model forgets the learned knowledge. Various memorybased models have since been proposed, which store training examples explicitly icarl; mbpa; castro2018end; varcontinual, regularize the parameter updates kirkpatrick2017overcoming, or learn a generative model fearnet. However, in these studies, incremental learning typically starts from scratch, and usually performs worse than a regular model that is trained with all available classes together since it needs to learned a good representation while dealing with catastrophic forgetting.
Incremental fewshot learning is also known as lowshot learning. To leverage a good representation, hariharan2017lowshot; wang2018lowshot; lwof starts off with a pretrained network on a set of base classes, and tries to augment the classifier with a batch of new classes that has not been seen during training. hariharan2017lowshot proposes the squared gradient magnitude loss, which makes the learned classifier from the lowshot examples have a smaller gradient value when learning on all examples. wang2018lowshot propose the prototypical matching networks, a combination of prototypical network and matching network. The paper also adds hallucination, which generates new examples. lwof proposes an attention based model which generates weights for novel categories. They also promote the use of cosine similarity between feature representations and weight vectors to classify images.
In contrast, during each fewshot episode, we directly learn a classifier network that is randomly initialized and solved till convergence, unlike lwof which directly output the prediction. Since the model cannot see base class data within the support set of each fewshot learning episode, it is challenging to learn a classifier that jointly classifies both base and novel categories. Towards this end, we propose to add a learned regularizer, which is predicted by a metanetwork, the “attention attractor network”. The network is learned by differentiating through fewshot learning optimization iterations. We found that using an iterative solver with the learned regularizer significantly improves the classifier model on the task of incremental fewshot learning.
3 Model
In this section, we first define the setup of incremental fewshot learning, and then we introduce our new model, the Attention Attractor Network, which attends to the set of base classes according to the fewshot training data by using the attractor regularizing term. Figure 1 illustrates the highlevel model diagram of our method.
3.1 Incremental FewShot Learning
The outline of our metalearning approach to incremental fewshot learning is: (1) We learn a fixed feature representation and a classifier on a set of base classes; (2) In each training and testing episode we train a novelclass classifier with our metalearned regularizer; (3) We optimize our metalearned regularizer on combined novel and base classes classification, adapting it to perform well in conjunction with the base classifier. Details of these stages follow.
Pretraining Stage:
We learn a base model for the regular supervised classification task on dataset where is the th example from dataset and its labeled class . The purpose of this stage is to learn both a good base classifier and a good representation. The parameters of the base classifier are learned in this stage and will be fixed after pretraining. We denote the parameters of the top fully connected layer of the base classifier where is the dimension of our learned representation.
Incremental FewShot Episodes:
A fewshot dataset is presented, from which we can sample fewshot learning episodes . Note that this can be the same data source as the pretraining dataset , but sampled episodically. For each shot way episode, there are novel classes disjoint from the base classes. Each novel class has and images from the support set and the query set respectively. Therefore, we have where . and can be regarded as this episodes training and validation sets. Each episode we learn a classifier on the support set whose learnable parameters are called the fast weights as they are only used during this episode. To evaluate the performance on a joint prediction of both base and novel classes, i.e., a way classification, a minibatch sampled from is also added to to form . This means that the learning algorithm, which only has access to samples from the novel classes , is evaluated on the joint query set .
MetaLearning Stage:
In metatraining, we iteratively sample fewshot episodes and try to learn the metaparameters in order to minimize the joint prediction loss on . In particular, we design a regularizer such that the fast weights are learned via minimizing the loss where is typically crossentropy loss for fewshot classification. The metalearner tries to learn metaparameters such that the optimal fast weights w.r.t. the above loss function performs well on . In our model, metaparameters are encapsulated in our attention attractor network, which produces regularizers for the fast weights in the fewshot learning objective.
Joint Prediction on Base and Novel Classes:
We now introduce the details of our joint prediction framework performed in each fewshot episode. First, we construct an episodic classifier, e.g., a logistic regression (LR) model or a multilayer perceptron (MLP), which takes the learned image features as inputs and classifies them according to the fewshot classes.
During training on the support set , we learn the fast weights via minimizing the following regularized crossentropy objective, which we call the episodic objective:
(1) 
This is a general formulation and the specific functional form of the regularization term will be specified later. The predicted output is obtained via, , where is our classification network and is the fast weights in the network. In the case of LR, is a linear model: . can also be an MLP for more expressive power.
During testing on the query set , in order to predict both base and novel classes, we directly augment the softmax with the fixed base class weights , , where are the optimal parameters that minimize the regularized classification objective in Eq. (1).
3.2 Attention Attractor Networks
Directly learning the fewshot episode, e.g., by setting to be zero or simple weight decay, can cause catastrophic forgetting on the base classes. This is because which is trained to maximize the correct novel class probability can dominate the base classes in the joint prediction. In this section, we introduce the Attention Attractor Network to address this problem. The key feature of our attractor network is the regularization term :
(2) 
where is the socalled attractor and is the th column of . This sum of squared Mahalanobis distances from the attractors adds a bias to the learning signal arriving solely from novel classes. Note that for a classifier such as an MLP, one can extend this regularization term in a layerwise manner. Specifically, one can have separate attractors per layer, and the number of attractors equals the number of output dimension of that layer.
To ensure that the model performs well on base classes, the attractors must contain some information about examples from base classes. Since we can not directly access these base examples, we propose to use the slow weights to encode such information. Specifically, each base class has a learned attractor vector stored in the memory matrix . It is computed as, , where is a MLP of which the learnable parameters are . For each novel class its classifier is regularized towards its attractor which is a weighted sum of vectors. Intuitively the weighting is an attention mechanism where each novel class attends to the base classes according to the level of interference, i.e. how prediction of new class causes the forgetting of base class .
For each class in the support set, we compute the cosine similarity between the average representation of the class and base weights then normalize using a softmax function
(3) 
where is the cosine similarity function, are the representations of the inputs in the support set and is a learnable temperature scalar. encodes a normalized pairwise attention matrix between the novel classes and the base classes. The attention vector is then used to compute a linear weighted sum of entries in the memory matrix , , where is an embedding vector and serves as a bias for the attractor.
Our design takes inspiration from attractor networks attractor; localist, where for each base class one learns an “attractor” that stores the relevant memory regarding that class. We call our full model “dynamic attractors” as they may vary with each episode even after metalearning. In contrast if we only have the bias term , i.e. a single attractor which is shared by all novel classes, it will not change after metalearning from one episode to the other. We call this model variant the “static attractor”.
In summary, our meta parameters include , , and , which is on the same scale as as the number of paramters in . It is important to note that is convex w.r.t. . Therefore, if we use the LR model as the classifier, the overall training objective on episodes in Eq. (1) is convex which implies that the optimum is guaranteed to be unique and achievable. Here we emphasize that the optimal parameters are functions of parameters and fewshot samples .
During metalearning, are updated to minimize an expected loss of the query set which contains both base and novel classes, averaging over all fewshot learning episodes,
(4) 
where the predicted class is .
3.3 Learning via Recurrent BackPropagation
As there is no closedform solution to the episodic objective (the optimization problem in Eq. 1), in each episode we need to minimize to obtain through an iterative optimizer. The question is how to efficiently compute , i.e., backpropagating through the optimization. One option is to unroll the iterative optimization process in the computation graph and use backpropagation through time (BPTT) bptt. However, the number of iterations for a gradientbased optimizer to converge can be on the order of thousands, and BPTT can be computationally prohibitive. Another way is to use the truncated BPTT tbptt (TBPTT) which optimizes for steps of gradientbased optimization, and is commonly used in metalearning problems. However, when is small the training objective could be significantly biased.
Alternatively, the recurrent backpropagation (RBP) algorithm rbp2; rbp3; rbp allows us to backpropagate through the fixed point efficiently without unrolling the computation graph and storing intermediate activations. Consider a vanilla gradient descent process on with step size . The difference between two steps can be written as , where . Since is identically zero as a function of , using the implicit function theorem we have , where denotes the Jacobian matrix of the mapping evaluated at . Algorithm 1 outlines the key steps for learning the episodic objective using RBP in the incremental fewshot learning setting. Note that the RBP algorithm implicitly inverts by computing the matrix inverse vector product, and has the same time complexity compared to truncated BPTT given the same number of unrolled steps, but meanwhile RBP does not have to store intermediate activations.
Damped Neumann RBP
To compute the matrixinverse vector product , rbp propose to use the Neumann series: . Note that can be computed by standard backpropagation. However, directly applying the Neumann RBP algorithm sometimes leads to numerical instability. Therefore, we propose to add a damping term to . This results in the following update: . In practice, we found the damping term with helps alleviate the issue significantly.
4 Experiments
We experiment on two fewshot classification datasets, miniImageNet and tieredImageNet. Both are subsets of ImageNet imagenet, with images sizes reduced to pixels. We also modified the datasets to accommodate the incremental fewshot learning settings. ^{1}^{1}1Code released at: https://github.com/renmengye/incfewshotattractorpublic
4.1 Datasets

[leftmargin=*]

miniImageNet Proposed by matching, miniImageNet contains 100 object classes and 60,000 images. We used the splits proposed by metalstm, where training, validation, and testing have 64, 16 and 20 classes respectively.

tieredImageNet Proposed by fewshotssl, tieredImageNet is a larger subset of ILSVRC12. It features a categorical split among training, validation, and testing subsets. The categorical split means that classes that belong to the same highlevel category, e.g. “working dog” and ”terrier” or some other dog breed, are not split between training, validation and test. This is a harder task, but one that more strictly evaluates generalization to new classes. It is also an order of magnitude larger than miniImageNet.
Method  Fewshot learner  Episodic objective  Attention mechanism 

Imprint qi2018imprinting  Prototypes  N/A  N/A 
LwoF lwof  Prototypes + base classes  N/A  Attention on base classes 
Ours  A fully trained classifier  Cross entropy on novel classes  Attention on learned attractors 
Model  1shot  5shot  

Acc.  Acc.  
ProtoNet proto  42.73 0.15  20.21  57.05 0.10  31.72 
Imprint qi2018imprinting  41.10 0.20  22.49  44.68 0.23  27.68 
LwoF lwof  52.37 0.20  13.65  59.90 0.20  14.18 
Ours  54.95 0.30  11.84  63.04 0.30  10.66 
Model  1shot  5shot  

Acc.  Acc.  
ProtoNet proto  30.04 0.21  29.54  41.38 0.28  26.39 
Imprint qi2018imprinting  39.13 0.15  22.26  53.60 0.18  16.35 
LwoF lwof  52.40 0.33  8.27  62.63 0.31  6.72 
Ours  56.11 0.33  6.11  65.52 0.31  4.48 
average decrease in acc. caused by joint prediction within base and novel classes ()
represents higher (lower) is better.
4.2 Experiment setup
We use a standard ResNet backbone resnet to learn the feature representation through supervised training. For miniImageNet experiments, we follow mishra2017meta and use a modified version of ResNet10. For tieredImageNet, we use the standard ResNet18 resnet, but replace all batch normalization batchnorm layers with group normalization groupnorm, as there is a large distributional shift from training to testing in tieredImageNet due to categorical splits. We used standard data augmentation, with random crops and horizonal flips. We use the same pretrained checkpoint as the starting point for metalearning.
In the metalearning stage as well as the final evaluation, we sample a fewshot episode from the , together with a regular minibatch from the . The base class images are added to the query set of the fewshot episode. The base and novel classes are maintained in equal proportion in our experiments. For all the experiments, we consider 5way classification with 1 or 5 support examples (i.e. shots). In the experiments, we use a query set of size 252 =50.
We use LBFGS zhu1997algorithm to solve the inner loop of our models to make sure converges. We use the ADAM kingma2014adam optimizer for metalearning with a learning rate of 1e3, which decays by a factor of after 4,000 steps, for a total of 8,000 steps. We fix recurrent backpropagation to 20 iterations and .
We study two variants of the classifier network. The first is a logistic regression model with a single weight matrix . The second is a 2layer fully connected MLP model with 40 hidden units in the middle and nonlinearity. To make training more efficient, we also add a shortcut connection in our MLP, which directly links the input to the output. In the second stage of training, we keep all backbone weights frozen and only train the metaparameters .
4.3 Evaluation metrics
We consider the following evaluation metrics: 1) overall accuracy on individual query sets and the joint query set (“Base”, “Novel”, and “Both”); and 2) decrease in performance caused by joint prediction within the base and novel classes, considered separately (“” and “”). Finally we take the average as a key measure of the overall decrease in accuracy.
4.4 Comparisons
We implemented and compared to three methods. First, we adapted Prototypical Networks proto to incremental fewshot settings. For each base class we store a base representation, which is the average representation (prototype) over all images belonging to the base class. During the fewshot learning stage, we again average the representation of the fewshot classes and add them to the bank of base representations. Finally, we retrieve the nearest neighbor by comparing the representation of a test image with entries in the representation store. In summary, both and are stored as the average representation of all images seen so far that belong to a certain class. We also compare to the following methods:

[leftmargin=*]

Weights Imprinting (“Imprint”) qi2018imprinting: the base weights are learned regularly through supervised pretraining, and are computed using prototypical averaging.

Learning without Forgetting (“LwoF”) lwof: Similar to qi2018imprinting, are computed using prototypical averaging. In addition, is finetuned during episodic metalearning. We implemented the most advanced variants proposed in the paper, which involves a classwise attention mechanism. This model is the previous stateoftheart method on incremental fewshot learning, and has better performance compared to other lowshot models wang2018lowshot; hariharan2017lowshot.
4.5 Results
We first evaluate our vanilla approach on the standard fewshot classification benchmark where no base classes are present in the query set. Our vanilla model consists of a pretrained CNN and a singlelayer logistic regression with weight decay learned from scratch; this model performs onpar with other competitive metalearning approaches (1shot 55.40 0.51, 5shot 70.17 0.46). Note that our model uses the same backbone architecture as mishra2017meta and lwof, and is directly comparable with their results. Similar findings of strong results using simple logistic regression on fewshot classification benchmarks are also recently reported in closerlook. Our full model has similar performance as the vanilla model on pure fewshot benchmarks, and the full table is available in Supp. Materials.
Next, we compare our models to other methods on incremental fewshot learning benchmarks in Tables 3 and 3. On both benchmarks, our best performing model shows a significant margin over the prior works that predict the prototype representation without using an iterative optimization proto; qi2018imprinting; lwof.
1shot  5shot  

Acc.  Acc.  
LR  52.74 0.24  13.95  60.34 0.20  13.60 
LR +S  53.63 0.30  12.53  62.50 0.30  11.29 
LR +A  55.31 0.32  11.72  63.00 0.29  10.80 
MLP  49.36 0.29  16.78  60.85 0.29  12.62 
MLP +S  54.46 0.31  11.74  62.79 0.31  10.77 
MLP +A  54.95 0.30  11.84  63.04 0.30  10.66 
1shot  5shot  

Acc.  Acc.  
LR  48.84 0.23  10.44  62.08 0.20  8.00 
LR +S  55.36 0.32  6.88  65.53 0.30  4.68 
LR +A  55.98 0.32  6.07  65.58 0.29  4.39 
MLP  41.22 0.35  10.61  62.70 0.31  7.44 
MLP +S  56.16 0.32  6.28  65.80 0.31  4.58 
MLP +A  56.11 0.33  6.11  65.52 0.31  4.48 
“+S” stands for static attractors, and “+A” for attention attractors.
4.6 Ablation studies
To understand the effectiveness of each part of the proposed model, we consider the following variants:

[leftmargin=*]

Vanilla (“LR, MLP”) optimizes a logistic regression or an MLP network at each fewshot episode, with a weight decay regularizer.

Static attractor (“+S”) learns a fixed attractor center and attractor slope for all classes.

Attention attractor (“+A”) learns the full attention attractor model. For MLP models, the weights below the final layer are controlled by attractors predicted by the average representation across all the episodes. is an MLP with one hidden layer of 50 units.
Tables 5 and 5 shows the ablation experiment results. In all cases, the learned regularization function shows better performance than a manually set weight decay constant on the classifier network, in terms of both jointly predicting base and novel classes, as well as less degradation from individual prediction. On miniImageNet, our attention attractors have a clear advantage over static attractors.
Formulating the classifier as an MLP network is slightly better than the linear models in our experiments. Although the final performance is similar, our RBPbased algorithm have the flexibility of adding the fast episodic model with more capacity. Unlike diffsolver, we do not rely on an analytic form of the gradients of the optimization process.
Comparison to truncated BPTT (TBPTT)
An alternative way to learn the regularizer is to unroll the inner optimization for a fixed number of steps in a differentiable computation graph, and then backpropagate through time. Truncated BPTT is a popular learning algorithm in many recent metalearning approaches gd2; metalstm; maml; mbpa; metareg. Shown in Figure 2, the performance of TBPTT learned models are comparable to ours; however, when solved to convergence at test time, the performance of TBPTT models drops significantly. This is expected as they are only guaranteed to work well for a certain number of steps, and failed to learn a good regularizer. While an earlystopped TBPTT model can do equally well, in practice it is hard to tell when to stop; whereas for the RBP model, doing the full episodic training is very fast since the number of support examples is small.
Visualization of attractor dynamics
We visualize attractor dynamics in Figure 3. Our learned attractors pulled the fast weights close towards the base class weights. In comparison, lwof only modifies the prototypes slightly.
Varying the number of base classes
While the framework proposed in this paper cannot be directly applied on classincremental continual learning, as there is no module for memory consolidation, we can simulate the continual learning process by varying the number of base classes, to see how the proposed models are affected by different stages of continual learning. Figure 4 shows that the learned regularizers consistently improve over baselines with weight decay only. The overall accuracy increases from 50 to 150 classes due to better representations on the backbone network, and drops at 200 classes due to a more challenging classification task.
5 Conclusion
Incremental fewshot learning, the ability to jointly predict based on a set of predefined concepts as well as additional novel concepts, is an important step towards making machine learning models more flexible and usable in everyday life. In this work, we propose an attention attractor model, which regulates a perepisode training objective by attending to the set of base classes. We show that our iterative model that solves the fewshot objective till convergence is better than baselines that do onestep inference, and that recurrent backpropagation is an effective and modular tool for learning in a general metalearning setting, whereas truncated backpropagation through time fails to learn functions that converge well. Future directions of this work include sequential iterative learning of fewshot novel concepts, and hierarchical memory organization.
Acknowledgment
Supported by NSERC and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government. \setstretch0.93
References
1.0
Appendix A Regular FewShot Classification
We include standard 5way fewshot classification results in Table 6. As mentioned in the main text, a simple logistic regression model can achieve competitive performance on fewshot classification using pretrained features. Our full model shows similar performance on regular fewshot classification. This confirms that the learned regularizer is mainly solving the interference problem between the base and novel classes.
Model  Backbone  1shot  5shot 

MatchingNets matching  C64  43.60  55.30 
MetaLSTM metalstm  C32  43.40 0.77  60.20 0.71 
MAML maml  C64  48.70 1.84  63.10 0.92 
RelationNet relationnet  C64  50.44 0.82  65.32 0.70 
R2D2 diffsolver  C256  51.20 0.60  68.20 0.60 
SNAIL mishra2017meta  ResNet  55.71 0.99  68.88 0.92 
ProtoNet proto  C64  49.42 0.78  68.20 0.66 
ProtoNet* proto  ResNet  50.09 0.41  70.76 0.19 
LwoF lwof  ResNet  55.45 0.89  70.92 0.35 
LR  ResNet  55.40 0.51  70.17 0.46 
Ours Full  ResNet  55.75 0.51  70.14 0.44 
Appendix B Visualization of FewShot Episodes
We include more visualization of fewshot episodes in Figure 5, highlighting the differences between our method and “Dynamic FewShot Learning without Forgetting” lwof.
Appendix C Visualization of Attention Attractors
To further understand the attractor mechanism, we picked 5 semantic classes in miniImageNet and visualized their the attention attractors across 20 episodes, shown in Figure 6. The attractors roughly form semantic clusters, whereas the static attractor stays in the center of all attractors.
1shot  5shot  

Acc.  Acc.  
LR  52.74 0.24  13.95  8.98  24.32  60.34 0.20  13.60  10.81  15.97 
LR +S  53.63 0.30  12.53  9.44  15.62  62.50 0.30  11.29  13.84  8.75 
LR +A  55.31 0.32  11.72  12.72  10.71  63.00 0.29  10.80  13.59  8.01 
MLP  49.36 0.29  16.78  8.95  24.61  60.85 0.29  12.62  11.35  13.89 
MLP +S  54.46 0.31  11.74  12.73  10.74  62.79 0.31  10.77  12.61  8.80 
MLP +A  54.95 0.30  11.84  12.81  10.87  63.04 0.30  10.66  12.55  8.77 
1shot  5shot  

Acc.  Acc.  
LR  48.84 0.23  10.44  11.65  9.24  62.08 0.20  8.00  5.49  10.51 
LR +S  55.36 0.32  6.88  7.21  6.55  65.53 0.30  4.68  4.72  4.63 
LR +A  55.98 0.32  6.07  6.64  5.51  65.58 0.29  4.39  4.87  3.91 
MLP  41.22 0.35  10.61  11.25  9.98  62.70 0.31  7.44  6.05  8.82 
MLP +S  56.16 0.32  6.28  6.83  5.73  65.80 0.31  4.58  4.66  4.51 
MLP +A  56.11 0.33  6.11  6.79  5.43  65.52 0.31  4.48  4.91  4.05 
Appendix D Dataset Statistics
In this section, we include more details on the datasets we used in our experiments.
miniImageNet  tieredImageNet  

Classes  Purpose  Split  N. Cls  N. Img  Split  N. Cls  N. Img 
Base  Train  TrainTrain  64  38,400  TrainATrain  200  203,751 
Val  TrainVal  64  18,748  TrainAVal  200  25,460  
Test  TrainTest  64  19,200  TrainATest  200  25,488  
Novel  Train  TrainTrain  64  38,400  TrainB  151  193,996 
Val  Val  16  9,600  Val  97  124,261  
Test  Test  20  12,000  Test  160  206,209 
d.1 Validation and testing splits for base classes
In standard fewshot learning, metatraining, validation, and test set have disjoint sets of object classes. However, in our incremental fewshot learning setting, to evaluate the model performance on the base class predictions, additional splits of validation and test splits of the metatraining set are required. Splits and dataset statistics are listed in Table 9. For miniImageNet, lwof released additional images for evaluating training set, namely “TrainVal” and “TrainTest”. For tieredImageNet, we split out 20% of the images for validation and testing of the base classes.
d.2 Novel classes
In miniImageNet experiments, the same training set is used for both and . In order to pretend that the classes in the fewshot episode are novel, following lwof, we masked the base classes in , which contains 64 base classes. In other words, we essentially train for a 59+5 classification task. We found that under this setting, the progress of metalearning in the second stage is not very significant, since all classes have already been seen before.
In tieredImageNet experiments, to emulate the process of learning novel classes during the second stage, we split the training classes into base classes (“TrainA”) with 200 classes and novel classes (“TrainB”) with 151 classes, just for metalearning purpose. During the first stage the classifier is trained using TrainATrain data. In each metalearning episode we sample fewshot examples from the novel classes (TrainB) and a query base set from TrainAVal.
200 Base Classes (“TrainA”):
n02128757, n02950826, n01694178, n01582220, n03075370, n01531178, n03947888, n03884397, n02883205, n03788195, n04141975, n02992529, n03954731, n03661043, n04606251, n03344393, n01847000, n03032252, n02128385, n04443257, n03394916, n01592084, n02398521, n01748264, n04355338, n02481823, n03146219, n02963159, n02123597, n01675722, n03637318, n04136333, n02002556, n02408429, n02415577, n02787622, n04008634, n02091831, n02488702, n04515003, n04370456, n02093256, n01693334, n02088466, n03495258, n02865351, n01688243, n02093428, n02410509, n02487347, n03249569, n03866082, n04479046, n02093754, n01687978, n04350905, n02488291, n02804610, n02094433, n03481172, n01689811, n04423845, n03476684, n04536866, n01751748, n02028035, n03770439, n04417672, n02988304, n03673027, n02492660, n03840681, n02011460, n03272010, n02089078, n03109150, n03424325, n02002724, n03857828, n02007558, n02096051, n01601694, n04273569, n02018207, n01756291, n04208210, n03447447, n02091467, n02089867, n02089973, n03777754, n04392985, n02125311, n02676566, n02092002, n02051845, n04153751, n02097209, n04376876, n02097298, n04371430, n03461385, n04540053, n04552348, n02097047, n02494079, n03457902, n02403003, n03781244, n02895154, n02422699, n04254680, n02672831, n02483362, n02690373, n02092339, n02879718, n02776631, n04141076, n03710721, n03658185, n01728920, n02009229, n03929855, n03721384, n03773504, n03649909, n04523525, n02088632, n04347754, n02058221, n02091635, n02094258, n01695060, n02486410, n03017168, n02910353, n03594734, n02095570, n03706229, n02791270, n02127052, n02009912, n03467068, n02094114, n03782006, n01558993, n03841143, n02825657, n03110669, n03877845, n02128925, n02091032, n03595614, n01735189, n04081281, n04328186, n03494278, n02841315, n03854065, n03498962, n04141327, n02951585, n02397096, n02123045, n02095889, n01532829, n02981792, n02097130, n04317175, n04311174, n03372029, n04229816, n02802426, n03980874, n02486261, n02006656, n02025239, n03967562, n03089624, n02129165, n01753488, n02124075, n02500267, n03544143, n02687172, n02391049, n02412080, n04118776, n03838899, n01580077, n04589890, n03188531, n03874599, n02843684, n02489166, n01855672, n04483307, n02096177, n02088364.
151 Novel Classes (“TrainB”):
n03720891, n02090379, n03134739, n03584254, n02859443, n03617480, n01677366, n02490219, n02749479, n04044716, n03942813, n02692877, n01534433, n02708093, n03804744, n04162706, n04590129, n04356056, n01729322, n02091134, n03788365, n01739381, n02727426, n02396427, n03527444, n01682714, n03630383, n04591157, n02871525, n02096585, n02093991, n02013706, n04200800, n04090263, n02493793, n03529860, n02088238, n02992211, n03657121, n02492035, n03662601, n04127249, n03197337, n02056570, n04005630, n01537544, n02422106, n02130308, n03187595, n03028079, n02098413, n02098105, n02480855, n02437616, n02123159, n03803284, n02090622, n02012849, n01744401, n06785654, n04192698, n02027492, n02129604, n02090721, n02395406, n02794156, n01860187, n01740131, n02097658, n03220513, n04462240, n01737021, n04346328, n04487394, n03627232, n04023962, n03598930, n03000247, n04009552, n02123394, n01729977, n02037110, n01734418, n02417914, n02979186, n01530575, n03534580, n03447721, n04118538, n02951358, n01749939, n02033041, n04548280, n01755581, n03208938, n04154565, n02927161, n02484975, n03445777, n02840245, n02837789, n02437312, n04266014, n03347037, n04612504, n02497673, n03085013, n02098286, n03692522, n04147183, n01728572, n02483708, n04435653, n02480495, n01742172, n03452741, n03956157, n02667093, n04409515, n02096437, n01685808, n02799071, n02095314, n04325704, n02793495, n03891332, n02782093, n02018795, n03041632, n02097474, n03404251, n01560419, n02093647, n03196217, n03325584, n02493509, n04507155, n03970156, n02088094, n01692333, n01855032, n02017213, n02423022, n03095699, n04086273, n02096294, n03902125, n02892767, n02091244, n02093859, n02389026.