FewShot Learning with Embedded Class Models and ShotFree Meta Training
Abstract
We propose a method for learning embeddings for fewshot learning that is suitable for use with any number of ways and any number of shots (shotfree). Rather than fixing the class prototypes to be the Euclidean average of sample embeddings, we allow them to live in a higherdimensional space (embedded class models) and learn the prototypes along with the model parameters. The class representation function is defined implicitly, which allows us to deal with a variable number of shots per each class with a simple constantsize architecture. The class embedding encompasses metric learning, that facilitates adding new classes without crowding the class representation space. Despite being general and not tuned to the benchmark, our approach achieves stateoftheart performance on the standard fewshot benchmark datasets.
1 Introduction
Consider Figure 1: Given one or few images of an Amanita Muscaria (left), one can easily recognize it in the wild. Identifying a Russula (center) may require more samples, enough to distinguish it from the deadly Amanita Phalloides (right), but likely not millions of them. We refer to this as fewshot learning. This ability comes from having seen and touched millions of other objects, in different environments, under different lighting conditions, partial occlusions and other nuisances. We refer to this as metalearning. We wish to exploit the availability of large annotated datasets to metatrain models so they can learn new concepts from few samples, or “shots.” We refer to this as metatraining for fewshot learning.
In this paper we develop a framework for both metatraining (learning a potentially large number of classes from a large annotated dataset) and fewshot learning (using the learned model to train new concepts from few samples), designed to have the following characteristics.
Open set: Accommodate an unknown, growing, and possibly unbounded number of new classes in an “open set” or “open universe” setting. Some of the simpler methods available in the literature, for instance based on nearestneighbors of fixed embeddings [15], do so in theory. In these methods, however, there is no actual fewshot learning per se, as all learnable parameters are set at metatraining.
Continual: Enable leveraging fewshot data to improve the model parameters, even those inferred during metatraining. While each class may only have few samples, as the number of classes grows, the fewshot training set may grow large. We want a model flexible enough to enable “lifelong” or “continual” learning.
Shot Free: Accommodate a variable number of shots for each new category. Some classes may have a few samples, others a few hundred; we do not want to metatrain different models for different number of shots, nor to restrict ourselves to all new classes having the same number of shots, as many recent works do. This may be a sideeffect of the benchmarks available that only test a few combinations of shots and “ways” (classes).
Embedded Class Models: Learn a representation of the classes that is not constrained to live in the same space as the representation of the data. All known methods for fewshot learning choose an explicit function to compute class representatives (a.k.a. “prototypes” [15], “proxies,” “means,” “modes,” or “templates”) as some form of averaging in the embedding (feature) space of the data. By decoupling the data (feature space) from the classes (class embedding), we free the latter to live in a richer space, where they can better represent complex distributions, and possibly grow over time.
To this end, our contributions are described as follows:

Shotfree: A metalearning model and sampling scheme that is suitable for use with any number of ways and any number of shots, and can operate in an openuniverse, lifelong setting. When we fix the shots, as done in the benchmarks, we achieve essentially stateoftheart performance, but with a model that is far more flexible.

Embedded Identities: We abstract the identities to a different space than the features, thus enabling capturing more complex classes.

Implicit Class Representation: The class representation function has a variable number of arguments, the shots in the class. Rather than fixing the number of shots, or choosing a complex architecture to handle variable numbers, we show that learning an implicit form of the class function enables seamless metatraining, while requiring a relatively simple optimization problem to be solved at fewshot time. We do not use either recurrent architectures that impose artificial ordering, or complex setfunctions.

Metric Learning is incorporated in our model, enabling us to add new classes without crowding the class representation space.

Performance: Since there is no benchmark to showcase all the features of our model, we use existing benchmarks for fewshot learning that fix the number of ways and shots to a few samples. Some of the top performing methods are tailored to the benchmark, training different models for different number of shots, which does not scale, and does not enable handling the standard case where each way comes with its own number of shots. Despite being general and not tuned to the benchmark, our approach achieves stateoftheart performance.
In the next section we present a formalism for ordinary classification that, while somewhat pedantic, allow us to generalize to lifelong, open universe, meta and fewshot training. The general model allows us to analyze existing work under a common language, and highlights limitations that motivate our proposed solution in Sect. 2.3.
1.1 Background, Notation; Ordinary Classification
In ordinary classification, we call , with a “largescale” training set; a sample from the same distribution. If it is in the training set, we write formally . Outside the training set, we approximate this probability with
(1) 
where the discriminant is an element of a sufficiently rich parametric class of functions with parameters, or “weights,” , and the subscript indicates the th component. The empirical crossentropy loss is defined as
(2)  
minimizing which is equivalent to maximizing . If is i.i.d., this yields the maximumlikelihood estimate , that depends on the dataset and approximates . We write crossentropy explicitly as a function of the discriminant as
(3) 
by substituting (1) into (2), where is given, with a slight abuse of notation, by
(4) 
with the logsumexp . Next, we introduce the general form for fewshot and lifelong learning, used later to taxonomize modeling choices made by different approaches in the literature.
1.2 General FewShot Learning
Let be the fewshot training set, with the classes, or “ways,” and the “shots,” or samples per class. We assume that meta and fewshot data live in the same domain (e.g., natural images), while the meta and fewshot classes are disjoint, which we indicate with .^{1}^{1}1The number of ways is apriori unknown and potentially unbounded. It typically ranges from a few to few hundreds, while is anywhere from one to a few thousands. The metatraining set has typically in the millions and in the thousands. Most benchmarks assume the same number of shots for each way, so there is a single number , an artificial and unnecessary restriction. There is no loss of generality in assuming the classes are disjoint, as fewshot classes that are shared with the metatraining set can just be incorporated into the latter.
During metatraining, from the dataset we learn a parametric representation (feature, or embedding) of the data , for use later for fewshot training. During fewshot training, we use samples for each new category to train a classifier, with potentially growing unbounded (lifelong learning). First, we define “useful” and then formalize a criterion to learn the parameters , both during meta and fewshot training.
Unlike standard classification, discussed in the previous section, here we do not know the number of classes ahead of time, so we need a representation that is more general than a dimensional vector . To this end, consider two additional ingredients: A representation of the classes (identities, prototypes, proxies), and a mechanism to associate a datum to a class through its representative . We therefore have three functions, all in principle learnable and therefore indexed by parameters . The data representation maps each datum to a fixeddimensional vector, possibly normalized,
(5) 
We also need a class representation, that maps the features sharing the same identity , to some representative through a function that yields, for each
(6) 
where . Note that the argument of has variable dimension. Finally, the class membership can be decided based on the posterior probability of a datum belonging to a class, approximated with a sufficiently rich parametric function class in the exponential family as we did for standard classification,
(7) 
where is analogous to (1). The crossentropy loss (2) can then be written as
(8) 
with given by (4) and by (6). The loss is minimized when a function of the fewshot set . Note, however, that this loss can also be applied to the metatraining set, by changing the outer sum to , or or to any combination of the two, by selecting subsets of . Different approaches to fewshot learning differ in the choice of model and mixture of meta and fewshot training sets used in one iteration of parameter update, or training “episode.”
2 Stratification of Fewshot Learning Models
Starting from the most general form of fewshot learning described thus far, we restrict the model until there is no fewshot learning left, to capture the modeling choices made in the literature.
2.1 Meta Training
In general, during metatraining for fewshot learning, one solves some form of
(9) 
Implicit class representation function: Instead of the explicit form in (6), one can infer the function implicitly: Let be the minimum of the optimization problem above. If we consider as free parameters in , the equation defines implicitly as a function of , . One can then simply find and simultaneously by solving
(10) 
which is equivalent to the previous problem, even if there is no explicit functional form for the class representation . As we will see, this simplifies metalearning, as there is no need to design a separate architecture with a variable number of inputs , but requires solving a (simple) optimization during fewshot learning. This is unlike all other known fewshot learning methods, that learn or fix during metalearning, and keep it fixed henceforth.
Far from being a limitation, the implicit solution has several advantages, including bypassing the need to explicitly define a function with a variable number of inputs (or a set function) . It also enables the identity representation to live in a different space than the data representation, again unlike existing work that assumes a simple functional form such as the mean.
2.2 Fewshot Training
Lifelong fewshot learning: Once metatraining is done, one can use the same loss function in (10) for to achieve lifelong, fewshot learning. While each new category is likely to have few samples , in the aggregate the number of samples is bound to grow beyond , which we can exploit to update both the embedding , the metric and the class function .
Metric learning: A simpler model consists of fixing the parameters of the data representation and using the same loss function, but summed for , to learn from few shots the new class proxies and change the metric as the class representation space becomes crowded. If we fix the data representation, during the fewshot training phase, we solve
(11) 
where the dependency on the metatraining phase is through and both and depend on the fewshot dataset .
New class identities: One further simplification step is to also fix the metric , leaving only the class representatives to be estimated
(12) 
The above is the implicit form of the parametric function , with parameters , as seen previously. Thus evaluating requires solving an optimization problem.
No fewshot learning: Finally, one can fix even the function explicitly, forgoing fewshot learning and simply computing
(13) 
that depends on through , and on through .
We articulate our modeling and sampling choices in the next section, after reviewing the most common approaches in the literature in light of the stratification described.
2.3 Related Prior Work
Most current approaches fall under the case (13), thus involving no fewshot learning, forgoing the possibility of lifelong learning and imposing additional undue limitations by constraining the prototypes to live in the same space of the features. Many are variants of Prototypical Networks [15], where only one of the three components of the model is learned: is fixed to be the mean, so and is the Euclidean distance. The only learning occurs at metatraining, and the trainable portion of the model is a conventional neural network. In addition, the sampling scheme used for training makes the model dependent on the number of shots, again unnecessarily.
Other work can be classified into two main categories: gradient based [11, 3, 9, 14] and metric based [16, 21, 10, 4]. In the first, a metalearner is trained to adapt the parameters of a network to match the fewshot training set. [11] uses the base set to learn long shortterm memory (LSTM) units [6] that update the base classifier with the data from the fewshot training set. MAML [3] learns an initialization for the network parameters that can be adapted by gradient descent in a few steps. LEO [14] is similar to MAML, but uses a task specific initial condition and performs the adaptation in a lowerdimensional space. Most of these algorithms adapt and use an ordinary classifier at fewshot test time. There is a different for every fewshot training set, with little reuse or any continual learning.
On the metric learning side, [21] trains a weighted classifier using an attention mechanism [23] that is applied to the output of a feature embedding trained on an the base set. This method requires the shots at meta and fewshot training to match. Prototypical Networks [16] are trained with episodic sampling and a loss function based on the performance of a nearestmean classifier [20] applied to a fewshot training set. [4] generate classification weights for a novel class based on a feature extractor using the base training set. Finally, [1] incorporates ridge regression in an endtoend manner into a deeplearning network. These methods learn a single , which is reused across fewshot training tasks. The class identities are then either obtained through a function defined apriori such as the sample mean in [16], an attention kernel [21], or ridge regression [1]. The form of or do not change at fewshot training. [10] use taskspecific adaptation networks to facilitate the adapting embedding network with output on a taskdependent metric space. In this method, the form of and are fixed and the output of is modulated based on the fewshot training set.
Next, we describe our model that, to the best of our knowledge, is the first and only to learn each component of the model: The embedding , the metric , and implicitly the class representation .
3 Proposed Model
Using the formalism of Sect. 2 we describe our modeling choices. Note that there is redundancy in the model class , as one could fix the data representation , and devolve all modeling capacity to , or viceversa. The choice depends on the application context. We outline our choices, motivated by limitations of prior work.
Embedding : In line with recent work, we choose a deep convolutional network. The details of the architecture are in Sect. 4.
Class representation function : We define it implicitly by treating the class representations as parameters along with the weights . As we saw earlier, this means that at fewshot training, we have to solve a simple optimization problem (12) to find the representatives of new classes, rather than computing the mean as in Prototypical Networks and its variants:
(14) 
Note that the class estimates depend on the parameters in . If fewshot learning is resource constrained, one can still learn the class representations implicitly during metatraining, and approximate them with a fixed function, such as the mean, during the fewshot phase.
Metric : we choose a discriminant induced by the Euclidean distance in the space of class representations, to which data representations are mapped by a learnable parameter matrix :
(15) 
Generally, we pick the dimension of larger than the dimension of , to enable capturing complex multimodal identity representations. Note that this choice encompasses metric learning: If was a symmetric matrix representing a change of inner product, then would be captured by simply choosing the weights . Since both the weights and the class proxies as free, there is no gain in generality in adding the metric parameters . Of course, can be replaced by any nonlinear map, effectively “growing” the model via
(16) 
for some parametric family such as a deep neural network.
4 Implementation
Embedding
We use two different architectures. The first [16, 21] is fourconvolution blocks, each block with 64 filters followed by batchnormalization and ReLU. This is passed through maxpooling of a kernel. Following the convention in [4], we call this architecture C64. The other network is a modified ResNet [5], similar to [10]. We call this ResNet12.
In addition, we normalize the embedding to live on the unit sphere, i.e. , where is the dimension of the embedding. This normalization is added as a layer to ensure that the feature embedding are on the unit sphere, as opposed to applying it posthoc. This adds some complications during metatraining due to poor scaling of gradients [22], and is addressed by a single parameter layer after normalization, whose sole purpose is scaling the output of the normalization layer. This layer is not required at test time.
Class representation:
As noted earlier, this is implicit during metatraining. In order to show the flexibility of our framework, we increase the dimension of the class representation.
Metric
We choose the angular distance in feature space, which is the hypersphere:
(17) 
where is the scaling factor used during training and the angle between the normalized arguments. As the representation is normalized, the classconditional model is a FisherVon Mises (spherical Gaussian). However, as , we need . During metatraining we apply the same normalization and scale function to the implicit representation as well.
(18) 
up to the normalization constant.
Sampling
At each iteration during metatraining, images from the training set are presented to the network in the form of episodes [21, 11, 16]; each episode consists of images sampled from classes. The images are selected by first sampling classes from and then sampling images from each of the sampled classes. The loss function is now restricted to the classes present in the episode as opposed to the entire set of classes available at metatraining. This setting allows for the network to learn a better embedding for an open set classification as shown in [2, 21]
Unlike existing sampling methods that use episodic sampling [11, 15], we do not split the images within an episode into a metatrain set and a metatest set. For instance, prototypical networks [16] use the elements in the metatrain set to learn the mean of the class representation. [11] learns the initial conditions for optimization. This requires a notion of training “shot,” and results in multiple networks to match the shots one expects at fewshot training.
Regularization
First, we notice that the loss function (10) has a degenerate solution where all the centers and the embeddings are the same. In this case, for all and , i.e., is a uniform distribution. For this degenerate case, the entropy is maximum, so we use entropy to bias the solution away from the trivial one. We also use Dropout [17] on top of the embedding during metatraining. Even when using episodic sampling, the embedding tends to overfit on the base set in the absence of dropout. We do not use this at fewshot train and test time.
Figure 2 summarizes our architecture for the loss function during meta training. This has layers that are only needed for training such as the scale layer, Dropout and the loss. During fewshot training, we only use the learned embedding
5 Experimental Results
We test our algorithm on three datasets: miniImagenet [21], tieredImagenet [12] and CIFAR FewShot [1]. The miniImagenet dataset consists of images of size sampled from 100 classes of the ILSVRC [13] dataset, with 600 images per class. We used the data split outlined in [11], where 64 classes are used for training, 16 classes are used for validation, and 20 classes are used for testing.
We also use tieredImagenet [12]. This is a larger subset of ILSVRC, and consists of 779,165 images of size representing 608 classes hierarchically grouped into 34 highlevel classes. The split of this dataset ensures that subclasses of the 34 highlevel classes are not spread over the training, validation and testing sets, minimizing the semantic overlap between training and test sets. The result is 448,695 images in 351 classes for training, 124,261 images in 97 classes for validation, and 206,209 images in 160 classes for testing. For a fair comparison, we use the same training, validation and testing splits as in [12], and use the classes at the lowest level of the hierarchy.
Finally, we use CIFAR FewShot, (CIFARFS) [1] containing images of size , a reorganized version of the CIFAR100 [8] dataset. We use the same data split as in [1], dividing the 100 classes into 64 for training, 16 for validation, and 20 for testing.
5.1 Comparison to Prototypical Networks
Many recent methods are variants of Prototypical Networks, so we perform detailed comparison with it. We keep the training procedure, network architecture, batchsize as well as data augmentation the same. The performance gains are therefore solely due to the improvements in our method.
We use ADAM [7] for training with an initial learning rate of , and a decay factor of every 2,000 iterations. We use the validation set to determine the best model. Our data augmentation consists of mean subtraction, standarddeviation normalization, random cropping and random flipping during training. Each episode contains 15 query samples per class during training. In all our experiments, we set and did not tune this parameter.
Except otherwise noted, we always test fewshot algorithms on 2000 episodes, with 30 query classes per point per episode. At fewshot training, we experimented with setting the class identity to be implicit (optimized) or average prototype (fixed). The latter may be warranted when the fewshot phase is resourceconstrained and yields similar performance. To compare computation time, we use the fixed mean. Note that, in all cases, the class prototypes are learned implicitly during metatraining.
The results of this comparison are shown in Table 1. From this table we see that for the 5shot 5way case we perform similarly to Prototypical Network. However, for the 1shot case we see significant improvements across all three datasets. Also, the performance of Prototypical Networks drops when the train and test shot are changed. Table 1 shows a significant drop in performance when we test models with a 5shot setting and train with 1shot. Notice that, from the table, our method is able to maintain the same performance. Consequently, we only train one model and test it across the different shot scenarios, hence the moniker “shotfree.”
Dataset  Testing Scenario  Training Scenario  Our implementation of [16]  Our Method 
miniImagenet  1shot 5way  1shot 5way  43.88 0.40  49.07 0.43 
5shot 5way  1shot 5way  58.33 0.35  64.98 0.35  
5shot 5way  5shot 5way  65.49 0.35  65.73 0.36  
tieredImagenet  1shot 5way  1shot 5way  41.36 0.40  48.19 0.43 
5shot 5way  1shot 5way  55.93 0.39  64.60 0.39  
5shot 5way  5shot 5way  65.51 0.38  65.50 0.39  
CIFAR FewShot  1shot 5way  1shot 5way  50.74 0.48  55.14 0.48 
5shot 5way  1shot 5way  64.63 0.42  70.33 0.40  
5shot 5way  5shot 5way  71.57 0.38  71.66 0.39 
5.2 Effect of Dimension of Class Identities
Class identities can live in a space of different dimensions than the feature embedding. This can be done in two ways: by lifting the embedding into a higher dimension space or by projecting the class identity into the embedding dimension. If the dimension of the class identity changes, we also need to modify according to (15). The weight matrix , where is the dimension of the embedding and is the dimension of the class identities, can be learned during metatraining. This is equivalent to adding a fully connected layer through which the class identities are passed before normalization. Thus, we now learn , and . We show experimental results with the architecture on the miniImagenet datasets in Table 2. Here, we tested the dimension of the class identities to be , and the dimension of the embedding. From this table we see that increasing the dimensions gives us a performance boost. However, this increase saturates at a dimension of the dimension of the embedding space.
Dimension  1x  2x  5x  10x 
Performance  49.07  51.46  51.46  51.32 
5.3 Comparison to the Stateoftheart
In order to compare with the stateoftheart, we use the ResNet12 base architecture, train our approach using SGD with Nesterov momentum with an initial learning rate of , weight decay of , momentum of and eight episodes per batch. Our learning rate was decreased by a factor of every time the validation error did not improve for 1000 iterations. We did not tune these parameters. As mentioned earlier, we train one model and test across various shots. We also compare our method with class identities in a space with twice the dimension of the embedding. Lastly, we compare our method with a variant of ResNet where we change the filter sizes to (64,160,320,640) from (64,128,256,512).
The results of our comparison for miniImagenet is shown in Table 3. Modulo empirical fluctuations, our method performs at the stateofthe art and in some cases exceeds it. We wish to point out that SNAIL [9], TADAM [10, 18], LEO [14], MTLF [18] pretrain the network for a 64 way classification task on miniImagenet and 351 way classification on tieredImagenet. However, all the models trained for our method are trained from scratch and use no form of pretraining. We also do not use the metavalidation set for tuning any parameters other than selecting the best trained model using the error on this set. Furthermore, unlike all other methods, we did not have to train multiple networks and tune the training strategy for each case. Lastly, LEO [14] uses a very deep 28 layer WideResNet as a base model compared to our shallower ResNet12. A fair comparison would involve training our methods with the same base network. However, we include this comparison for complete transparency.
Algorithm  1shot  5Shot  10shot 
5way  5way  5way  
Meta LSTM [11]  43.44  60.60   
Matching networks [21]  44.20  57.0   
MAML [3]  48.70  63.1   
Prototypical Networks [16]  49.40  68.2   
Relation Net [19]  50.40  65.3   
R2D2 [1]  51.20  68.2   
SNAIL [9]  55.70  68.9   
Gidariset al. [4]  55.95  73.00   
TADAM [10]  58.50  76.7  80.8 
MTFL [18]  61.2  75.5   
LEO [14]  61.76  77.59   
Our Method (ResNet12)  59.00  77.46  82.33 
Our Method (ResNet12) 2x dims.  60.64  77.02   
Our Method (ResNet12) Variant  59.04  77.64  82.48 
Our Method (ResNet12) Variant 2x dims  60.71  77.26   
Algorithm  1shot  5Shot  10shot 
5way  5way  5way  
tieredImagenet  
MAML [3]  51.67  70.30   
Prototypical Networks [12]  53.31  72.69   
Relation Net [19]  54.48  71.32   
LEO [14]  65.71  81.31   
Our Method (ResNet12)  63.99  81.97  85.89 
Our Method (ResNet12) 2x dims.  66.87  82.64   
Our Method (ResNet12) Variant  63.52  82.59  86.62 
Our Method (ResNet12) Variant 2x dims  66.87  82.43   
CIFAR FewShot  
MAML [3]  58.9  71.5   
Prototypical Networks [16]  55.5  72.0   
Relation Net  55.0  69.3   
R2D2 [1]  65.3  79.4   
Our Method (ResNet12)  69.15  84.70  87.64 
The performance of our method on tieredImagenet is shown in Table 4. This table shows that we are the top performing method for 1shot 5way and 5shot 5way. We test on this dataset as it is much larger and does not have semantic overlap between meta training and fewshot training even though only a few baselines exist for this dataset compared to miniImagenet. Also shown in Table 4 is the performance of our method on the CIFAR FewShot dataset. We show results on this dataset to illustrate that our method can generalize across datasets. From this table we see that our method performs the best for CIFAR FewShot.
As a final remark, there is no consensus on the fewshot training and testing paradigm in the literature. There are too many variables that can affect performance. Even with all major factors such as network architecture, training procedure, batch size remaining the same, factors such as the number of query points used for testing these methods affect the performance and methods in existing literature uses anywhere between 1530 points for testing, and for some methods it is unclear what this choice was. This calls for stricter protocols for evaluation, and richer benchmark datasets.
6 Discussion
We have presented a method for metalearning for fewshot learning where all three ingredients of the problem are learned: The representation of the data , the representation of the classes , and the metric or membership function . The method has several advantages compared to prior approaches. First, by allowing the class representation and the data representation spaces to be different, we can allocate more representative power to the class prototypes. Second, by learning the class models implicitly we can handle a variable number of shots without having to resort to complex architectures, or worse, training different architectures, one for each number of shots. Finally, by learning the membership function we implicitly learn the metric, which allows class prototypes to redistribute during fewshot learning.
While some of these benefits are not immediately evident due to limited benchmarks, the improved generality allows our model to extend to a continual learning setting where the number of new classes grows over time, and is flexible in allowing each new class to come with its own number of shots. Despite the added generality, our model is simpler than some of the top performing ones in the benchmarks, having a single model, and yet it performs onpar or better in the fewshot setting.
References
 [1] L. Bertinetto, J. F. Henriques, P. H. S. Torr, and A. Vedaldi. Metalearning with differentiable closedform solvers. CoRR, abs/1805.08136, 2018.
 [2] W.Y. Chen, Y.C. Liu, Z. Kira, Y.C. F. Wang, and J.B. Huang. A closer look at fewshot classification. In International Conference on Learning Representations, 2019.
 [3] C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [4] S. Gidaris and N. Komodakis. Dynamic fewshot visual learning without forgetting. In CVPR, 2018.
 [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
 [6] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Comput., 9(8):1735–1780, Nov. 1997.
 [7] D. P. Kingma and J. L. Ba. ADAM: A method for stochastic optimization. International Conference on Learning Representations 2015, 2015.
 [8] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 [9] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. In ICLR, 2018.
 [10] B. N. Oreshkin, P. Rodríguez, and A. Lacoste. Improved fewshot learning with task conditioning and metric scaling. In NIPS, 2018.
 [11] S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In ICLR, 2017.
 [12] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Metalearning for semisupervised fewshot classification. CoRR, abs/1803.00676, 2018.
 [13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211–252, Dec. 2015.
 [14] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Metalearning with latent embedding optimization. CoRR, abs/1807.05960, 2018.
 [15] J. Snell, K. Swersky, and R. S. Zemel. Modelprototypical networks for fewshot learning. In NIPS, 2017.
 [16] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for fewshot learning. In NIPS, pages 4080–4090, 2017.
 [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan. 2014.
 [18] Q. Sun, Y. Liu, T. Chua, and B. Schiele. Metatransfer learning for fewshot learning. CoRR, abs/1812.02391, 2018.
 [19] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for fewshot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [20] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):6567–6572, 2002.
 [21] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In NIPS, 2016.
 [22] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17, pages 1041–1049, New York, NY, USA, 2017. ACM.
 [23] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015.