# A Theoretical Analysis of the Number of Shots in Few-Shot Learning

###### Abstract

Few-shot classification is the task of predicting the category of an example from a set of few labeled examples. The number of labeled examples per category is called the number of shots (or shot number). Recent works tackle this task through meta-learning, where a meta-learner extracts information from observed tasks during meta-training to quickly adapt to new tasks during meta-testing. In this formulation, the number of shots exploited during meta-training has an impact on the recognition performance at meta-test time. Generally, the shot number used in meta-training should match the one used in meta-testing to obtain the best performance. We introduce a theoretical analysis of the impact of the shot number on Prototypical Networks, a state-of-the-art few-shot classification method. From our analysis, we propose a simple method that is robust to the choice of shot number used during meta-training, which is a crucial hyperparameter. The performance of our model trained for an arbitrary meta-training shot number shows great performance for different values of meta-testing shot numbers. We experimentally demonstrate our approach on different few-shot classification benchmarks.

## 1 Introduction

Human cognition has the impressive ability of grasping new concepts from exposure to a handful of examples (Yger et al., 2015). In comparison, while modern deep learning methods achieve unprecedented performances with very deep neural-networks (He et al., 2016; Szegedy et al., 2015), they require extensive amounts of data to train, often ranging in the millions. Recently proposed methods for few-shot learning aims to bridge the sample-efficiency gap between deep learning and human learning in fields such as computer vision and reinforcement learning (Santoro et al., 2016; Ravi and Larochelle, 2017; Finn et al., 2017; Vinyals et al., 2016). These methods fall under the framework of meta-learning, in which a meta-learner extracts knowledge from many related tasks (in the meta-training phase) and leverages that knowledge to quickly learn new tasks (in the meta-testing phase). In this paper we are interested in the few-shot classification problem where each task is defined as a -way classification problem with samples (shots) per class available for training.

Many meta-learning methods use the episodic training setup in which the meta-learner iterate through episodes in the meta-training phase. In each episode, a task is drawn from some population and a limited amount of support and query data from that task is made available. The meta-learner then learns a task-specific classifier on the support data and the classifier predicts on the query data. Updates to the meta-learner is computed based on the performance of the classifier on the query set. Evaluation of the meta-learner (during a phase called meta-testing) is also carried out in episodes in a similar fashion, except that the meta-learner is no longer updated and the performance on query data across multiple episodes is aggregated.

In the episodic setup, the selection of during meta-training time can have significant effects on the learning outcomes of the meta-learner. Intuitively, if support data is expected to be scarce, the meta-learner need to provide strong inductive bias to the task-specific learner as the danger of overfitting is high. In contrast, if support data is expected to be abundant, then the meta-learner can provide generally more relaxed biases to the task-specific learner in order to achieve better fitting to the task data. Therefore it is plausible that a meta-learner trained with one value to be sub-optimal at adapting to tasks with a different value and thus exhibit “meta-overfitting” to . In experiments, is often simply kept fixed between meta-training and meta-testing, but in real-world usage, one cannot expect to know beforehand the amount of support data from unseen tasks during deployment.

In this paper we will focus on Prototypical networks (Snell et al., 2017), a.k.a. ProtoNet. ProtoNet is of practical interest because of its flexibility: a single trained instance of ProtoNet can be used on new tasks with any and . However, ProtoNet exhibits performance degradation when the used in training does not match the used in testing.^{1}^{1}1For example, 1-shot accuracy of a 10-shot trained network performs (absolute) worse than the 1-shot trained network First, we will undertake a theoretical investigation to elicit the connection from to a lower bound of expected performance, as well as to the intrinsic dimension of the learned embedding space. Then, we conduct experiments to empirically verify our theoretical results across various settings. Guided by our new understanding of the effects of , we propose an elegant method to tackle performance degradation in mismatched cases.
Our contributions are threefold:

We provide performance bounds for ProtoNets given an embedding function. From which, we argue that affects learning and performance by scaling the contribution of intra-class variance.

Through VC-learnability theory, we connect the value of used in meta-training to the intrinsic dimension of the embedding space.

The most important contribution of this paper (introduced in Section 3.3) is a new method that improves upon vanilla ProtoNets by eliminating the performance degradation in cases where the is mismatched between meta-training and meta-testing. Our evaluation protocol more closely adheres to real-world scenarios where the model is exposed to different numbers of training samples.

## 2 Background

### 2.1 Problem setup

The few-shot classification problem considered in this paper is setup as described below. Consider a space of classes with a probability distribution , classes are sampled iid from to form a -way classification problem. For each class , support data are sampled from class-conditional distribution , where , denotes the dimension of data, and denotes the class assignment of . Note that we assume that is singular (e.g. each can only have 1 label), and does not depend on (e.g. a data point with a label “cat” will always have the label “cat”), in contrast to defined later on.

Additionally, the set containing query data is sampled from the joint distribution .^{2}^{2}2We assume equal likelihood of drawing from each class for simplicity
For each , let denote its label in the context of the few-shot classification task.
Define as the augmented set of supports for class , and denote the union of support sets for classes as .
The few-shot classification task is to predict for each in given . During meta-training, the ground truth label for is also available to the learner.

### 2.2 Meta-learning setup

Meta-learning approaches train on a distribution of tasks to obtain information that generalizes to unseen tasks. For few-shot classification, a task is determined by which classes are involved in the -way classification task. During meta-training, the meta-learner observes episodes of few-shot classification tasks consisting of classes, labelled samples per class, and unlabelled samples, as previously described. The collection of all classes observed during meta-training forms the meta-training split . Critically, we assume that every unseen class that the learner is evaluated upon (during meta-testing) is also drawn from the same distribution .

### 2.3 Prototypical Networks

ProtoNets (Snell et al., 2017) compute -dimensional embeddings for all samples in and . The embedding function is usually a deep network. Prototype representation for each class is formed by averaging the embeddings for all supports of said class: . Classification of any input (e.g. ) is performed by computing the softmax over squared Euclidean distances of the input point’s embedding to the prototypes. Let denote the prediction of the classifier for one of the categories :

(1) |

The parameters of the embedding functions are learned through meta-training. Negative log-likelihood of the correct class is minimized on the *query* points through SGD.

As explained in (Law et al., 2019), ProtoNets can be seen as a metric learning approach optimized for the supervised hard clustering task (Law et al., 2016). The model is learned so that the representations of similar examples (i.e. belonging to a same category) are all grouped into the same cluster in . We propose in this paper a subsequent metric learning step which learns a linear transformation that maximizes inter-to-intra class variance ratio.

## 3 Proposed method

We first present theoretical results explaining the effect of the shot number on ProtoNets, and then introduce our method for addressing performance degradation in cases of mismatched shots.

### 3.1 Relating k to lower bound of expected accuracy

To better understand the role of on the performance of ProtoNets, we study how it contributes to the expected accuracy across episodes when using any kind of fixed embedding function (e.g. the embedding function obtained at the end of the meta-training phase). With denoting the indicator function, we define the expected accuracy as:

(2) |

Definitions: Throughout this section, we will use the following symbols to denote the means and variances of embeddings under different expectations:

###### Remark.

is the expectation of the embedding conditioned on class . is the (full) expectation of the embedding, which can be expressed as the expectation of over classes. is the variance of class means in the embedding space - it can be interpreted as the signal of the input to the classifier, as larger implies larger distances between classes. is the expected intra-class variance - it represents the noise in the above signal.

Modelling assumptions of ProtoNets: The use of the squared Euclidean distance and softmax activation in ProtoNets implies that classification with ProtoNets is equivalent to a mixture density estimation on the support set with spherical Gaussian densities (Snell et al., 2017). Specifically, we adopt the modelling assumptions that the distribution of given any class assignment is normally distributed (), with equal covariance for all classes in the embedding space ()^{3}^{3}3The second assumption is more general than the one used in the original paper as we do not require to be diagonal.

We present the analysis for the special case of episodes with binary classification (i.e. with ) for ease of presentation, but the conclusion can be generalized to arbitrary (see appendix). Also, as noted in Section 2.1, we assume equal likelihood between the classes. We would like to emphasize that the assignment of labels can be permuted freely and the classifier’s prediction would not be affected due to symmetry. Hence, we only need to consider one case for the ground truth label without loss of generality. Let and denote any pair of classes sampled from . Let be drawn from , and overload and to also indicate the ground truth label in the context of that episode, then equation 2 can be written as:

(3) |

Additionally, noting that can be expressed as a sigmoid function:

We can express equation 3 as a probability:

(4) |

We will introduce a few auxiliary results before stating the main result for this section.

###### Proposition 1.

From Chebyshev’s inequality, it immediately follows that:

(5) |

In Lemma 1 and Lemma 2, we derive the expectation and variance of when conditioned on the classes sampled in a episode. Then, in theorem 1, we compose them into the RHS of proposition 1 through law of total expectation.

###### Lemma 1.

Consider space of classes with sampling distribution , . Let , , is the shot number, and . Define and . Consider as defined earlier. Assume and for any choice of , then,

, and |

###### Lemma 2.

Under the same notation and assumptions as Lemma 1, additionally invoking definition for , then,

(6) |

The proofs of the above lemmas are in the appendix. With the results above, we are ready to state our main theoretical result in this section.

###### Proof.

Several observations can be made from Theorem 3:

1. The shot-number only appears in first two terms of the denominator. It does *not* contribute to the last term of the denominator. This implies diminishing returns in expected accuracy when more support data is added without altering .

2. By observing the degree of terms in equation 7 (and treating the last term of the denominator as a constant), it is clear that increasing will decrease the sensitivity (magnitude of partial derivative) of this lower bound to , and increase its sensitivity to .

3. If one postulates that meta-learning updates on is similar to gradient ascent on this accuracy lower bound, then learning with smaller emphasizes minimizing noise, while learning with higher emphasizes maximizing signal.

In conclusion, these observations give us a plausible reason for the performance degradation observed in mismatched shots: when an embedding function has been optimized (trained) for , the relatively high is amplified by the now smaller , resulting in degraded performance. Conversely, an embedding function trained for already has small , such that increasing during testing has further diminished effect on performance.

### 3.2 Interpretation in terms of VC dimension

In any given episode, a nearest neighbour prediction is performed from the support data (with a fixed embedding function). Therefore, a PAC learnability interpretation of the relation between the number of support data and complexity of the classifier can be made. Specifically, for binary classification, classical PAC learning theory (Vapnik et al., 1994) states that with probability of at-least , the following inequality on the difference between empirical error (of the support samples) and true error holds for any classifier :

(11) |

Where is the VC dimension, and is the number of support samples per class.^{4}^{4}4This considers learning of a single episode through the formation of prototypes Under this binary classification setting, the predictions of prototypical network are equal to as shown earlier. Denoting and , we can manipulate as follows:

(12) |

From equation 12, the prototypes form a linear classifier with offset in the embedding space. The VC dimension of this type of classifier is where is the intrinsic dimension of the embedding space (Vapnik, 1998). In ProtoNets, the intrinsic dimension of the embedding space is not only influenced by network architecture, but more importantly determined by the parameter themselves, making it a learned property. For example, if the embedding function can be represented with a linear transformation , then the intrinsic dimension of the embedding space is upper bounded by the rank of (since all embeddings must lie in the column space of ). Thus, the number of support samples required to learn from an episode is proportional to the intrinsic dimension of the embedding space. We hypothesize that an embedding function optimal for lower-shot (e.g. one-shot) classification affords fewer intrinsic dimensions than one that is optimal for higher-shot (e.g. 10-shot) classification.

### 3.3 Reconciling shot discrepancy through embedding space transformation

Based on the above results, we postulate that when there is discrepancy between meta-training shots and meta-testing shots^{5}^{5}5In deployment, the number of support samples per class is likely random from episode to episode., classification accuracy at meta-test time can be improved by maximizing . Intuitively, this can be achieved by minimizing and maximizing through a transformation that lies in the space of non-dominant eigenvectors of while also being aligned to the dominant eigenvectors of . Furthermore, section 3.2 suggests that reducing the dimensionality of the embedding space will improve performance in low shot regimes. Hence, we propose a modification of ProtoNets which we call Embedding Space Transformation (EST). EST works by applying a transformation

(13) |

to the outputs of the embedding function during test-time, effectively transforming the embedding space. Here, is a linear transformation computed using after meta-training has been completed. To compute , we first iterate through all classes in and compute their in-class means and covariance matrices in the embedding space. We can then find the covariance of means , and the mean of covariances across . Finally, is computed by taking the leading eigenvectors of - the difference between the covariance matrix of the mean and the mean covariance matrix with weight parameter . The exact procedure for computing is presented in the appendix.

## 4 Experiments and results

In this section, our first two experiments aim at supporting our theoretical results in sections 3.1 and 3.2, while our third experiment demonstrates the improvement of EST on benchmark datasets over vanilla ProtoNets. To illustrate the applicability of our results to different embedding function architectures, all experiments are performed with both a vanilla 4-layer CNN (as in (Snell et al., 2017)) and a 7-layer Residual network (He et al., 2016). Detailed description of the architecture can be found in the appendix. Experiments are performed on three datasets: Omniglot (Lake et al., 2015), mini-imagenet (Vinyals et al., 2016), and tiered-imagenet (Ren et al., 2018). We followed standard data processing procedures which are detailed in the appendix.

### 4.1 Training shots affect variance contribution

The total variance observed among embeddings of all data points can be seen as a composition of inter-class and (expected) intra-class variance based on the law of total variance (). Our analysis predicts that as we increase the shot number used during training, the ratio of inter-class to intra-class variance will decrease.

To verify this hypothesis, we trained ProtoNets (vanilla and residual) with a range of shots (1 to 10 on mini-imagenet and tiered imagenet, 1 to 5 on omniglot) until convergence with 3 random initializations per group. Then, we computed the inter-class and intra-class covariance matrices across all samples in the training-set embedded by each network. To qualify the amplitude of each matrix, we take the trace of each covariance matrix. The ratio of inter-class to intra-class variance is presented in figure 3: as we increase used during training, the inter-class to intra-class variance ratio decreases. This trend can be observed in both vanilla and residual embeddings, and across all three datasets, lending strong support to our result in section 3.1. Another observation can be made that the ratio between inter-class and intra-class variance is significantly higher in the omniglot data set than the other two datasets. This may indeed be reflective of the relative difficulty of each dataset and the accuracy of ProtoNet on the data sets.

### 4.2 Training shots affect intrinsic dimension

We consider the intrinsic dimension (id) of an embedding function with extrinsic dimension (operated on a data set) to be defined as the minimum integer where all embedded points of that data set lie within a -dimensional subspace of (Bishop, 2006). A simple method for estimating is through principal component analysis (PCA) of the embedded data set. By eigen-decomposing the covariance matrix of embeddings, we obtain the principal components expressed as the significant eigen-values, and the principal directions expressed as the eigen-vectors corresponding to those eigen-values. The number of significant eigen-values approximates the intrinsic dimension of the embedding space. When the subspace is linear, this approximation is exact; otherwise, it serves as an upper bound to the true intrinsic dimension (Fukunaga and Olsen, 1971).

We determine the number of significant eigenvalues by an explained-variance over total-variance criterion. The qualifying metric is: In our experiments, we set the threshold for at . Similar to the previous experiment, we train ProtoNets with different shots to convergence. The total covariance matrix is then computed on the training set and eigen-decomposition is performed. The approximate id is plotted for various values of in figure 2. We can see a clear trend that as we increase training shot, the id of the embedding space increases.

### 4.3 Experiments with EST

We evaluate the performance of EST on the three aforementioned data sets. The performance is compared against our implementation of vanilla ProtoNets as a baseline, as well as a variant of ProtoNets using principal components obtained from all embedding points (ProtoNet-PCA).

All methods in this section use the same set of trained ProtoNets. As with before, networks are trained with on Omniglot and on mini-imagenet and tiered-imagenet. Additionally, we also trained a mixed-shot network for each data set. This is done by randomly selecting a value for within the specified range for each episode, and then sampling the corresponding number of support samples. Hyper-parameters for training are described in the appendix.

Each model is evaluated on the test splits of the corresponding data sets (e.g. networks trained on Omniglot are only evaluated on Omniglot). 5 test runs are performed per network on Omniglot to evaluate the -shot performance (). Each run consists of 600 episodes formed by 1-5 support samples and 5 query sample per class. The performance is aggregated across runs for the combined performance. Similarly, on mini-imagenet and tiered-imagenet, 10 test runs are performed with support per sample , 600 episodes per run, and 15 query samples in each episode.

Model configuration: *Vanilla ProtoNet* is used as our baseline. We present the performance of multiple ProtoNets trained with different shots to illustrate the performance degradation issue.
*ProtoNet-PCA* uses principal components of the training split embeddings in place of , with components other than the leading ones zeroed-out. We carry out a parameter sweep on mini-imagenet and set ; the same value is used on the other two datasets. For selecting the training shot of the embedding network, we found that overall performance to be optimal using .
*ProtoNet-EST* contains three parameters that need to be determined: , , and training-shots of the embedding network. For our experiments, we set and based on performance on mini-imagenet. For selecting the number of training-shots, we use the same strategy as before by evaluating ProtoNet-EST with all trained embedding networks and found the same trend to hold.

As an abalation study, *FC-ProtoNet* adds a fully connected layer to the embedding network such that the output dimension is also . Results of this variant can be found in the appendix.

EST performance results: Table 1 summarizes the performance of the evaluated methods on all data sets. Due to space constraints, only 1-shot, 5-shot, 10-shot and 1-10 shot average performance are included. Additional results are in the appendix. The best performing method in each evaluation is in bold. On Omniglot, there is no significant difference in performance between the best performing vanilla ProtoNet and any other methods. We attribute this to the already high accuracy of the baseline model. On mini-imagenet and tiered-imagenet, EST-ProtoNet significantly outperforms baseline methods and PCA-protonet in terms of average accuracy over test runs with different shots.

We observe that matching the training-shot to the test-shot generally provides the best performance for vanilla ProtoNets. Also importantly, training with a mixture of different does not provide optimal performance when evaluated on the same mixture of values. Instead, the resulting performance is mediocre in all test-shots. ProtoNet-EST provide minor improvements over the best-performing baseline method under most test-shots settings. We hypothesize that this is due to EST aligning the embedding space to the directions with high inter-class variance and low intra-class variance. Comparison against the direct PCA approach demonstrates that the performance uplift is not entirely attributed to reducing the dimensions of the embedding space.

In conclusion, EST improves the performance of ProtoNets in the more challenging data sets when evaluated with various test-shots. It successfully tackles performance degradation when test shots and training shots are different. This improvement is vital to the deployment of ProtoNets in real world scenarios where the number of support samples cannot be determined in advance.

Omniglot-20-way, with 4 layer CNN. |
Omniglot-20-way, with 7 layer ResNet. |
Mini-imagenet-5-way, with 4 layer CNN. |
Mini-imagenet-5-way, with 7 layer ResNet. |
Tiered-imagenet-5-way, with 4 layer CNN. |
Tiered-imagenet-5-way, with 7 layer ResNet. |

## 5 Conclusion and future work

We have explored how the number of support samples used during meta-training can influence the learned embedding function’s performance and intrinsic dimensions. Our proposed method transforms the embedding space to maximize inter-to-intra class variance ratio while constraining the dimensions of the space itself. In terms of applications, our method can be combined other works (Oreshkin et al., 2018; Ye et al., 2018; Rusu et al., 2019; Dong and Xing, 2018; Ren et al., 2018) with an embedding learning component. We believe our approach is a significant step to reduce the impact of the shot number in meta-training, which is a crucial hyperparameter for few-shot classification.

## 6 Acknowledgments

We thank Clément Fuji Tsang and Mark Brophy for helpful feedback on early versions of this paper.

## References

- Pattern recognition and machine learning (information science and statistics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §4.2.
- Few-shot semantic segmentation with prototype learning. In British Machine Vision Conference (BMVC), Cited by: §5.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1126–1135. External Links: Link Cited by: §1.
- An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers C-20, pp. 176–183. Cited by: §4.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778. Cited by: §1, §4.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 448–456. External Links: Link Cited by: §A.2.
- Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR), pp. . Cited by: §A.4.
- Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. External Links: Document, ISSN 0036-8075, Link, http://science.sciencemag.org/content/350/6266/1332.full.pdf Cited by: §A.3, §4.
- Dimensionality reduction for representing the knowledge of probabilistic models. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.3.
- Closed-form training of mahalanobis distance for supervised clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3909–3917. Cited by: §2.3.
- Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 721–731. Cited by: §5.
- Optimization as a model for few-shot learning. In In International Conference on Learning Representations (ICLR), Cited by: §A.3, §1.
- Meta-learning for semi-supervised few-shot classification. In Proceedings of 6th International Conference on Learning Representations (ICLR), Cited by: §A.3, §4, §5.
- Linear models in statistics. 2nd edition, John Wiley & Sons, Inc.. Cited by: §A.5.
- Meta-learning with latent embedding optimization. CoRR abs/1807.05960. Cited by: §5.
- Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning (ICML), M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1842–1850. External Links: Link Cited by: §1.
- Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NIPS), pp. 4077–4087. Cited by: §A.2, §A.4, §1, §2.3, §3.1, §4.
- Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §1.
- Measuring the vc-dimension of a learning machine. Neural Computation 6 (5), pp. 851–876. External Links: Document, Link, https://doi.org/10.1162/neco.1994.6.5.851 Cited by: §3.2.
- Statistical machine learning. JOHN WILEY & SONS, INC.. Cited by: §3.2.
- Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), NIPS’16, USA, pp. 3637–3645. External Links: ISBN 978-1-5108-3881-9, Link Cited by: §A.3, §1, §4.
- Learning embedding adaptation for few-shot learning. CoRR abs/1812.03664. External Links: Link, 1812.03664 Cited by: §5.
- Fast learning with weak synaptic plasticity. Journal of Neuroscience 35 (39), pp. 13351–13362. External Links: Document, ISSN 0270-6474, Link Cited by: §1.

## Appendix A Appendix

### a.1 Algorithm for EST

Below is the exact procedure for computing for embedding space transformation.

### a.2 Network architecture

The vanilla CNN has the exact same architecture as the original ProtoNet (Snell et al., 2017). It consists of four convolution layers with depth of 64; each convolution layer is followed by Relu activation, max-pooling, and batch normalization (Ioffe and Szegedy, 2015). Resnet of 7 layers is constructed with one vanilla convolution layer of depth 64 followed by three residual blocks, all joined by max-pooling layers; each residual block consists of two sets of conv-batchnorm-Relu layers, of depth 128-256-256.

### a.3 Dataset description and pre-processing

Experiments are performed on three datasets: Omniglot (Lake et al., 2015), mini-imagenet (Vinyals et al., 2016), and tiered-imagenet (Ren et al., 2018). For Omniglot experiments, we follow the same configuration as in the original paper where 1200 classes augmented with rotations (4800 total) are used for training, and the remaining classes are used for testing.

For mini-imagenet experiments, we use the splits proposed by (Ravi and Larochelle, 2017) where 64 classes are used for training, 16 for validation, and 20 for testing. Mirroring the original paper, we resize all mini-imagenet images to 84x84. No data augmentation is applied. As most state-of-art few-shot classification methods achieve saturating accuracies on Omniglot, and mini-imagenet’s small number of classes make claims about generalization difficult, we also conduct experiments of tiered-imagenet.

Tiered-imagenet is also a subset of Imagenet1000. Tiered-imagenet groups classes into broader categories corresponding to higher-level nodes in the ImageNet hierarchy. It includes 34 categories, with each category containing between 10 and 30 classes. These are split into 20 training, 6 validation and 8 testing categories. In total, there are 351 classes in training, 97 in validation, and 160 in testing. Preprocessing of images follow the same steps as used for mini-imagenet.

### a.4 ProtoNet training

Training procedure of ProtoNets largely mirrors the protocol used by Snell et al. (2017). On Omniglot, we train the network to convergence after 30000 episodes. On mini-imagenet and tiered-imagenet, we monitor the performance of the network on the validation set and select the best performing checkpoint after training for 50000 episodes. Adam (Kingma and Ba, 2014) optimizer is used with , , , and an initial learning rate of that is decayed by half every 2000 episodes. On Omniglot, we train with 60 classes and 5 query points per episode. On mini-imagenet and tiered-imagenet, we train with 20 classes and 15 query points per episode.

### a.5 Derivation Details

Proof of Lemma 1:

###### Proof.

Similarly for :

Putting together and :

Then, since , we have:

∎

For proof of Lemma 2, we first re-state the result on quadratic forms of normally distributed random vectors by Rencher and Schaalje (2008).

###### Theorem 4.

Consider random vector and symmetric matrix of constants , we have:

Proof of Lemma 2:

###### Proof.

From line 2 to 3, we used Cauchy Schwarz inequality. From line 3 to line 4, note for all By applying Theorem 4, we have:

Finally,

∎

Extending to class: Let denote the query data pair, and the set of classes be denoted as . Let . Then we have a correct prediction if . Hence:

By Frechet’s inequality:

Noting that Theorem 3 can be applied to each term in the summation:

It is then clear that the observations made on the binary case also applies to the multiclass case.

### a.6 Additional Results

Additionally, we experimented with directly setting the output dimension of the embedding network to 60 by adding a fully connected layer to the embedding network. This variant of protonet performs worse than both the base variant and all other methods.

Training | Testing Shots | Average | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Model | Shots | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Accuracy |

ProtoNet + FC | 5 | 44.77 | 53.75 | 58.04 | 61.06 | 62.26 | 64.60 | 65.19 | 66.63 | 66.65 | 67.52 | 61.05 0.28 |

*mini-imagenet-5-way*, with 4 layer CNN + 1 Fully connected layer embedding network.

Training | Testing Shots | Average | ||||||||||

Model | Shots | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Accuracy |

Vanilla ProtoNet | 1 | 48.89 | 56.54 | 60.31 | 63.12 | 64.70 | 66.02 | 66.62 | 67.99 | 68.41 | 68.90 | 63.15 0.21 |

EST ProtoNet | 1 | 49.07 | 56.49 | 60.62 | 62.67 | 64.83 | 66.23 | 66.84 | 67.90 | 67.68 | 68.73 | 63.11 0.22 |

PCA ProtoNet | 1 | 49.01 | 56.35 | 60.07 | 62.42 | 64.38 | 65.28 | 66.56 | 67.81 | 67.59 | 68.26 | 62.77 0.22 |

Vanilla ProtoNet | 5 | 44.75 | 56.61 | 61.52 | 65.32 | 67.23 | 69.04 | 70.66 | 71.47 | 71.84 | 72.36 | 65.08 0.23 |

EST ProtoNet | 5 | 50.22 | 59.04 | 64.14 | 66.61 | 68.25 | 69.46 | 70.80 | 71.60 | 72.61 | 73.29 | 66.60 0.23 |

PCA ProtoNet | 5 | 48.72 | 58.43 | 63.17 | 66.07 | 68.63 | 69.56 | 70.55 | 71.21 | 72.09 | 72.82 | 66.12 0.24 |

Vanilla ProtoNet | 10 | 39.99 | 52.73 | 59.71 | 63.41 | 66.23 | 68.27 | 69.86 | 71.03 | 71.72 | 72.47 | 63.54 0.25 |

EST ProtoNet | 10 | 48.98 | 57.83 | 63.13 | 66.39 | 68.12 | 69.82 | 70.63 | 71.85 | 72.79 | 73.22 | 66.28 0.23 |

PCA ProtoNet | 10 | 48.04 | 57.05 | 62.46 | 64.63 | 67.61 | 68.98 | 69.71 | 71.78 | 71.75 | 72.42 | 65.44 0.24 |

Vanilla ProtoNet | 1-10 | 49.36 | 58.67 | 62.77 | 65.76 | 67.96 | 69.20 | 70.05 | 70.90 | 71.34 | 72.27 | 65.83 0.21 |

*mini-imagenet-5-way*, with 4 layer CNN embedding network.

Training | Testing Shots | Average | ||||||||||

Model | Shots | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Accuracy |

Vanilla ProtoNet | 1 | 47.37 | 55.72 | 59.27 | 61.50 | 63.70 | 64.28 | 65.01 | 66.43 | 67.20 | 67.99 | 61.85 0.29 |

EST ProtoNet | 1 | 47.71 | 55.05 | 59.31 | 61.97 | 63.48 | 64.65 | 65.35 | 66.33 | 66.42 | 67.44 | 61.77 0.31 |

PCA ProtoNet | 1 | 47.72 | 54.60 | 58.69 | 60.73 | 62.78 | 64.92 | 65.88 | 65.88 | 66.11 | 66.87 | 61.42 0.32 |

Vanilla ProtoNet | 5 | 42.33 | 55.06 | 61.32 | 64.57 | 66.51 | 67.86 | 69.33 | 70.15 | 71.32 | 72.05 | 64.05 0.33 |

EST ProtoNet | 5 | 48.85 | 58.38 | 62.75 | 65.16 | 67.24 | 68.39 | 69.89 | 70.99 | 70.87 | 72.09 | 65.46 0.32 |

PCA ProtoNet | 5 | 48.34 | 57.44 | 0.79 | 64.96 | 67.07 | 67.93 | 69.08 | 70.40 | 70.38 | 71.65 | 64.96 0.31 |

Vanilla ProtoNet | 10 | 35.38 | 49.40 | 56.76 | 61.25 | 64.56 | 66.56 | 68.05 | 69.31 | 70.12 | 71.03 | 61.24 0.36 |

EST ProtoNet | 10 | 47.33 | 56.75 | 62.80 | 65.78 | 66.84 | 69.07 | 69.98 | 70.82 | 71.99 | 71.20 | 65.26 0.32 |

PCA ProtoNet | 10 | 46.55 | 56.61 | 60.81 | 64.01 | 66.53 | 67.70 | 69.18 | 69.83 | 70.62 | 71.22 | 64.31 0.33 |

Vanilla ProtoNet | 1-10 | 47.65 | 56.23 | 62.12 | 63.93 | 66.34 | 67.94 | 68.44 | 68.93 | 70.80 | 70.96 | 64.33 0.30 |

*tiered-imagenet-5-way*, with 4 layer CNN embedding network.

Testing Shots | |||||||
---|---|---|---|---|---|---|---|

Model | Training Shots | 1 | 2 | 3 | 4 | 5 | Average Accuracy |

Vanilla ProtoNet | 1 | 95.07 0.17 | 97.89 0.09 | 98.45 0.08 | 98.75 0.06 | 98.89 0.06 | 97.81 0.06 |

Vanilla ProtoNet | 2 | 94.59 0.18 | 97.69 0.09 | 98.44 0.07 | 98.69 0.06 | 98.89 0.06 | 97.66 0.07 |

Vanilla ProtoNet | 3 | 94.19 0.18 | 97.57 0.09 | 98.30 0.07 | 98.63 0.07 | 98.79 0.06 | 97.50 0.07 |

Vanilla ProtoNet | 4 | 93.79 0.18 | 97.41 0.10 | 98.19 0.08 | 98.54 0.07 | 98.75 0.06 | 97.34 0.07 |

Vanilla ProtoNet | 5 | 93.42 0.18 | 97.34 0.10 | 98.18 0.07 | 98.53 |