Stochastic Prototype Embeddings
Abstract
Supervised deepembedding methods project inputs of a domain to a representational space in which sameclass instances lie near one another and differentclass instances lie far apart. We propose a probabilistic method that treats embeddings as random variables. Extending a stateoftheart deterministic method, Prototypical Networks (Snell et al., 2017), our approach supposes the existence of a class prototype around which class instances are Gaussian distributed. The prototype posterior is a product distribution over labeled instances, and query instances are classified by marginalizing relative prototype proximity over embedding uncertainty. We describe an efficient sampler for approximate inference that allows us to train the model at roughly the same space and time cost as its deterministic sibling. Incorporating uncertainty improves performance on fewshot learning and gracefully handles label noise and outofdistribution inputs. Compared to the stateoftheart stochastic method, Hedged Instance Embeddings (Oh et al., 2019), we achieve superior large and openset classification accuracy. Our method also aligns classdiscriminating features with the axes of the embedding space, yielding an interpretable, disentangled representation.
figurec
1 Introduction
Supervised deepembedding methods map instances from an input space to a latent embedding space in which samelabel pairs are near and differentlabel pairs are far. The embedding thus captures semantic relationships without discarding interclass structure. In contrast, consider a standard neural network classifier with a softmax output layer trained with a crossentropy loss. Although its penultimate layer might be treated as an embedding, the classifier’s training objective attempts to orthogonalize all classes and thereby eliminate any information about interclass structure.
Nearly all methods previously proposed for deep embeddings are deterministic: an instance projects to a single point in the embedding space. Deterministic embeddings fail to capture uncertainty due either to outofdistribution inputs (e.g., data corruption) or label ambiguity (e.g., overlapping classes). Representing uncertainty is important for many reasons, including robust classification and decision making, informing downstream models, interpreting representations, and detecting outofdistribution samples. In this article, we propose a method for discovering stochastic embeddings, where each embedded instance is a random variable whose distribution reflects the uncertainty in the embedding space.
Our proposed method, the Stochastic Prototype Embedding (SPE), is an extension of the Prototypical Network (PN) (Snell et al., 2017). As in the PN, our SPE assumes each class can be characterized by a prototype in the embedding space and an instance is classified based on its proximity to a prototype. In the case of the SPE, the embeddings and prototypes are Gaussian random variables, each class instance is assumed to be a Gaussian perturbation of the prototype, and a query instance is classified by marginalizing out over the embedding uncertainty. Using a synthetic data set, we demonstrate that the embedding uncertainty is related to both input and label noise. On a fewshot learning task, we show that the SPE significantly outperforms its stateoftheart deterministic sibling, the PN. And on a challenging classification task, we find that the SPE outperforms Hedged Instance Embeddings (HIB) (Oh et al., 2019), the stateoftheart stochastic embedding method.
2 Related Work
Supervised embedding methods are popular in the fewshot learning literature (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Triantafillou et al., 2017; Finn et al., 2017; Edwards and Storkey, 2017; Scott et al., 2018; Ridgeway and Mozer, 2018; Mishra et al., 2018) where the goal is to classify query instances based on one or a small number of labeled exemplars of novel classes. These methods operate by embedding the queries and exemplars using a pretrained network, and classifying each query according to its proximity to the exemplars. Embedding methods are also critical in openset recognition domains such as face recognition and person reidentification (Chopra et al., 2005; Li et al., 2014; Yi et al., 2014; Zheng et al., 2015; Schroff et al., 2015; Liu et al., 2015; Ustinova and Lempitsky, 2016; Song et al., 2016; Wang et al., 2017).
Loss functions used to obtain embeddings can be characterized according to the number of instances required to specify a loss. To describe these losses, we will use the notation for an embedding of class . Pairwise losses attempt to minimize withinclass distances, , and maximize betweenclass distances, (Chopra et al., 2005; Hadsell et al., 2006; Yi et al., 2014). Triplet losses attempt to ensure withinclass instances are closer than betweenclass instances, (Schroff et al., 2015; Song et al., 2016; Wang et al., 2017). Quadruplet losses attempt to ensure every withinclass pair is closer than every betweenclass pair, (Ustinova and Lempitsky, 2016). Finally, clusterbased losses attempt to use all instances of a class (Rippel et al., 2016; Fort, 2017; Song et al., 2017; Snell et al., 2017; Ridgeway and Mozer, 2018). In particular, the Prototypical Network (Snell et al., 2017) computes the mean of a set of instances of a class, , and ensures that additional instances of that class, , satisfy a proximity constraint such as . Clusterbased methods represent stateoftheart over, in particular, pairwise and triplet losses, as one might expect given the chronology of publications.
Recently, probabilistic embedding methods have begun to appear. Allen et al. (2019) extend PNs via Bayesian nonparametric methods that treat each prototype as a mixture distribution, though they do not explore uncertainty in the embedding space nor leverage the embedding to handle noisy inputs and noisy labels, which is a significant aspect of our work. Vilnis and McCallum (2018) propose an unsupervised method for learning densitybased word embeddings, where each embedding is represented by a Gaussian distribution; however this work is not comparable to our supervised method. Deep Variational Transfer (Belhaj et al., 2018) is a generative form of the discriminative model we propose; this work has the drawback that it needs to model the input distribution. Authors of this work used their approach for covariate shift, a somewhat different problem than we tackle.
Two prior methods have been proposed for discovering stochastic embeddings in a supervised setting, i.e., for fewshot and openset recognition. The Hedged Instance Embedding (HIB) (Oh et al., 2019) utilizes a probabilistic alternative to the contrastive loss and is trained using a variational approximation to the information bottleneck principle. HIB is critically dependent on a constant, , that determines characteristics of an information bottleneck (i.e., how much of the input entropy is retained in the embedding). Choosing this constant is a matter of art. The OraclePrioritized Belief Network (OPBN) (Karaletsos et al., 2016) is a generative model that learns a joint distribution over inputs and oracleprovided triplet constraints. The OPBN was not tested on fewshot and openset recognition because it requires extensions to be applied to classification tasks. In the deterministic setting, Scott et al. (2018) argue that clusterbased methods outperform pairwise and triplet methods; thus, we have reason to expect that in a stochastic setting, a clusterbased method like the one we propose in this article, SPE, will outperform pairwise (HIB) and triplet (OPBN) methods.
3 The Model
The SPE assumes that the latent representation, , is a Gaussian RV conditioned on the input, :
(1) 
with mean, , and variance, , computed by a deep neural network, similar to a Variational Autoencoder (Kingma and Welling, 2014). The classification, , in turn is conditioned on , with taking the same form as in the original PN (Snell et al., 2017), to be described shortly. Given an input, a class prediction is made by marginalizing over the embedding uncertainty:
(2) 
Figure 1a depicts the relationship between the input, latent, and class representations. We train the SPE using the standard fewshot learning paradigm, consisting of a sequence of episodes, each with instances of classes. We split the instances into support examples, defining a set , and query examples. The support instances for each class , , are used to determine the class prototype, , and the query instances are evaluated to predict class label (Equation 2).
3.1 Forming class prototypes
In the SPE, each class has an associated prototype, , in the embedding space, and each instance of class , denoted , projects to an embedding, , in the neighborhood of such that:
(3) 
We assume that the prototype is consistent with all support instances, allowing us to express the likelihood of as a product distribution:
(4) 
Because is Gaussian, the resulting product is too:
(5) 
where and denotes the Hadamard product. Essentially, the prototype is a confidenceweighted average of the support instances. This formulation has a clear advantage over the deterministic PN, which is premised on an unweighted average, because it deemphasizes noisy support instances.
3.2 Prediction and approximate inference
We assume a softmax prediction for a query embedding, :
(6) 
with as before, yielding the class posterior for query :
(7) 
The class distribution is equivalent to that produced by the deterministic PN as when for all class pairs . However, in the general case, the integral has no closed form solution; thus, we must sample to approximate , both for training and evaluation. We employ two samplers, which we refer to as naïve and intersection.
3.2.1 Naïve sampling
A direct approach to approximating the class posterior is to express Equation 2 as an expectation, , and to replace the expectation with the average over a set of samples. We utilize the reparameterization trick of Kingma and Welling (2014) to train the model. Although this is the simplest approach, it is sampleinefficient during training, and when the number of samples is reduced, model performance is impacted.
3.2.2 Intersection sampling
In Equation 7, the product of Gaussian densities in the numerator can be rewritten:
(8) 
where and Substituting Equation 8 into Equation 7,
(9) 
By approximating the expectation with samples from , we obtain a sampler that focuses on the intersection of the input distribution and a given class distribution, as illustrated in Figure 1b. During training with a crossentropy loss, we need only sample for the known (target) class . As we will demonstrate, this method is more robust and significantly more sample efficient than the naïve sampler, requiring only a single sample to train effectively.
4 Experimental Results
We report on three sets of experiments. In Section 4.1, we demonstrate, using a synthetic data set, that SPE infers the generative structure of a domain, disentangles classdiscriminating features, and provides meaningful estimates of label uncertainty and input noise. In Section 4.2, we show that SPE obtains stateoftheart results on fewshot learning via a comparison to its deterministic sibling, PN, the previous stateoftheart method. We evaluate on a standard data set used to compare methods in the fewshot learning literature, Omniglot (Lake et al., 2015). In Section 4.3, we show that SPE obtains stateoftheart results on largeset classification via a comparison to the only other fully developed stochastic method for supervised embeddings, HIB (Oh et al., 2019). We evaluate on the only data set that Oh et al. (2019) used to explore HIB, a multidigit variant of MNIST. For details regarding network architectures and hyperparameters, see Appendix A, and for simulation details, including the choice of initialization for , see Appendix B.
4.1 Synthetic colororientation data set
The data set consists of pixel images of ‘L’ shapes, with four classes that are distinguished by orientation, color, or both (Figure 2a). Instances are sampled from a classconditional isotropic Gaussian distribution in the generative space. (The isotropy of these qualitatively different dimensions comes from the fact that both can be mapped as directional quantities.) Because classes overlap on both color and orientation dimensions, elicited embeddings should indicate increased uncertainty near class boundaries. Full details of the synthetic data set can be found in Appendix A.2.
We trained a twodimensional, intersectionsampling SPE on samples from this domain, using two instances per class to form prototypes. Classification accuracy of heldout samples is approximately . Accounting for class overlap, a Bayes optimal classifier has an accuracy of approximately . For visualization, Figure 2b presents a array of examples with the class centroids in the corners and the other examples obtained by linear interpolation in the generative space. The resulting embeddings are presented in Figure 2c. Although the correspondence between Figures 2b and 2c seems trivial (mirror one set along the horizontal axis to obtain the other set), remember that the input space is dimensional and the latent space is dimensional. The network has captured the structure of the domain by disentangling the two factors of variation. Further, the embedding variance encodes label ambiguity; instances halfway between two classes on one dimension have maximal variance along that dimension. Label ambiguity is one type of uncertainty. An equally important source of uncertainty comes from noisy or outofdistribution (OOD) inputs. We examined OOD inputs generated in two different ways. In the left panel of Figure 3, we show the consequence of adding pixel hue noise to the four class centroids. Only one of these centroids is shown along the abscissa, but all four are used to make the graph, with many samples per noise level. The grey and black bars in the graph indicate variance on the horizontal and vertical dimensions of the embedding space, respectively. As pixel hue noise increases, uncertainty in color grows but uncertainty in orientation does not. In the right panel of Figure 3, we show the consequence of shortening the leglength of the shape. Shortening the legs removes cues that can be used both for determining color and orientation. As a result, the uncertainty grows on both dimensions.
4.2 Omniglot
The Omniglot data set contains images of labeled, handwritten characters from diverse alphabets. Omniglot is one of the standard data sets for comparing methods in the fewshot learning literature. The data set contains unique characters, each with instances. Following Snell et al. (2017), each grayscale image is resized from to , and we augment the original classes with all rotations, resulting in total classes. We train PNs and SPEs episodically, where a training episode contains randomly sampled classes and query instances per class.
To compare the relative effectiveness of naïve and intersection samplers, we train the SPE on Omniglot varying both the sampler and the number of samples drawn per training query, denoted by . We evaluate in a shot class setting, where shot refers to the number of support examples used to compute each prototype. Figure 4 shows test classification accuracy as the number of samples drawn per training trial () increases. As we previously stated, the intersectionsampling SPE is far more sample efficient, to the point that the intersection sampler with outperforms the naïve sampler with . We have verified that the pattern in Figure 4 is consistent across simulations; consequently, we present only intersectionsampling SPE results in the remainder of the article, and all SPEs are trained with a single sample () per query. This choice causes the SPE to be on par with the PN in time and space requirements, even though using more samples may boost classification accuracy, as suggested by the trend in Figure 4.
Figure 5 is a visualization of a D embedding learned by the intersectionsampling SPE on Omniglot. All classes shown in the figure were heldout during training. Omniglot characters clearly vary along more than two dimensions, so a D SPE cannot learn a fullydisentangled representation as it did with the synthetic data set. However, we can still interpret the axes of the embedding. The horizontal axis appears to represent character complexity, with singlestroke characters on the left and manystroke characters on the right. The vertical axis appears to encode the aspect ratio of the characters, with horizontally extended characters on the bottom and vertically extended characters on the top.
Figure 6a compares the PN and SPE with D embeddings on Omniglot test classes. Each bar is the mean accuracy across four conditions: shot/class, shot/class, shot/class, and shot/class. The first pair of bars perform the standard comparison in which the (1 or 5 instance) support set is used to obtain an embedding for each class, prototypes are formed, and query instances are classified. SPE is reliably better than the PN. Because the Omniglot data are carefully curated, the instances have little noise and therefore offer little opportunity to leverage SPE’s assessment of uncertainty. Consequently, we corrupted instances by masking out rectangular regions of the input, as proposed by Oh et al. (2019). (See Appendix E for details.) The second and third sets of bars in Figure 6a correspond to the situations where the support and query instances are corrupted, respectively. SPE’s advantage over PN increases significantly when the support instances are corrupted due to the fact that SPE’s confidenceweighted prototypes (Equation 5) discount noisier support examples. Although the SPE is still superior when only the query is corrupted, the benefit is small. We also compared PN and SPE using a D embedding, but with high dimensional embeddings, both methods are near ceiling on this data set, resulting in comparable performance between the two methods. (See Appendix D for additional results, broken down by condition.)
To emphasize, SPE outperforms the PN, arguably the leading fewshot learning method, especially when inputs are corrupted, at essentially the same computational cost for training. And by providing an estimate of uncertainty associated with embedded instances, the SPE offers the possibility of detecting OOD samples and informing downstream systems that operate on the embedding.
4.3 Ndigit MNIST
The digit MNIST data set was proposed to evaluate HIB (Oh et al., 2019); it is formed by horizontal concatenation of MNIST digit images. The resulting images are . To compare with HIB, we study  and digit MNIST, and use a network architecture identical to that in Oh et al. (2019). Oh et al. (2019) split the data into a training set (with of the total classes), a seen test set, and an unseen test set. For digit MNIST, the seen test set has the same of classes as the training set and the unseen test set has the remaining classes. For digit MNIST, the training set has classes, the seen and unseen test sets each have a sample of of the seen or unseen classes, respectively. We use the same train and test data splits as Oh et al. (2019), but we further divide the training split to include a validation set for early stopping.
Figure 7 shows two views of the D embedding learned by the SPE on the digit MNIST test set. Each number is a class label; for example, , located in the lower left of the embedding, is the class in which the first of the two MNIST digits is a and the second is a . The location of a label in the space corresponds to the mean of its prototype. In the left plot, each class is colored according to the first digit. The right plot is the same embedding, but each prototype is colored according to the second digit. The SPE learns an incredibly robust factorial representation in which the horizontal dimension represents the first digit of a class and the vertical dimension represents the second digit. A black bounding box indicates the unseen test classes, classes not presented during training. Impressively, the unseen test classes are embedded in exactly the positions where they belong, indicating that the SPE can discover relationships among classes that allow it to generalize to classes it has never seen during training. Furthermore, the embedding has captured interclass similarity structure by placing visually similar digits close to one another. For example, on both the vertical and horizontal bands, nines (teal) and fours (purple) are adjacent, and fives (brown) and threes (red) are adjacent. The adjacency relationships vary a bit from one dimension of the mapping to the other; for example, sixes (pink) are adjacent to eights (yellow) and zeros (blue) in the vertical bands, but adjacent to fives (brown) and zeros in the horizontal bands. HIB is able to discover a similar structure along one dimension (Oh et al., 2019), but the second dimension is somewhat more entangled, suggesting that the SPE learns a more robust representation. Additionally, embeddings for the unseen class are not presented for HIB. The ability to sensibly embed novel classes is essential for any model that will be used for openset recognition or fewshot learning.
Figure 6b,c compare digit MNIST test accuracy on seen and unseen classes, respectively.^{1}^{1}1HIB results are from Oh et al. (2019). We thank the authors for providing us results on unseen classes, which were not included in their publication. Each bar is the mean test accuracy across the Cartesian product of conditions specified by the number of MNIST digits in each image, , and the dimensionality of the embedding, . As in the Omniglot simulation, we varied whether support and query instances were clean or corrupted. The SPE outperforms HIB in all six comparisons. In the individual conditions, SPE is worse on only . As in the Omniglot simulation, SPE shines best when support instances may be corrupted. (Appendix A.3 provides tabular results by condition, not only for HIB and SPE, but also their deterministic counterparts, contrastive loss and PN. Because the deterministic methods perform consistently worse than the stochastic methods, we omit the deterministic methods from the figure.)
Whereas SPE is a discriminative model with a specified classification procedure, Oh et al. (2019) had the freedom to design one. They use all available data—roughly examples per class—and perform leaveoneout nearestneighbor classification. To be consistent with our episodic test procedure, the SPE uses only support instances per class to form prototypes. It is particularly impressive that the SPE, based on a single stored prototype and approximately the labeled data, can outperform a memorybased nonparametric method that is able to model arbitrary distributions in latent space.
5 Discussion and Conclusions
Our Stochastic Prototype Embedding (SPE) method outperforms a stateoftheart deterministic method, the Prototypical Net (PN), on fewshot learning, particularly when support instances may be corrupted. Because the SPE reduces to the PN under certain restrictions, it seems unlikely to fare worse; but because it can handle uncertainty in both the query and support set, it has great opportunity to improve on the PN. Many extensions have been proposed to the PN (e.g., Fort, 2017; Allen et al., 2019). These extensions are mostly compatible with ours, and thus methods may be potentially combined to attain even stronger fewshot learning performance under uncertainty.
SPE also significantly outperforms the only existing alternative stochastic method, the Hedged Instance Embedding (HIB), on a the complete battery of largeset classification tasks used to evaluate HIB. Beyond its performance gains, SPE has no hand tuned parameters, whereas HIB has constant that determines characteristics of an information bottleneck (i.e., how much of the input entropy is retained in the embedding). Although one could simply set , doing so would encourage the net to perform like a softmax classifier and discard all information about interclass similarity. Such similarities are essential in order to generalize to unseen classes (e.g., Figure 7).
We proposed and evaluated an intersection sampler to train the SPE, which makes the SPE as time and space efficient for training as the deterministic PN, and more efficient for training than HIB, which relies on about samples per item. (Our evaluation method for SPE presently involves drawing samples from the naive sampler, though this conservative decision was arbitrary and not tuned.)
An unanticipated virtue of SPE is its ability to obtain interpretable, disentangled representations (Figures 2, 5, 7). Because uncertainty is encoded in a diagonal covariance matrix, any classification ambiguity maps to uncertainty in the value of individual features of the embedding. Thus, classdiscriminating feature dimensions must align with the principle axes of the embedding space. In contrast to traditional unsupervised disentangling methods, which aim to discover the underlying generative factors of a domain, the SPE obtains a supervised analog in which the underlying classdiscriminative factors are represented explicitly. This representation facilitates generalization to novel unseen classes and is therefore valuable for fewshot and lifelong learning paradigms.
References
 Allen et al. (2019) Allen, K. R., Shelhamer, E., Shin, H., and Tenenbaum, J. B. (2019). Infinite Mixture Prototypes for FewShot Learning. arXiv eprints 1902.04552 cs.LG.
 Belhaj et al. (2018) Belhaj, M., Protopapas, P., and Pan, W. (2018). Deep Variational Transfer: Transfer Learning through Semisupervised Deep Generative Models. arXiv eprints 1812.03123 cs.LG.
 Chopra et al. (2005) Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learning a Similarity Metric Discriminatively, with Application to Face Verification. In IEEE Conference on Computer Vision and Pattern Recognition.
 Edwards and Storkey (2017) Edwards, H. and Storkey, A. (2017). Towards a Neural Statistician. In International Conference on Learning Representations.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135.
 Fort (2017) Fort, S. (2017). Gaussian Prototypical Networks for FewShot Learning on Omniglot. In Second Workshop on Bayesian Deep Learning (NIPS 2017).
 Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality Reduction by Learning an Invariant Mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 1735–1742.
 Karaletsos et al. (2016) Karaletsos, T., Belongie, S., and Rätsch, G. (2016). Bayesian Representation Learning with Oracle Constraints. In International Conference on Learning Representations.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). AutoEncoding Variational Bayes. In International Conference on Learning Representations.
 Koch et al. (2015) Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Siamese Neural Networks for OneShot Image Recognition. In ICML Deep Learning Workshop, volume 2.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). HumanLevel Concept Learning through Probabilistic Program Induction. Science, 350(6266):1332–1338.
 Li et al. (2014) Li, W., Zhao, R., Xiao, T., and Wang, X. (2014). DeepReID: Deep Filter Pairing Neural Network for Person Reidentification. In IEEE Conference on Computer Vision and Pattern Recognition.
 Liu et al. (2015) Liu, J., Deng, Y., Bai, T., Wei, Z., and Huang, C. (2015). Targeting Ultimate Accuracy: Face Recognition via Deep Embedding. arXiv eprints 1506.07310 cs.CV.
 Masson and Loftus (2003) Masson, M. E. J. and Loftus, G. R. (2003). Using Confidence Intervals for Graphically Based Data Interpretation. In Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale, volume 57, pages 203–220.
 Mishra et al. (2018) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. (2018). A Simple Neural Attentive MetaLearner. In International Conference on Learning Representations.
 Oh et al. (2019) Oh, S. J., Gallagher, A. C., Murphy, K. P., Schroff, F., Pan, J., and Roth, J. (2019). Modeling Uncertainty with Hedged Instance Embeddings. In International Conference on Learning Representations.
 Ridgeway and Mozer (2018) Ridgeway, K. and Mozer, M. C. (2018). Learning Deep Disentangled Embeddings With the FStatistic Loss. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 185–194. Curran Associates, Inc.
 Rippel et al. (2016) Rippel, O., Paluri, M., Dollar, P., and Bourdev, L. (2016). Metric Learning with Adaptive Density Discrimination. In International Conference on Learning Representations.
 Schroff et al. (2015) Schroff, F., Kalenichenko, D., and Philbin, J. (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition.
 Scott et al. (2018) Scott, T., Ridgeway, K., and Mozer, M. C. (2018). Adapted Deep Embeddings: A Synthesis of Methods for kShot Inductive Transfer Learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 76–85. Curran Associates, Inc.
 Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical Networks for Fewshot Learning. In Advances in Neural Information Processing Systems 31, pages 4077–4087.
 Song et al. (2017) Song, H. O., Jegelka, S., Rathod, V., and Murphy, K. (2017). Deep Metric Learning via Facility Location. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2206–2214.
 Song et al. (2016) Song, H. O., Xiang, Y., Jegelka, S., and Savarese, S. (2016). Deep Metric Learning via Lifted Structured Feature Embedding. In IEEE Conference on Computer Vision and Pattern Recognition.
 Triantafillou et al. (2017) Triantafillou, E., Zemel, R., and Urtasun, R. (2017). FewShot Learning Through an Information Retrieval Lens. In Advances in Neural Information Processing Systems 31, pages 2255–2265.
 Ustinova and Lempitsky (2016) Ustinova, E. and Lempitsky, V. (2016). Learning Deep Embeddings with Histogram Loss. In Advances in Neural Information Processing Systems 30, pages 4170–4178.
 Vilnis and McCallum (2018) Vilnis, L. and McCallum, A. (2018). Word Representations via Gaussian Embedding. In International Conference on Learning Representations.
 Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems 30, pages 3630–3638.
 Wang et al. (2017) Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017). Deep Metric Learning with Angular Loss. In IEEE International Conference on Computer Vision, pages 2612–2620.
 Yi et al. (2014) Yi, D., Lei, Z., Liao, S., and Li, S. Z. (2014). Deep Metric Learning for Person Reidentification. In International Conference on Pattern Recognition, pages 34–39.
 Zheng et al. (2015) Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015). Scalable Person Reidentification: A Benchmark. In IEEE International Conference on Computer Vision.
Appendix A Network Architectures and Hyperparameters
a.1 Omniglot
For all Omniglot experiments, the network consisted of four convolutional blocks. The first three blocks had a convolutional layer with filters, a kernel, zeropadding of length , and a stride of , followed by a batch normalization layer, ReLU activation, and maxpooling. The fourth and final block had a convolutional layer with filters, a kernel, zeropadding of length , and a stride of , followed by maxpooling, where represents the dimensionality of the embedding space. The flattened output of the network is a vector of length , where the first elements were considered the mean of the Gaussian distribution and the remaining elements were the diagonal covariance entries. The weights were initialized using He initialization and the biases with the following uniform distribution: .
All Omniglot models were trained with an initial learning rate of which was cut in half every epochs. The models were stopped early using a patience parameter when performance on the validation set no longer increased.
a.2 Synthetic data
The images in the synthetic data set are pixels in size. For orientation, we chose class centers at and , with a standard deviation of . For color, we manipulated the hue and kept value and saturation constant. Like orientation, hue is a circular quantity. If hue ranges from to degrees, we chose color class centers and standard deviation in the same way as orientation. Additionally, we add noise to a minority (15%) of the images used to train the model. For these, we add Gaussian noise to the hue of each pixel inside the shape. The standard deviation of the hue noise was chosen uniformly between and . We also added noise to the leg lengths of the L shapes. The leg length was chosen uniformly between 10% and 98% of its original length. See Figure 3 for some examples.
The network followed an architecture similar to the one we used for Omniglot, except that we added two additional blocks of convolution, batch normalization, ReLU, and maxpooling because the images are larger. We used instances per class to form prototypes and samples per query instance during training. We used a learning rate of 0.0001 and the models were stopped early using a patience parameter when performance on the validation set no longer increased.
a.3 NDigit MNIST
For all digit MNIST experiments, we constructed an architecture which we believe to be identical to that used for HIB MNIST experiments, based on code provided by the authors (Oh et al., 2019). The network consisted of two convolutional blocks followed by two fullyconnected layers. The convolutional blocks each contained a convolutional layer, followed by an ReLU activation, and maxpooling. The first convolutional layer had filters, a kernel, zeropadding of length , and a stride of . The second convolutional layer was identical to the first, but had filters instead of . The output of the second convolutional block was flattened, passed through a fullyconnected layer with units, an ReLU activation, and a final fullyconnected layer with units, where represents the dimensionality of the embedding space. Like the Omniglot architectures, the first entries in the output vector are treated as the mean and the remaining elements as the diagonal covariance entries. The weights were initialized using a Xavier uniform initialization and biases were initialized to zero.
The PN and SPE are trained episodically with all performance results in the main article measured as the mean over random test episodes. All digit MNIST models were trained with an initial learning rate of which was cut in half every epochs. The models were stopped early using a patience parameter when performance on the validation set no longer increased. For digit MNIST, each episode in training, validation, and seenclass testing contained all classes and support instances per class. For testing of unseen classes, each episode contained all classes. For digit MNIST, each episode contained classes and either support instances per class for training/validation or support instances per class for seen and unseenclass testing.
Appendix B Simulation Details
For all SPE models,
where is a trainable parameter. We initialize using the following prescription:
where is the number of support examples per episode during training and is the dimensionality of the embedding. We chose this prescription for two reasons: (1) as the number of support examples increases, the variance of the prototype distribution approaches zero, so scaling linearly by tends to provide a stronger training signal early on, and (2) the amount of noise in the projection of an embedding should scale with the dimensionality of the embedding space as to maintain unitvolume. All models used .
The variance of each dimension , , is guaranteed to be nonnegative by using a softplus transfer function.
Whether trained with the naïve or intersection sampler, we evaluate model performance using the naïve sampler with samples. This approach ensures that we are comparing the quality of models based only on the method by which they were trained.
Appendix C SPE Variants
We assumed only diagonal covariance matrices in this work. Switching to a full covariance matrix would require matrix inversion, which is ordinarily infeasible, but because one purpose of deep embeddings is visualization, there may be interesting cases involving 2D embeddings where the cost of inversion is trivial. However, using a diagonal covariance matrix causes classdiscriminating features to be aligned with the axes of the latent space, as we argued in the main article, and this alignment is a virtue for interpretation.
Appendix D Tabular Results
d.1 Omniglot
Clean Support, Clean Query 


1shot, 5class  5shot, 5class  1shot, 20class  5shot, 20class 
Mean 

PN  75.7  82.6  45.0  55.9 
64.8 
SPE  76.9  82.3  49.7  55.3 
66.1 
Corrupt Support, Clean Query 


1shot, 5class  5shot, 5class  1shot, 20class  5shot, 20class 
Mean 

PN  50.0  65.9  23.6  31.7 
42.8 
SPE  50.7  73.9  25.6  41.6 
48.0 
Clean Support, Corrupt Query 


1shot, 5class  5shot, 5class  1shot, 20class  5shot, 20class 
Mean 

PN  48.9  52.3  21.7  25.6 
37.1 
SPE  47.8  52.3  22.8  26.8 
37.4 
d.2 NDigit MNIST
Clean Support, Clean Query 

seen test classes  unseen test classes  
N=2  N=3  N=2  N=3  
D=2  D=3  D=2  D=3 
mean 
D=2  D=3  D=2  D=3 
mean 

Contrastive  88.2  95.0  65.8  87.3 
84.1 
85.5  84.8  59.0  85.5 
78.7 
HIB  87.9  95.2  65.0  87.3 
83.9 
87.3  91.0  64.4  88.2 
82.7 
PN  91.1  95.0  65.8  90.6 
85.6 
82.0  89.5  64.3  89.1 
81.2 
SPE  93.0  94.2  80.2  89.0 
89.1 
90.0  89.3  80.2  88.2 
86.9 
Corrupt Support, Clean Query 

seen test classes  unseen test classes  
N=2  N=3  N=2  N=3  
D=2  D=3  D=2  D=3 
mean 
D=2  D=3  D=2  D=3 
mean 

Contrastive  76.2  92.2  49.5  77.6 
73.9 
76.5  73.3  42.6  73.2 
66.4 
HIB  81.6  94.3  54.0  81.2 
77.8 
80.8  86.7  53.9  81.2 
75.7 
PN  72.7  93.3  44.6  82.7 
73.3 
70.9  86.3  42.9  79.6 
69.9 
SPE  92.4  93.8  76.7  87.8 
87.7 
88.8  86.3  75.4  86.3 
84.2 
Clean Support, Corrupt Query 

seen test classes  unseen test classes  
N=2  N=3  N=2  N=3  
D=2  D=3  D=2  D=3 
mean 
D=2  D=3  D=2  D=3 
mean 

Contrastive  43.5  51.6  29.3  44.7 
42.3 
46.3  44.8  26.2  42.0 
39.8 
HIB  49.9  57.8  31.8  49.9 
47.4 
53.5  57.0  32.1  50.2 
48.2 
PN  53.1  61.1  33.8  56.4 
51.1 
51.1  57.9  33.0  54.8 
49.2 
SPE  53.7  58.2  40.2  48.1 
50.1 
56.3  56.5  39.3  46.6 
49.7 
Appendix E Corruption Procedure
The algorithm for applying corruption was identical to the scheme used in Oh et al. (2019). A random rectangularsized occlusion of black pixels was determined by first sampling a patch width, , and patch height, , from a uniform distribution, , and then sampling the topleft corner coordinates, , . This resulted in an occlusion of area . Note that if or , the image was left unoccluded. Figure 8 shows examples of occluded 2digit images.
For Omniglot, we only trained/validated on corrupted imagery if the test set contained a corrupted support or corrupted query set. When testing on clean support and clean query, the training and validation sets were left unoccluded. When testing on corrupted imagery, the training and validation sets corrupted each character independently with a probability of .
The training and validation sets for digit MNIST corrupted each digit of each image independently with a probability of , regardless of test imagery. This matched Oh et al. (2019).
During testing on both data sets, we considered both clean and corrupt support sets, as well as clean and corrupt query sets. A clean set was one in which all digits/characters were unoccluded. A corrupt set occluded each digit/character in each image according to the procedure described above.