Noise Contrastive MetaLearning for Conditional Density Estimation using Kernel Mean Embeddings
Abstract
Current metalearning approaches focus on learning functional representations of relationships between variables, i.e. on estimating conditional expectations in regression. In many applications, however, we are faced with conditional distributions which cannot be meaningfully summarized using expectation only (due to e.g. multimodality). Hence, we consider the problem of conditional density estimation in the metalearning setting. We introduce a novel technique for metalearning which combines neural representation and noisecontrastive estimation with the established literature of conditional mean embeddings into reproducing kernel Hilbert spaces. The method is validated on synthetic and realworld problems, demonstrating the utility of sharing learned representations across multiple conditional density estimation tasks.
Noise Contrastive MetaLearning for Conditional Density Estimation using Kernel Mean Embeddings
JeanFrançois Ton University of Oxford ton@stats.ox.ac.uk Lucian Chan University of Oxford leung.chan@stats.ox.ac.uk Yee Whye Teh University of Oxford teh@stats.ox.ac.uk Dino Sejdinovic University of Oxford dino.sejdinovic@stats.ox.ac.uk
noticebox[b]Preprint. Under review.\end@float
1 Introduction
The estimation of conditional densities based on paired samples is a general and ubiquitous task when modelling relationships between random objects and . While the problem of regression focuses on estimating the conditional expectations of responses given the features , many scenarios require a more expressive representation of the relationship between and . In particular, the distribution of given may exhibit multimodality or heteroscedasticity, thus requiring a flexible nonparametric model of the full conditional density. Estimating conditional densities becomes even more challenging when the sample size is small, especially when and are multivariate. Hence, we approach this problem from a metalearning perspective, where we are faced with a number of conditional density estimation tasks, allowing us to transfer information between them via a shared learned representation of both the responses and the features .
Our contribution can be viewed as a development which parallels that of neural processes [Garnelo et al., 2018b] and conditional neural processes [Garnelo et al., 2018a] in the context of regression and functional relationships, but is applicable to a much broader set of relationships between random objects, i.e. those where the response cannot be meaningfully represented using a single function of the features . To that end, we will make use of the framework of conditional mean embeddings (CME) of distributions into reproducing kernel Hilbert spaces (RKHSs) [Song et al., 2013, Muandet et al., 2017].
Let us consider a simple illustrative example of one such relationship where there is no functional relationship between and in the data space. Assume that we are given a dataset sampled uniformly from an annulus . Any regression model would fail to capture the dependence between and because clearly . However, we can consider "augmenting" the representation of by using a feature map . The relationship between the two variables now becomes trivial since is a simple function of . In general, however, we will require a much more expressive feature mapping so that the CME, i.e. the conditional expectation of the feature map, captures all relevant information about the conditional density . In the RKHS literature, the feature maps that yield kernel mean embeddings that fully characterize probability distributions correspond to the notion of characteristic kernels [Sriperumbudur et al., 2011] and are infinitedimensional. However, such kernels can often be too simplistic for specific tasks (e.g. a simple Gaussian kernel is known to be characteristic). Moreover, even though they give a unique representation of a probability distribution and can be a useful tool to represent conditional distributions^{1}^{1}1In particular, CME can be used to estimate conditional expectations for a broad class of functions , namely functions in the RKHS determined by the feature map ., they do not yield (conditional) density estimates and it is not clear how to adopt them for such tasks. In this contribution, we propose to use neural networks to learn appropriate feature maps and by adopting the metalearning framework, i.e. by considering a number of (similar) conditional density estimation tasks simultaneously. While CME estimation for fixed feature maps is well understood [Song et al., 2013, Muandet et al., 2017], we are here concerned with the challenge of linking our CME estimates back to the conditional density estimation (CDE) task, while simultaneously learning the feature maps defining CME. To address this challenge, we propose to use a technique based on noise contrastive estimation (NCE) [Gutmann and Hyvärinen, 2012], treating CMEs as dataset features in the binary classifier discriminating between the true and artificially generated samples of pairs.
The proposed method is validated on synthetic and realworld data demonstrating multimodal properties, namely on Ramachandran plots from computational chemistry [Gražulis et al., 2011], which represent relationships between dihedral angles in molecular structures, as well as on the NYC taxi data used in Trippe and Turner [2018] to model the conditional densities of dropoff locations given the taxi tips.
2 Background
We first introduce some notation that we use throughout this paper. We denote the observed dataset by , with and . We also define the learned RKHS/feature maps of inputs and responses as and respectively.
2.1 Conditional Mean Embeddings (CME)
Kernel mean embeddings of distributions provide a powerful framework for representing and manipulating probability distributions [Song et al., 2013, Muandet et al., 2017]. Formally, given sets and , with a distribution over the random variables taking values in , the conditional mean embedding (CME) of the conditional distribution of , assumed to have density , is defined as:
(1) 
Hence, for each value of the conditioning variable , we obtain an element of . Following Song et al. [2013], the conditional mean embedding can be associated with the operator , which satisfies
(2) 
It can be shown [Song et al., 2013] that we can write where and .
As a result, the finite sample estimator of based on dataset can be written as
(3) 
where and are the feature matrices, is the kernel matrix with entries , and is a regularization parameter. Hence simplifies to a weighted sum of the feature maps of the observed points :
(4)  
(5) 
where . In fact, when using finitedimensional feature maps, the conditional mean embedding operator is simply a solution to a vectorvalued ridge regression problem (regressing to ), which allows computation scaling linearly in the number of observations. Namely, the Woodbury matrix identity allows us to have computations of either order or , where is the dimension of the feature map.
2.2 Noise Contrastive Estimation (NCE) of Unnormalized Statistical Models
The seminal work on noise contrastive estimation by Gutmann and Hyvärinen [2012] allows converting density estimation into binary classification, via learning to discriminate between the noisy artificial data and the real data. More concretely, assume that the true underlying density of the data is and the distribution of the fake data is . Following Gutmann and Hyvärinen [2012] we set up the experiments such that we see times more fake examples than the real ones, which are all fed together with their labels (True/Fake) into the classifier. Hence, the data arises from and the probability that any given comes from the true distribution is
(6) 
Since our goal is to learn the true density we can construct the probabilistic classifier where we model as where is the logistic function and are the parameters of the classifier, resulting in the corresponding density model . Gutmann and Hyvärinen [2012] show empirically that one can model the unnormalized density (say, ) and the corresponding normalizing constant separately, by writing with , to obtain a normalized density. In the next section, we will adopt these ideas to the context of conditional density estimation in the metalearning setting. In particular, we will build classifiers that use conditional mean embeddings to model .
3 Methodology
3.1 Conditional Mean Embeddings for Noise Contrastive Estimation
As described above, the key ingredient of noise contrastive estimation is a classifier which can discriminate between the samples from the true density, in our case the conditional i.e. , and those from the fake density i.e. . For a given , assuming that the classifier observes samples from the mixture , the probability that arises from the true conditional distribution as opposed to the fake density is given by:
(7) 
Assuming for the moment that the learned probabilistic classifier attains Bayes optimality, we can deduce the pointwise evaluations of the true conditional density directly from expression (7) as
(8) 
We note that this expression is already normalized. However, given that we only have approximations to the Bayes classifier, it will be useful, following Gutmann and Hyvärinen [2012] to model the normalizing constant separately.
In particular, consider the density model given by
(9) 
for some function , which following terminology in Mnih and Teh [2012] we will refer to as the scoring function. Here, arises from the normalizing constant for each conditional density . Under this model, the probability that arises from the true conditional distribution is given by:
(10)  
(11) 
where is the logistic function. Eq.(11) gives us the form of the probabilistic classifier we will adopt, where we will need to construct the scoring function appropriately, and, in particular, how it relates to the feature maps and . While the contribution is directly determined by the choice of , computing it will be intractable for any given and we will hence decouple the two, and model as a separate neural network with input and its own set of parameters to be learned (collated into the overall parameter set ).
We will map and using feature maps and . In order to facilitate learning of these feature maps, they will be parametrized using neural networks (with both sets of parameters collated into ). Hence, we use finitedimensional feature maps here, but other choices are possible. Next, we compute the Conditional Mean Embedding Operator (CMEO) given in (3).
Given , we can estimate the conditional mean embedding for any new using
(12) 
Note that . We can now compute for any new . This is an evaluation of the conditional mean embedding at any given new response. We expect this value to be high when is drawn from the true conditional distribution and low in cases where is drawn from the fake distribution and falls in a region where the true conditional density is low. This is readily seen from observing that the true CME evaluated at can be written as
(13) 
where . This suggests the following form of the scoring function:
(14) 
Given a set of true examples as well as the fake responses associated to each input , we can now train the classifier using model (11) by maximizing conditional loglikelihood of the True/Fake labels,
(15) 
or, equivalently, by minimizing the logistic loss:
(16) 
After the classifier has been learned, conditional density estimates can simply be read off from (9). Note that we need to be able to evaluate the fake density pointwise. We will take a closer look at the choices of fake densities in Section 3.3.
We note that using the above criterion may be of an independent interest when learning feature maps for conditional mean embeddings, i.e. where the goal is not necessarily density estimation, but other uses of conditional mean embeddings discussed in Song et al. [2013]. Namely, even though estimation of conditional mean embedding corresponds to regression in the feature space, it is inappropriate to use the squared error loss of the featuremapped responses to learn the feature maps themselves, as the notion of the distance in the loss is changing as the feature maps are changing and they are not comparable across different feature maps. In fact, it would be optimal for the feature map to be constant as the squared error would then be zero, and we would not have learned anything useful about the relationship between and .
3.2 MetaLearning of Conditional Densities
We now describe how to train our developed model in the metalearning setting. Let be the set of conditional density estimation tasks with corresponding to the dataset , where and share the same domains across the tasks. We use an approach similar to that of the Neural Process (NP) [Garnelo et al., 2018b], where during training we define a context set and a target set. For example, for task we use samples to be context and the remaining to be target. Conditional mean embedding operator in (3) will be estimated using the context set, whereas the conditional mean embeddings will be evaluated on the target set, as in (12).
Next, for each target example, we sample fake samples from and represent them in using the feature map so that (11) can be computed for each of these samples ( true and fakes). By also providing the labels (i.e. True/Fake), we proceed by training the classifier, i.e. learning the parameters of neural networks , and using the objective (16) jointly over all tasks. The resulting feature maps hence generalize across tasks and can be readily applied to a new, previously unseen dataset, where we are simply required to compute the scoring function using the conditional mean embedding operator estimated on this new dataset and insert it into (9).
3.3 Choice of the Fake Distribution in NCE
The choice of the fake distribution plays a key role in the learning process here, especially due to the fact that we are interested in conditional densities. In particular, if the fake density is different from the marginal density , then our model could learn to distinguish between the fake and true samples of simply by constructing a “good enough” model of the marginal density on a given task while completely ignoring the dependence on (this can be achieved by making the feature maps of constant). This becomes obvious if, say, the supports of the fake and the true marginal distribution are disjoint, where clearly no information about is needed to build a classifier – i.e. the classification problem is “too easy”. Thus, ideally we wish to draw fake samples from the true marginal in a given task. While we could achieve this by drawing a paired to another , i.e. from the empirical distribution of pooled s in a given task, recall that we also require existence of a fake density which can be computed pointwise and inserted into (9). Hence, we propose to use a kernel density estimate (KDE) of s as our fake density in any given task. In particular, kernel density estimator of is computed on all responses (context and target). In order to sample from the this fake distribution, we simply draw from the empirical distribution of pooled s and add Gaussian noise with standard deviation being the bandwidth of the KDE (assuming we are using a Gaussian KDE for simplicity here; other choices of kernel are of course possible with appropriate modification of the type of noise). As our experiments demonstrate, this choice ensures that the fake samples are sufficiently hard to distinguish from the true ones, requiring the model to learn meaningful feature maps which capture the dependence between and and are informative for the CDE task.
Finally, we note that while in principle it is possible to consider families of fake distributions which also depend on the conditioning variable , we do not explore this direction here. This is due to the fact that such approach would require a nontrivial construction of a model of fake conditional densities that is easy to sample from, can be computed pointwise, and according to the same rationale as above, shares the same marginal density with the true conditional model we are interested in.
4 Related Work
NCE for learning representations has been considered before and the closest work to our paper is Mnih and Teh [2012], which focuses on learning discrete distributions in the context of Natural Language Processing (NLP). They achieve impressive speedups over other word embeddings as they avoid having to compute the normalizing constant thanks to the NCE setup of the optimization. More recently, Van den Oord et al. [2018] also introduce a NCE method for representation learning, however, they focus on learning an expressive representation in the unsupervised setting, thereby optimizing a mutual information objective instead.
Other methods that also use the idea of fake examples in order to learn an expressive feature map are Zhang et al. [2018], who train a GAN in order to use the resulting discriminator for fewshot classification.
In terms of using RKHSs in density models, several works, for example Dai et al. [2018], Arbel and Gretton [2017] have considered training kernel exponential family models, where the main bottleneck is to compute the normalizing constant. Dai et al. [2018] exploit the flexibility of kernel exponential families to learn conditional densities and avoid the problem of computing normalizing constants by solving so called nested Fenchel duals. Arbel and Gretton [2017] train kernel exponential family models using score matching criteria, which allows them to bypass normalizing constant computation. The method however requires computing and storing the first and second order derivatives of the kernel function for each dimension and each sample and as such requires memory and time, where is the number of data points and the dimension of the problem.
Sugiyama et al. [2010] propose a method of learning the conditional density by learning a ratio of the joint and the marginal. They model the conditional density as a linear combination of a set of basis functions. This method works well on reasonably complicated tasks, although the optimal choice of basis functions is still unclear.
In terms of fewshot learning, the only paper (to the best of our knowledge) that considers CDE is Dutordoir et al. [2018]. They propose to extend the inputs with additional latent variables and use a GP to project these extended vectors onto samples from the conditional density. Contrary to our method they do not learn a feature map for the output specifically but rather use a multiouput GP onto which they stack a probabilistic projection into the original output space.
5 Experiments
5.1 Experiments on Synthetic Data
We first validate our method on a synthetic dataset. In our experimental setup, we wish to measure how well our method can pick up multimodality and heteroscedasticity in the response variable and so we construct datasets with this in mind as follows: we first sample Uniform and then set , where and vary between tasks, with noise . Note that in this case can be written as a simple function of with added noise, but not vice versa on the whole range of , leading to the multimodality of . Note also that the marginal distribution of is known and hence using uniform fake samples is sufficient in this case. In Figure 1, we compare our method with a number of alternative conditional density estimations methods. These methods include KDE (KDE applied to the neighbourhood of ), DDE [Dai et al., 2018], KCEF [Arbel and Gretton, 2017], and LSCDE [Sugiyama et al., 2010].
During metalearning, we use context points and target points for each task. We also fix as suggested in [Gutmann and Hyvärinen, 2012]. At testing time, we evaluate the method on new tasks with context/training points each. We simply pass the new context points to our model, which can evaluate the density with a simple forward pass as in (8). The non metalearning baselines are trained on each of the 100 datasets separately. We have included additional experiments with and context points in the Appendix to illustrate the robustness of the methods with varying training data sizes as well as additional information on the neural network architectures.
In the table below we report the mean loglikelihood over the 100 different datasets. The reason for the high variance in some methods stems from the varying difficulty of tasks. We also report the pvalues of the onesided signed Wilcoxon test which confirms that the likelihood of our method, MetaCDE, is significantly higher than all the respective methods we compare against. For more experiments and further clarifications, cf. the Appendix, where the high variance is explained using a histogram of differences in likelihood w.r.t. to MetaCDE.
MetaCDE  DDE  LSCDE  KCEF  KDE  

Mean over 100 loglikelihoods  197.36 25.26  162.98 68.67  44.95 74.36  388.30 699.65  116.31 235.80 
Pvalue for Wilcoxon test  NA  9.681e06  <2.2e16  < 2.2e16  1.92e07 
5.2 Experiments on Ramachandran plots for molecules
Finding all energetically favourable conformations for flexible molecular structures in both bound and unbound state is one of the biggest challenge in computational chemistry [Hawkins, 2017] as the number of possibilities increases exponentially with the dimension. Knowledge about the distributions of dihedral angles in molecules (represented using Ramachandran plots [Mardia, 2013]) is used in different sampling schemes and it is currently limited by the library curated by chemists. Here, we attempt to apply MetaCDE in order to learn richer relationships between dihedral angles, which can lead to an improved performance in the molecule sampling scheme.
The data we have used in our experiments was extracted from crystallography database [Gražulis et al., 2011]. The multimodality of the dataset arises from the molecular symmetries such as reflection and rotational symmetry. Namely, when we rotate the molecule around the symmetry axis or reflect along the plane of symmetry, it results in a conformation indistinguishable from the original.
In our experiments we consider the cases where we have 80 data points per task during training and only 20 at testing time. In our metalearning setup we are going to take 20 context and 60 target points. We again fix [Gutmann and Hyvärinen, 2012] and evaluate our method using loglikelihood on 100 examples for pairs of dihedral angles, which have not been seen during training. As in the previous experiment, we perform a onesided signed Wilcoxon test to confirm that MetaCDE achieves a significantly higher heldout loglikelihood than other methods.
MetaCDE  DDE  LSCDE  KCEF  KDE  

Mean over 100 Loglikelihoods  297.58 67.63  315.49 204.82  335.85 192.09  596.95 871.97  422.99 346.46 
Pvalue for Wilcoxon test  NA  4.932e05  2.692e05  1.397e05  9.49e07 
Further clarifications and illustrations are given in the Appendix. Note that one could take into account that the data itself lies on a torus by simply preprocessing the angles and into and , respectively. This is sensible and easily implemented in our method.
5.3 Experiments on NYC taxi data
Lastly, we illustrate our algorithm on the NYC taxi dataset from January 2016 that contains over one million data points ^{2}^{2}2Data has been taken from: https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page. We are interested in estimating conditional densities . Hence, we use different pickup locations as determining our tasks and our goal is to model the dropoff density conditionally on the tip amount. In this case, different tasks will correspond to different pickup locations and hence the conditional density will change accordingly.
At testing time, we give the model 200 datapoints of unseen pickup locations and model the conditional density based on those. In Figure 3, we illustrate one testing case and show how the density evolves as the tip amount increases. The pickup location in Brooklyn (red dot) is not seen by the method. In particular, we see that as the tip amount increases, the trips become more likely to end in Manhattan. We illustrate additional unseen pickup locations in the Appendix as well as additional information on these experiments.
6 Conclusions and Future Work
We introduced a novel method for conditional density estimation in a metalearning setting. We applied our method to a variety of synthetic and realworld data, with strong performance on an application in computational chemistry and an illustrative example using NYC Taxi data. Owing to the metalearning framework, experiments indicate that the developed method is able to capture correct density structure even when presented with small sample sizes at testing time. Similarly to the neural process [Garnelo et al., 2018b], our method is able to construct a task embedding. In our case however, embedding of each task takes the form of a conditional mean embedding operator, computed with feature maps learned using noise contrastive estimation. Further study could involve other choices of fake distribution , including those depending on the conditioning variable. An interesting avenue of applications would be in modelling conditional distributions in the reinforcement learning setting. In particular, Lyle et al. [2019] and Bellemare et al. [2017] have shown the benefits of using distributional perspective on reinforcement learning as opposed to only modelling expectations of returns received by the agents.
Acknowledgements
We would like to thanks Anthony Caterini, Qinyi Zhang, Emilien Dupont, David Rindt, Robert Hu, Leon Law, Jin Xu, Edwin Fong and Kaspar Martens for helpful discussions and feedback. JFT is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). YWT and DS are supported in part by Tencent AI Lab and DS is supported in part by the Alan Turing Institute (EP/N510129/1). YWT’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/20072013) ERC grant agreement no. 617071.
References
 Arbel and Gretton [2017] Michael Arbel and Arthur Gretton. Kernel conditional exponential family. arXiv preprint arXiv:1711.05363, 2017.
 Bellemare et al. [2017] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 449–458. JMLR. org, 2017.
 Dai et al. [2018] Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential family estimation via doubly dual embedding. arXiv preprint arXiv:1811.02228, 2018.
 Dutordoir et al. [2018] Vincent Dutordoir, Hugh Salimbeni, James Hensman, and Marc Deisenroth. Gaussian process conditional density estimation. In Advances in Neural Information Processing Systems, pages 2385–2395, 2018.
 Garnelo et al. [2018a] Marta Garnelo, Dan Rosenbaum, Chris J Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J Rezende, and SM Eslami. Conditional neural processes. arXiv preprint arXiv:1807.01613, 2018a.
 Garnelo et al. [2018b] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
 Gražulis et al. [2011] Saulius Gražulis, Adriana Daškevič, Andrius Merkys, Daniel Chateigner, Luca Lutterotti, Miguel Quiros, Nadezhda R Serebryanaya, Peter Moeck, Robert T Downs, and Armel Le Bail. Crystallography open database (cod): an openaccess collection of crystal structures and platform for worldwide collaboration. Nucleic acids research, 40(D1):D420–D427, 2011.
 Gutmann and Hyvärinen [2012] Michael U Gutmann and Aapo Hyvärinen. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361, 2012.
 Hawkins [2017] Paul C. D. Hawkins. Conformation Generation: The State of the Art. Journal of Chemical Information and Modeling, 57(8):1747–1756, 2017. doi: 10.1021/acs.jcim.7b00221. URL https://doi.org/10.1021/acs.jcim.7b00221. PMID: 28682617.
 Lyle et al. [2019] Clare Lyle, Pablo Samuel Castro, and Marc G Bellemare. A comparative analysis of expected and distributional reinforcement learning. arXiv preprint arXiv:1901.11084, 2019.
 Mardia [2013] Kanti V. Mardia. Statistical approaches to three key challenges in protein structural bioinformatics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3):487–514, 2013.
 Mnih and Teh [2012] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, 2012.
 Muandet et al. [2017] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(12):1–141, 2017.
 Song et al. [2013] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98–111, 2013.
 Sriperumbudur et al. [2011] Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. J. Mach. Learn. Res., 12:2389–2410, July 2011. ISSN 15324435.
 Sugiyama et al. [2010] Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara. Leastsquares conditional density estimation. IEICE Transactions on Information and Systems, 93(3):583–594, 2010.
 Trippe and Turner [2018] Brian L Trippe and Richard E Turner. Conditional density estimation with bayesian normalising flows. arXiv preprint arXiv:1802.04908, 2018.
 Van den Oord et al. [2018] Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 Zhang et al. [2018] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to fewshot learning. In Advances in Neural Information Processing Systems, pages 2365–2374, 2018.
Appendix A Synthetic dataset setup and further experiments
In this experiment we are given a variable number of context points during testing time ranging from and . Each of the non meta learning models DDE, KCEF, KDE, LSCDE are trained on the new datasets. Our MetaCDE is trained with and context points on the tasks respectively and with target points. At testing time, we simply pass the data through our model without having to retrain on the new unseen dataset. Note that we report again the pvalues of the Wilcoxon signed onesided test and we can see that as we decrease the context points, our methods is significantly outperforming the other methods.
a.1 Model specifications
For our MetaCDE we used a hidden layer Neural Network with activation functions and optimizer for all of our feature maps. We cross validate on held out dataset, over 32 and 64 hidden nodes per layer and for the regularization parameter. We fix the learning rate at . We also set .

KCEF: we used the CV function that was in built in their Github repository

LSCDE: We CV for in and in

KDE: We CV over in and bandwidth in

DDE: We CV over the bandwidth of and
a.2 Using 50 context points
MetaCDE  DDE  LSCDE  KCEF  KDE  

100 Loglikelihoods  197.36 25.26  162.98 68.67  44.95 74.36  388.30 699.65  116.31 235.80 
Pvalue for Wilcoxon test  NA  9.681e06  <2.2e16  < 2.2e16  1.92e07 
a.3 Using 30 context points
MetaCDE  DDE  LSCDE  KCEF  KDE  

100 Loglikelihoods  114.92 17.25  64.61 54.33  23.02 65.31  233.38 528.99  29.64 194.51 
Pvalue for Wilcoxon test  NA  4.017e14  <2.2e16  < 2.2e16  1.389e13 
a.4 Using 15 context points
MetaCDE  DDE  LSCDE  KCEF  KDE  

100 Loglikelihoods  46.99 12.24  0.58 40.70  57.99 59.13  142.19 259.59  87.50 224.13 
Pvalue for Wilcoxon test  NA  < 2.2e16  <2.2e16  < 2.2e16  < 2.2e16 
a.5 Plotting the histogram of the difference in loglikelihoods
Next we will illustrate why the variance in the loglikelihood estimates are that big. In order to illustrate the idea, we will plot the difference between the loglikelihood of MetaCDE and the other methods including DDE, LSCDE, KCEF, KDE.
Appendix B Further illustration of the Ramachandran plots
b.1 Additional information on the experimental setup
In this experiment, we look into the Ramachandran plots for molecules. Each plot indicates the energetically stable region of a pair of correlated torsion in the molecule. Specifically, we are interested in estimating the distributions of these correlated dihedral angles. In the experiment, we compute the conditional density for each correlated torsion, given 20 context points at testing time. For our metalearning training we use 20 context points and 60 targets points.
Note that the data was extracted from crystallography database [Gražulis et al., 2011]. It is possible that some specific pairs of dihedral angles are rarely seen in the dataset, Hence, we may obtain a conditional density with high probability on the region without any observations in some cases. This is reasonable as the database covered only a small part of the chemical space and some potential area could be overlooked. Given that we assume that the support of our conditioning variable ranges from , we will inevitable also compute conditional distribution on areas where the configurations are not defined and hence the densities in those areas can be safely ignored as a computational biologist would not have queried these configurations in the first place.
b.2 Model specifications
For our MetaCDE we used a 3hidden layer NN with activation functions for all of our feature maps. We cross validate over 32 and 64 hidden nodes per layer and for the regularization parameter. We fix the learning rate at . We also set .

KCEF: we used the CV function that was in built in their Github repository

LSCDE: We CV for in and in

KDE: We CV over in and bandwidth in

DDE: We CV over bandwidth of and
b.3 Additional illustration on the Ramachandran plots
b.4 Plotting the histogram of the difference in loglikelihoods
Next we will illustrate why the variance in the loglikelihood estimates are that big. In order to illustrate the idea, we will plot the difference between the loglikelihood of MetaCDE and the other methods including DDE, LSCDE, KCEF, KDE.
b.5 Note on the results
We note that some of the times that the nonmeta learning methods do insignificantly better than our method. After investigating the dataset in more detail, we see that the cases where the non meta learning versions are better are cases where the data looks like the plot below. Our proposed method seems to be a lot more conservative on these datasets, by having a higher variance, whereas other methods are able to focus all there mass on those lines. Nevertheless, MetaCDE does recognize that the data follows a line. There lines however are less useful to scientists are they are more interested in more complicated structures. See Figure (8)
Furthermore it looks like our method is not able to always capture the true trend given the limited amount of data. However, it seems to be able to capture some interesting patterns that would be useful to scientist to include in their models. Recently, there has been work done on these Ramachandran plots for Molecules but handcrafting the density maps. Our model would allow us to compute the density maps without prior knowledge.
Appendix C Further illustration of the NYC taxi dataset
c.1 Experimental Setup
We have extracted the publicly available dataset from the website ^{3}^{3}3Data has been taken from: https://www1.nyc.gov/site/tlc/about/tlctriprecorddata.page. We have first of all restricted ourselves to dropoff locations in from to in longitude and to in latitude. Next we have given our meta learning model 200 datapoints for context during training and 300 for target. At testing time we are presented with 200 context points and are required to compute the conditional density given a tip. Again, we are using a hidden layer NN with nodes and CV over and . We use the optimizer and fixed the learning rate to . We also set .
c.2 Note on the dataset
In the main text we have seen how the dropoff density changes as we increase the amount of tips. This move of density illustrates well the data itself, as one is more likely to pay higher tips for longer journeys. Below we have plotted the dropoff locations of one specific pickup location colored with the respective tips paid.