Variational Learning on Aggregate Outputs
with Gaussian Processes
Abstract
While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the finescale spatial modelling of malaria incidence, with over million observations.
Variational Learning on Aggregate Outputs
with Gaussian Processes
Ho Chung Leon Law^{†}^{†}thanks: Department of Statistics, Oxford, UK. <ho.law@stats.ox.ac.uk, dino.sejdinovic@stats.ox.ac.uk> University of Oxford Dino Sejdinovic^{1}^{1}footnotemark: 1 University of Oxford Ewan Cameron^{†}^{†}thanks: Big Data Institute, Oxford, UK. <dr.ewan.cameron@gmail.com, timcdlucas@gmail.com, katherine.battle@bdi.ox.ac.uk> University of Oxford Tim CD Lucas^{2}^{2}footnotemark: 2 University of Oxford Seth Flaxman^{†}^{†}thanks: Department of Mathematics and Data Science Institute, London, UK. <s.flaxman@imperial.ac.uk> Imperial College London Katherine Battle^{2}^{2}footnotemark: 2 University Of Oxford Kenji Fukumizu^{†}^{†}thanks: Tachikawa, Japan. <fukumizu@ism.ac.jp> Institute of Statistical Mathematics
noticebox[b]Preprint. \end@float
1 Introduction
A typical supervised learning setup assumes existence of a set of inputoutput examples from which a functional relationship or a conditional probabilistic model of outputs given inputs can be learned. A prototypical usecase is the situation where obtaining outputs for new, previously unseen, inputs is costly, i.e., labelling is expensive and requires human intervention, but measurements of inputs are cheap and automated. Similarly, in many applications, due to a much greater cost in acquiring labels, they are only available at a much coarser resolution than the level at which the inputs are available and at which we wish to make predictions. This is the problem of weakly supervised learning on aggregate outputs [14, 20], which has been studied in the literature in a variety of forms, with classification and regression notably being developed separately and without any unified treatment which can allow more flexible observation models. In this contribution, we consider a framework of observation models of aggregated outputs given bagged inputs, which reside in exponential families. While we develop a more general treatment, the main focus in the paper is on the Poisson likelihood for count data, which is motivated by the applications in spatial statistics. In particular, we consider the important problem of finescale mapping of diseases. High resolution maps of infectious disease risk can offer a powerful tool for developing National Strategic Plans, allowing accurate stratification of intervention types to areas of greatest impact [5]. In low resource settings these maps must be constructed through probabilistic models linking the limited observational data to a suite of spatial covariates (often from remotesensing images) describing social, economic, and environmental factors thought to influence exposure to the relevant infectious pathways. In this paper, we apply our method to the incidence of clinical malaria cases. Point incidence data of malaria is typically available at a high temporal frequency (weekly or monthly), but lacks spatial precision, being aggregated by administrative district or by health facility catchment. The challenge for risk modelling is to produce finescale predictions from these coarse incidence data, leveraging the remotesensing covariates and appropriate regularity conditions to ensure a wellbehaved problem.
Methodologically, the Poisson distribution is a popular choice for modelling count data. In the mapping setting, the intensity of the Poisson distribution is modelled as a function of spatial and other covariates. We use Gaussian processes (GPs) as a flexible model for the intensity. GPs are a widely used approach in spatial modelling but also one of the pillars of Bayesian machine learning, enabling predictive models which explicitly quantify their uncertainty.
Recently, we have seen many advances in variational GP posterior approximations, allowing them to couple with more complex observation likelihoods (e.g. binary or Poisson data [21, 17]) as well as a number of effective scalable GP approaches [24, 30, 8, 9], extending the applicability of GPs to dataset sizes previously deemed prohibitive.
Contribution Our contributions can be summarised as follows. A general framework is developed for aggregated observation models using exponential families and Gaussian processes. This is novel, as previous work on aggregation or bag models focuses on specific types of output models such as binary classification. Tractable and scalable variational inference methods are proposed for several instances of the aggregated observation models, making use of novel lower bounds on the model evidence. In experiments, it is demonstrated that the proposed methods can scale to dataset sizes of more than million observations. We thoroughly investigate an application of the developed methodology to disease mapping from coarse measurements, where the observation model is Poisson, giving encouraging results. Uncertainty quantification, which is explicit in our models, is essential for this application.
Related Work
The framework of learning from aggregate data was believed to have been first introduced in [20], which considers the two regimes of classification and regression. However, while the task of classification of individuals from aggregate data (also known as learning from label proportions) has been explored widely in the literature [23, 22, 13, 18, 35, 34, 14], there has been little literature on the analogous regression regime in the machine learning community. Perhaps the closest literature available is [13], who considers a general framework for learning from aggregate data, but also only considers the classification case for experiments. In this work, we will appropriately adjust the framework in [13] and take this to be our baseline. A related problem arises in the spatial statistics community under the name of ‘downscaling’, ‘finescale modelling’ or ‘spatial disaggregation’ [11, 10], in the analysis of disease mapping, agricultural data, and species distribution modelling, with a variety of proposed methodologies (cf. [33] and references therein), including kriging [6]. However, to the best of our knowledge, approaches making use of recent advances in scalable variational inference for GPs are not considered.
Another closely related topic is multiple instance learning (MIL), concerned with classification with maxaggregation over labels in a bag, i.e. a bag is positively labeled if at least one individual is positive, and it is otherwise negatively labelled. While the task in MIL is typically to predict labels of new unobserved bags, [7] demonstrates that individual labels of a GP classifier can also be inferred in MIL setting with variational inference. Our work parallels that approach, considering bag observation models in exponential families and deriving new approximation bounds for some common generalized linear models. In deriving these bounds, we have taken an approach similar to [17], who considers the problem of Gaussian processmodulated Poisson process estimation using variational inference. However, our problem is made more complicated by the aggregation of labels.
Other related research topics include distribution regression and set regression, as in [28, 15, 16] and [36]. In these regression problems, while the input data for learning is the same as the current setup, the goal is to learn a function at the bag level, rather than the individual level, the application of these methods in our setting, naively treating single individuals as “distributions”, may lead to suboptimal performance. An overview of some other approaches for classification using bags of instances is given in [4].
2 Bag observation model: aggregation in mean parameters
Suppose we have a statistical model for output , with parameter given by a function of input , i.e., . Although one can formulate in an arbitrary fashion, practitioners often only focus on interpretable simple models, hence we restrict our attention to arising from exponential families. We assume that is the mean parameter of the exponential family.
Assume that we have a fixed set of points such that is a bag of points with individuals, and we wish to estimate the regression value for each individual. However, instead of the typical setup where we have a paired sample of individuals and their outputs to use as a training set, we observe only aggregate outputs for each of the bags. Hence, our training data is of the form
(1) 
and the goal is to estimate parameters corresponding to individuals. To relate the aggregate and the bag , we use the following bag observation model:
(2) 
where is an optional fixed nonnegative weight used to adjust the scales (see Section 3 for an example). Note that the aggregation in the bag observation model is on the mean parameters for individuals, not necessarily on the individual responses . This implies that each individual contributes to the mean bag response and that the observation model for bags belongs to the same parametric form as the one for individuals. For tractable and scalable estimation, we will use variational methods, as the aggregated observation model leads to intractable posteriors. We consider the Poisson, normal, and exponential distributions, but devote a special focus to the Poisson model in this paper, and refer readers to Appendix A for other cases and experimental results for the Normal model in Appendix H.2.
It is also worth noting that we place no restrictions on the collection of the individuals, with the bagging process possibly dependent on covariates or any unseen factors. The bags can also be of different sizes, with potentially the same individuals appearing in multiple bags. After we obtain our individual model , we can use it for prediction of inbag individuals, as well as outofbag individuals.
3 Poisson bag model: Modelling aggregate counts
The Poisson distribution is considered for count observations, and this paper discusses the Poisson regression with intensity multiplied by a ‘population’ , which is a constant assumed to be known for each individual (or ‘subbag’) in the bag. The population for a bag is given by . An observed bag count is assumed to follow
Note that, by introducing unobserved counts , the bag observation has the same distribution as since the Poisson distribution is closed under convolutions. If a bag and its individuals correspond to an area and its partition in geostatistical applications, as in the malaria example in Section 4.2, the population in the above bag model can be regarded as the population of an area or a subarea. With this formulation, the goal is to estimate the basic intensity function from the aggregated observations (1). Assuming independence given , the negative loglikelihood (NLL) across bags is
(3) 
where denotes an equality up to additive constant. During training, this term will pass information from the bag level observations to the individual basic intensity . It is noted that once we have trained an appropriate model for , we will be able to make individual level predictions, and also bag level predictions if desired. We will consider baselines with (3) using penalized likelihoods inspired by manifold regularization in semisupervised learning [2] – presented in Appendix B. In the next section, we propose a model for based on GPs.
3.1 VBAggPoisson: Gaussian processes for aggregate counts
Suppose now we model as a Gaussian process (GP), then we have:
(4) 
where and are some appropriate mean function and covariance kernel . (For implementation, we consider a constant mean function.) Since the intensity is always nonnegative, in all models, we will need to use a transformation , where is a nonnegative valued function. We will consider cases and . A discussion of various choices of this link function in the context of Poisson intensities modulated by GPs is given in [17]. Modelling with a GP allows us to propagate uncertainty on the predictions to , which is especially important in this weakly supervised problem setting, where we do not directly observe any individual output . Since the total number of individuals in our target application of disease mapping is typically in the millions (see Section 4.2), we will approximate the posterior over using variational inference, with details found in Appendix E.
For scalability of the GP method, as in previous literature [7, 17], we use a set of inducing points , which are given by the function evaluations of the Gaussian process at landmark points ; i.e., . The distribution is thus given by
(5) 
The joint likelihood is given by:
(6) 
(7) 
where , with , denoting their respective evaluations of the mean function and being parameters of the mean and kernel functions of the GP. Proceeding similarly to [17], which discusses (nonbag) Poisson regression with GP, we obtain a lower bound of the marginal loglikelihood :
(8) 
where is a variational distribution to be optimized. The general solution to the maximization over of the evidence lower bound above is given by the posterior of the inducing points , which is intractable. We introduce a restriction to the class of to approximate the posterior . Suppose that the variational distribution is Gaussian, . We then need to maximize the lower bound over the variational parameters and .
The resulting gives an approximation to the posterior which also leads to a Gaussian approximation to the posterior , which we finally then transform through to obtain the desired approximate posterior on each (which is either lognormal or noncentral depending on the form of ). The approximate posterior on will then allow us to make predictions for individuals while, crucially, taking into account the uncertainties in (note that even the posterior predictive mean of will depend on the predictive variance in due to the nonlinearity ). We also want to emphasis the use of inducing variables is essential for scalability in our model: we cannot directly obtain approximations to the posterior of for all individuals, since this is often large in our problem setting (Section 4.2).
As the and are both Gaussian, the last term (KLdivergence) of (8) can be computed explicitly with exact form found in Appendix E.3. To consider the first two terms, let be the marginal normal distribution of , where follows the variational posterior . The distribution of is then , using (7) :
(9) 
In the first term of (8), each summand is of the form
(10) 
in which the second term is tractable for both of and . The integral of the first term, however with Gaussian is not tractable. To solve this, we take different approaches for and ; for the former, approximation by Taylor expansion is applied, while for the latter, further lower bound is taken.
First consider the case , and rewrite the first term of (8) as:
with and . By a Taylor series approximation for (similar to [29]) around , we obtain
(11) 
with details are in Appendix E.4. An alternative approach which we use for the case is to take a further lower bound, which is applicable to a general class of (we provide further details for the analogous approach for in Appendix E.2). We use the following Lemma (proof found in Appendix E.1):
Lemma 1.
Let be a random vector with probability density with marginal densities , and let , . Then, for any nonnegative valued function ,
Hence we obtain that
(12) 
Using the above two approximation schemes, our objective (up to constant terms) can be formulated as: 1)
(13) 
2)
(14) 
Given these objectives, we can now optimise these lower bounds with respect to variational parameters , parameters of the mean and kernel functions, using stochastic gradient descent (SGD) on bags. Additionally, we might also learn , locations for the landmark points. In this form, we can also see that the bound for has the added computational advantage of not requiring the full computation of the matrix , but only its diagonals, while for computation of involves full , which may be problematic for extremely large bag sizes.
4 Experiments
We will now demonstrate various approaches: Variational Bayes with Gaussian Process (VBAgg), a MAP estimator of Bayesian Poisson regression with explicit feature maps (Nyström) and a neural network (NN) – the latter two employing manifold regularisation with RBF kernel (unless stated otherwise). For additional baselines, we consider a constant within bag model (constant), i.e. and also consider creating ‘individual’ covariates by aggregation of the covariates within a bag (bagpixel). For details of all these approaches, see Appendix B. We also denote and as Exp and Sq respectively.
We implement our models in TensorFlow^{1}^{1}1Code will be available for use. and use SGD with Adam [12] to optimise their respective objectives, and we split the dataset into parts, namely train, earlystop, validation and test set. Here the earlystop set is used for early stopping for the Nyström, NN and bagpixel models, while the VBAgg approach ignores this partition as it optimises the lower bound to the marginal likelihood. The validation set is used for parameter tuning of any regularisation scaling, as well as learning rate, layer size and multiple initialisations. Throughout, VBAgg and Nyström have access to the same set of landmarks for fair comparison. It is also important to highlight that we perform early stopping and tuning based on bag level performance on NLL only, as this is the only information available to us.
For the VBAgg model, there are two approaches to tuning, one approach is to choose parameters based on NLL on the validation bag sets, another approach is to select all parameters based on the training objective , the lower bound to the marginal likelihood. We denote the latter approach VBAggObj and report its toy experimental results in Appendix H.1.1 for presentation purposes. In general, the results are relatively insensitive to this choice, especially when . To make predictions, we use the mean of our approximated posterior (provided by a lognormal and noncentral distribution for Exp and Sq). As an additional evaluation, we report mean square error (MSE) and bag performance results in Appendix H.
4.1 Poisson Model: Swiss Roll
We first demonstrate our method on the swiss roll dataset^{2}^{2}2The swiss roll manifold function (for sampling) can be found on the Python scikitlearn package., illustrated in Figure 1 (left). To make this an aggregate learning problem, we first construct bags with sizes drawn from a negative binomial distribution , where and represents the respective mean and standard deviation of .
We then randomly select points from the swiss roll manifold to be the locations, giving us a set of colored locations in . Ordering these random locations by their axis coordinate, we group them, filling up each bag in turn as we move along the axis. The aim of this is to simulate that in real life the partitioning of locations into bags is often not independent of covariates. Taking the colour of each location as the underlying rate at that location, we simulate , and take our observed outputs to be , where . Our goal is then to predict the underlying individual rate parameter , given only baglevel observations . To make this problem even more challenging, we embed the data manifold into by rotating it with a random orthogonal matrix. For the choice of for VBAgg and Nyström, we use the RBF kernel, with the bandwidth parameter learnt. For landmark locations, we use the Kmeans++ algorithm, so that landmark points lie evenly across the data manifold.
Varying number of Bags:
To see the effect of increasing number of bags available for training, we fix and , and vary the number of bags for the training set from to with the same number of bags for early stopping and validation. Each experiment is repeated for runs, and results are shown in Figure 1 for individual NLL on the train set. Again we emphasise that the individual labels are not used in training. We see that all versions of VBAgg outperform all other models, in terms of MSE and NLL, with statistical significance confirmed by a signed rank permutation test (see Appendix H.1.1). We also observe that the bagpixel model has poor performance, as a result of losing individual level covariate information in training by simply aggregating them.
Varying number of individuals per bag: To study the effect of increasing bag sizes (with larger bag sizes, we expect "disaggregation" to be more difficult), we fix the number of training bags to be with early stopping and validation set to be bags, while varying the number of individuals per bag through and in the negative binomial distribution. To keep the relative scales between and the same, we take . The results are shown in Figure 1, focusing on the best performing methods in the previous experiment. Here, we observe that VBAgg models again perform better than the Nyström and NN models with statistical significance as reported in Appendix H.1.1, with performance stable as increases.
Discussion To gain more insight into the VBAgg model, we look at the calibration of our two different Bayesian models: VBAggExp and VBAggSquare. We compute their respective posterior quantiles and observe the ratio of times the true lie in these quantiles. We present these in Appendix H.1.1. The calibration plots reveal an interesting nature about using the two different approximations for using versus for . While experiments showed that the two model perform similarly in terms of NLL, the calibration of the models is very different. While the VBAggSquare is well calibrated in general, the VBAggExp suffers from poor calibration. This is not surprising, as VBAggExp uses an additional lower bound on model evidence. Thus, uncertainty estimates given by VBAggExp should be treated with care.
4.2 Malaria Incidence Prediction
We now demonstrate the proposed methodology on an important real life malaria prediction problem for an endemic country from the Malaria Atlas Project database^{3}^{3}3Due to confidentiality reasons, we do not report country or plot the full map of our results.. In this problem, we would like to predict the underlying malaria incidence rate in each km by km region (referred to as a pixel), while having only observed aggregated incidences of malaria at much larger regional levels, which are treated as bags of pixels. These bags are nonoverlapping administrative units, with pixels per bag ranging from 13 to 6,667, with a total of 1,044,683 pixels. In total, data is available for bags^{4}^{4}4We consider bags for train, bags each for validation and earlystop, with bags for testing, with different splits across different trials, selecting them to ensure distributions of labels are similar across sets.. Along with these pixels, we also have population estimates (per people) for pixel in bag , spatial coordinates given by , as well as covariates , collected by remote sensing. Some examples of covariates includes accessibility, distance to water, mean of land surface temperature and stable night lights. It is clear that rather than expecting malaria incidence rate to be constant throughout the entire bag (as in Figure 4.2), we expect pixel incidence rate to vary, depending on social, economic and environmental factors [32]. Our goal is therefore to build models that can predict malaria incidence rates at a pixel level.
We assume a Poisson model on each individual pixel, i.e. , where is the underlying pixel incidence rate of malaria per people that we are interested in predicting. We consider the VBAgg, Nyström and NN as prediction models and use a kernel given as a sum of an ARD (automatic relevance determination) kernel on covariates and a Matérn kernel on spatial locations for the VBAgg and Nyström methods, learning all kernel parameters (the kernel expression is provided in Appendix G). We use the same kernel for manifold regularisation in the NN model. This kernel choice incorporates spatial information, while allowing feature selection amongst other covariates. For choice of landmarks, we ensure landmarks are placed evenly throughout space by using one landmark point per training bag (selected by kmeans++). This is so that the uncertainty estimates we obtain are not too sensitive to the choice of landmarks.
In this problem, no individuallevel labels are available, so we report Bag NLL and MSE (on observed incidences) on the test bags in Appendix G over different resplits of the data. Although we can see that Nyström is the best performing method, the improvement over VBAgg models is not statistically significant. On the other hand, both VBAgg and Nyström models statistically significantly outperform NN, which also has some instability in its predictions, as discussed in Appendix G.1. However, a caution should be exercised when using the measure of performance at the bag level as a surrogate for the measure of performance at the individual level: in order to perform well at the bag level, one can simply utilise spatial coordinates and ignore other covariates, as malaria intensity appears to smoothly vary between the bags (Left of Figure 4.2). However, we do not expect this to be true at the individual level.
To further investigate this, we consider a particular region, and look at the predicted individual malaria incidence rate, with results found in Figure 4.2 and in Appendix G.1 across different data splits, where the behaviours of each of these models can be observed. While Nyström and VBAgg methods both provide good baglevel performance, Nyström and VBAggExp can sometimes provide overlysmooth spatial patterns, which does not seem to be the case for the VBAggSq method (recall that VBAggSq performed best in both prediction and calibration for the toy experiments). In particular, VBAggSq consistently predicts higher intensity along rivers (a known factor [31]; indicated by triangles in Figure 4.2) using only coarse aggregated intensities, demonstrating that prediction of (unobserved) pixellevel intensities is possible using finescale environmental covariates, especially ones known to be relevant such as covariates indicated by the Topographic Wetness Index, a measure of wetness, see Appendix G.2 for more details.
In summary, by optimising the lower bound to the marginal likelihood, the proposed variational methods are able to learn useful relations between the covariates and pixel level intensities, while avoiding the issue of overfitting to spatial coordinates. Furthermore, they also give uncertainty estimates (Figure 4.2, right), which are essential for problems like these, where validation of predictions is difficult, but they may guide policy and planning.
5 Conclusion
Motivated by the vitally important problem of malaria, which is the direct cause of around 187 million clinical cases [3] and 631,000 deaths [5] each year in subSaharan Africa, we have proposed a general framework of aggregated observation models using Gaussian processes, along with scalable variational methods for inference in those models, making them applicable to large datasets. The proposed method allows learning in situations where outputs of interest are available at a much coarser level than that of the inputs, while explicitly quantifying uncertainty of predictions. The recent uptake of digital health information systems offers a wealth of new data which is abstracted to the aggregate or regional levels to preserve patient anonymity. The volume of this data, as well as the availability of much more granular covariates provided by remote sensing and other geospatially tagged data sources, allows to probabilistically disaggregate outputs of interest for finer risk stratification, e.g. assisting public health agencies to plan the delivery of disease interventions. This task demands new highperformance machine learning methods and we see those that we have developed here as an important step in this direction.
Acknowledgement
We thank Kaspar Martens for useful discussions, and Dougal Sutherland for providing the code base in which this work was based on. HCLL is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). HCLL and KF are supported by JSPS KAKENHI 26280009. EC and KB are supported by OPP1152978, TL by OPP1132415 and the MAP database by OPP1106023. DS is supported in part by the ERC (FP7/617071) and by The Alan Turing Institute (EP/N510129/1). The data were provided by the Malaria Atlas Project supported by the Bill and Melinda Gates Foundation.
References
 [1] LU Ancarani and G Gasaneo. Derivatives of any order of the confluent hypergeometric function f 1 1 (a, b, z) with respect to the parameter a or b. Journal of Mathematical Physics, 49(6):063508, 2008.
 [2] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
 [3] Samir Bhatt, DJ Weiss, E Cameron, D Bisanzio, B Mappin, U Dalrymple, KE Battle, CL Moyes, A Henry, PA Eckhoff, et al. The effect of malaria control on plasmodium falciparum in africa between 2000 and 2015. Nature, 526(7572):207, 2015.
 [4] Veronika Cheplygina, David M.J. Tax, and Marco Loog. On classification with bags, groups and sets. Pattern Recognition Letters, 59:11 – 17, 2015.
 [5] Peter W Gething, Daniel C Casey, Daniel J Weiss, Donal Bisanzio, Samir Bhatt, Ewan Cameron, Katherine E Battle, Ursula Dalrymple, Jennifer Rozier, Puja C Rao, et al. Mapping plasmodium falciparum mortality in africa between 1990 and 2015. New England Journal of Medicine, 375(25):2435–2445, 2016.
 [6] Pierre Goovaerts. Combining areal and point data in geostatistical interpolation: Applications to soil science and medical geography. Mathematical Geosciences, 42(5):535–554, Jul 2010.
 [7] Manuel Haußmann, Fred A Hamprecht, and Melih Kandemir. Variational bayesian multiple instance learning with gaussian processes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6570–6579, 2017.
 [8] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. 2013.
 [9] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process Classification. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 351–360, San Diego, California, USA, 09–12 May 2015. PMLR.
 [10] Richard Howitt and Arnaud Reynaud. Spatial disaggregation of agricultural production data using maximum entropy. European Review of Agricultural Economics, 30(3):359–387, 2003.
 [11] Petr Keil, Jonathan Belmaker, Adam M Wilson, Philip Unitt, and Walter Jetz. Downscaling of species distribution models: a hierarchical approach. Methods in Ecology and Evolution, 4(1):82–94, 2013.
 [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [13] Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606. ACM, 2015.
 [14] H. Kueck and N. de Freitas. Learning about individuals from group statistics. In UAI, pages 332–339, 2005.
 [15] H. C. L. Law, C. Yau, and D. Sejdinovic. Testing and learning on distributions with symmetric noise invariance. In NIPS, 2017.
 [16] Ho Chung Leon Law, Dougal Sutherland, Dino Sejdinovic, and Seth Flaxman. Bayesian approaches to distribution regression. In International Conference on Artificial Intelligence and Statistics, pages 1167–1176, 2018.
 [17] Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pages 1814–1822, 2015.
 [18] Vitalik Melnikov and Eyke Hüllermeier. Learning to aggregate using uninorms. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 756–771. Springer, 2016.
 [19] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyonds. arXiv preprint arXiv:1605.09522, 2016.
 [20] David R Musicant, Janara M Christensen, and Jamie F Olson. Supervised learning by training on aggregate outputs. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 252–261. IEEE, 2007.
 [21] H. Nickisch and CE. Rasmussen. Approximations for binary gaussian process classification. Journal of Machine Learning Research, 9:2035–2078, October 2008.
 [22] Giorgio Patrini, Richard Nock, Tiberio Caetano, and Paul Rivera. (Almost) no label no cry. In NIPS. 2014.
 [23] Novi Quadrianto, Alex J Smola, Tiberio S Caetano, and Quoc V Le. Estimating labels from label proportions. JMLR, 10:2349–2374, 2009.
 [24] Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. J. Mach. Learn. Res., 6:1939–1959, December 2005.
 [25] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In NIPS, pages 1177–1184, 2007.
 [26] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning, 2006.
 [27] Alex J Smola and Peter L Bartlett. Sparse greedy gaussian process regression. In Advances in neural information processing systems, pages 619–625, 2001.
 [28] Zoltán Szabó, Bharath K Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.
 [29] Yee W Teh, David Newman, and Max Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In Advances in neural information processing systems, pages 1353–1360, 2007.
 [30] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR.
 [31] DA Warrel, T Cox, J Firth, and Jr E Benz. Oxford textbook of medicine, 2017.
 [32] Daniel J Weiss, Bonnie Mappin, Ursula Dalrymple, Samir Bhatt, Ewan Cameron, Simon I Hay, and Peter W Gething. Reexamining environmental correlates of plasmodium falciparum malaria endemicity: a dataintensive variable selection approach. Malaria journal, 14(1):68, 2015.
 [33] António Xavier, Maria de Belém Costa Freitas, Maria do Socorro Rosário, and Rui Fragoso. Disaggregating statistical data at the field level: An entropy approach. Spatial Statistics, 23:91 – 108, 2018.
 [34] Felix X Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, and ShihFu Chang. On learning from label proportions. arXiv preprint arXiv:1402.5902, 2014.
 [35] Felix X Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, and ShihFu Chang. svm for learning with label proportions. arXiv preprint arXiv:1306.0886, 2013.
 [36] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets. In NIPS, 2017.
Appendix A Aggregated Exponential Family Models
Consider an observation model of the form
(15) 
where response is onedimensional, is a natural parameter corresponding to the statistic , is a dispersion parameter, and is base measure. For simplicity, we will assume that natural parameters corresponding to the other parts of the sufficient statistic are fixed and folded into the base measure. Let be the corresponding mean parameter, i.e.
and be the link function mapping from mean to the natural parameters and its inverse. We wish to model the mean parameter using a Gaussian process on a domain together with a function which transforms the GP value to the natural parameter space, i.e.
(16) 
For example, the mean parameter for some models is restricted to the positive part of the real line, while the GP values cover the whole real line. We will consider the following examples:

Normal (with fixed variance). and can be identity, too, as there are no restrictions on the mean parameter space.

Poisson. , . should take a positive value, so we consider or .

Exponential. and , , . should take a positive value, so we consider or
Note that the link function is concave for all the examples above.
a.1 Bag model
We will consider the aggregation in the mean parameter space. Namely, let be independent aggregate responses for each of the bags of covariates , . We assume the following aggregation model:
(17) 
where are fixed weights to adjust the scales among the individuals and the bag (e.g., adjusting for population size).
We also can model individual (unobserved) variables (, which follow:
(18) 
Note that we consider aggregation in mean parameters of responses, not in the responses themselves. If we consider a case where underlying individual responses aggregate to as a weighted sum, the form of the bag likelihood and individual likelihood would be different unless we restrict attention to distribution families which are closed under both scaling and convolution. However, when aggregation occurs in the mean parameter space, the form of the bag likelihood and individual likelihood is always the same. This corresponds to the following measurement process:

Each individual has a mean parameter  if it were possible to sample a response for that particular individual, we would obtain a sample

However, we cannot sample the individual and we can only observe a bag response. But in that case, only a single bag response is taken and depends on all individuals simultaneously. Each individual contributes in terms of an increase in a mean bag response, but this measurement process is different from the twostage procedure by which we aggregate individual responses.
a.2 Marginal likelihood and ELBO
Let (bag observations). With the inducing points , the marginal likelihood is
(19) 
The evidence lower bound can be derived as
(20) 
where .
By setting the variational distribution as Gaussian, the third term is tractable. The first and second terms are however tractable only in limited cases. The cases we develop are the Poisson bag model, described in the main text, as well as the normal bag model and the exponential bag model, described below.
a.3 Normal bag model
is identity and , which makes both the first and the second terms tractable with the choice of . Moreover, the viewpoints of aggregating in the mean parameters and in the individual responses are equivalent for this model and we can also allow different variance parameters for different bags (and individuals).
Consider a bag of items . Each item is assumed to have a weight . At the individual level, we model the (unobserved) responses as
(21) 
where , thus is a mean parameter per unit weight corresponding to the item and it is assumed to be a function of both . Similarly, is a variance parameter per unit weight. At the bag level, we consider the following model for the observed aggregate response , assuming conditional independence of individual responses given covariates :
(22) 
where and are the mean and variance parameters per unit weight of the whole bag and is the total weight of bag . Although we can take to also be a function of the covariates, here for simplicity, we take to be constant per bag (note the abuse of notation). We can now compute the negative loglikelihood (NLL) across bags (assuming conditional independence given the ):
(23) 
where is the function we are interested in, and are the variance parameters to be learnt.
We can now consider the lower bound to the marginal likelihood as below (assuming here to simplify notation, while the analogous expression with nonuniform weights is straightforward):
(24) 
Using again a Gaussian distribution for , we have , which is a normal distribution and let be its marginal normal distribution of with mean and covariance given by and as before in (9).
Then all expectations with respect to are tractable and the ELBO is simply
(25) 
a.4 Exponential bag model
In this case, we have . We can apply the similar argument as in Lemma 1. For any with , by the concavity of ,
For , the last line is equal to
When using a normal , this is tractable for several choices of including and . If we let , and maximize
under the constraint , we obtain
Finally, we have a lower bound
(26) 
where
which is tractable for a Gaussian variational family. Also with an explicit form of , it is easy to take the derivatives of the resulting lower bound with respect to the variational parameters in .
Appendix B Alternative approaches
Constant
For the Poisson model, we can take , a constant rate across the bag, then:
hence the individual level predictive distribution is the form , and for unseen bag , , with predictive distribution given by .
bagpixel: Bag as Individual
Another baseline is to train a model from the weighted average of the covariates, given by in the Poisson case, and in the normal case. The purpose of this baseline is to demonstrate that modelling at the individual level is important during training. Since we now have labels and covariates at the bag level, we can consider the following model:
with for the Poisson model. For the normal model, we have:
where and is a parameter to be learnt (assuming constant across bags). Now we observe that these models are identical to the individual model, except for a difference in indexing. Hence, after learning the function at the bag level, we can transfer the model to the individual level. Essentially here we have created fake individual level instances by aggregation of individual covariates inside a bag.
Nyström: Bayesian MAP for Poisson regression on explicit feature maps
Instead of the posterior based on the model (6), we can also consider an explicit feature map in order to directly construct a MAP estimator. While this method does not provide posterior uncertainty over , it does provide an interesting connection to the settings we have considered and also manifoldregularized neural networks, as discussed below. Let be the covariance function defined on covariates , and consider its low rank approximation with landmark points and . By using landmark points , we have avoided computation of the full kernel matrix, reducing computational complexity. Under this setup, we have that , with being the explicit (Nyström) feature map. Using this explicit feature map , we have the following model:
where is a prior parameter and is the corresponding row of . We can then consider a MAP estimator of the model coefficients :
(27) 
This essentially recovers the same model as in (3) with the standard loss regularising the complexity of the function. This model can be thought of in several different ways, for example as a weight space view of the GP ([26] for an overview), or as a MAP of the Subset of Regressors (SoR) approximation [27] of the GP when . Additional we may include manifold regulariser as part of the prior, see discussion below about neural network.
NN: Manifoldregularized neural networks
The next approach we consider is a parametric model for as in [13], and search the best parameter to minimize negative loglikelihood across bags. This paper considers a neural network with parameters for the model , and uses the backpropagation to learn and hence individual level model . However, since we only have aggregated observations at the bag level, but lots of individual covariate information, it is useful to incorporate this information also, by enforcing smoothness on the data manifold given by the unlabelled data. To do this, following [13] and [22], we pursue a semisupervised view of the problem and include an additional manifold regularisation term [2] (rescaling with during implementation):