Multi-task Learning for Aggregated Data using Gaussian Processes
Abstract
Aggregated data is commonplace in areas such as epidemiology and demography. For example, census data for a population is usually given as averages defined over time periods or spatial resolutions (cities, regions or countries). In this paper, we present a novel multi-task learning model based on Gaussian processes for joint learning of variables that have been aggregated at different input scales. Our model represents each task as the linear combination of the realizations of latent processes that are integrated at a different scale per task. We are then able to compute the cross-covariance between the different tasks either analytically or numerically. We also allow each task to have a potentially different likelihood model and provide a variational lower bound that can be optimised in a stochastic fashion making our model suitable for larger datasets. We show examples of the model in a synthetic example, a fertility dataset and an air pollution prediction application.
1 Introduction
Many datasets in fields like ecology, epidemiology, remote sensing, sensor networks and demography appear naturally as aggregated, that is, variables in these datasets are measured or collected in intervals, areas or supports of different shapes and sizes. For example, census data are usually sampled or collected as aggregated at different administrative divisions, e.g. borough, town, postcode or city levels. In sensor networks, correlated variables are measured at different resolutions or scales. In the near future, air pollution monitoring across cities and regions could be done using a combination of a few high-quality low time-resolution sensors and several low-quality (low-cost) high time-resolution sensors. Joint modelling of the variables registered in the census data or the variables measured using different sensor configurations at different scales can improve predictions at the point or support levels.
In this paper, we are interested in providing a general framework for multi-task learning on these types of datasets. Our motivation is to use multi-task learning to jointly learn models for different tasks where each task is defined at (potentially) a different support of any shape and size and has a (potentially) different nature, i.e. it is a continuous, binary, categorical or count variable. We appeal to the flexibility of Gaussian processes (GPs) for developing a prior over such type of datasets and we also provide a scalable approach for variational Bayesian inference.
Gaussian processes have been used before for aggregated data (Smith et al., 2018; Law et al., 2018) and also in the context of the related field of multiple instance learning (Kim and De la Torre, 2010; Kandemir et al., 2016; Haußmann et al., 2017). In multiple instance learning, each instance in the dataset consists of a set (or bag) of inputs with only one output (or label) for that whole set. The aim is to provide predictions at the level of individual inputs. Smith et al. (2018) provide a new kernel function to handle single regression tasks defined at different supports. They use cross-validation for hyperparameter selection. Law et al. (2018) use the weighted sum of a latent function evaluated at different inputs as the prior for the rate of a Poisson likelihood. The latent function follows a GP prior. The authors use stochastic variational inference (SVI) for approximating the posterior distributions. While the previous two approaches perform the aggregation at the latent prior stage, Kim and De la Torre (2010); Kandemir et al. (2016) and Haußmann et al. (2017) perform the aggregation at the likelihood level. These three approaches target a binary classification problem. Both Kim and De la Torre (2010) and Haußmann et al. (2017) focus on the case for which the label of the bag corresponds to the maximum of the (unobserved) individual labels of each input. Kim and De la Torre (2010) approximate the maximum using a softmax function computed using a latent GP prior evaluated across the individual elements of the bag. They use the Laplace approximation for computing the approximated posterior (Rasmussen and Williams, 2006). Haußmann et al. (2017), on the other hand, approximate the maximum using the so called bag label likelihood, introduced by the authors, which is similar to a Bernoulli likelihood with soft labels given by a convex combination between the bag labels and the maximum of the (latent) individual labels. The latent individual labels in turn follow Bernoulli likelihoods with parameters given by a GP. The authors provide a variational bound and include inducing inputs for scalable Bayesian inference. Kandemir et al. (2016) follow a similar approach to Law et al. (2018) equivalent to setting all the weights in Law et al.’s model to one. The sum is then used to modulate the parameter of a Bernoulli likelihood that models the bag labels. They use a Fully Independent Training Conditional approximation for the latent GP prior (Snelson and Ghahramani, 2006). Other relevant work is described in Section 3. In contrast to these previous works, we provide a multi-task learning model for aggregated data that scales to large datasets.
For building the multi-task learning model we appeal to the linear model of coregionalisation (Journel and Huijbregts, 1978; Goovaerts, 1997) that has gained popularity in the multi-task GP literature in recent years (Bonilla et al., 2008; Alvarez et al., 2012). We also allow different likelihood functions (Moreno-Muñoz et al., 2018) and different input supports per individual task. Moreover, we introduce inducing inputs at the level of the underlying common set of latent functions, an idea initially proposed in Alvarez and Lawrence (2009). We then use stochastic variational inference for GPs (Hensman et al., 2013) leading to an approximation similar to the one obtained in Moreno-Muñoz et al. (2018). Empirical results show that the multi-task learning approach developed here provides accurate predictions in different challenging datasets where tasks have different supports.
2 Multi-task learning for aggregated data at different scales
In this section we first define the basic model in the single-task setting. We then extend the model to the multi-task setting and finally provide details for the stochastic variational formulation for approximate Bayesian inference.
2.1 Change of support using Gaussian processes
Change of support has been studied in geostatistics before (Gotway and Young, 2002). In this paper, we use a formulation similar to Kyriakidis (2004). We start by defining a stochastic process over the input interval using
where is a latent stochastic process that we assume follows a Gaussian process with zero mean and covariance and . Dividing by helps to keep the proportionality between the length of the interval and the area under in the interval. In other words, the process is modeled as a density meaning that inputs with widely differing supports will behave in a similar way. The first two moments for are given as and . The covariance for follows as since . Let us use to refer to . We can now use these mean and covariance functions for representing the Gaussian process prior for .
For some forms of it is possible to obtain an analytical expression for . For example, if follows an Exponentiated-Quadratic (EQ) covariance form, , where is the variance of the kernel and is the length-scale, it can be shown that follows as
where with , the error function defined as . Other kernels for also lead to analytical expressions for . See for example Smith et al. (2018).
So far, we have restricted the exposition to one-dimensional intervals. However, we can define the stochastic process over a general support , with measure , using
The support generally refers to an area or volume of any shape or size. Following similar assumptions to the ones we used for , we can build a GP prior to represent with covariance defined as . Let . If the support has a regular shape, e.g. a hyperrectangle, then assumptions on such as additivity or factorization across input dimensions will lead to kernels that can be expressed as addition of kernels or product of kernels acting over a single dimension. For example, let , where , and a GP over each . If each is an EQ kernel, then , where and are the intervals across each input dimension. If the support does not follow a regular shape, i.e it is a polytope, then we can approximate the double integration by numerical integration inside the support.
2.2 Multi-task learning setting
Our inspiration for multi-task learning is the linear model of coregionalisation (LMC) (Journel and Huijbregts, 1978). This model has connections with other multi-task learning approaches that use kernel methods, as reviewed in Alvarez et al. (2012). In the LMC, each output (or task in our case) is represented as a linear combination of a common set of latent Gaussian processes. Let be a set of GPs with zero means and covariance functions . Each GP is sampled independently and identically times to produce realizations that are used to represent the outputs. Let be a set of tasks where each task is defined at a different support . We use the set of realizations to represent each task as
(1) |
where the coefficients weigh the contribution of each integral term to . Since , with the Kronecker delta between and , the cross-covariance between and is then given as
where . Following the discussion in section 2.1, the double integral can be solved analytically for some options of , and . Generally, a numerical approximation can be obtained.
It is also worth mentioning at this point that the model does not require all the tasks to be defined at the area level. Some of the tasks could also be defined at the point level. Say for example that is defined at the support level , , whereas is defined at the point level, say , . In this case, . We can still compute the cross-covariance between and , , leading to, . For the case and (i.e. dimensionality of the input space), this is, , and an EQ kernel for , we get (we used to avoid an overparameterization for the variance). Again, if does not have a regular shape, we can approximate the integral numerically.
Let us define the vector-valued function . A GP prior over can use the kernel defined above so that
where each is known as a coregionalisation matrix. The scalar term modulates as a function of and .
The prior above can be used for modulating the parameters of likelihood functions that model the observed data. The most simple case corresponds to the multi-task regression problem that can be modeled using a multivariate Gaussian distribution. Let be a random vector modeling the observed data. In the multi-task regression problem , where is the mean vector and is a diagonal matrix with entries . We can use the GP prior as the prior over the mean vector . Since both the likelihood and the prior are Gaussian, both the marginal distribution for and the posterior distribution of given can be computed analytically. For example, the marginal distribution for is given as . Moreno-Muñoz et al. (2018) introduced the idea of allowing each task to have a different likelihood function and modulating the parameters of that likelihood function using one or more elements in the the vector-valued GP prior. For that general case, the marginal likelihood and the posterior distribution cannot be computed in closed form.
2.3 Stochastic variational inference
Let be a dataset of multiple tasks with potentially different supports per task, where , with , and , with and . We are interested in computing the posterior distribution , where , with and . In this paper, we will use stochastic variational inference to compute a deterministic approximation of the posterior distribution , by means of the the well known idea of inducing variables. In contrast to the use of SVI for traditional Gaussian processes, where the inducing variables are defined at the level of the process , we follow Álvarez et al. (2010) and Moreno-Muñoz et al. (2018), and define the inducing variables at the level of the latent processes . For simplicity in the notation, we assume . Let be the set of inducing variables, where , with the inducing inputs. Notice also that we have used a common set of inducing inputs for all latent functions but we can easily define a set per inducing vector .
For the multi-task regression case, it is possible to compute an analytical expression for the Gaussian posterior distribution over the inducing variables , , following a similar approach to Álvarez et al. (2010). However, such approximation is only valid for the case in which the likelihood model is Gaussian and the variational bound obtained is not amenable for stochastic optimisation. An alternative for finding also establishes a lower-bound for the log-marginal likelihood , but uses numerical optimisation for maximising the bound with respect to the mean parameters, , and the covariance parameters, , for the Gaussian distribution (Moreno-Muñoz et al., 2018). Such numerical procedure can be used for any likelihood model and the optimisation can be performed using mini-batches. We follow this approach.
Lower-bound
The lower bound for the log-marginal likelihood follows as
where , , and is the prior over the inducing variables. Here is a blockwise matrix with matrices . In turn each of these matrices have entries given by . Similarly, is a block-diagonal matrix with blocks given by with entries computed using . The optimal is chosen by numerically maximizing with respect to the parameters and . To ensure a valid covariance matrix we optimise the Cholesky factor for . See appendix A.1 for more details on the lower bound. The computational complexity is similar to the one for the model in Moreno-Muñoz et al. (2018).
Hyperparameter learning
When using the multi-task learning method, we need to optimise the hyperparameters associated with the LMC, these are: the coregionalisation matrices , the hyperparameters of the kernels , and any other hyperparameter associated to the likelihood functions that has not been considered as a member of the latent vector . Hyperparameter optimisation is done using the lower bound as the objective function. First is maximised with respect to the variational distribution and then with respect to the hyperparameters. The two-steps are repeated one after the other until reaching convergence. Such style of optimisation is known as variational EM (Expectation-Maximization). In the Expectation step we compute a variational posterior distribution and in the Maximization step we use a variational lower bound to find point estimates of any hyperparameters. For optimising the hyperparameters in , we also use a Cholesky decomposition for each matrix to ensure positive definiteness. So instead of optimising directly, we optimise , where . For the experimental section, we use the EQ kernel for , so we fix the variance for to one (the variance per output is already contained in the matrices ) and optimise the length-scales .
Predictive distribution
Given a new set of test inputs , the predictive distribution for is computed using , where and refer to the vector-valued functions and evaluated at . Notice that . Even though does not appear explicitly in the expression for , it has been used to compute the posterior for through the optimisation of where is explicitly taken into account. We are usually interested in the mean prediction and the predictive variance . Both can be computed by exchanging integrals in the double integration over and . See appendix A.1 for more details on this.
3 Related work
Machine learning methods for different forms of aggregated datasets are also known under the names of multiple instance learning, learning from label proportions or weakly supervised learning on aggregate outputs (Kück and de Freitas, 2005; Musicant et al., 2007; Quadrianto et al., 2009; Patrini et al., 2014; Kotzias et al., 2015; Bhowmik et al., 2015). Law et al. (2018) provides a summary of these different approaches. Commonly these methods start with the following setting: each instance in the dataset is in the form of a set of inputs for which there is only one corresponding output (e.g. the proportion of class labels, an average or a sample statistic). The prediction problem usually consists then in predicting the output for the individual inputs in a set. The setting we present in this paper is slightly different in the sense that, in general, for each instance, the input corresponds to a support of any shape and size and the output corresponds to a vector-valued output. Moreover, each task can have its own support. Similarly, while most of these ML approaches have been developed for either regression or classification, our model is built on top of (Moreno-Muñoz et al., 2018), allowing each task to have a potentially different likelihood.
As mentioned in the introduction, Gaussian processes have also been used for multiple instance learning or aggregated data (Kim and De la Torre, 2010; Kandemir et al., 2016; Haußmann et al., 2017; Smith et al., 2018; Law et al., 2018). Compared to these previous approaches, our model goes beyond the single task problem and allows learning multiple tasks simultaneously. Each task can have its own support at training and test time. Although our model was formulated for a continuous support , we can also define it in terms of a finite set of (previously defined) inputs in the support, e.g. a set which is more akin to the bag formulations in these previous works. This would require changing the integral in (1) for the sum . All the other components, including the inference approach, would remain equal.
In geostatistics, a similar problem has been studied under the names of downscaling or spatial disaggregation (Zhang et al., 2014), particularly using different forms of kriging (Goovaerts, 1997). It is also closely related to the problem of change of support described with detail in Gotway and Young (2002). Block-to-point kriging (or area-to-point kriging if the support is defined in a surface) is a common method for downscaling, this is, provide predictions at the point level provided data at the block level (Kyriakidis, 2004; Goovaerts, 2010). We extend the approach introduced in Kyriakidis (2004) later revisited by Goovaerts (2010) for count data, to the multi-task setting, including also a stochastic variational EM algorithm for scalable inference.
4 Experiments
In this section, we apply the multi-task learning model for prediction in three different datasets: a synthetic example for two tasks that follow a Poisson likelihood each, a two-dimensional input dataset of fertility rates aggregated by year of conception and ages in Canada, and an air-pollution sensor network where one task corresponds to a high-accuracy, low-frequency particulate matter sensor and another task corresponds to a low-cost, low-accuracy, high resolution sensor.
Synthetic data
In this section we evaluated our model with a synthetic dataset. For all of the experiments we use with an EQ covariance for the latent function . We set up a toy problem with tasks, where both likelihood functions are Poisson. We sample from the latent vector-valued GP and use those samples to modulate the Poisson likelihoods using and as the respective rates. The first task is generated using intervals of units, whereas the second task is generated using intervals of units. All the inputs are uniformly distributed in the range . We generated observations for task 1 and for task 2. For training the multi-task model, we select from the observations for task 1 and use all for the second task. The other data points for task 1 correspond to a gap in the interval of that we use as the test set. In this experiment, we evaluated our model’s capability in predicting one task, sampled more frequently, using the training information from a second task with a larger support. For optimization we used the standard LBFGS-B algorithm. In Figure 1 we show that the data in the second task, with a larger support, helps predicting the test data in the gap present in the first task, with a smaller support (right panel). However, this is not the case in the single task learning scenario where the predictions are basically constant and equal to 1 (left panel). Both models predict the training data equally well.
Fertility rates from a Canadian census
In this experiment, a subset of the Canadian fertility dataset is used from the Human Fertility Database (HFD) ^{1}^{1}1https://www.humanfertility.org. The dataset consists of live births’ statistics by year, age of mother and birth order. The ages of the mother are between and the years are between . It contains data points of fertility rate per birth order (the output variable) and there are four birth orders. We used the data points of the 1st birth only. The dataset was randomly split into training points and test points. We consider two tasks: the first task consists of a different number of data observations randomly taken from the training points. The second task consists of all the training data aggregated at two different resolutions, and (we wanted to test the predictive performance when the relation of high-resolution data to low-resolution data was to and another for to ). The aggregated data for the case is reduced to data points and the aggregated data for case is reduced to points. In the experiments, we train this multi-task model by slowly increasing the original resolution training data, while maintaining a fixed amount of training points mentioned before for the aggregated second task. The output variable (fertility rate for the first birth) was assumed to be Gaussian, so both tasks follow a Gaussian likelihood. We use with an EQ kernel with where the two input variables are age of mother and birth year. We used fixed inducing variables. The Adam optimiser, included in climin library, was used for the optimisation of the variational distribution (variational E-step) and the hyperparameters (variational M-step). We use a mini-batch of size samples. The prediction task consists of predicting the original resolution test data with the help of the second task which consists of the aggregated data ( or for two separate experiments). Figure 2 shows SMSE (Standardized Mean Squared Error) for five random selections of data points in the training and test sets. We notice that the multi-task learning model outperforms the single-task GP when there are few observations in the task with the original resolution data. This pattern holds below observations. At that point, both models perform equally well since the single-task GP now has enough training data. With respect to the two different resolutions, the performance of the multi-task model is better when the second task has a resolution rather than resolution, as one might also expect.
Air pollution monitoring network
Particulate air pollution can be measured accurately with high temporal precision by using a attenuation (BAM) sensor or similar. Unfortunately these are often prohibitively expensive. We propose instead that one can combine the measurements from a low-cost optical particle counter (OPC) which gives good temporal resolution but is often badly biased, with the results of a Cascade Impactor (CIs), which are a cheaper method for assessing the mass of particulate species but integrate over many hours (e.g. 6 or 24 hours).
One can formulate the problem as observations of integrals of a latent function. The CI integrating over 6 hour periods while the OPC sensor integrating over short 5 minute periods. We used data from two colocated particulate matter (PM2.5) sensors in Kampala (0.3073N 32.6205E) between 2019-03-13 and 2019-03-23. We used the average of six-hour periods from a calibrated mcerts-verified Osiris (Turnkey) particulate air pollution monitoring system as the low-resolution data, and then compared the prediction results to the original measurements (available at a 15 minute resolution). We used a PMS 5003 (Plantower) low-cost OPC to provide the high-resolution data. Typically we found these values would often be biased. We simply normalised (scaled) the data to emphasise that the absolute values of these variables are not of interest in this model.
Our multi-task model consists of a single latent function, , with covariance that follows an EQ form. We assume both outputs follow Gaussian likelihoods. In our model, one task represents the high accuracy low-resolution samples and the second task represents the low-accuracy high-resolution samples. The posterior GP both aims to fulfil the 6-hour long integrals of the high-accuracy data (from the Osiris instrument) while remaining correlated with the high-frequency bias data from the OPC. We used 600 iterations of the variational EM algorithm, with 200 evenly spaced inducing points and a fixed lengthscale of 0.75 hours. We only optimise the parameters of the coregionalisation matrix and the variance of the noise of each Gaussian likelihood.
Figure 3 illustrates the results for a 24 hour period. The training data consists of the high-resolution low-accuracy sensor and a low-frequency high accuracy sensor. The aim is to reconstruct the underlying level of pollution both sensors are measuring. To test whether the additional high-frequency data improves the accuracy we ran the coregionalisation both with and without this additional training data.
We found that the SMSE for the predictions over the 10 days tested were substantially smaller with multi-task learning compared to using only the low-resolution samples, and respectively (the difference is statistical significant using a paired -test with a value of for a confidence level of ). In summary, the model was able to incorporate this additional data and use it to improve the estimates while still ensuring the long integrals were largely satisfied.
5 Conclusion
In this paper, we have introduced a powerful framework for working with aggregated datasets that allows the user to combine observations from disparate data types, with varied support. This allows us to produce both finely resolved and accurate predictions by using the accuracy of low-resolution data and the fidelity of high-resolution side-information. We chose our inducing points to lie in the latent space, a distinction which allows us to estimate multiple tasks with different likelihoods. SVI and variational-EM with mini-batches make the framework scalable and tractable for potentially very large problems. In summary, this framework provides a vital toolkit, allowing a mixture of likelihoods, kernels and tasks and paves the way to the analysis of a very common and widely used data structure - that of values over a variety of supports on the domain.
6 Acknowledgement
MTS and MAA have been financed by the Engineering and Physical Research Council (EPSRC) Research Project EP/N014162/1. MAA has also been financed by the EPSRC Research Project EP/R034303/1.
References
- Álvarez et al. (2010) Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil Lawrence. Efficient multioutput Gaussian processes through variational inducing kernels. In AISTATS, pages 25–32, 2010.
- Alvarez and Lawrence (2009) Mauricio A. Alvarez and Neil D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 57–64. Curran Associates, Inc., 2009.
- Alvarez et al. (2012) Mauricio A. Alvarez, Lorenzo Rosasco, Neil D. Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195–266, 2012.
- Bhowmik et al. (2015) Avradeep Bhowmik, Joydeep Ghosh, and Oluwasanmi Koyejo. Generalized linear models for aggregated data. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 93–101. PMLR, May 2015.
- Bonilla et al. (2008) Edwin V. Bonilla, Kian M. Chai, and Christopher Williams. Multi-task gaussian process prediction. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 153–160. Curran Associates, Inc., 2008.
- Goovaerts (1997) Pierre Goovaerts. Geostatistics for natural resources evaluation. Oxford University Press, 1997.
- Goovaerts (2010) Pierre Goovaerts. Combining areal and point data in geostatistical interpolation: Applications to soil science and medical geography. Mathematical Geosciences, 42(5):535–554, July 2010.
- Gotway and Young (2002) Carol A. Gotway and Linda J. Young. Combining incompatible spatial data. Journal of the American Statistical Association, 97(458):632–648, 2002.
- Haußmann et al. (2017) Manuel Haußmann, Fred A. Hamprecht, and Melih Kandemir. Variational Bayesian multiple instance learning with Gaussian processes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 810–819, July 2017.
- Hensman et al. (2013) James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pages 282–290, 2013.
- Journel and Huijbregts (1978) Andre G. Journel and Charles J. Huijbregts. Mining Geostatistics. Academic Press, 1978.
- Kandemir et al. (2016) Melih Kandemir, Manuel Haussmann, Ferran Diego, Kumar T. Rajamani, Jeroen van der Laak, and Fred A. Hamprecht. Variational weakly supervised Gaussian processes. In Proceedings of the British Machine Vision Conference 2016, BMVC, September 2016.
- Kim and De la Torre (2010) Minyoung Kim and Fernando De la Torre. Gaussian processes multiple instance learning. In Johannes Fürnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 535–542. Omnipress, June 2010.
- Kotzias et al. (2015) Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 597–606. ACM, 2015.
- Kück and de Freitas (2005) Hendrik Kück and Nando de Freitas. Learning about individuals from group statistics. In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI’05, pages 332–339. AUAI Press, 2005.
- Kyriakidis (2004) Phaedon C. Kyriakidis. A geostatistical framework for area-to-point spatial interpolation. Geographical Analysis, 36(3):259–289, 2004.
- Law et al. (2018) Ho Chung Leon Law, Dino Sejdinovic, Ewan Cameron, Tim C.D. Lucas, Seth Flaxman, Katherine Battle, and Kenji Fukumizu. Variational learning on aggregate outputs with gaussian processes. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 6084–6094. Curran Associates Inc., 2018.
- Moreno-Muñoz et al. (2018) Pablo Moreno-Muñoz, Antonio Artés, and Mauricio A. Álvarez. Heterogeneous multi-output gaussian process prediction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6711–6720. Curran Associates, Inc., 2018.
- Musicant et al. (2007) David R. Musicant, Janara M. Christensen, and Jamie F. Olson. Supervised learning by training on aggregate outputs. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 252–261, October 2007.
- Patrini et al. (2014) Giorgio Patrini, Richard Nock, Paul Rivera, and Tiberio Caetano. (almost) no label no cry. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 190–198. Curran Associates, Inc., 2014.
- Quadrianto et al. (2009) Novi Quadrianto, Alex J. Smola, Tiberio S. Caetano, and Quoc V. Le. Estimating labels from label proportions. J. Mach. Learn. Res., 10:2349–2374, December 2009. ISSN 1532-4435.
- Rasmussen and Williams (2006) Carl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.
- Smith et al. (2018) Michael T. Smith, Mauricio A. Álvarez, and Neil D Lawrence. Gaussian process regression for binned data. arXiv e-prints, September 2018.
- Snelson and Ghahramani (2006) Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006.
- Zhang et al. (2014) Jingxiong Zhang, Peter Atkinson, and Michael F. Goodchild. Scale in Spatial Information and Analysis. CRC Press, 2014.
Appendix A Supplemental material
a.1 Additional details on SVI
Lower-bound
It can be shown (Moreno-Muñoz et al., 2018) that the bound is given as
where the expected value is taken with respect to the distribution, which is a Gaussian distribution with mean and covariance . For Gaussian likelihoods,
wecan compute the expected value in the bound in closed form. For other likelihoods, we can use numerical integration to approximate it. Instead of using the whole batch of data , we can use mini-batches to estimate the gradient.
Predictive distribution
For example, . The inner integral is computed with the conditional distribution and its form depend on the likelihood terms per task. The outer integral can be approximated using numerical integration or Monte-Carlo sampling. A similar procedure can be followed to compute .
a.2 SNLP for the fertility dataset
Figure 4 shows the results in terms of SNLP (standardized negative log probability density) for the Fertility dataset. We can notice a similar pattern to the one observed for the SMSE in Figure 2.