Multi-task Learning for Aggregated Data using Gaussian Processes

# Multi-task Learning for Aggregated Data using Gaussian Processes

## Abstract

Aggregated data is commonplace in areas such as epidemiology and demography. For example, census data for a population is usually given as averages defined over time periods or spatial resolutions (cities, regions or countries). In this paper, we present a novel multi-task learning model based on Gaussian processes for joint learning of variables that have been aggregated at different input scales. Our model represents each task as the linear combination of the realizations of latent processes that are integrated at a different scale per task. We are then able to compute the cross-covariance between the different tasks either analytically or numerically. We also allow each task to have a potentially different likelihood model and provide a variational lower bound that can be optimised in a stochastic fashion making our model suitable for larger datasets. We show examples of the model in a synthetic example, a fertility dataset and an air pollution prediction application.

## 1 Introduction

Many datasets in fields like ecology, epidemiology, remote sensing, sensor networks and demography appear naturally aggregated, that is, variables in these datasets are measured or collected in intervals, areas or supports of different shapes and sizes. For example, census data are usually sampled or collected as aggregated at different administrative divisions, e.g. borough, town, postcode or city levels. In sensor networks, correlated variables are measured at different resolutions or scales. In the near future, air pollution monitoring across cities and regions could be done using a combination of a few high-quality low time-resolution sensors and several low-quality (low-cost) high time-resolution sensors. Joint modelling of the variables registered in the census data or the variables measured using different sensor configurations at different scales can improve predictions at the point or support levels.

In this paper, we are interested in providing a general framework for multi-task learning on these types of datasets. Our motivation is to use multi-task learning to jointly learn models for different tasks where each task is defined at (potentially) a different support of any shape and size and has a (potentially) different nature, i.e. it is a continuous, binary, categorical or count variable. We appeal to the flexibility of Gaussian processes (GPs) for developing a prior over such type of datasets and we also provide a scalable approach for variational Bayesian inference.

Gaussian processes have been used before for aggregated data (Smith et al., ; Law et al., 2018; Tanaka et al., 2019) and also in the context of the related field of multiple instance learning (Kim and De la Torre, 2010; Kandemir et al., 2016; Haußmann et al., 2017). In multiple instance learning, each instance in the dataset consists of a set (or bag) of inputs with only one output (or label) for that whole set. The aim is to provide predictions at the level of individual inputs. Smith et al. () provide a new kernel function to handle single regression tasks defined at different supports. They use cross-validation for hyperparameter selection. Law et al. (2018) use the weighted sum of a latent function evaluated at different inputs as the prior for the rate of a Poisson likelihood. The latent function follows a GP prior. The authors use stochastic variational inference (SVI) for approximating the posterior distributions. Tanaka et al. (2019) mainly use GPs for creating data from different auxiliary sources. Furthermore, they only consider Gaussian regression and they do not include inducing variables. While Smith et al. () and Law et al. (2018) perform the aggregation at the latent prior stage, Kim and De la Torre (2010); Kandemir et al. (2016) and Haußmann et al. (2017) perform the aggregation at the likelihood level. These three approaches target a binary classification problem. Both Kim and De la Torre (2010) and Haußmann et al. (2017) focus on the case for which the label of the bag corresponds to the maximum of the (unobserved) individual labels of each input. Kim and De la Torre (2010) approximate the maximum using a softmax function computed using a latent GP prior evaluated across the individual elements of the bag. They use the Laplace approximation for computing the approximated posterior (Rasmussen and Williams, 2006). Haußmann et al. (2017), on the other hand, approximate the maximum using the so called bag label likelihood, introduced by the authors, which is similar to a Bernoulli likelihood with soft labels given by a convex combination between the bag labels and the maximum of the (latent) individual labels. The latent individual labels in turn follow Bernoulli likelihoods with parameters given by a GP. The authors provide a variational bound and include inducing inputs for scalable Bayesian inference. Kandemir et al. (2016) follow a similar approach to Law et al. (2018) equivalent to setting all the weights in Law et al.’s model to one. The sum is then used to modulate the parameter of a Bernoulli likelihood that models the bag labels. They use a Fully Independent Training Conditional approximation for the latent GP prior (Snelson and Ghahramani, 2006). In contrast to these previous works, we provide a multi-task learning model for aggregated data that scales to large datasets and allows for heterogeneous outputs. At the time of submission of this paper, the idea of using multi-task learning for aggregated datasets was simultaneously proposed by Hamelijnck et al. () and Tanaka et al. (), two additional models to the one we propose in this paper. In our work, we allow heterogenous likelihoods which is different to both Hamelijnck et al. () and Tanaka et al. (). We also allow an exact solution to the integration of the latent function through the kernel in Smith et al. (), which is different to Hamelijnck et al. (). Also, for computational complexity, inducing inputs are used, another difference from the work in Tanaka et al. (). Other relevant work is described in Section 3.

For building the multi-task learning model we appeal to the linear model of coregionalisation (Journel and Huijbregts, 1978; Goovaerts, 1997) that has gained popularity in the multi-task GP literature in recent years (Bonilla et al., 2008; Alvarez et al., 2012). We also allow different likelihood functions (Moreno-Muñoz et al., 2018) and different input supports per individual task. Moreover, we introduce inducing inputs at the level of the underlying common set of latent functions, an idea initially proposed in Alvarez and Lawrence (2009). We then use stochastic variational inference for GPs (Hensman et al., 2013) leading to an approximation similar to the one obtained in Moreno-Muñoz et al. (2018). Empirical results show that the multi-task learning approach developed here provides accurate predictions in different challenging datasets where tasks have different supports.

## 2 Multi-task learning for aggregated data at different scales

In this section we first define the basic model in the single-task setting. We then extend the model to the multi-task setting and finally provide details for the stochastic variational formulation for approximate Bayesian inference.

### 2.1 Change of support using Gaussian processes

Change of support has been studied in geostatistics before (Gotway and Young, 2002). In this paper, we use a formulation similar to Kyriakidis (2004). We start by defining a stochastic process over the input interval using

 f(xa,xb)=1Δx∫xbxau(z)dz,

where is a latent stochastic process that we assume follows a Gaussian process with zero mean and covariance and . Dividing by helps to keep the proportionality between the length of the interval and the area under in the interval. In other words, the process is modeled as a density meaning that inputs with widely differing supports will behave in a similar way. The first two moments for are given as and . The covariance for follows as since . Let us use to refer to . We can now use these mean and covariance functions for representing the Gaussian process prior for .

For some forms of it is possible to obtain an analytical expression for . For example, if follows an Exponentiated-Quadratic (EQ) covariance form, , where is the variance of the kernel and is the length-scale, it can be shown that follows as

 k(xa,xb,x′a,x′b) =σ2ℓ22ΔxΔx′[h(xb−x′aℓ)+h(xa−x′bℓ)−h(xa−x′aℓ)−h(xb−x′bℓ)],

where with , the error function defined as . Other kernels for also lead to analytical expressions for . See for example Smith et al. ().

So far, we have restricted the exposition to one-dimensional intervals. However, we can define the stochastic process over a general support , with measure , using

 f(υ)=1|υ|∫z∈vu(z)dz.

The support generally refers to an area or volume of any shape or size. Following similar assumptions to the ones we used for , we can build a GP prior to represent with covariance defined as . Let . If the support has a regular shape, e.g. a hyperrectangle, then assumptions on such as additivity or factorization across input dimensions will lead to kernels that can be expressed as addition of kernels or product of kernels acting over a single dimension. For example, let , where , and a GP over each . If each is an EQ kernel, then , where and are the intervals across each input dimension. If the support does not follow a regular shape, i.e it is a polytope, then we can approximate the double integration by numerical integration inside the support.

Our inspiration for multi-task learning is the linear model of coregionalisation (LMC) (Journel and Huijbregts, 1978). This model has connections with other multi-task learning approaches that use kernel methods. For example, Teh et al. (2005) and Bonilla et al. (2008) are particular cases of LMC. A detailed review is available in Alvarez et al. (2012). In the LMC, each output (or task in our case) is represented as a linear combination of a common set of latent Gaussian processes. Let be a set of GPs with zero means and covariance functions . Each GP is sampled independently and identically times to produce realizations that are used to represent the outputs. Let be a set of tasks where each task is defined at a different support . We use the set of realizations to represent each task as

 fd(υ) =Q∑q=1Rq∑i=1aid,q|υ|∫z∈vuiq(z)dz, (1)

where the coefficients weight the contribution of each integral term to . Since , with the Kronecker delta between and , the cross-covariance between and is then given as

 kfd,fd′(υ,υ′)=Q∑q=1bqd,d′|υ||υ′|∫z∈υ∫z′∈υ′kq(z,z′)dz′dz,

where . Following the discussion in Section 2.1, the double integral can be solved analytically for some options of , and . Generally, a numerical approximation can be obtained.

It is also worth mentioning at this point that the model does not require all the tasks to be defined at the area level. Some of the tasks could also be defined at the point level. Say for example that is defined at the support level , , whereas is defined at the point level, say , . In this case, . We can still compute the cross-covariance between and , , leading to, . For the case and (i.e. dimensionality of the input space), this is, , and an EQ kernel for , we get (we used to avoid an overparameterization for the variance). Again, if does not have a regular shape, we can approximate the integral numerically.

Let us define the vector-valued function . A GP prior over can use the kernel defined above so that

 f(υ)∼GP(0,Q∑q=11|υ||υ′|Bq∫z∈v∫z′∈v′kq(z,z′)dz′dz),

where each is known as a coregionalisation matrix. The scalar term modulates as a function of and .

The prior above can be used for modulating the parameters of likelihood functions that model the observed data. The most simple case corresponds to the multi-task regression problem that can be modeled using a multivariate Gaussian distribution. Let be a random vector modeling the observed data as a function of . In the multi-task regression problem , where is the mean vector and is a diagonal matrix with entries . We can use the GP prior as the prior over the mean vector . Since both the likelihood and the prior are Gaussian, both the marginal distribution for and the posterior distribution of given can be computed analytically. For example, the marginal distribution for is given as . Moreno-Muñoz et al. (2018) introduced the idea of allowing each task to have a different likelihood function and modulated the parameters of that likelihood function using one or more elements in the vector-valued GP prior. For that general case, the marginal likelihood and the posterior distribution cannot be computed in closed form.

### 2.3 Stochastic variational inference

Let be a dataset of multiple tasks with potentially different supports per task, where , with , and , with and . Notice that without refers to the output vector for the dataset. We are interested in computing the posterior distribution , where , with and . In this paper, we will use stochastic variational inference to compute a deterministic approximation of the posterior distribution , by means of the the well known idea of inducing variables. In contrast to the use of SVI for traditional Gaussian processes, where the inducing variables are defined at the level of the process , we follow Álvarez et al. (2010) and Moreno-Muñoz et al. (2018), and define the inducing variables at the level of the latent processes . For simplicity in the notation, we assume . Let be the set of inducing variables, where , with the inducing inputs. Notice also that we have used a common set of inducing inputs for all latent functions but we can easily define a set per inducing vector .

For the multi-task regression case, it is possible to compute an analytical expression for the Gaussian posterior distribution over the inducing variables , , following a similar approach to Álvarez et al. (2010). However, such approximation is only valid for the case in which the likelihood model is Gaussian and the variational bound obtained is not amenable for stochastic optimisation. An alternative for finding also establishes a lower-bound for the log-marginal likelihood , but uses numerical optimisation for maximising the bound with respect to the mean parameters, , and the covariance parameters, , for the Gaussian distribution (Moreno-Muñoz et al., 2018). Such numerical procedure can be used for any likelihood model and the optimisation can be performed using mini-batches. We follow this approach.

#### Lower-bound

The lower bound for the log-marginal likelihood follows as

 logp(y)≥∫∫q(f,u)logp(y|f)p(f|u)p(u)q(f,u)dfdu=L,

where , , and is the prior over the inducing variables. Here is a blockwise matrix with matrices . In turn each of these matrices have entries given by . Similarly, is a block-diagonal matrix with blocks given by with entries computed using . The optimal is chosen by numerically maximizing with respect to the parameters and . To ensure a valid covariance matrix we optimise the Cholesky factor for . See Appendix A.1 for more details on the lower bound. The computational complexity is similar to the one for the model in Moreno-Muñoz et al. (2018), , where depends on the types of likelihoods used for the different tasks. For example, if we model all the outputs using Gaussian likelihoods, then . For details, see Moreno-Muñoz et al. (2018).

#### Hyperparameter learning

When using the multi-task learning method, we need to optimise the hyperparameters associated with the LMC, these are: the coregionalisation matrices , the hyperparameters of the kernels , and any other hyperparameter associated to the likelihood functions that has not been considered as a member of the latent vector . Hyperparameter optimisation is done using the lower bound as the objective function. First is maximised with respect to the variational distribution and then with respect to the hyperparameters. The two-steps are repeated one after the other until reaching convergence. Such style of optimisation is known as variational EM (Expectation-Maximization) when using the full dataset (Beal, 2003) or stochastic version, when employing mini-batches (Hoffman et al., 2013). In the Expectation step we compute a variational posterior distribution and in the Maximization step we use a variational lower bound to find point estimates of any hyperparameters. For optimising the hyperparameters in , we also use a Cholesky decomposition for each matrix to ensure positive definiteness. So instead of optimising directly, we optimise , where . For the experimental section, we use the EQ kernel for , so we fix the variance for to one (the variance per output is already contained in the matrices ) and optimise the length-scales .

#### Predictive distribution

Given a new set of test inputs , the predictive distribution for is computed using , where and refer to the vector-valued functions and evaluated at . Notice that . Even though does not appear explicitly in the expression for , it has been used to compute the posterior for through the optimisation of where is explicitly taken into account. We are usually interested in the mean prediction and the predictive variance . Both can be computed by exchanging integrals in the double integration over and . See Appendix A.1 for more details on this.

## 3 Related work

Machine learning methods for different forms of aggregated datasets are also known under the names of multiple instance learning, learning from label proportions or weakly supervised learning on aggregate outputs (Kück and de Freitas, 2005; Musicant et al., 2007; Quadrianto et al., 2009; Patrini et al., 2014; Kotzias et al., 2015; Bhowmik et al., 2015). Law et al. (2018) provided a summary of these different approaches. Typically these methods start with the following setting: each instance in the dataset is in the form of a set of inputs for which there is only one corresponding output (e.g. the proportion of class labels, an average or a sample statistic). The prediction problem usually consists then in predicting the individual outputs for the individual inputs in the set. The setting we present in this paper is slightly different in the sense that, in general, for each instance, the input corresponds to a support of any shape and size and the output corresponds to a vector-valued output. Moreover, each task can have its own support. Similarly, while most of these ML approaches have been developed for either regression or classification, our model is built on top of Moreno-Muñoz et al. (2018), allowing each task to have a potentially different likelihood.

As mentioned in the introduction, Gaussian processes have also been used for multiple instance learning or aggregated data (Kim and De la Torre, 2010; Kandemir et al., 2016; Haußmann et al., 2017; Smith et al., ; Law et al., 2018; Tanaka et al., 2019; Hamelijnck et al., ; Tanaka et al., ). Compared to most of these previous approaches, our model goes beyond the single task problem and allows learning multiple tasks simultaneously. Each task can have its own support at training and test time. Compared to other multi-task approaches, we allow for heterogeneous outputs. Although our model was formulated for a continuous support , we can also define it in terms of a finite set of (previously defined) inputs in the support, e.g. a set which is more akin to the bag formulations in these previous works. This would require changing the integral in (1) for the sum .

In geostatistics, a similar problem has been studied under the names of downscaling or spatial disaggregation (Zhang et al., 2014), particularly using different forms of kriging (Goovaerts, 1997). It is also closely related to the problem of change of support described with detail in Gotway and Young (2002). Block-to-point kriging (or area-to-point kriging if the support is defined in a surface) is a common method for downscaling, this is, provide predictions at the point level provided data at the block level (Kyriakidis, 2004; Goovaerts, 2010). We extend the approach introduced in Kyriakidis (2004) later revisited by Goovaerts (2010) for count data, to the multi-task setting, including also a stochastic variational EM algorithm for scalable inference.

If we consider the high-resolution outputs as high-fidelity outputs and low-resolution outputs as low-fidelity outputs, our work also falls under the umbrella of multi-fidelity models where co-kriging using the linear model of coregionalisation has also been used as an alternative (Peherstorfer et al., 2018; Fernández-Godino et al., ).

## 4 Experiments

In this section, we apply the multi-task learning model for prediction in three different datasets: a synthetic example for two tasks that each have a Poisson likelihood, a two-dimensional input dataset of fertility rates aggregated by year of conception and ages in Canada, and an air-pollution sensor network where one task corresponds to a high-accuracy, low-frequency particulate matter sensor and another task corresponds to a low-cost, low-accuracy, high resolution sensor. In these examples, we use -means clustering over the input data, with , to initialise the values of the inducing inputs, , which are also kept fixed during optimisation. We assume the inducing inputs are points, but they could be defined as intervals or supports. For standard optimisation we used the LBFGS-B algorithm and when SVI was needed, the Adam optimiser, included in climin library, was used for the optimisation of the variational distribution (variational E-step) and the hyperparameters (variational M-step). The implementation is based on the GPy framework and is available on Github: https://github.com/frb-yousefi/aggregated-multitask-gp.

#### Synthetic data

In Figure 1 we show that the data in the second task, with a larger support, helps predicting the test data in the gap present in the first task, with a smaller support (right panel). However, this is not the case in the single task learning scenario where the predictions are basically constant and equal to 1 (left panel). Both models predict the training data equally well. SMSE (Standardized Mean Squared Error) and SNLP (standardized negative log probability density) are calculated for five independent runs. For the multi-task scenario they are and and for the single task case they are and , respectively.

#### Fertility rates from a Canadian census

In this experiment, a subset of the Canadian fertility dataset is used from the Human Fertility Database (HFD) 1. The dataset consists of live births’ statistics by year, age of mother and birth order. The ages of the mother are between and the years are between . It contains data points of fertility rate per birth order (the output variable) and there are four birth orders. We used the data points of the 1st birth only. The dataset was randomly split into training points and test points. We consider two tasks: the first task consists of a different number of data observations randomly taken from the training points. The second task consists of all the training data aggregated at two different resolutions, and (we wanted to test the predictive performance when the relation of high-resolution data to low-resolution data was to and another for to ). The aggregated data for the case (a squared support of years for the input age times years for the input years of the study) is reduced to data points and the aggregated data for case is reduced to points.

In the experiments, we train this multi-task model by slowly increasing the original resolution training data, while maintaining a fixed amount of training points mentioned before for the aggregated second task. The output variable (fertility rate for the first birth) was assumed to be Gaussian, so both tasks follow a Gaussian likelihood. We use with an EQ kernel with where the two input variables are age of mother and birth year. We used fixed inducing variables and mini-batches of size samples. The prediction task consists of predicting the original resolution test data with the help of the second task which consists of the aggregated data ( or for two separate experiments).

Figure 2 shows SMSE for five random selections of data points in the training and test sets. We notice that the multi-task learning model outperforms the single-task GP when there are few observations in the task with the original resolution data. This pattern holds below observations. At that point, both models perform equally well since the single-task GP now has enough training data. With respect to the two different resolutions, the performance of the multi-task model is better when the second task has a resolution rather than resolution, as one might also expect.

#### Air pollution monitoring network

Particulate air pollution can be measured accurately with high temporal precision by using a attenuation (BAM) sensor or similar. Unfortunately these are often prohibitively expensive. We propose instead that one can combine the measurements from a low-cost optical particle counter (OPC) which gives good temporal resolution but is often badly biased, with the results of a Cascade Impactors (CIs), which are a cheaper method for assessing the mass of particulate species but integrate over many hours (e.g. 6 or 24 hours).

One can formulate the problem as observations of integrals of a latent function. The CI integrating over 6 hour periods while the OPC sensor integrating over short 5 minute periods. We used data from two fine particulate matter (PM) sensors. The sensors are less than micrometer diameter (PM2.5) and are colocated in Kampala, Uganda at 0.3073N 32.6205E. The data is taken between 2019-03-13 and 2019-03-22. We used the average of six-hour periods from a calibrated mcerts-verified Osiris (Turnkey) particulate air pollution monitoring system as the low-resolution data, and then compared the prediction results to the original measurements (available at a 15 minute resolution). We used a PMS 5003 (Plantower) low-cost OPC to provide the high-resolution data. Typically we found these values would often be biased. We simply normalised (scaled) the data to emphasise that the absolute values of these variables are not of interest in this model.

Our multi-task model consists of a single latent function, , with covariance that follows an EQ form. We assume both outputs follow Gaussian likelihoods. In our model, one task represents the high accuracy low-resolution samples and the second task represents the low-accuracy high-resolution samples. The posterior GP both aims to fulfil the 6-hour long integrals of the high-accuracy data (from the Osiris instrument) while remaining correlated with the high-frequency bias data from the OPC. We used 2000 iterations of the variational EM algorithm, with 200 evenly spaced inducing points and a fixed lengthscale of 0.75 hours. We only optimise the parameters of the coregionalisation matrix and the variance of the noise of each Gaussian likelihood.

Figure 3 illustrates the results for a 24 hour period. The training data consists of the high-resolution low-accuracy sensor and a low-frequency high accuracy sensor. The aim is to reconstruct the underlying level of pollution both sensors are measuring. To test whether the additional high-frequency data improves the accuracy we ran the coregionalisation both with and without this additional training data.

We found that the SMSE for the predictions over the 9 days tested were substantially smaller with multi-task learning compared to using only the low-resolution samples, and respectively (the difference is statistical significant using a paired -test with a value of ). In summary, the model was able to incorporate this additional data and use it to improve the estimates while still ensuring the long integrals were largely satisfied.

## 5 Conclusion

In this paper, we have introduced a powerful framework for working with aggregated datasets that allows the user to combine observations from disparate data types, with varied support. This allows us to produce both finely resolved and accurate predictions by using the accuracy of low-resolution data and the fidelity of high-resolution side-information. We chose our inducing points to lie in the latent space, a distinction which allows us to estimate multiple tasks with different likelihoods. SVI and variational-EM with mini-batches make the framework scalable and tractable for potentially very large problems. A potential extension would be to consider how the “mixing” achieved through coregionalisation could vary across the domain by extending, for example, the Gaussian Process Regression Network model (Wilson et al., 2012) to be able to deal with aggregated data. Such model would allow latent functions of different lengthscales to be relevant at different locations in the domain. In summary, this framework provides a vital toolkit, allowing a mixture of likelihoods, kernels and tasks and paves the way to the analysis of a very common and widely used data structure - that of values over a variety of supports on the domain.

## 6 Acknowledgement

MTS and MAA have been financed by the Engineering and Physical Research Council (EPSRC) Research Project EP/N014162/1. MAA has also been financed by the EPSRC Research Project EP/R034303/1.

## Appendix A Supplemental material

### a.1 Additional details on SVI

#### Lower-bound

It can be shown [Moreno-Muñoz et al., 2018] that the bound is given as

 L =D∑d=1Nd∑j=1E[logp(yd(υd,j)|fd(υd,j))]−KL(q(u)∥p(u)),

where the expected value is taken with respect to the distribution, which is a Gaussian distribution with mean and covariance . For Gaussian likelihoods,

 p(yd(υd,j)|fd(υd,j))=N(yd(υd,j)|fd(υd,j),σ2yd),

we can compute the expected value in the bound in closed form. For other likelihoods, we can use numerical integration to approximate it such as Gaussian-Hermite quadratures as in Hensman et al. [2015] and Saul et al. [2016]. Instead of using the whole batch of data , we can use mini-batches to estimate the gradient.

#### Predictive distribution

As we mentioned before, we are usually interested in the mean prediction and the predictive variance . Both can be computed by exchanging integrals in the double integration over and For example, . The inner integral in is computed with the conditional distribution and its form depend on the likelihood term per task. The outer integral can be approximated using numerical integration or Monte-Carlo sampling. A similar procedure can be followed to compute .

### a.2 SNLP for the fertility dataset

Figure 4 shows the results in terms of SNLP for the Fertility dataset. We can notice a similar pattern to the one observed for the SMSE in Figure 2.

### a.3 Comparing the model with different baselines for the fertility dataset

In Figure 5 different baselines are compared to the proposed method. Dependent GPs (DGP) [Boyle and Frean, 2005] and Intrinsic Co-regionalisation Model or Multi-task GPs (ICM) [Bonilla et al., 2008] use the centroid of the area as input. MTGPA (proposed method) performs better or similar to baselines as we increase the number of training points for the high-resolution output.

### a.4 Experimental results considering more tasks for the fertility dataset

In Figure 6 SNLP is calculated for four outputs (two outputs with high-resolution and a few data points and two outputs with low-resolution and many more data points). The high-resolution data correspond to the fertility rates of the first and second birth orders.

The first task consists of a different number of data observations randomly taken from the training points of the fertility rate of the first birth. The second task consists of all the training data at the first task aggregated at two different resolutions, and .

The third task consists of a different number of data observations randomly taken from the training points of the fertility rate of the second birth. The fourth task consists of all the training data at the third task aggregated at two different resolutions, and .

We use two different versions of our model and compare their SNLPs. In one version, all the outputs are considered as Gaussians (MTGPA) and in the second version, all the outputs are considered as heteroscedastic Gaussians (HetGPA). In the Gaussian case, only the mean parameter is modelled as a latent function, while the variance is a hyperparameter. However, in the heteroscedastic case, both mean and variance are assumed to follow latent functions. The model with HetGPA outperforms the model with MTGPA since it allows more flexibility toward the latent function that models the variance of the Gaussian.

### a.5 Bar plot for the synthetic dataset

Figure 7 shows the result for the synthetic count data with Poisson likelihood and prediction using the multi-task model. This plot is the same as Figure 1.b, however, the green bars are removed for better visualisation purposes.

### Footnotes

1. https://www.humanfertility.org

### References

1. Mauricio Álvarez, David Luengo, Michalis Titsias, and Neil Lawrence. Efficient multioutput Gaussian processes through variational inducing kernels. In AISTATS, pages 25–32, 2010.
2. Mauricio A. Alvarez and Neil D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In NIPS, pages 57–64. 2009.
3. Mauricio A. Alvarez, Lorenzo Rosasco, Neil D. Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, pages 195–266, 2012.
4. Matthew J. Beal. Variational algorithms for approximate Bayesian inference. Ph. D. Thesis, University College London, 2003.
5. Avradeep Bhowmik, Joydeep Ghosh, and Oluwasanmi Koyejo. Generalized linear models for aggregated data. In AISTATS, pages 93–101, 2015.
6. Edwin V. Bonilla, Kian M. Chai, and Christopher Williams. Multi-task Gaussian process prediction. In NIPS, pages 153–160. 2008.
7. Phillip Boyle and Marcus Frean. Dependent Gaussian processes. In NIPS, pages 217–224. 2005.
8. M. Giselle Fernández-Godino, Chanyoung Park, Nam-Ho Kim, and Raphael T. Haftka. Review of multi-fidelity models. arXiv:1609.07196, 2016.
9. Pierre Goovaerts. Geostatistics for natural resources evaluation. Oxford University Press, 1997.
10. Pierre Goovaerts. Combining areal and point data in geostatistical interpolation: Applications to soil science and medical geography. Mathematical Geosciences, pages 535–554, 2010.
11. Carol A. Gotway and Linda J. Young. Combining incompatible spatial data. Journal of the American Statistical Association, pages 632–648, 2002.
12. Oliver Hamelijnck, Theodoros Damoulas, Kangrui Wang, and Mark Girolami. Multi-resolution multi-task Gaussian processes. arXiv:1906.08344, 2019.
13. Manuel Haußmann, Fred A. Hamprecht, and Melih Kandemir. Variational Bayesian multiple instance learning with Gaussian processes. In CVPR, pages 810–819, 2017.
14. James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. In UAI, pages 282–290, 2013.
15. James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In AISTATS, page 351–360, 2015.
16. Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, pages 1303–1347, 2013.
17. Andre G. Journel and Charles J. Huijbregts. Mining Geostatistics. Academic Press, 1978.
18. Melih Kandemir, Manuel Haussmann, Ferran Diego, Kumar T. Rajamani, Jeroen van der Laak, and Fred A. Hamprecht. Variational weakly supervised Gaussian processes. In Proceedings of the British Machine Vision Conference (BMVC), 2016.
19. Minyoung Kim and Fernando De la Torre. Gaussian processes multiple instance learning. In ICML, pages 535–542, 2010.
20. Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606, 2015.
21. Hendrik Kück and Nando de Freitas. Learning about individuals from group statistics. In UAI, pages 332–339, 2005.
22. Phaedon C. Kyriakidis. A geostatistical framework for area-to-point spatial interpolation. Geographical Analysis, pages 259–289, 2004.
23. Ho Chung Leon Law, Dino Sejdinovic, Ewan Cameron, Tim C.D. Lucas, Seth Flaxman, Katherine Battle, and Kenji Fukumizu. Variational learning on aggregate outputs with Gaussian processes. In NeurIPS, pages 6084–6094, 2018.
24. Pablo Moreno-Muñoz, Antonio Artés, and Mauricio A. Álvarez. Heterogeneous multi-output Gaussian process prediction. In NeurIPS, pages 6711–6720. 2018.
25. David R. Musicant, Janara M. Christensen, and Jamie F. Olson. Supervised learning by training on aggregate outputs. In Seventh IEEE International Conference on Data Mining (ICDM), pages 252–261, 2007.
26. Giorgio Patrini, Richard Nock, Paul Rivera, and Tiberio Caetano. (almost) no label no cry. In NIPS, pages 190–198. 2014.
27. Benjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization. SIAM Review, 60(3):550–591, 2018.
28. Novi Quadrianto, Alex J. Smola, Tiberio S. Caetano, and Quoc V. Le. Estimating labels from label proportions. J. Mach. Learn. Res., pages 2349–2374, 2009.
29. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.
30. Alan D. Saul, James Hensman, Aki Vehtari, and Neil D. Lawrence. Chained Gaussian processes. In AISTATS, page 1431–1440, 2016.
31. Michael T. Smith, Mauricio A. Álvarez, and Neil D Lawrence. Gaussian process regression for binned data. arXiv:1809.02010, 2018.
32. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NIPS, pages 1257–1264. 2006.
33. Yusuke Tanaka, Toshiyuki Tanaka, Tomoharu Iwata, Takeshi Kurashima, Maya Okawa, Yasunori Akagi, and Hiroyuki Toda. Spatially aggregated Gaussian processes with multivariate areal outputs. arXiv:1907.08350, 2019.
34. Yusuke Tanaka, Tomoharu Iwata, Toshiyuki Tanaka, Takeshi Kurashima, Maya Okawa, and Hiroyuki Toda. Refining coarse-grained spatial data using auxiliary spatial data sets with various granularities. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5091–5099, 2019.
35. Yee-Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models. In AISTATS, page 333–340, 2005.
36. Andrew G. Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In ICML, 2012.
37. Jingxiong Zhang, Peter Atkinson, and Michael F. Goodchild. Scale in Spatial Information and Analysis. CRC Press, 2014.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters