# Near-Optimal Active Learning of Multi-Output Gaussian Processes

###### Abstract

This paper

^{1}^{1}1This paper is an extended version (with proofs) of (?).addresses the problem of active learning of a multi-output Gaussian process (MOGP) model representing multiple types of coexisting correlated environmental phenomena. In contrast to existing works, our active learning problem involves selecting not just the most informative sampling locations to be observed but also the types of measurements at each selected location for minimizing the predictive uncertainty (i.e., posterior joint entropy) of a target phenomenon of interest given a sampling budget. Unfortunately, such an entropy criterion scales poorly in the numbers of candidate sampling locations and selected observations when optimized. To resolve this issue, we first exploit a structure common to sparse MOGP models for deriving a novel active learning criterion. Then, we exploit a relaxed form of submodularity property of our new criterion for devising a polynomial-time approximation algorithm that guarantees a constant-factor approximation of that achieved by the optimal set of selected observations. Empirical evaluation on real-world datasets shows that our proposed approach outperforms existing algorithms for active learning of MOGP and single-output GP models.

Near-Optimal Active Learning of Multi-Output Gaussian Processes

Yehong Zhang and Trong Nghia Hoang and Kian Hsiang Low and Mohan Kankanhalli Department of Computer Science, Interactive Digital Media Institute National University of Singapore, Republic of Singapore {yehong, lowkh, mohan}@comp.nus.edu.sg, idmhtn@nus.edu.sg

## 1 Introduction

For many budget-constrained environmental sensing and monitoring applications in the real world, active learning/sensing is an attractive, frugal alternative to passive high-resolution (hence, prohibitively costly) sampling of the spatially varying target phenomenon of interest. Different from the latter, active learning aims to select and gather the most informative observations for modeling and predicting the spatially varying phenomenon given some sampling budget constraints (e.g., quantity of deployed sensors, energy consumption, mission time).

In practice, the target phenomenon often coexists and correlates well with some auxiliary type(s) of phenomena whose measurements
may be more spatially correlated, less noisy (e.g., due to higher-quality sensors), and/or less tedious to sample (e.g., due to greater availability/quantity, higher sampling rate, and/or lower sampling cost of deployed sensors of these type(s)) and can consequently be exploited for improving its prediction.
For example, to monitor soil pollution by some heavy metal (e.g., Cadmium), its complex and time-consuming extraction from soil samples can be alleviated by supplementing its prediction with correlated auxiliary types of soil measurements (e.g., pH) that are easier to sample (?).
Similarly, to monitor algal bloom in the coastal ocean, plankton abundance correlates well with auxiliary types of ocean measurements (e.g., chlorophyll a, temperature, and salinity) that can be sampled more readily.
Other examples of real-world applications include remote sensing, traffic monitoring (?; ?; ?), monitoring of groundwater
and indoor environmental quality (?),
and precision agriculture (?), among others.
All of the above practical examples motivate the need to design and develop an active learning algorithm that selects not just the most informative sampling locations to be observed but also the types of measurements (i.e., target and/or auxiliary) at each selected location for minimizing the predictive uncertainty of unobserved areas of a target phenomenon given a sampling budget, which is the focus of our work here^{2}^{2}2Our work here differs from multivariate spatial sampling algorithms (?; ?) that aim to improve the prediction of all types of coexisting phenomena, for which existing active learning algorithms for sampling measurements only from the target phenomenon can be extended and applied straightforwardly, as detailed in Section 3..

To achieve this, we model all types of coexisting phenomena (i.e., target and auxiliary) jointly as a multi-output Gaussian process (MOGP) (?; ?; ?), which allows the spatial correlation structure of each type of phenomenon and the cross-correlation structure between different types of phenomena to be formally characterized. More importantly, unlike the non-probabilistic multivariate regression methods, the probabilistic MOGP regression model allows the predictive uncertainty of the target phenomenon (as well as the auxiliary phenomena) to be formally quantified (e.g., based on entropy or mutual information criterion) and consequently exploited for deriving the active learning criterion.

To the best of our knowledge, this paper is the first to present an efficient algorithm for active learning of a MOGP model. We consider utilizing the entropy criterion to measure the predictive uncertainty of a target phenomenon, which is widely used for active learning of a single-output GP model.
Unfortunately, for the MOGP model, such a criterion scales poorly in the number of candidate sampling locations of the target phenomenon (Section 3) and even more so in the number of selected observations (i.e., sampling budget) when optimized (Section 4). To resolve this scalability issue, we first exploit a structure common to a unifying framework of sparse MOGP models (Section 2) for deriving a novel active learning criterion (Section 3). Then, we define a relaxed notion of submodularity^{3}^{3}3The original notion of submodularity has been used in (?; ?) to theoretically guarantee the performance of their algorithms for active learning of a single-output GP model. called -submodularity and exploit the -submodularity property of our new criterion for devising a polynomial-time approximation algorithm that guarantees a constant-factor approximation of that achieved by the optimal set of selected observations (Section 4).
We empirically evaluate the performance of our proposed algorithm using three real-world datasets (Section 5).

## 2 Modeling Coexisting Phenomena with Multi-Output Gaussian Process (MOGP)

### Convolved MOGP (CMOGP) Regression.

A number of MOGP models such as co-kriging (?), parameter sharing (?), and linear model of coregionalization (LMC) (?; ?) have been proposed to handle multiple types of correlated outputs. A generalization of LMC called the convolved MOGP (CMOGP) model has been empirically demonstrated in (?) to outperform the others and will be the MOGP model of our choice due to its approximation whose structure can be exploited for deriving our active learning criterion and in turn an efficient approximation algorithm, as detailed later.

Let types of coexisting phenomena be defined to vary as a realization of a CMOGP over a domain corresponding to a set of sampling locations such that each location is associated with noisy realized (random) output measurement () if is observed (unobserved) for type for .
Let and . Then, measurement of type is defined as a convolution between a
smoothing kernel and a latent measurement function ^{4}^{4}4To ease exposition, we consider a single latent function. Note, however, that multiple latent functions can be used to improve the fidelity of modeling, as shown in (?). More importantly, our proposed algorithm and theoretical results remain valid with multiple latent functions. corrupted by an additive noise with noise variance :

As shown in (?), if is a GP, then is also a GP, that is, every finite subset of follows a multivariate Gaussian distribution. Such a GP is fully specified by its prior mean and covariance cov for all , the latter of which characterizes the spatial correlation structure for each type of phenomenon (i.e., ) and the cross-correlation structure between different types of phenomena (i.e., ). Specifically, let be a GP with prior covariance and where is the signal variance controlling the intensity of measurements of type , and are diagonal precision matrices controlling, respectively, the degrees of correlation between latent measurements and cross-correlation between latent and type measurements, and denotes a column vector comprising components of value . Then,

(1) |

where is a Kronecker delta of value if and , and 0 otherwise.

Supposing a column vector of realized measurements is available for some set of tuples of observed locations and their corresponding measurement types where , a CMOGP regression model can exploit these observations to provide a Gaussian predictive distribution of the measurements for any set of tuples of unobserved locations and their corresponding measurement types with the following posterior mean vector and covariance matrix:

(2) |

where and for any .

### Sparse CMOGP Regression.

A limitation of the CMOGP model is its poor scalability in the number of observations: Computing its Gaussian predictive distribution (2) requires inverting , which incurs time. To improve its scalability, a unifying framework of sparse CMOGP regression models such as the deterministic training conditional, fully independent training conditional, and partially independent training conditional (PITC) approximations (?) exploit a vector of inducing measurements for some small set of inducing locations (i.e., ) to approximate each measurement :

They also share two structural properties that can be exploited for deriving our active learning criterion and in turn an efficient approximation algorithm: (P1) Measurements of different types (i.e., and for ) are conditionally independent given , and (P2) and are conditionally independent given . PITC will be used as the sparse CMOGP regression model in our work here since the others in the unifying framework impose further assumptions. With the above structural properties, PITC can utilize the observations to provide a Gaussian predictive distribution where

(3) |

such that for any , () is a covariance matrix with covariance components for all and ( for all ), is the transpose of , and is a block-diagonal matrix constructed from the diagonal blocks of for any , each of which is a matrix for where and . Note that computing (3) does not require the inducing locations to be observed. Also, the covariance matrix in (2) is approximated by a reduced-rank matrix summed with the resulting sparsified residual matrix . So, by using the matrix inversion lemma to invert the approximated covariance matrix and applying some algebraic manipulations, computing (3) incurs time (?) in the case of and evenly distributed observations among all types.

## 3 Active Learning of CMOGP

Recall from Section 1 that the entropy criterion can be used to measure the predictive uncertainty of the unobserved areas of a target phenomenon. Using the CMOGP model (Section 2), the Gaussian posterior joint entropy (i.e., predictive uncertainty) of the measurements for any set of tuples of unobserved locations and their corresponding measurement types can be expressed in terms of its posterior covariance matrix (2) which is independent of the realized measurements :

Let index denote the type of measurements of the target phenomenon^{5}^{5}5Our proposed algorithm can be extended to handle multiple types of target phenomena, as demonstrated in Section 5..
Then, active learning of a CMOGP model involves selecting an optimal set of tuples (i.e., sampling budget) of sampling locations and their corresponding measurement types to be observed that minimize the posterior joint entropy of type measurements at the remaining unobserved locations of the target phenomenon:

(4) |

where is a finite set of tuples of candidate sampling locations of the target phenomenon and their corresponding measurement type available to be selected for observation. However, evaluating the term in (4) incurs time, which is prohibitively expensive when the target phenomenon is spanned by a large number of candidate sampling locations. If auxiliary types of phenomena are missing or ignored (i.e., ), then such a computational difficulty can be eased instead by solving the well-known maximum entropy sampling (MES) problem (?): which can be proven to be equivalent to (4) by using the chain rule for entropy and noting that is a constant. Evaluating the term in MES incurs time, which is independent of . Such an equivalence result can in fact be extended and applied to minimizing the predictive uncertainty of all types of coexisting phenomena, as exploited by multivariate spatial sampling algorithms (?; ?):

(5) |

where and is defined in a similar manner to but for measurement type . This equivalence result (5) also follows from the chain rule for entropy and the fact that is a constant. Unfortunately, it is not straightforward to derive such an equivalence result for our active learning problem (4) in which a target phenomenon of interest coexists with auxiliary types of phenomena (i.e., ): If we consider maximizing or , then it is no longer equivalent to minimizing (4) as the sum of the two entropy terms is not necessarily a constant.

### Exploiting Sparse CMOGP Model Structure.

We derive a new equivalence result by considering instead a constant entropy that is conditioned on the inducing measurements used in sparse CMOGP regression models (Section 2). Then, by using the chain rule for entropy and structural property P2 shared by sparse CMOGP regression models in the unifying framework (?) described in Section 2, (4) can be proven (see Appendix A) to be equivalent to

(6) |

where

(7) |

is the conditional mutual information between and given . Our novel active learning criterion in (6) exhibits an interesting exploration-exploitation trade-off: The inducing measurements can be viewed as latent structure of the sparse CMOGP model to induce conditional independence properties P1 and P2. So, on one hand, maximizing the term aims to select tuples of locations with the most uncertain measurements of the target phenomenon and their corresponding type to be observed given the latent model structure (i.e., exploitation). On the other hand, minimizing the term (7) aims to select tuples to be observed (i.e., possibly of measurement types ) so as to rely less on measurements of type at the remaining unobserved locations of the target phenomenon to infer latent model structure (i.e., exploration) since won’t be sampled.

Supposing , evaluating our new active learning criterion in (6) incurs time for every and a one-off cost of time (Appendix B). In contrast, computing the original criterion in (4) requires time for every , which is more costly, especially when the number of selected observations is much less than the number of candidate sampling locations of the target phenomenon due to, for example, a tight sampling budget or a large sampling domain that usually occurs in practice. The trick to achieving such a computational advantage can be inherited by our approximation algorithm to be described next.

## 4 Approximation Algorithm

Our novel active learning criterion in (6), when optimized, still suffers from poor scalability in the number of selected observations (i.e., sampling budget) like the old criterion in (4) because it involves evaluating a prohibitively large number of candidate selections of sampling locations and their corresponding measurement types (i.e., exponential in ). However, unlike the old criterion, it is possible to devise an efficient approximation algorithm with a theoretical performance guarantee to optimize our new criterion, which is the main contribution of our work in this paper.

The key idea of our proposed approximation algorithm is to greedily select the next tuple of sampling location and its corresponding measurement type to be observed that maximally increases our criterion in (6), and iterate this till tuples are selected for observation. Specifically, let

(8) |

denote our active learning criterion in (6) augmented by a positive constant to make non-negative. Such an additive constant is simply a technical necessity for proving the performance guarantee and does not affect the outcome of the optimal selection (i.e., ). Then, our approximation algorithm greedily selects the next tuple of sampling location and its corresponding measurement type that maximizes :

(9) |

where is a Kronecker delta of value if , and otherwise. The derivation of (9) is in Appendix C. Our algorithm updates and iterates the greedy selection (9) and the update till (i.e., sampling budget is depleted). The intuition to understanding (9) is that our algorithm has to choose between observing a sampling location with the most uncertain measurement (i.e., ) of the target phenomenon (i.e., type ) vs. that for an auxiliary type inducing the largest reduction in predictive uncertainty of the measurements at the remaining unobserved locations of the target phenomenon since .

It is also interesting to figure out whether our approximation algorithm may avoid selecting tuples of a certain auxiliary type and formally analyze the conditions under which it will do so, as elucidated in the following result:

###### Proposition 1

Let , , , and . Assuming absence of suppressor variables, for any .

Its proof (Appendix D) relies on the following assumption of the absence of suppressor variables which holds in many practical cases (?): Conditioning does not make and more correlated for any and . Proposition 1 reveals that when the signal-to-noise ratio of auxiliary type is low (e.g., poor-quality measurements due to high noise) and/or the cross correlation (1) between measurements of the target phenomenon and auxiliary type is small due to low , our greedy criterion in (9) returns a small value, hence causing our algorithm to avoid selecting tuples of auxiliary type .

###### Theorem 1 (Time Complexity)

Our approximation algorithm incurs time.

Its proof is in Appendix E. So, our approximation algorithm only incurs quartic time in the number of selected observations and cubic time in the number of candidate sampling locations of the target phenomenon.

### Performance Guarantee.

To theoretically guarantee the performance of our approximation algorithm, we will first motivate the need to define a relaxed notion of submodularity. A submodular set function exhibits a natural diminishing returns property: When adding an element to its input set, the increment in its function value decreases with a larger input set. To maximize a nondecreasing and submodular set function, the work of ? (?) has proposed a greedy algorithm guaranteeing a -factor approximation of that achieved by the optimal input set.

The main difficulty in proving the submodularity of (8) lies in its mutual information term being conditioned on . Some works (?; ?) have shown the submodularity of such conditional mutual information by imposing conditional independence assumptions (e.g., Markov chain). In practice, these strong assumptions (e.g., for any and ) severely violate the correlation structure of multiple types of coexisting phenomena and are an overkill: The correlation structure can in fact be preserved to a fair extent by relaxing these assumptions, which consequently entails a relaxed form of submodularity of (8); a performance guarantee similar to that of ? (?) can then be derived for our approximation algorithm.

###### Definition 1

A function G : is -submodular if for any and .

###### Lemma 1

Let . Given , if

(10) |

for any and , then is -submodular where .

Its proof is in Appendix F. Note that (10) relaxes the above example of conditional independence assumption (i.e., assuming ) to one which allows . In practice, is expected to be small: Since further conditioning monotonically decreases a posterior variance (?), an expected large set of tuples of remaining unobserved locations of the target phenomenon tends to be informative enough to make small and hence the non-negative variance reduction term and in (10) small.

Furthermore, (10) with a given small can be realized by controlling the discretization of the domain of candidate sampling locations. For example, by refining the discretization of (i.e., increasing ), the variance reduction term in (10) decreases because it has been shown in (?) to be submodular in many practical cases. We give another example in Lemma 2 to realize (10) by controlling the discretization such that every pair of selected observations are sufficiently far apart.

It is easy to derive that .
The “information never hurts” bound for entropy (?) entails a nondecreasing :
.
The first inequality requires so that ,^{6}^{6}6 is proven in Lemma 3 in Appendix D. which is reasonable in practice due to ubiquitous noise.
Combining this result with Lemma 1 yields the performance guarantee:

###### Theorem 2

Given , if (10) holds, then our approximation algorithm is guaranteed to select s.t. where .

Its proof (Appendix G) is similar to that of the well-known result of ? (?) except for exploiting -submodularity of in Lemma 1 instead of submodularity.

Finally, we present a discretization scheme that satisfies (10): Let be the smallest discretization width of for . Construct a new set of tuples of candidate sampling locations and their corresponding measurement types such that every pair of tuples are at least a distance of apart for some ; each candidate location thus has only one corresponding type. Such a construction constrains our algorithm to select observations sparsely across the spatial domain so that any has sufficiently many neighboring tuples of remaining unobserved locations of the target phenomenon from to keep small and hence the variance reduction term and in (10) small. Our previous theoretical results still hold if is used instead of . The result below gives the minimum value of to satisfy (10):

## 5 Experiments and Discussion

This section evaluates the predictive performance of our approximation algorithm (m-Greedy) empirically on three real-world datasets: (a) Jura dataset (?) contains concentrations of heavy metals collected at locations in a Swiss Jura region; (b) Gilgai dataset (?) contains electrical conductivity and chloride content generated from a line transect survey of locations of Gilgai territory in New South Wales, Australia; and (c) indoor environmental quality (IEQ) dataset (?) contains temperature (F) and light (Lux) readings taken by temperature sensors and light sensors deployed in the Intel Berkeley Research lab. The sampling locations for the Jura and IEQ datasets are shown in Appendix I.

The performance of m-Greedy is compared to that of the (a) maximum variance/entropy (m-Var) algorithm which greedily selects the next location and its corresponding measurement type with maximum posterior variance/entropy in each iteration; and (b) greedy maximum entropy (s-Var) (?) and mutual information (s-MI) (?) sampling algorithms for gathering observations only from the target phenomenon.

For all experiments, k-means is used to select inducing locations by clustering all possible locations available to be selected for observation into clusters such that each cluster center corresponds to an element of . The hyper-parameters (i.e., , , and for ) of MOGP and single-output GP models are learned using the data via maximum likelihood estimation (?). For each dataset, observations (i.e., for Jura and Gilgai datasets and for IEQ dataset) of type are randomly selected to form the test set ; the tuples of candidate sampling locations and corresponding type therefore become less than that of auxiliary types. The root mean squared error (RMSE) metric is used to evaluate the performance of the tested algorithms. All experimental results are averaged over random test sets. For a fair comparison, the measurements of all types are normalized before using them for training, prediction, and active learning.

### Jura Dataset.

Three types of correlated lg-Cd, Ni, and lg-Zn measurements are used in this experiment; we take the log of Cd and Zn measurements to remove their strong skewness, as proposed as a standard statistical practice in (?). The measurement types with the smallest and largest signal-to-noise ratios (respectively, lg-Cd and Ni; see Appendix J) are each set as type .

(a) | (b) | (c) |

(d) | (e) | (f) |

Figs. 1a-c and 1d-f show, respectively, results of the tested algorithms with lg-Cd and Ni as type . It can be observed that the RMSE of m-Greedy decreases more rapidly than that of m-Var, especially when observations of auxiliary types are selected after about . This is because our algorithm selects observations of auxiliary types that induce the largest reduction in predictive uncertainty of the measurements at the remaining unobserved locations of the target phenomenon (Section 4). In contrast, m-Var may select observations that reduce the predictive uncertainty of auxiliary types of phenomena, which does not directly achieve the aim of our active learning problem. With increasing , both m-Greedy and m-Var reach smaller RMSEs, but m-Greedy can achieve this faster with much less observations. As shown in Figs. 1a-f, m-Greedy performs much better than s-Var and s-MI, which means observations of correlated auxiliary types can indeed be used to improve the prediction of the target phenomenon. Finally, by comparing the results between Figs. 1a-c and 1d-f, the RMSE of m-Greedy with Ni as type decreases faster than that with lg-Cd as type , especially in the beginning (i.e., ) due to higher-quality Ni measurements (i.e., larger signal-to-noise ratio).

### Gilgai Dataset.

In this experiment, the lg-Cl contents at depth -cm and -cm are used jointly as two types of target phenomena while the log of electrical conductivity, which is easier to measure at these depths, is used as the auxiliary type. Fig. 5a shows results of the average RMSE over the two lg-Cl types with . Similar to the results of the Jura dataset, with two types of target phenomena, the RMSE of m-Greedy still decreases more rapidly with increasing than that of m-Var and achieves a much smaller RMSE than that of s-Var and s-MI; the results of s-Var and s-MI are also averaged over two independent single-output GP predictions of lg-Cl content at the two depths.

### IEQ Dataset.

Fig. 5b shows results with light as type and . The observations are similar to that of the Jura and Gilgai datasets: RMSE of m-Greedy decreases faster than that of the other algorithms. More importantly, with the same number of observations, m-Greedy achieves much smaller RMSE than s-Var and s-MI that can sample only from the target phenomenon. This is because m-Greedy selects observations of the auxiliary type (i.e., temperature) that are less noisy () than that of light (), which demonstrates its advantage over s-Var and s-MI when type measurements are noisy (e.g., due to poor-quality sensors).

(a) (b)

## 6 Related Work

Existing works on active learning with multiple output measurement types are not driven by the MOGP model and have not formally characterized the cross-correlation structure between different types of phenomena: Some spatial sampling algorithms (?; ?) have simply modeled the auxiliary phenomenon as a noisy perturbation of the target phenomenon that is assumed to be latent, which differs from our work here. Multi-task active learning (MTAL) and active transfer learning (ATL) algorithms have considered the prediction of each type of phenomenon as one task and used the auxiliary tasks to help learn the target task. But, the MTAL algorithm of ? (?) requires relations between different classification tasks to be manually specified, which is highly non-trivial to achieve in practice and not applicable to MOGP regression. Some ATL and active learning algorithms (?; ?) have used active learning criteria (e.g., margin-based criterion) specific to their classification or recommendation tasks that cannot be readily tailored to MOGP regression.

## 7 Conclusion

This paper describes a novel efficient algorithm for active learning of a MOGP model. To resolve the issue of poor scalability in optimizing the conventional entropy criterion, we exploit a structure common to a unifying framework of sparse MOGP models for deriving a novel active learning criterion (6). Then, we exploit the -submodularity property of our new criterion (Lemma 1) for devising a polynomial-time approximation algorithm (9) that guarantees a constant-factor approximation of that achieved by the optimal set of selected observations (Theorem 2). Empirical evaluation on three real-world datasets shows that our approximation algorithm m-Greedy outperforms existing algorithms for active learning of MOGP and single-output GP models, especially when measurements of the target phenomenon are more noisy than that of the auxiliary types. For our future work, we plan to extend our approach by generalizing non-myopic active learning (?; ?; ?; ?; ?; ?) of single-output GPs to that of MOGPs and improving its scalability to big data through parallelization (?; ?), online learning (?), and stochastic variational inference (?).

Acknowledgments. This research was carried out at the SeSaMe Centre. It is supported by Singapore NRF under its IRC@SG Funding Initiative and administered by IDMPO.

## References

- [Álvarez and Lawrence 2011] Álvarez, M. A., and Lawrence, N. D. 2011. Computationally efficient convolved multiple output Gaussian processes. JMLR 12:1459–1500.
- [Angulo and Bueso 2001] Angulo, J. M., and Bueso, M. C. 2001. Random perturbation methods applied to multivariate spatial sampling design. Environmetrics 12(7):631–646.
- [Bodik et al. 2004] Bodik, P.; Guestrin, C.; Hong, W.; Madden, S.; Paskin, M.; and Thibaux, R. 2004. http://www.select.cs.cmu.edu/data/labapp3/index.html.
- [Bonilla, Chai, and Williams 2008] Bonilla, E. V.; Chai, K. M. A.; and Williams, C. K. 2008. Multi-task Gaussian process prediction. In Proc. NIPS.
- [Bueso, Angulo, and Alonso 1998] Bueso, M. C.; Angulo, J. M.; and Alonso, F. J. 1998. A state-space model approach to optimum spatial sampling design based on entropy. Environmental and Ecological Statistics 5(1):29–44.
- [Bueso et al. 1999] Bueso, M. C.; Angulo, J. M.; Cruz-Sanjulián, J.; and García-Aróstegui, J. L. 1999. Optimal spatial sampling design in a multivariate framework. Math. Geology 31(5):507–525.
- [Cao, Low, and Dolan 2013] Cao, N.; Low, K. H.; and Dolan, J. M. 2013. Multi-robot informative path planning for active sensing of environmental phenomena: A tale of two algorithms. In Proc. AAMAS.
- [Chen et al. 2012] Chen, J.; Low, K. H.; Tan, C. K.-Y.; Oran, A.; Jaillet, P.; Dolan, J. M.; and Sukhatme, G. S. 2012. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traffic phenomena. In Proc. UAI, 163–173.
- [Chen et al. 2013] Chen, J.; Cao, N.; Low, K. H.; Ouyang, R.; Tan, C. K.-Y.; and Jaillet, P. 2013. Parallel Gaussian process regression with low-rank covariance matrix approximations. In Proc. UAI, 152–161.
- [Chen et al. 2015] Chen, J.; Low, K. H.; Jaillet, P.; and Yao, Y. 2015. Gaussian process decentralized data fusion and active sensing for spatiotemporal traffic modeling and prediction in mobility-on-demand systems. IEEE Trans. Autom. Sci. Eng. 12:901–921.
- [Chen, Low, and Tan 2013] Chen, J.; Low, K. H.; and Tan, C. K.-Y. 2013. Gaussian process-based decentralized data fusion and active sensing for mobility-on-demand system. In Proc. RSS.
- [Cover and Thomas 1991] Cover, T., and Thomas, J. 1991. Elements of Information Theory. John Wiley & Sons, Inc.
- [Das and Kempe 2008] Das, A., and Kempe, D. 2008. Algorithms for subset selection in linear regression. In Proc. STOC, 45–54.
- [Golub and Van Loan 1996] Golub, G. H., and Van Loan, C.-F. 1996. Matrix Computations. Johns Hopkins Univ. Press, 3rd edition.
- [Goovaerts 1997] Goovaerts, P. 1997. Geostatistics for Natural Resources Evaluation. Oxford Univ. Press.
- [Hoang et al. 2014] Hoang, T. N.; Low, K. H.; Jaillet, P.; and Kankanhalli, M. 2014. Nonmyopic -Bayes-optimal active learning of Gaussian processes. In Proc. ICML, 739–747.
- [Hoang, Hoang, and Low 2015] Hoang, T. N.; Hoang, Q. M.; and Low, K. H. 2015. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, 569–578.
- [Krause and Golovin 2014] Krause, A., and Golovin, D. 2014. Submodular function maximization. In Bordeaux, L.; Hamadi, Y.; and Kohli, P., eds., Tractability: Practical Approaches to Hard Problems. Cambridge Univ. Press. 71–104.
- [Krause and Guestrin 2005] Krause, A., and Guestrin, C. 2005. Near-optimal nonmyopic value of information in graphical models. In Proc. UAI.
- [Krause, Singh, and Guestrin 2008] Krause, A.; Singh, A.; and Guestrin, C. 2008. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR 9:235–284.
- [Le, Sun, and Zidek 2003] Le, N. D.; Sun, L.; and Zidek, J. V. 2003. Designing networks for monitoring multivariate environmental fields using data with monotone pattern. Technical Report #2003-5, Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC.
- [Ling, Low, and Jaillet 2016] Ling, C. K.; Low, K. H.; and Jaillet, P. 2016. Gaussian process planning with Lipschitz continuous reward functions: Towards unifying Bayesian optimization, active learning, and beyond. In Proc. AAAI.
- [Low et al. 2015] Low, K. H.; Yu, J.; Chen, J.; and Jaillet, P. 2015. Parallel Gaussian process regression for big data: Low-rank representation meets Markov approximation. In Proc. AAAI.
- [Low, Dolan, and Khosla 2008] Low, K. H.; Dolan, J. M.; and Khosla, P. 2008. Adaptive multi-robot wide-area exploration and mapping. In Proc. AAMAS, 23–30.
- [Low, Dolan, and Khosla 2009] Low, K. H.; Dolan, J. M.; and Khosla, P. 2009. Information-theoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In Proc. ICAPS.
- [Low, Dolan, and Khosla 2011] Low, K. H.; Dolan, J. M.; and Khosla, P. 2011. Active Markov information-theoretic path planning for robotic environmental sensing. In Proc. AAMAS, 753–760.
- [Nemhauser, Wolsey, and Fisher 1978] Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions - I. Mathematical Programming 14(1):265–294.
- [Petersen and Pedersen 2012] Petersen, K. B., and Pedersen, M. S. 2012. The Matrix Cookbook.
- [Renner and Maurer 2002] Renner, R., and Maurer, U. 2002. About the mutual (conditional) information. In Proc. IEEE ISIT.
- [Roth and Small 2006] Roth, D., and Small, K. 2006. Margin-based active learning for structured output spaces. In Proc. ECML, 413–424.
- [Shewry and Wynn 1987] Shewry, M. C., and Wynn, H. P. 1987. Maximum entropy sampling. J. Applied Stat. 14(2):165–170.
- [Skolidis 2012] Skolidis, G. 2012. Transfer Learning with Gaussian Processes. Ph.D. Dissertation, University of Edinburgh.
- [Stewart and Sun 1990] Stewart, G. W., and Sun, J.-G. 1990. Matrix Perturbation Theory. Academic Press.
- [Teh and Seeger 2005] Teh, Y. W., and Seeger, M. 2005. Semiparametric latent factor models. In Proc. AISTATS, 333–340.
- [Webster and Oliver 2007] Webster, R., and Oliver, M. 2007. Geostatistics for Environmental Scientists. John Wiley & Sons, Inc., 2nd edition.
- [Webster 1977] Webster, R. 1977. Spectral analysis of Gilgai soil. Australian Journal of Soil Research 15(3):191–204.
- [Xu et al. 2014] Xu, N.; Low, K. H.; Chen, J.; Lim, K. K.; and Ozgul, E. B. 2014. GP-Localize: Persistent mobile robot localization using online sparse Gaussian process observation model. In Proc. AAAI, 2585–2592.
- [Zhang et al. 2016] Zhang, Y.; Hoang, T. N.; Low, K. H.; and Kankanhalli, M. 2016. Near-optimal active learning of multi-output Gaussian processes. In Proc. AAAI.
- [Zhang 2010] Zhang, Y. 2010. Multi-task active learning with output constraints. In Proc. AAAI, 667–672.
- [Zhao et al. 2013] Zhao, L.; Pan, S. J.; Xiang, E. W.; Zhong, E.; Lu, Z.; and Yang, Q. 2013. Active transfer learning for cross-system recommendation. In Proc. AAAI, 1205–1211.

## Appendix A Derivation of Novel Active Learning Criterion (6)

The first equality follows from the fact that is a constant. The third equality is due to the chain rule for entropy as well as the definition of conditional mutual information and . The last equality follows from structural property P2 shared by sparse CMOGP regression models in the unifying framework (?) described in Section 2, which results in and .

## Appendix B Time Complexity of Evaluating Active Learning Criterion in (6)

where by definition (see last paragraph of Section 2). So, evaluating incurs time for every ; this worst-case time complexity occurs when all the tuples in are of measurement type (i.e., ).

where

for any , as derived in (?). Therefore, evaluating incurs time for every ; this worst-case time complexity occurs when all the tuples in are of one measurement type.

Let . Then, by the definition of (see last paragraph of Section 2),

Evaluating the term incurs time for every ; this worst-case time complexity occurs when all the tuples in are of one measurement type. Note that the term remains the same for every (i.e., since it is independent of ) and hence only needs to be computed once in time. Therefore, evaluating incurs time for every and a one-off cost of time. Consequently, evaluating incurs time for every and a one-off cost of time.

So, evaluating our active learning criterion in (6) incurs time for every and a one-off cost of time.

## Appendix C Derivation of Greedy Criterion in (9)

If , then

(11) |

The first equality follows from (6) and (8). The second equality is due to . The third equality is due to the chain rule for entropy as well as the definition of conditional mutual information . The second last equality follows from structural property P2 shared by sparse CMOGP regression models in the unifying framework (?) described in Section 2.

Otherwise (i.e., ),