# Grouped Gaussian Processes for Solar Power Prediction

###### Abstract

We consider multi-task regression models where the observations are assumed to be a linear combination of several latent node functions and weight functions, which are both drawn from Gaussian process priors. Driven by the problem of developing scalable methods for distributed solar power forecasting, we propose coupled priors over groups of (node or weight) processes to estimate a forecast model for solar power production at multiple distributed sites, exploiting spatial dependence between functions. Our results show that our approach provides better quantification of predictive uncertainties than competing benchmarks while maintaining high point-prediction accuracy.

Grouped Gaussian Processes for Solar Power Prediction

Astrid Dahl* School of Computer Science and Engineering University of New South Wales Sydney, Australia *astrid.dahl@student.unsw.edu.au Edwin V. Bonilla School of Computer Science and Engineering University of New South Wales Sydney, Australia

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

The problem of forecasting local solar output in the short term is of significant interest for the purpose of distributed grid control and household energy management Voyant et al. (2017); Widén et al. (2015). Variation in output is driven by two principal factors: diurnal cyclical effects (variation due to sun angle and distance) and variability due to weather effects, both inducing spatially-related dependence between proximate sites. In general, correlations across sites depend on many particulars relating to system configuration, local environment and so on. As such we wish to exploit spatial dependencies (and potentially other site-specific covariates) between sites in a flexible manner. More importantly, inherent to this application is the need for modeling uncertainty in a flexible and principled way.

Gaussian process (gp) models are a flexible nonparametric Bayesian approach that can be applied to various problems such as regression and classification Rasmussen and Williams (2006) and have been extended to numerous multivariate and multi-task problems including spatial and spatio-temporal contexts Cressie and Wikle (2011). Multi-task gp methods have been developed along several lines (see e.g. Álvarez et al., 2012, for a review). Of relevance here are various mixing approaches that combine multiple latent univariate Gaussian processes via linear or nonlinear mixing to predict multiple related tasks Wilson et al. (2012). The challenge in multi-task cases is maintaining scalability of the approach. To this end, both scalable inference methods and model constraints have been employed Álvarez et al. (2010); Matthews et al. (2017); Krauth et al. (2017) . In particular, latent Gaussian processes are generally constrained to be statistically independent Wilson et al. (2012); Krauth et al. (2017).

In this paper we consider the case where the statistical independence constraint is relaxed for subsets of latent functions. We build on the scalable generic inference method of Krauth et al. (2017) to allow nonzero covariance between arbitrary subsets, or ‘groups’ of latent functions. The grouping structure is flexible and can be tailored to applications of interest, and additional features can potentially be incorporated to govern input-dependent covariance across functions. By adopting separable kernel specifications, we maintain scalability of the approach whilst capturing latent dependence structures.

With this new multi-task gp model, we consider the specific challenge of forecasting power output of multiple, distributed residential solar sites. We apply our approach to capture spatial covariance between sites by explicitly incorporating spatial dependency between latent functions and test our method on two datasets comprised of solar output at distributed household sites in Australia. Our results show that introducing spatial covariance over groups of latent functions provides better quantification of predictive uncertainties relative to competing benchmark methods, while maintaining high point-prediction accuracy.

## 2 Related work

Gaussian Processes have been considered in the multi-task setting via a number of approaches. Several methods linearly combine multiple univariate gp models via coefficients that may be parameters (latent factor models as in Teh et al., 2005); linear coregional models (lcm; Goovaerts, 1997), or themselves input dependent (Wilson et al., 2012). Other multi-task gp approaches allow task-specific predictions through use of task-specific features or ‘free-form’ cross-task covariances (Bonilla et al., 2008), and more recently priors placed over cluster allocations allowing cluster-specific covariances (Hensman et al., 2014; Gardner et al., 2018). Combination via convolutions has also been developed and extended to sparse, variational settings (Álvarez and Lawrence, 2009, 2011).

Most mixing approaches focus on methods to combine multiple underlying independent latent functions. Coupling between node latent functions directly is considered by Remes et al. (2017), who build upon the Gaussian process regression network (gprn) framework of Wilson et al. (2012). Under this method, covariance for latent functions is internally parameterized rather than dependent on specific features governing cross-function covariance. Unlike our method, the natural disadvantage of such an approach is that it presents significant computational challenges in terms of scalability to a large number of observations and tasks.

##### Multi-task solar power forecasting

A number of studies have confirmed that multi-task learning approaches can be useful for distributed solar irradiance or solar power forecasting, finding that cross-site information can be relevant Yang et al. (2013, 2015). Several studies build on the early work of Sampson and Guttorp (1992) and consider co-kriging methods for distributed solar irradiance forecasting or spatial prediction (notably Yang et al., 2013; Shinozaki et al., 2016). Other approaches include a range of multivariate linear statistical methods, shown to be competitive at shorter horizons, and neural network methods Inman et al. (2013); Widén et al. (2015); Voyant et al. (2017). These approaches are generally constrained by data requirements, notably pre-flattening of data to remove diurnal cyclical trends, which requires knowledge of local system and environment variables.

In addition to kriging studies, gp models have been considered for short term solar forecasting Bilionis et al. (2014); Dahl and Bonilla (2017). Earlier approaches are generally constrained to small-data problems by poor scalability of exact gp models. More recently, Dahl and Bonilla (2017) use scalable sparse, variational inference to apply multi-task Gaussian (mtg) and linear coregional models (lcm) to forecast solar output at multiple, distributed residential sites. Multi-task approaches are found to improve model performance during mixed weather conditions, less so in sunny conditions.

## 3 Gaussian process models

A gp (Rasmussen and Williams, 2006) is formally defined as a distribution over functions such that is a Gaussian process with mean function and covariance function iff any subset of values follows a Gaussian distribution with mean and covariance , which are obtained by evaluating the corresponding mean function and covariance function at the training points .

### 3.1 Multi-task latent function models

Let us consider data of the form where each in is a -dimensional vector of input features and each in is a -dimensional vector of task outputs. Krauth et al. (2017) model the correlations between the outputs using independent latent functions each drawn from a zero-mean gp prior, i.e. . These latent functions are mapped to outputs via a so-called ‘black-box’ iid likelihood. Therefore, the prior over the latent function values corresponding to the observations along with the likelihood model are given by:

(1) |

where is the matrix of latent function variables; is the -dimensional vector for latent function ; and the corresponding hyper-parameters; is the -dimensional vector of latent function values corresponding to observation and are the likelihood parameters.

Krauth et al. (2017) exploit the structure of the model in Eq. 1 to develop a scalable algorithm via variational inference. While the likelihood in this model is suitable for most unstructured machine-learning problems such as standard regression and classification, the prior can be too restrictive for problems where dependences across tasks can be incorporated explicitly. In this paper, driven by the solar power prediction problem where spatial relatedness can be leveraged to improve predictions across different sites, we lift this statistical independence (across latent functions) constraint in the prior to propose a new multi-task model where some of the functions are coupled a priori.

## 4 Grouped Gaussian processes

Our likelihood model considers the general case where observations are a linear combination of latent Gaussian process models where the coefficients are also input-dependent and drawn from Gaussian process priors: where and is a diagonal matrix. -dimensional outputs are constructed at as the product of a matrix of weights, , and -dimensional vector of node functions .

This likelihood model is studied in Wilson et al. (2012) under the assumption that all the latent functions in and are statistically independent a priori, where and are referred to as weight functions and node functions respectively. Before explaining how we generalize such an approach to include coupled priors, we note that this likelihood model fits into the framework explained in §3.1 by setting and making

(2) |

where , and we have denoted and . and are effectively submatrices of formed by gathering and columns, respectively. It is useful to conceptualize as latent functions of dimension arranged into a tensor . The cross section recovers the matrix of weights Thus, latent functions may be referred to as belonging to one of rows or columns of .

### 4.1 Grouped prior

The matrix of all latent function variables is re-expressed in terms of submatrices , where is the total number of groups, and for each group the number of latent functions within is denoted , which we will also refer to as the group size. Each group is comprised of latent functions , and the covariance between two functions is non-zero iff the corresponding processes belong to the same group.

Hence, the prior on can be expressed similarly to the generic prior defined in Eq. (1):

(3) |

where is the covariance matrix generated by the group kernel function , which evaluates the covariance of functions and at the locations and , respectively. As described above, we note that iff the functions and do not belong to the same group .

As we shall see below, in some settings it is possible to characterize the functions with a -dimensional vector of features . Hence we can define an additional matrix containing all the features for the functions belonging to group . This structure allows arbitrary grouping of latent functions depending on the application (in our case, groups are structured for distributed forecasting, discussed below). Further, just as provides input-varying coefficients to , so does the group kernel, allowing input-varying cross-function covariance. Rather than constraining latent processes in and to be conditionally independent, we now assume that arbitrary subsets of these processes, or groups, are coupled. Since our model generalizes Gaussian process regression networks (gprn; Wilson et al., 2012) so as to allow arbitrary couplings in the prior over latent functions belonging to the same group, we refer to it as grouped Gaussian processes (ggp).

### 4.2 Separable kernels

Once latent functions are coupled a priori, scalability becomes an important consideration. Thus, although is not constrained in terms of kernel choice, for the problem at hand we consider separable kernels of the form , leading to covariance matrices of the Kronecker form , where and expresses covariance between functions.

By adopting the Kronecker-structured prior covariance over functions within a group, we reduce the maximum dimension of required matrix inversions, allowing scalable inference.

### 4.3 Grouped Gaussian processes for spatially dependent tasks

It is natural to consider a multi-task framework in a spatio-temporal setting such as distributed solar forecasting, where power output at solar sites in a region would a priori be expected to covary over time and space. Given the expectation of spatially-driven covariance across sites, i.e. tasks, we seek to exploit this structure to increase both efficiency and accuracy of multi-task forecasts. Our approach does this by incorporating explicit spatial dependencies between latent functions in the model.

Latent functions in the general framework do not necessarily map to a particular task. The question therefore arises as to how to use spatial information relating to tasks to structure covariance between latent functions. We solve this by setting and grouping latent functions within rows of . We then define a feature matrix that governs covariance across the functions in each row. With it is possible to obtain a very general representation of the multi-task process with full mixing between tasks via , which now contains node functions. This grouping structure allows parameters to vary across tasks, and at the same time, the coupled prior can act to regularize latent function predictions.

##### Model settings

In our setting, we consider each latent process in to be an independent gp, i.e., for . Furthermore, input features of are defined to be task features i.e. features for relate to task , specifically lagged-target values for .

We define spatial features governing weightings applied to node functions. For a given task , It can be seen that, in addition to depending on input features , relative weights placed on node functions are now governed by spatial covariance imposed over the weights in . This allows site-by-site optimization of spatial decay in (cross-task) weights in addition to site-specific parameterization and features in . In total, this grouping structure yields groups: groups of size (corresponding to ) and groups of size (corresponding to ).

Kernels and features for and are selected in line with previous studies relating to multi-task distributed forecasting Inman et al. (2013); Dahl and Bonilla (2017). In particular, for our task of forecasting distributed solar output over time, for , we define as a radial basis function kernel () applied to a feature vector of recent lagged observed power at site , i.e. for site at time , .

For row-group , we define a separable, multiplicative kernel structure as discussed above, i.e. . We set the kernel over the inputs as , where is a periodic kernel on a time index capturing diurnal cyclical trends in solar output.

We adopt a compact kernel over functions, specifically a separable rbf- Epanechnikov structure, i.e., , , where for site . By using a more flexible compact kernel, we aim to allow beneficial shared learning across tasks while reducing negative transfer by allowing cross-function weights to reduce to zero at an optimal distance.

## 5 Inference

All models are estimated based on the generic inference approach for latent variable Gaussian process models set out by Krauth et al. (2017). This is a sparse variational method that considers the case where latent functions are conditionally independent. We adapt that framework to the more general case where latent functions covary within groups, and for our case exploit the Kronecker structures noted at §4.

Under the sparse method, the prior at (3) is augmented with inducing variables, , drawn from the same gp priors as at new inducing points , where lie in the same space as , . Since are drawn from the same gp priors, inducing variables within a group are similarly coupled via evaluated at points . The prior in Eq. (3) is thus replaced by

(4) |

where , and . Imposing a separable kernel structure, is the covariance matrix induced by evaluated over , . Importantly, thus decomposes as a Kronecker .

### 5.1 Posterior estimation

The (analytically intractable) joint posterior distribution of the latent functions and inducing variables under the prior and likelihood models in Eqs. (4) and (1) is approximated via variational inference (Jordan et al., 1998). Specifically, The variational posterior is defined as a mixture of diagonal Gaussians (mog) with mixture proportions . We assume that also factorizes over blocks (and in the diagonal case over individual latent functions). The variational posterior is thus defined as

(5) |

where and . We then estimate the model by maximizing the so-called evidence lower bound (elbo), which de-constructs to which are the entropy, cross-entropy and expected log likelihood terms, respectively. The explicit expression required for is a generalization of the results in Krauth et al. (2017). For the entropy term we have that (using Jensen’s inequality):

(6) |

where is the vector and is the block diagonal matrix with diagonal elements (and equivalent for , ). For the cross-entropy and the expected log likelihood term:

(7) | ||||

(8) |

where results from integration of the joint approximate posterior over inducing variables .^{1}^{1}1Note that factorizes as and factorizes as .

Given factorization of the joint and variational posteriors over and and standard conjugacy results, we have

(9) |

The distribution similarly factorizes as

(10) |

may be estimated by Monte Carlo, requiring sampling only from -dimensional multivariate Gaussians where is the vector comprised of every th element of , and is the (full) matrix comprised of th diagonal elements of posterior covariance sub-matrices of . Thus, we estimate

(11) |

### 5.2 Prediction

Prediction for a new point given is taken as the expectation over the general posterior distribution for the new point:

(12) |

(13) |

Given the explicit expression for the posterior distribution, the expectation in Eq. (12) is estimated by sampling: where are samples from .

### 5.3 Complexity

Under the ggp with a Kronecker-structured prior the time complexity per iteration changes slightly from the independent function case. For the same and , fewer -dimensioned inversions are required for ggp versus gprn, without any increase in maximum dimension under the Kronecker specification assuming . This represents a substantial reduction in -dimensioned inversions, depending on the grouping scheme.

The cost of calculating is dominated by the cost of inversions, being for the grouped case and for the independent case. Under the diagonal posterior specification, in Eq. (6) reduces to the same form as the independent case of Krauth et al. (2017). Lastly, under the grouped structure requires sampling from low-dimensional multivariate Gaussians with non-diagonal posterior covariance matrices, whereas this is avoided under the independent framework. However, the low dimensionality (number of tasks in our empirical evaluation) involved yields minimal additional cost.

## 6 Experiments

We evaluate our approach on forecasting problems for distributed solar installations. The task is to forecast power production fifteen minutes ahead at multiple distributed sites in a region. We compare forecast performance of our ggp method to the fully independent gprn and several other benchmark models. Data consist of five minute average power readings from groups of proximate sites in the Adelaide and Sydney regions in Australia. We present results for two datasets: one of ten Adelaide sites (adelaide) and a second set of twelve Sydney sites (sydney), both over 60 days during Autumn 2016. We train all models on 36 days of data, and test forecast accuracy for 24 subsequent days (days are defined as 7 am to 7pm). In all, for each site, we have 5000 datapoints for training and 3636 datapoints for testing. The adelaide sites are spread over an approximate 30 kilometre (East-West) by 40 kilometre (North-South) area. sydney sites are evenly dispersed over an approximate 15 kilometre (East-West) by 20 kilometre (North-South) area.

##### Benchmark models

Benchmark models are drawn from Dahl and Bonilla (2017). We estimate (1) separate independent gp forecast models for each site (igp), (2) pooled multi-task models with task-specific (spatial) features for sites (mtg), and (3) multi-task linear coregional models (lcm). The final benchmark model (4) is the gprn with independent latent functions (Wilson et al., 2012). These models can be expressed in terms of the general latent function framework with differing values of , and , and different likelihood functions. As discussed in §4, where latent functions are independent, group size is equal to 1 and .
Models are estimated for diagonal and full mog posterior specifications, with .^{2}^{2}2Early experiments with a greater number of mixtures found that, in all multi-task models, the posterior mixture-weights were heavily concentrated on a single component and did not significantly alter results. In the case of the ggp, to maintain the scalable specification, we adopt a Kronecker construction of the full posterior for each group in line with the prior specification.

Both igp and mtg models have a standard, single task Gaussian likelihood functions, while the lcm model is comprised of node functions mapped to outputs via a matrix of deterministic weights, i.e. where and .

To compare model performance under equivalent settings, we consider the complexity of the different approaches and standardize model settings by reference to a consistent target computational complexity per iteration. In our variational framework, the time complexity is dominated by algebraic operations with cubic complexity on the number of inducing inputs . We therefore set for all adelaide models ( for sydney) and adjust accordingly.

Kernels for all models are presented at Table 1. We maintain similar kernel specification across models. All kernels are based on the specification described at §4.3.

model | (benchmark models) |
---|---|

igp | |

mtg | |

lcm | . |

gprn | |

ggp | (ggp model) |

##### Experiment settings and performance measures

All models are estimated based on the variational framework explained in §5. We optimize the elbo iteratively until its relative change over successive epochs is less than up to a maximum of 200 epochs. Optimization is performed using adam (Kingma and Ba, 2014) with settings . All data except time index features are normalized prior to optimization. Reported forecast accuracy measures are root mean squared error (rmse) and negative log predictive density (nlpd). The non-Gaussian likelihood of gprn models makes the usual analytical expression for nlpd intractable. We therefore estimate it using Monte Carlo: where are draws from their corresponding posterior over .

##### Results

Results are presented at Table 2 for the two datasets with diagonal and full-Gaussian posterior specifications. Our key observation is that ggp appears competitive with best performing benchmarks on both rmse and nlpd together, in contrast to results for competing baselines where models either perform well on rmse at the expense of poor performance under nlpd or vice versa.

Model | rmse | nlpd | rmse | nlpd | rmse | nlpd | rmse | nlpd |
---|---|---|---|---|---|---|---|---|

adelaide | sydney | |||||||

Diagonal posterior | Full posterior | Diagonal posterior | Full posterior | |||||

igp | 0.315 | 0.368 | 0.314 | 0.370 | 0.286 | 0.340 | 0.286 | 0.335 |

lcm | 0.294 | 0.240 | 0.293 | 0.240 | 0.310 | 0.273 | 0.302 | 0.257 |

mtg | 0.301 | 0.337 | 0.304 | 0.376 | 0.280 | 0.342 | 0.283 | 0.360 |

gprn | 0.278 | 0.311 | 0.283 | 0.320 | 0.281 | 0.323 | 0.284 | 0.326 |

ggp | 0.282 | 0.243 | 0.288 | 0.265 | 0.284 | 0.257 | 0.298 | 0.286 |

On adelaide, for example, ggp with diagonal posterior performed similarly to the best performing model (gprn) on rmse (0.282 for ggp versus 0.278 for gprn), however significantly better on nlpd (0.243 versus 0.311). ggp also performed similarly to lcm on nlpd (0.243 versus 0.240 for lcm), however better on rmse (0.282 versus 0.293). Results for sydney exhibit a similar pattern for nlpd. sydney results for rmse are more mixed. rmse for the ggp with diagonal posterior is mid-ranked for this dataset (0.284 versus 0.280 for the best performing model (mtg)), however, benchmark models that outperform ggp on rmse, i.e. gprn and mtg, perform poorly on nlpd.

We test statistical significance of differences in performance via bootstrapped 95 percent credible intervals. Results show nlpd for ggp (diagonal posterior) to be significantly lower than benchmarks for adelaide and sydney except lcm, which is not significantly different. rmse is confirmed lower for gprn compared to ggp for sydney and adelaide, while rmse for other benchmarks is higher or not significantly different. See the Appendix for details.

Additional analysis of the results indicate that the ggp benefits from the sparser diagonal posterior specification. The benefit of regularization under the ggp is clearer when considering mean predictive variance, as ggp variance is lower than all benchmark models for both adelaide and sydney experiments. See the Appendix for details.

Finally, the role of explicit spatial links can be examined via the optimized spatial kernel in the grouped gprn model. Orientation of spatial kernels in terms of latitude and longitude axes was found to range across groups in the ggp from strongly north-south to strongly east-west, confirming the usefulness of flexible, site-specific parameterizations. In addition, the degree to which kernel support was truncated by the Epanechnikov component ranged across groups and spatial axes. See the Appendix for details.

## 7 Discussion

We have proposed a general multi-task gp model, where groups of functions are coupled a priori. Our approach allows for input-varying covariance across tasks governed by kernels and features and, by building upon sparse variational methods and exploiting Kronecker structures, our inference method is inherently scalable to a large number of observations.

We have shown the applicability of our approach to forecasting short term distributed solar power at multiple households, where it matches point forecast performance of single-task learning approaches and other multi-task baselines under similar computational constraints while improving quantification of predictive variance. In general, the ggp strikes a balance between flexible, task-specific parameterization and effective regularization via structure imposed in the prior.

We emphasize, however, that other grouping structures, likelihood functions or applications are possible. For example, we could impose separate kernels for latent functions in diag(), or group subsets of functions relating to subsets of clustered sites.

#### Acknowledgments

This research was conducted with support from the Cooperative Research Centre for Low-Carbon Living in collaboration with the University of New South Wales and Solar Analytics Pty Ltd.

## Appendix A Additional results

Table 3 shows average predictive variance under models with full and diagonal posterior specifications for Adelaide and Sydney datasets. ggp predictive variance is lower than all benchmark models.

Adelaide | Sydney | |||
---|---|---|---|---|

diag. post. | full post. | diag. post. | full post. | |

IGP | 0.177 | 0.183 | 0.204 | 0.202 |

LCM | 0.162 | 0.160 | 0.180 | 0.178 |

MTG | 0.174 | 0.206 | 0.207 | 0.219 |

GPRN | 0.173 | 0.174 | 0.187 | 0.185 |

GGP | 0.140 | 0.136 | 0.157 | 0.142 |

The role of explicit spatial links can be examined via the optimized spatial kernel in the grouped gprn model. Orientation of spatial kernels in terms of latitude and longitude axes was found to vary across groups in the ggp (Figure 1), confirming the usefulness of flexible, site-specific parameterization. In addition, the degree to which kernel support was truncated by the Epanechnikov component varied across groups and spatial axes. No particular relation between radial basis and Epanechnikov lengthscales was observed.

Results for bootstrapped tests of significant differences in model performance are provided at Table 4. Bootstrap samples are conditioned upon optimized results reported at Table 2 in the main paper.

Mean rmse | diff. from ggp | Mean nlpd | diff. from ggp | |||

adelaide | ||||||

ggp (diagonal) | 0.282 | 0.243 | ||||

ggp (full) | 0.288 | 0.006 | * | 0.264 | 0.022 | * |

igp (diagonal) | 0.315 | 0.033 | * | 0.368 | 0.125 | * |

igp (full) | 0.314 | 0.032 | * | 0.370 | 0.127 | * |

gprn (diagonal) | 0.278 | -0.004 | * | 0.311 | 0.068 | * |

gprn (full) | 0.283 | 0.001 | * | 0.320 | 0.077 | * |

lcm (diagonal) | 0.294 | 0.012 | * | 0.240 | -0.003 | |

lcm (full) | 0.293 | 0.011 | * | 0.240 | -0.004 | |

mtg (diagonal) | 0.301 | 0.019 | * | 0.337 | 0.094 | * |

mtg (full) | 0.304 | 0.022 | * | 0.376 | 0.133 | * |

sydney | ||||||

ggp (diagonal) | 0.284 | 0.257 | ||||

ggp (full) | 0.298 | 0.014 | * | 0.286 | 0.029 | * |

igp (diagonal) | 0.287 | 0.003 | * | 0.340 | 0.084 | * |

igp (full) | 0.286 | 0.002 | 0.335 | 0.078 | * | |

gprn (diagonal) | 0.281 | -0.003 | * | 0.323 | 0.067 | * |

gprn (full) | 0.284 | 0.000 | 0.326 | 0.069 | * | |

lcm (diagonal) | 0.310 | 0.026 | * | 0.273 | 0.016 | * |

lcm (full) | 0.302 | 0.018 | * | 0.257 | 0.001 | |

mtg (diagonal) | 0.280 | -0.003 | 0.342 | 0.085 | * | |

mtg (full) | 0.283 | -0.001 | 0.360 | 0.103 | * | |

* Mean difference is significant at 95 percent credible interval. |

## References

- Álvarez and Lawrence (2009) Álvarez, M. and Lawrence, N. D. (2009). Sparse convolved Gaussian processes for multi-output regression. In Neural Information Processing Systems.
- Álvarez and Lawrence (2011) Álvarez, M. A. and Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. Journal of Machine Learning Research, 12(5):1459–1500.
- Álvarez et al. (2010) Álvarez, M. A., Luengo, D., Titsias, M. K., and Lawrence, N. D. (2010). Efficient multioutput Gaussian processes through variational inducing kernels. In Artificial Intelligence and Statistics.
- Álvarez et al. (2012) Álvarez, M. A., Rosasco, L., and Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Found. Trends Mach. Learn., 4(3):195–266.
- Bilionis et al. (2014) Bilionis, I., Constantinescu, E. M., and Anitescu, M. (2014). Data-driven model for solar irradiation based on satellite observations. Solar Energy, 110:22–38.
- Bonilla et al. (2008) Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I. (2008). Multi-task Gaussian process prediction. In Neural Information Processing Systems.
- Cressie and Wikle (2011) Cressie, N. and Wikle, C. K. (2011). Statistics for spatio-temporal data. John Wiley & Sons.
- Dahl and Bonilla (2017) Dahl, A. and Bonilla, E. (2017). Scalable Gaussian process models for solar power forecasting. In Woon, W. L., Aung, Z., Kramer, O., and Madnick, S., editors, Data Analytics for Renewable Energy Integration: Informing the Generation and Distribution of Renewable Energy, pages 94–106. Springer International Publishing.
- Gardner et al. (2018) Gardner, J. R., Pleiss, G., Wu, R., Weinberger, K. Q., and Wilson, A. G. (2018). Product kernel interpolation for scalable Gaussian processes. arXiv, abs/1802.08903.
- Goovaerts (1997) Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford Univ. Press.
- Hensman et al. (2014) Hensman, J., Rattray, M., and Lawrence, N. D. (2014). Fast nonparametric clustering of structured time-series. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Inman et al. (2013) Inman, R. H., Pedro, H. T., and Coimbra, C. F. (2013). Solar forecasting methods for renewable energy integration. Progress in energy and combustion science, 39(6):535–576.
- Jordan et al. (1998) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1998). An introduction to variational methods for graphical models. Springer.
- Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Krauth et al. (2017) Krauth, K., Bonilla, E. V., Cutajar, K., and Filippone, M. (2017). AutoGP: Exploring the capabilities and limitations of Gaussian process models. In Artificial Intelligence and Statistics.
- Matthews et al. (2017) Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6.
- Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian processes for machine learning. The MIT Press.
- Remes et al. (2017) Remes, S., Heinonen, M., and Kaski, S. (2017). A mutually-dependent Hadamard kernel for modelling latent variable couplings. In Zhang, M.-L. and Noh, Y.-K., editors, Proceedings of the Ninth Asian Conference on Machine Learning, volume 77 of Proceedings of Machine Learning Research, pages 455–470.
- Sampson and Guttorp (1992) Sampson, P. D. and Guttorp, P. (1992). Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108–119.
- Shinozaki et al. (2016) Shinozaki, K., Yamakawa, N., Sasaki, T., and Inoue, T. (2016). Areal solar irradiance estimated by sparsely distributed observations of solar radiation. IEEE Transactions on Power Systems, 31(1):35–42.
- Teh et al. (2005) Teh, Y. W., Seeger, M., and Jordan, M. I. (2005). Semiparametric latent factor models. In Artificial Intelligence and Statistics.
- Voyant et al. (2017) Voyant, C., Notton, G., Kalogirou, S., Nivet, M.-L., Paoli, C., Motte, F., and Fouilloy, A. (2017). Machine learning methods for solar radiation forecasting: A review. Renewable Energy, 105:569–582.
- Widén et al. (2015) Widén, J., Carpman, N., Castellucci, V., Lingfors, D., Olauson, J., Remouit, F., Bergkvist, M., Grabbe, M., and Waters, R. (2015). Variability assessment and forecasting of renewables: A review for solar, wind, wave and tidal resources. Renewable and Sustainable Energy Reviews, 44:356–375.
- Wilson et al. (2012) Wilson, A. G., Knowles, D. A., and Ghahramani, Z. (2012). Gaussian process regression networks. In International Conference on Machine Learning.
- Yang et al. (2013) Yang, D., Gu, C., Dong, Z., Jirutitijaroen, P., Chen, N., and Walsh, W. M. (2013). Solar irradiance forecasting using spatial-temporal covariance structures and time-forward kriging. Renewable Energy, 60:235–245.
- Yang et al. (2015) Yang, D., Ye, Z., Lim, L. H. I., and Dong, Z. (2015). Very short term irradiance forecasting using the lasso. Solar Energy, 114:314–326.