Data-driven polynomial chaos expansion for machine learning regression

Data-driven polynomial chaos expansion for machine learning regression

E. Torre, S. Marelli, P. Embrechts, B. Sudret
Abstract

We present a regression technique for data driven problems based on polynomial chaos expansion (PCE). PCE is a popular technique in the field of uncertainty quantification (UQ), where it is typically used to replace a runnable but expensive computational model subject to random inputs with an inexpensive-to-evaluate polynomial function. The metamodel obtained enables a reliable estimation of the statistics of the output, provided that a suitable probabilistic model of the input is available.

In classical machine learning (ML) regression settings, however, the system is only known through observations of its inputs and output, and the interest lies in obtaining accurate pointwise predictions of the latter. Here, we show that a PCE metamodel purely trained on data can yield pointwise predictions whose accuracy is comparable to that of other ML regression models, such as neural networks and support vector machines. The comparisons are performed on benchmark datasets available from the literature. The methodology also enables the quantification of the output uncertainties and is robust to noise. Furthermore, it enjoys additional desirable properties, such as good performance for small training sets and simplicity of construction, with only little parameter tuning required. In the presence of statistically dependent inputs, we investigate two ways to build the PCE, and show through simulations that one approach is superior to the other in the stated settings.

Keywords: sparse polynomial chaos expansions, machine learning, regression, uncertainty quantification, copulas.

1 Introduction

Machine learning (ML) is increasingly used today to make predictions of system responses and to aid or guide decision making. Given a -dimensional input vector to a system and the corresponding -dimensional output vector , data-driven ML algorithms establish a map on the basis of an available sample set of input observations and of the corresponding output values , where . In classification tasks, the output is discrete, that is, it can take at most a countable number of different values (the class labels). In regression tasks, which this paper is concerned with, the output takes continuous values. Regression techniques include linear regression, neural networks (NN), kernel methods (such as Gaussian processes, GP), sparse kernel machines (such as support vector machines, SVM), graphical models (such as Bayesian networks and Markov random fields), and others. A comprehensive overview of these methods can be found in Bishop (2009) and in Witten et al. (2016).

Current research on ML algorithms focuses on various open issues. First, there is an increasing interest towards problems where the inputs to the system, and as a consequence the system’s response, are uncertain (see, e.g., Chan and Elsheikh (2018); Kasiviswanathan et al. (2016); Mentch and Hooker (2016)). Properly accounting for both aleatory and epistemic input uncertainties allows one to estimate the response statistics. Second, novel methods are sought to better automatize the tuning of hyperparameters, which several ML methods are highly sensitive to (Snoek et al., 2012). Third, the paucity of data in specific problems calls for models that can be effectively trained on few observations only (Forman and Cohen, 2004). Finally, complex models, while possibly yielding accurate predictions, are often difficult if not impossible to interpret.

This manuscript revisits polynomial chaos expansions (PCE), a well established metamodelling technique in uncertainty quantification (UQ), as a statistical ML regression algorithm (Vapnik, 1995) that can deal with these challenges (Sudret et al., 2015). UQ classically deals with problems where is uncertain, and is therefore modelled as a random vector. As a consequence, is also uncertain, but its statistics are unknown and have to be estimated. Differently from ML, in UQ the model is typically available (e.g., as a finite element code), and can be computed pointwise. However, is often a computationally expensive black-box model, so that a Monte Carlo approach to estimate the statistics of (generate a large input sample set to obtain a large output sample set ) is not feasible. PCE is a UQ spectral technique that is used in these settings to express as a polynomial function of . PCE thereby allows one to replace the original computational model with an inexpensive-to-run but accurate metamodel. The metamodel can be used to derive, for instance, estimates of various statistics of , such as its moments or its sensitivity to different components of (Saltelli et al., 2000), in a computationally efficient way. Recently, we established a general framework to perform UQ (including PCE metamodelling) in the presence of complex input dependencies modelled through copulas (Torre et al., 2017).

Here, we re-establish PCE in a purely data-driven ML setup, where the goal is to obtain a model that can predict the response of a system given its inputs . For simplicity, we consider the case where is a scalar random variable . The generalisation to multivariate outputs is straightforward. In the setup of interest here, no computational model is available, but only a set of input values and corresponding responses. and are considered to be (possibly noisy) realisations of and , the true relationship between which is deterministic but unknown.

After recalling the underlying theory (Section 2), we first show by means of simulation that data-driven PCE delivers accurate pointwise predictions (Section 3). In addition, PCE also enables a reliable estimation of the statistics of the response (such as its moments and ), thus enabling uncertainty quantification of the predictions being made. Dependencies among the input variables are effectively modelled by copula functions, purely inferred from the data as well. The approach is shown to be robust to noise in the data, and to work well also in the presence of small training sets. In Section 4, we further apply PCE to real data previously analyzed by other authors with different NN and/or SVM algorithms, where it achieves a comparable performance. Importantly, the construction of the PCE metamodel does not rely on fine tuning of critical hyper-parameters. This and other desirable features of the PCE modelling framework are discussed in Section 5.

2 Methods

2.1 Measures of accuracy

Before introducing PCE, which is a popular metamodelling technique in UQ, as an ML technique used to make pointwise predictions, it is important to clarify the distinction between the error types that UQ and ML typically aim at minimizing.

ML algorithms are generally used to predict, for any given input, the corresponding output of the system. Their performance is assessed in terms of the distance of the prediction from the actual system response, calculated for each input and averaged over a large number of (ideally, all) possible inputs. A common error measure in this sense is the mean absolute error (MAE). For a regression model trained on a training data set , the MAE is typically evaluated over a validation data set of size by

(1)

The associated relative error considers point by point the ratio between the absolute error and the actual response, i.e.,

(2)

which is well defined if for all .

PCE is instead typically used to draw accurate estimates of various statistics of the output – such as its moments, its , confidence intervals – given a probabilistic model of an uncertain input. The relative error associated to the estimates of and of is often quantified by

(3)

provided that and .

A popular measure of the error made on the full response is the Kullback-Leibler divergence

(4)

which quantifies the average difference between the logarithms of the predicted and of the true s.

Provided a suitable model of the joint of the input, PCE is known to converge (in the sense that the error on the mentioned output statistics decreases) very rapidly as the size of the training set increases, compared for instance to Monte-Carlo sampling (Puig et al., 2002; Todor and Schwab, 2007; Ernst et al., 2012). This happens because PCE, which is a spectral method, efficiently propagates the probabilistic information delivered by the input model to the output. Nevertheless, this does not necessarily imply a similarly rapid pointwise convergence of the error, which remains to be demonstrated. For a discussion of various types of convergence, see Ernst et al. (2012) and references therein.

2.2 Data-driven settings

PCE is designed to build an inexpensive-to-evaluate analytical model mapping an input random vector onto an output random variable (Ghanem and Spanos, 1990; Xiu and Karniadakis, 2002). PCE assumes that an unknown deterministic map exists, such that . is additionally assumed to have finite variance: . Under the further assumption that each has finite moments of all orders (Ernst et al., 2012), the space of square integrable functions with respect to admits a basis of polynomials orthonormal to each other with respect to the probability measure , i.e., such that

(5)

Here, is the Kroenecker delta symbol. The element of the multi-index indicates the degree of in the -th variable, . has a total degree given by .

Thus, admits the spectral representation

(6)

The goal of PCE is to determine the coefficients of the expansion, truncated at some polynomial order, given an initial set of observations (the training set or experimental design). In some engineering applications the model is directly available (e.g., as a finite element model) but computationally expensive to evaluate: in these cases, PCE is used as a surrogate to replace the true model with an inexpensive-to-evaluate metamodel. In the standard ML settings considered here, conversely, is unknown and has to be modelled solely on the basis of available observations . The primary goal of this paper is to show that the accuracy of a PCE metamodel built purely on can compete with that of state-of-the-art machine learning regression models, while requiring little parameter tuning and in some cases offering additional advantages.

2.3 PCE for independent inputs

In this section, we assume that the input random vector has statistically independent components. The PCE basis is built from the tensor product of univariate orthogonal polynomials. The case of inputs with dependent components is discussed later in Section 2.4.

2.3.1 Orthogonalization for independent inputs

When has independent components, the can be obtained as the tensor product of univariate polynomials,

(7)

where each is the element of degree of a basis of univariate polynomials orthonormal with respect to the marginals of , that is, such that

(8)

The problem of building a basis of mutually orthonormal multivariate polynomials with respect to hence becomes the problem of building bases of mutually orthonormal univariate polynomials, each with respect to a marginal . The bases can be built independently.

2.3.2 Specification of the marginals and arbitrary PCE

The construction of the polynomial basis requires a model of the marginal input distributions. Families of univariate orthonormal polynomials associated to classical continuous distributions are described in Xiu and Karniadakis (2002). An input variable with a different distribution (continuous, strictly increasing) may be transformed to a random variable with distribution belonging to one of the classical families by the transformation

(9)

This relation offers a first possible approach to build the PCE of for inputs with generic continuous marginals : transform into through (9), and then build a PCE of in terms of . The PCE has to be trained on , where is obtained from using (9).

Alternatively, it is possible to directly construct a basis of orthonormal polynomials with respect to by Stiltjes or Gram-Schmidt orthogonalization (Soize and Ghanem, 2004; Wan and Karniadakis, 2006). The polynomial chaos representation for arbitrary distributions is known as arbitrary PCE. In practice, performing arbitrary PCE on the original input is preferable to building the PCE on inputs transformed via (9). Indeed, the latter is usually a highly non-linear transformation, making the map from to more difficult to approximate by polynomials (Oladyshkin and Nowak, 2012).

In the applications presented in this paper, we opt for a non-parametric approach to infer the input marginals, and specify by kernel density estimation (KDE) (Rosenblatt, 1956; Parzen, 1962). Given a set of observations of , the kernel density estimate of reads

(10)

where is the kernel function and is the appropriate kernel bandwidth that is learned from the data. Different kernels are used in practice, such as the Epanechnikov or Gaussian kernels. In this paper we use the latter, that is, is selected as the standard normal .

After estimating the input marginals by KDE, we resort to arbitrary PCE to build a basis of orthonormal polynomials. The PCE metamodel is then trained on .

2.3.3 Truncation schemes

The sum in (6) involves an infinite number of terms. For practical purposes, it is truncated to a finite sum, according to one of various possible truncation schemes. The standard basis truncation (Xiu and Karniadakis, 2002) retains the subset of terms defined by

where is the chosen maximum polynomial degree and is the total degree of . Thus, only polynomials of degree up to are considered. contains elements. To further reduce the basis size, several additional truncation schemes have been proposed. Hyperbolic truncation (Blatman and Sudret, 2011) retains the polynomials with indices in

where and is the -norm defined by .

A complementary strategy is to set a limit to the number of non-zero elements in , that is, to the number of interactions among the components of in the expansion. This maximum interaction truncation scheme retains the polynomials with indices in

where .

In our case studies below we combined hyperbolic and maximum truncation, thus retaining the expansion’s polynomials with coefficients in

(11)

where the triplet will be selected later on. This choice is motivated by the sparsity of effect principle, which assumes that only few meaningful interactions influence system responses and which holds in most engineering applications.

2.3.4 Calculation of the coefficients

Selected a truncation scheme and the corresponding set of multi-indices, the coefficients in

(12)

need to be determined on the basis of a set of observed data. In these settings, the vector of expansion coefficients can be determined by regression.

When the number of regressors is smaller than the number of observations, can be determined by solving the ordinary least squares (OLS) problem

(13)

The solution reads

(14)

where , , .

The OLS problem cannot be solved when , because in this case the matrix in (14) is not invertible. Also, OLS tends to overfit the data when is large. Simpler models with fewer coefficients can be constructed by sparse regression.

Least angle regression (LAR), proposed in Efron et al. (2004), is an algorithm that achieves sparse regression by solving the regularized regression problem

(15)

The last addendum in the expression is the regularization term, which forces the minimisation to favour sparse solutions. The use of LAR in the context of PCE was initially proposed in Blatman and Sudret (2011), which the reader is referred to for more details. In the applications illustrated in Sections 3 and 4, we adopt LAR to determine the PCE coefficients.

2.3.5 Estimation of the output statistics

Given the statistical model of the input and the PCE metamodel (12) of the input-output map, the model response is not only known pointwise, but can also be characterized statistically. For instance, the orthonormality of the polynomial basis ensures that the mean and the variance of read, respectively,

(16)

This property provides a useful interpretation of the expansion coefficients in terms of the first two moments of the output. Other output statistics, such as the Sobol partial sensitivity indices (Sobol’, 1993), can be obtained from the expansion coefficients analytically (Sudret, 2008).

Higher-order moments of the output, as well as other statistics (such as the full ), can be efficiently estimated through Monte-Carlo simulation, by sampling sufficiently many realisations of and evaluating the corresponding PCE responses. Polynomial evaluation is computationally cheap and can be trivially vectorized, making estimation by resampling extremely efficient.

2.4 PCE for mutually dependent inputs

If the input vector has statistically dependent components, its joint is not the product of the marginals and the orthogonality property (5) does not hold. As a consequence, one cannot construct multivariate orthogonal polynomials by tensor product of univariate ones, as done in (7). Constructing a basis of orthogonal polynomials in this general case is still possible, but computationally demanding. For this reason, we investigate two alternative strategies.

The first strategy consists in ignoring the input dependencies and in building a basis of polynomials orthonormal with respect to by arbitrary PCE. This approach is labelled from here on as aPCEonX. While not forming a basis of the space of square integrable functions with respect to , the considered set of polynomials may still yield a similarly good pointwise approximation.

Accurate estimates of the output statistics may be obtained a posteriori by modelling the input dependencies through copula functions, as detailed in Appendix A. The joint of a random vector with copula distribution and marginals (here, obtained by KDE) is given by (A.1). Resampling from such a probabilistic input model yields more accurate estimates of the output statistics than resampling from the distribution , that is, than ignoring dependencies.

The second possible approach, presented in more details in Appendix B, consists in modelling directly the input dependencies by copulas, and then in transforming the input vector into a set of independent random variables with prescribed marginals (e.g., standard uniform) through the associated Rosenblatt transform (Rosenblatt, 1952), defined in (A.18). The PCE metamodel is then built between the transformed input vector and the output . When the input marginals are standard uniform distributions, the corresponding class of orthonormal polynomials is the Legendre family, and the resulting PCE is here indicated by lPCEonZ. The asymptotic properties of an orthonormal PCE, such as the relations (16), hold. The expression of in terms of is given by the relation (B.1).

At first, the second approach seems advantageous over the first one. It involves, in a different order, the same steps: modelling the input marginals and copula, and building the PCE metamodel. By capturing dependencies earlier on, it enables the construction of a basis of mutually orthonormal polynomials with respect to the joint of the (transformed) inputs. It thereby provides a model with spectral properties (except for unavoidable errors due to truncation and parameter fitting). The experiments run in Section 3, however, show that the first approach yields typically more accurate pointwise predictions (although not necessarily better statistical estimates). This is due to the fact that it avoids mapping the input into independent variables via the (typically strongly non-linear) Rosenblatt transform. For a more detailed discussion, see Section 3.4.

3 Validation on synthetic data

3.1 Validation scheme

We first investigate data-driven regression by PCE on two different simulated data sets. The first data set is obtained through an analytical, highly non-linear function of three variables, subject to three uncertain inputs. The second data set is a finite element model of a horizontal truss structure, subject to uncertain inputs. In both cases, the inputs are statistically dependent, and their dependence is modelled through a canonical vine (C-vine) copula (see Appendix A).

We study the performance of the two alternative approaches described in Section 2.4: aPCEonX and lPCEonZ. In both cases we model the input copula as a C-vine, fully inferred from the data. For the aPCEonX the choice of the copula only plays a role in the estimation of the output statistics, while for the lPCEonZ it affects also the pointwise predictions. We additionally tested the performance obtained by using the true input copula (known here because it was used to generate the synthetic data) and a Gaussian copula inferred from data. We also investigated the performance obtained by using the true marginals instead of the ones fitted by KDE. Using the true copula and marginals yielded better statistical estimates, but is of little practical interest, as it would not be known in real applications. The Gaussian copula yielded generally similar or slightly worse accuracy. For brevity, we do not show these results.

To assess the pointwise accuracy, we generate a training set of increasing size , build the PCE metamodels both by aPCEonX and by lPCEonZ, and evaluate their rMAE on a validation set of fixed size points. The statistical accuracy is quantified instead in terms of error on the mean, standard deviation, and full of the true models by (2)-(4). The statistical estimates obtained by the two PCE approaches are furthermore compared to the corresponding sample estimates obtained from the same training data (sample mean, sample standard deviation, and KDE of the )

For each value of and each error type, error bands are obtained as the minimum to maximum error obtained across different realisations of and . The minimum error is often taken in machine learning studies as an indication of the best performance that can be delivered by a given algorithm in a given task. The maximum error represent analogously the worst-case scenario.

Finally, we assess the robustness to noise by adding a random perturbation to each output observations in used to train the PCE model. The noise is drawn independently for each observation from a univariate Gaussian distribution with mean and prescribed standard deviation .

3.2 Ishigami function

The first model we consider is

(17)

where

(18)

is the Ishigami function (Ishigami and Homma, 1990), defined for inputs and taking values in . The Ishigami function is often used as a test case in global sensitivity analysis due to its strong non-linearity and non-monotonicity. Model (17) is rescaled here to take values in the interval . Rescaling does not affect the approximation accuracy of the PCE, but enables a meaningful evaluation of the rMAE (2) by avoiding values of close to .

We model the input as a random vector with marginals , , and C-vine copula with density

Here, and are the densities of the pairwise Gumbel and t- copula families defined in rows and of Table A.4. Thus, correlates positively with both and . and are also positively correlated, but conditionally independent given . The joint of can be obtained from its marginals and copula through (A.1). The of in response to input , its mean and its standard deviation , obtained on sample points, are shown in the left panel of Figure 1.

Figure 1: Response PDFs of the Ishigami function. Left panel: true of the Ishigami model’s response, obtained on sample points by KDE. Central panel: histogram obtained from output observations used for training (gray bars), true response as in the left panel (black), s obtained from the aPCEonX (red) and the lPCEonZ (green) by resampling. Right panel: as in the central panel, but for training data perturbed with Gaussian noise (). The blue line indicates the true of the perturbed model.
Figure 2: PCE of Ishigami model: performance. From left to right: rMAE, error on the mean, error on the standard deviation, and Kullback-Leibler divergence of aPCEonX (red) and lPCEonZ (green), for a size of the training set increasing from to . The dash-dotted lines and the bands indicate, respectively, the average and the minimum to maximum errors over simulations. In the second to fourth panels, the blue lines correspond to the empirical estimates obtained from the training data (error bands not shown).
Figure 3: aPCEonX of Ishigami model: robustness to noise (for multiple noise levels). From left to right: rMAE, error on the mean, error on the standard deviation, and Kullback-Leibler divergence of the aPCEonX for an increasing amount of noise: (dark gray), (mid-light gray), and (light gray). The dash-dotted lines and the bands indicate, respectively, the average and the minimum to maximum error over simulations. The red lines, reported from Figure 2 for reference, indicate the mean error obtained for the noise-free data.
Figure 4: aPCEonX of Ishigami model: robustness to noise (w.r.t. sample estimation). From left to right: rMAE, error on the mean, error on the standard deviation, and Kullback-Leibler divergence obtained by aPCEonX (gray) and by direct sample estimation (blue), for noise . The dash-dotted lines and the bands indicate, respectively, the average and the minimum to maximum error over simulations. The red lines, reported from Figure 2 for reference, indicate the mean error obtained for the noise-free data.

Next, we build the aPCEonX and the lPCEonZ on training data , and assess their performance as described in Section 3.1. The errors are shown in Figure 2 (red: aPCEonX; green: lPCEonZ), as a function of the training set size . The dotted line indicates the average error over the repetitions, while the shaded area around it spans the range from the minimum to the maximum error across repetitions. The aPCEonX yields a considerably lower rMAE. This is due to the strong non-linearity of the Rosenblatt transform used by the lPCEonZ to de-couple the components of the input data. Importantly, the methodology works well already in the presence of relatively few data points: the pointwise error and the Kullback-Leibler divergence both drop below when using as few as data points. The central panel of Figure 1 shows the histogram obtained from output observations of one training set, the true (black), and the s obtained by resampling from the aPCEonX and the lPCEonZ built on that training set. The statistics of the true response are better approximated by the aPCEonX than by the lPCEonZ or by sample estimation (blue solid lines in Figure 2).

Finally, we examine the robustness of aPCEonX to noise. We perturb each observation in by adding noise drawn from a Gaussian distribution with mean and standard deviation . is assigned as a fixed proportion of the model’s true mean : , , and of (corresponding to , , and of , respectively). The results, shown in Figures 3-4, indicate that the methodology is robust to noise. Indeed, the errors of all types are significantly smaller than the magnitude of the added noise, and decrease with increasing sample size (see Figure 3). For instance, the rMAE for drops to if or more training points are used. The error on is minorly affected, which is expected since the noise is unbiased. More importantly, and can be recovered with high precision even in the presence of strong noise (see also Figure 1, fourth panel). In this case, the PCE predictor for the standard deviation works significantly better than the sample estimates, as illustrated in Figure 4.

3.3 23-bar horizontal truss

We further replicate the analysis carried out in the previous section on a more complex finite element model of a horizontal truss (Blatman and Sudret, 2011). The structure consists of bars connected at upper nodes, and is meters long and meters high (see Figure 5). The bars belong to two different groups (horizontal and diagonal bars), both having uncertain Young modulus and uncertain cross-sectional area , :

where is the univariate lognormal distribution with mean and standard deviation .

Figure 5: Scheme of the horizontal truss model. -bar horizontal truss with bar cross-section and Young modulus (: horizontal and vertical bars, respectively), subject to loads . Modified from Blatman and Sudret (2011).

The four variables can be considered statistically independent, and their values influence the structural response to loading. An additional source of uncertainty comes from six random loads the truss is subject to, one on each upper node. The loads have Gumbel marginal distribution with mean and standard deviation :

(19)

where , , and is the Euler-Mascharoni constant. In addition, the loads are made mutually dependent through the C-vine copula with density

(20)

where each is the density of the pair-copula between and , , and belongs to the Gumbel-Hougaard family defined in Table A.4, row 11.

The presence of the loads causes a downward vertical displacement at the mid span of the structure. is taken to be the system’s uncertain response to the -dimensional random input consisting of the structural variables and the loads. The true statistics (mean, standard deviation, ) of are obtained by MCS over sample points, and are shown in the left panel of Figure 6.

Figure 6: Response PDFs of the horizontal truss. Left panel: true of the truss response, obtained on sample points by KDE. Central panel: probability histogram obtained from output observations used for training (gray bars), true response as in the left panel (black), s obtained from the aPCEonX (red) and the lPCEonZ (green) by resampling. Right panel: as in the central panel, but for training data perturbed with Gaussian noise (). The blue line indicates the true of the perturbed model.
Figure 7: PCE performance for the horizontal truss. From left to right: rMAE, error on the mean, error on the standard deviation, and Kullback-Leibler divergence of aPCEonX (red) and lPCEonZ (green), for a size of the training set increasing from to . The dash-dotted lines and the bands indicate, respectively, the average and the minimum to maximum errors over simulations. In the second to fourth panels, the blue lines correspond to the empirical estimates obtained from the training data (error bands not shown).

We analyse the system with the same procedure undertaken for the Ishigami model: we build aPCEonX and lPCEonZ on each of training sets of increasing size , and validate their performance. The pointwise error is evaluated on validation sets of fixed size , while the statistical errors are determined by large resampling.

The results are shown in Figure 7. Both PCEs exhibit high performance, yet the aPCEonX yields a significantly smaller pointwise error (first panel). The lPCEonZ yields a better estimate of the standard deviation, yet the empirical estimates obtained from the training data are the most accurate ones in this case.

Having selected the aPCEonX as the better of the two metamodels, we further assess its performance in the presence of noise. We perturb the response values used to train the model by adding Gaussian noise with increasing standard deviation , set to , , and of (equivalent to , , and of , respectively). The results are shown in Figures 8-9. The errors of all types are significantly smaller than the magnitude of the added noise, and decrease with increasing sample size for all noise levels (Figure 8). Also, the PCE estimates are significantly better than the sample estimates (Figure 9; see also Figure 6, fourth panel).

Figure 8: aPCEonX of horizontal truss: robustness to noise (for multiple noise levels). From left to right: rMAE, error on the mean, error on the standard deviation, and Kullback-Leibler divergence of aPCEonX for an increasing amount of noise: (dark gray), (mid-light gray), and (light gray). The dash-dotted lines and the bands indicate, respectively, the average and the minimum to maximum error over simulations. The red lines, reported from Figure 7 for reference, indicate the error for the noise-free data.
Figure 9: aPCEonX of horizontal truss: robustness to noise (w.r.t. sample estimation). From left to right: rMAE, error on the mean, error on the standard deviation, and Kullback-Leibler divergence obtained by aPCEonX (gray) and by direct sample estimation (blue), for noise . The dash-dotted lines and the bands indicate, respectively, the average and the minimum to maximum error over simulations. The red lines, reported from Figure 2 for reference, indicate the mean error obtained for the noise-free data.

3.4 Preliminary conclusion

The results obtained in the previous section allow us to draw some important preliminary conclusions on data-driven PCE. The methodology:

  • delivers reliable predictions of the system response to multivariate inputs;

  • produces reliable estimates of the response statistics if the input dependencies are properly modelled, as done here through copulas (for aPCEonX: a-posteriori);

  • works well already when trained on few observations;

  • deals effectively with noise, thus providing a tool for denoising;

  • involves only few hyperparameters: the range of degrees allowed for the PCE and the truncation parameters. All have a clear meaning and require little tuning.

In order to build the expansion when the inputs are mutually dependent, we investigated two alternative approaches, labelled as lPCEonZ and aPCEonX. Of the two strategies, aPCEonX appears to be the most effective one in purely data-driven problems. It is worth mentioning, though, that lPCEonZ may provide superior statistical estimates if the joint distribution of the input is known with greater accuracy than in the examples shown here. This was the case, for instance, when we replaced the inferred marginals and copula used to build the lPCEonZ with the true ones (not shown here): in both examples above, we obtained more accurate estimates of , , and (but not better pointwise predictions) than using aPCEonX.

4 Results on real data sets

We now demonstrate the use of aPCEonX on three different real data sets. The selected data sets were previously analysed by other authors with different machine learning algorithms, which establish here the performance benchmark.

4.1 Analysis workflow

4.1.1 Statistical input model

The considered data sets comprise samples made of multiple input quantities and one scalar output. Adopting the methodology outlined in Section 2, we characterize the multivariate input statistically by modelling its marginals through KDE, and we then resort to arbitrary PCE to express the output as a polynomial of . The basis of the expansion thus consists of mutually orthonormal polynomials with respect to , where is the marginal inferred for .

4.1.2 Estimation of pointwise accuracy

Following the pointwise error assessment procedure carried out in the original publications, for the case studies considered here we assess the method’s performance by cross-validation. Standard -fold cross-validation partitions the data into subsets, trains the model on of those (the training set ), and assesses the pointwise error between the model’s predictions and the observations on the -th one (the validation set ). The procedure is then iterated over all possible combinations of training and validation sets. The final error is computed as the average error over all validation sets. The number of data subsets is chosen as in the reference studies. Differently from the synthetic models considered in the previous section, the true statistics of the system response are not known here, and the error on their estimates cannot be assessed.

A variation on standard cross-validation consists in performing a -fold cross validation on each of multiple random shuffles of the data. The error is then typically reported as the average error obtained across all randomisations, ensuring that the final results are robust to the specific partitioning of the data in its subsets. In the following, we refer to a -fold cross validation performed on random permutations of the data (i.e. random -fold partitions) as an -fold randomised cross-validation.

4.1.3 Statistical estimation

Finally, we estimate for all cases studies the response a-posteriori (AP) by resampling. To this end, we first model their dependencies through a C-vine copula . The vine is inferred from the data as detailed in Appendix A.3. Afterwards, resampling involves the following steps:

  • sample points from . We opt for Sobol’ quasi-random low-discrepancy sequences (Sobol’, 1967), and set ;

  • map by the inverse Rosenblatt transform of ;

  • map by the inverse probability integral transform of each marginal . is a sample set of input observations with copula and marginals ;

  • evaluate the set of responses to the inputs in .

The of is estimated on by kernel density estimation.

4.2 Combined-cycle power plant

The first real data set we consider consists of data points collected from a combined-cycle power plant (CCPP) over years (2006-2011). The CCPP generates electricity by gas and steam turbines, combined in one cycle. The data comprise ambient variables and the energy production , measured over time. The four ambient variables are the temperature , the pressure and the relative humidity measured in the gas turbine, and the exhaust vacuum measured in the steam turbine. All five quantities are hourly averages. The data are available online (Lichman, 2013).

The data were analysed in Tüfekci (2014) with neural network- (NN-) based ML methods to predict the energy output based on the measured ambient variables. The authors assessed the performance of various learners by -fold randomised cross-validation, yielding a total of pairs of training and validation sets. Each set contained observations. The best learner among those tested by the authors was a bagging reduced error pruning (BREP) NN, which yielded a mean MAE of (see their Table 10, row 4). The lowest MAE of this model over the validation sets, corresponding to the “best” validation set, was indicated to be . Besides providing an indicative lower bound to the errors, the minimum gives, when compared to the means, an indication of the variability of the performance over different partitions of the data. The actual error variance over the validation sets was not provided in the mentioned study.

MAE min. MAE mean-min rMAE ()
aPCEonX 3.11 0.03 3.05 0.06 0.68 0.007
BREP-NN (Tüfekci, 2014) 3.22 n.a. 2.82 0.40 n.a.
Table 1: Errors on CCPP data. MAE yielded by the aPCEonX (first row) and by the BREP-NN model in Tüfekci (2014) (second row). From left to right: average MAE its standard deviation over all validation sets (in ), its minimum (error on the “best set”), difference between the average and the minimum MAEs, and rMAE.

We analyze the very same training sets by PCE. The results are reported in Table 1. The average MAE yielded by the aPCEonX is slightly smaller than that of the BREP NN model. More importantly, the difference between the average and the minimum error, calculated over the validation sets, is significantly lower with our approach, indicating a lower sensitivity of the results to the partition of the data, and therefore a higher reliability in the presence of random observations. The average error of the PCE predictions relative to the observed values is below .

Finally, we estimate the of the hourly energy produced by the CCPP following the procedure described in Section 4.1.3. The results are shown in Figure 10. Reliable estimates of the energy aid for instance energy production planning and management.

Figure 10: Estimated of the energy produced by the CCPP. The bars indicate the histogram obtained from the observed CCPP energy output. The coloured lines show the s of the PCE metamodels built on the training sets, for input dependencies modelled by C-vines.

4.3 Boston Housing

The second real data set used to validate the PCE-based ML method concerns housing values in the suburbs of Boston, collected in 1970. The data set, downloaded from Lichman (2013), was first published in Harrison and Rubinfeld (1978), and is a known reference in the machine learning and data mining communities.

The data comprise instances, each having attributes. One attribute (the proximity of the neighborhood to the Charles river) is binary-valued and is therefore disregarded in our analysis. Of the remaining attributes, one is the median housing value of owner-occupied homes in the neighbourhood, in thousands of (MEDV). The remaining attributes are, in order: the per capita crime rate by town (CRIM), the proportion of residential land zones for lots over 25,000 sq.ft. (ZN), the proportion of non-retail business acres per town (INDUS), the nitric oxides concentration, in parts per 10 million (NOX), the average number of rooms per dwelling (RM), the proportion of owner-occupied units built prior to 1940 (AGE), the weighted distances to five Boston employment centres (DIS), the index of accessibility to radial highways (RAD), the full-value property-tax rate per 10,000 (TAX), the pupil-teacher ratio by town (PTRATIO), the index , where Bk is the proportion of black residents by town, and the lower status of the population (LSTAT).

The data were analysed in previous studies with different regression methods to predict the median house values MEDV on the basis of the other attributes (Can, 1992; Gilley and Pace, 1995; Quinlan, 1993; R Kelley Pace, 1997). The original publication itself (Harrison and Rubinfeld, 1978) was concerned with determining whether the demand for clean air affected housing prices. The data were analysed with different supervised learning methods in Quinlan (1993). Among them, the best predictor was shown to be an NN model combined with instance-based learning, yielding (rMAE: ) on a -fold cross-validation.

We model the data by PCE and quantify the performance by randomised cross-validation. The results are summarized in Table 2. The errors are comparable to the NN model with instance based learning in Quinlan (1993). While the latter yields the lowest absolute error, the aPCEonX achieves a smaller relative error. In addition, it does not require the fine parameter tuning that affects most NN models. Finally, we estimate the of the median house value as described in Section 4.1.3. The results are shown in Figure 11.

MAE ($) rMAE ()
aPCEonX 2483 337 12.6 2.0
NN (Quinlan, 1993) 2230 n.a. 12.9 n.a.
Table 2: Errors on Boston Housing data. MAE and rMAE yielded by the aPCEonX (first row) and by the NN model with instance based learning from Quinlan (1993) (second row).
Figure 11: Estimated of the variable MEDV. The bars indicate the sample , as an histogram obtained using bins from to k (the maximum house value in the data set). The coloured lines show the s of the PCE metamodels built on of the training sets (one per randomization of the data), for input dependencies modelled by C-vines. The dots indicate the integrals of the estimated s for house values above k.

4.4 Wine quality

The third real data set we consider concerns the quality of wines from the vinho verde region in Portugal. The data set consists of red samples and white samples, collected between 2004 and 2007. The data are available online at http://www3.dsi.uminho.pt/pcortez/wine/. Each wine sample was analysed in laboratory for physico-chemical parameters: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. In addition, each sample was given a quality score based on blinded sensory tests from three or more sensory assessors. The score is the median of the grades (integers between 0 and 10) assigned by each assessor.

The data were previously analysed in Cortez et al. (2009) to predict the wine quality score on the basis of the physico-chemical parameters. The study compared various ML algorithms, namely multiple regression, different types of NN, and support vector machine (SVM, see Schölkopf and Smola (2002)) regression. SVM regression outperformed the other methods, yielding the lowest MAE, as assessed by means of -fold randomised cross-validation.

We model the data by aPCEonX, and round the predicted wine ratings, which take continuous predicted values, to their closest integer. The performance is then assessed through the same cross-validation procedure used in Cortez et al. (2009). The results are reported in Table 3. The MAE of aPCEonX is comparable to that of the SVM regressor, and always lower than the best NN.

red wine: white wine:
MAE rMAE () MAE rMAE ()
aPCEonX 0.44 0.03 8.0 0.6 0.50 0.02 8.8 0.3
SVM in Cortez et al. (2009) 0.46 0.00 n.a. 0.45 0.00 n.a.
Best NN in Cortez et al. (2009) 0.51 0.00 n.a. 0.58 0.00 n.a.
Table 3: Errors on wine data. MAE and rMAE yielded on red and white wine data by the aPCEonX, by the SVM in Cortez et al. (2009), and by the best NN model in Cortez et al. (2009).

Finally, our framework enables the estimation of the of the wine rating as predicted by the PCE metamodels. The resulting s are shown in Figure 12. One could analogously compute the conditional s given by fixing any subset of inputs to given values (e.g., the residual sugar or alcohol content, which can be easily controlled in the wine making process). This may help, for instance, predicting the wine quality for fixed physico-chemical parameters, or choosing the latter so as to optimize the wine quality or to minimize its uncertainty. This analysis goes beyond the scope of the present work.

Figure 12: Estimated of the wine rating. For each panel (left: red wine; right: white wine), the grey bars indicate the sample of the median wine quality score assigned by the assessors. The coloured bars show the predicted s obtained by resampling from the PCE metamodels built on of the total training sets, for input dependencies modelled by C-vines.

5 Discussion and conclusions

We proposed an approach to machine learning (ML) that capitalises on polynomial chaos expansion (PCE), an advanced regression technique from uncertainty quantification. PCE is a popular spectral method in engineering applications, where it is often used to replace expensive-to-run computational models subject to uncertain inputs with an inexpensive metamodel that retains the statistics of the output (e.g., moments, ). Our paper shows that PCE can also be used as an effective regression model in purely data driven problems, where only input observations and corresponding system responses - but no computational model of the system - are available.

We tested the performance of PCE on simulated data first, and then on real data by cross-validation. The reference performance measure was the average point-wise error of the PCE metamodel over all test data. The simulations also allowed us to assess the ability of PCE to estimate the statistics of the response (its mean, standard deviation, and ) in the considered data-driven scenario. Both the point-wise and the statistical errors of the methodology were low, even when relatively few observations were used to train the model and in the presence of strong noise. The applications to real data showed a performance comparable, and sometimes slightly superior, to that of other ML methods used in previous studies, such as different types of neural networks and support vector machines.

PCE, however, offers several advantages. First, the framework performs well on very different tasks, with only little parameter tuning needed to adapt the methodology to the specific data considered. In fact, only the total degree , the -norm parameter and the interaction degree are to be specified. As a single analysis takes only a few seconds to a minute to be completed on a standard laptop (depending on the size of the data), is is straightforward to loop it on an array of values, and to retain the PCE with minimum error in the end. This feature distinguishes PCE from the above mentioned ML methods, which instead are known to be highly sensitive to their hyperparameters and require an appropriate and typically time consuming calibration (Claesen and Moor, 2015). Indeed, it is worth noting that, in the comparisons we made, all PCE metamodels were built by using the very same procedure and hyperparameters . When compared to the best NNs or SVM found in other studies, which differed significantly among each other in their construction and structure, the PCE metamodels exhibited a comparable performance.

Second, PCE delivers not only accurate pointwise predictions of the output for any given input, but also statistics thereof in the presence of input uncertainties. This is made possible by combining the PCE metamodel with a proper probabilistic characterization of the input uncertainties through marginal distributions and copulas. The methodology works well also in the presence of several inputs (as tested on problems of dimension up to ) and of sample sets of comparably small size.

Third, the analytical expression of the output yielded by PCE in terms of a simple polynomial of the input makes the model easy to interpret. For instance, its coefficients are directly related to the first and second moments of the output. For independent inputs, Sobol sensitivity indices are also directly encoded in the polynomial coefficients (Sudret, 2008). Sensitivity indices for dependent inputs (e.g., Kucherenko et al. (2012)) may be computed numerically. Other statistics of the output, e.g., its full , can be efficiently estimated by resampling.

Fourth, the polynomial form makes the calibrated metamodel portable to embedded devices (e.g., drones). For this kind of applications, the demonstrated robustness to noise in the data is a particularly beneficial feature.

Fifth and last, PCE needs relatively few data points to attain acceptable performance levels, as shown here on various test cases. This feature demonstrates the validity of PCE metamodelling for problems affected by data scarcity, also when combined with complex vine copula representations of the input dependencies.

One limitation of PCE-based regression as presented here is its difficulty in dealing with data of large size or consisting of a large number of inputs. Both features lead to a substantially increased computational cost needed to fit the PCE parameters and (if statistical estimation is wanted and the inputs are dependent) to infer the copula. Various solutions are possible. As for the PCE construction, in the presence of very large training sets the PCE may be initially trained on a subset of the available observations, and subsequently refined by enriching the training set with points in the region where the observed error is larger. Regarding copula inference, which is only needed for an accurate quantification of the prediction’s uncertainty, a possible solution is to employ a Gaussian copula. The latter involves a considerably faster fitting than the more complex vine copulas, and still yielded in our simulations acceptable performance. Alternatively, one may reduce the computational time needed for parameter estimation by parallel computing, as done in Wei et al. (2016).

Finally, the proposed methodology has been shown here on data characterized by continuous input variables only. PCE construction in the presence of discrete data is equally possible, and the Stiltjes orthogonalization procedure is known to be quite stable in that case (Gautschi, 1982). The a-posteriori quantification of the output uncertainty, however, generally poses a challenge. Indeed, it involves the inference of a copula among discrete random variables, which requires a different construction Genest and Nešlehová (2007). Recently, however, methods have been proposed to this end, including inference for R-vines (Panagiotelis et al., 2012; Panagiotelis et al., 2017). Further work is foreseen to integrate these advances with PCE metamodelling.

Acknowledgments

Emiliano Torre gratefully acknowledges financial support from RiskLab, Department of Mathematics, ETH Zurich and from the Risk Center of the ETH Zurich.

References

  • Aas (2016) Aas, K. (2016). Pair-copula constructions for financial applications: A review. Econometrics 4(4), 43.
  • Aas et al. (2009) Aas, K., C. Czado, A. Frigessi, and H. Bakkend (2009). Pair-copula constructions of multiple dependence. Insurance: Mathematics and Economics 44(2), 182–198.
  • Applegate et al. (2006) Applegate, D. L., R. E. Bixby, V. Chvátal, and W. J. Cook (2006). The Traveling Salesman Problem: A Computational Study. NewJersey: Princeton University Press.
  • Bedford and Cooke (2002) Bedford, T. and R. M. Cooke (2002). Vines – a new graphical model for dependent random variables. The Annals of Statistics 30(4), 1031–1068.
  • Bishop (2009) Bishop, C. (2009). Pattern recognition and machine learning. Springer.
  • Blatman and Sudret (2011) Blatman, G. and B. Sudret (2011). Adaptive sparse polynomial chaos expansion based on Least Angle Regression. J. Comput. Phys 230, 2345–2367.
  • Can (1992) Can, A. (1992). Specification and estimation of hedonic housing price models. Regional Science and Urban Economics 22(3), 453–474.
  • Chan and Elsheikh (2018) Chan, S. and A. H. Elsheikh (2018). A machine learning approach for efficient uncertainty quantification using multiscale methods. Journal of Computational Physics 354, 493–511.
  • Claesen and Moor (2015) Claesen, M. and B. D. Moor (2015). Hyperparameter search in machine learning. arXiv 1502.02127.
  • Cortez et al. (2009) Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47, 547–553.
  • Czado (2010) Czado, C. (2010). Pair-Copula Constructions of Multivariate Copulas, pp. 93–109. Berlin, Heidelberg: Springer Berlin Heidelberg.
  • Dißmann et al. (2013) Dißmann, J., E. C. Brechmann, C. Czado, and D. Kurowicka (2013). Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics and Data Analysis 59, 52–69.
  • Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics 32, 407–499.
  • Ernst et al. (2012) Ernst, O. G., A. Mugler, H.-J. Starkloff, and E. Ullmann (2012). On the convergence of generalized polynomial chaos expansions. ESAIM: M2AN 46(2), 317–339.
  • Forman and Cohen (2004) Forman, G. and I. Cohen (2004). Learning from little: Comparison of classifiers given little training. In J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi (Eds.), Knowledge Discovery in Databases: PKDD 2004, Berlin, Heidelberg, pp. 161–172. Springer Berlin Heidelberg.
  • Gautschi (1982) Gautschi, W. (1982). On generating orthogonal polynomials. SIAM J. Sci. Stat. Comput. 3(3), 289–317.
  • Genest and Nešlehová (2007) Genest, C. and J. Nešlehová (2007). A primer on copulas for count data. The Astin Bulletin 37, 475–515.
  • Ghanem and Spanos (1990) Ghanem, R. and P.-D. Spanos (1990). Polynomial chaos in stochastic finite elements. J. Applied Mech. 57, 197–202.
  • Gilley and Pace (1995) Gilley, O. W. and R. K. Pace (1995). Improving hedonic estimation with an inequality restricted estimator. The Review of Economics and Statistics 77(4), 609–621.
  • Haff et al. (2010) Haff, I. H., K. Aas, and A. Frigessi (2010). On the simplified pair-copula construction – simply useful or too simplistic? Journal of Multivariate Analysis 101, 1296–1310.
  • Harrison and Rubinfeld (1978) Harrison, D. and D. L. Rubinfeld (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5, 81–102.
  • Ishigami and Homma (1990) Ishigami, T. and T. Homma (1990). An importance quantification technique in uncertainty analysis for computer models. In Proc. ISUMA’90, First Int. Symp. Unc. Mod. An, pp. 398–403. University of Maryland.
  • Joe (1996) Joe, H. (1996). Families of -variate distributions with given margins and bivariate dependence parameters. In L. Rüschendorf, B. Schweizer, and M. D. Taylor (Eds.), Distributions with fixed marginals and related topics, Volume 28 of Lecture Notes–Monograph Series, pp. 120–141. Institute of Mathematical Statistics.
  • Joe (2015) Joe, H. (Ed.) (2015). Dependence modeling with copulas. CRC Press.
  • Kasiviswanathan et al. (2016) Kasiviswanathan, K. S., K. P. Sudheer, and J. He (2016). Quantification of Prediction Uncertainty in Artificial Neural Network Models, pp. 145–159. Cham: Springer International Publishing.
  • Kirk (2014) Kirk, J. (2014). Traveling salesman problem – genetic algorithm.
    https://ch.mathworks.com/matlabcentral/fileexchange/13680-traveling-salesman-problem-genetic-algorithm.
  • Kucherenko et al. (2012) Kucherenko, S., A. Tarantola, and P. Annoni (2012). Estimation of global sensitivity indices for models with dependent variables. Comput. Phys. Comm. 183, 937–946.
  • Kurowicka and Cooke (2005) Kurowicka, D. and R. M. Cooke (2005). Distribution-free continuous Bayesian belief nets. In A. Wilson, N. Limnios, S. Keller-McNulty, and Y. Armijo (Eds.), Modern Statistical and Mathematical Methods in Reliability, Chapter 10, pp. 309–322. World Scientific Publishing.
  • Kurz (2015) Kurz, M. (2015). Vine copulas with matlab.
    https://ch.mathworks.com/matlabcentral/fileexchange/46412-vine-copulas-with-matlab.
  • Lebrun and Dutfoy (2009) Lebrun, R. and A. Dutfoy (2009). Do Rosenblatt and Nataf isoprobabilistic transformations really differ? Prob. Eng. Mech. 24, 577–584.
  • Lichman (2013) Lichman, M. (2013). UCI machine learning repository.
  • Mentch and Hooker (2016) Mentch, L. and G. Hooker (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research 17, 1–41.
  • Morales-Nápoles (2011) Morales-Nápoles, O. (2011). Counting vines. In D. Kurowicka and H. Joe (Eds.), Dependence Modeling: Vine Copula Handbook, Chapter 9, pp. 189–218. World Scientific Publisher Co.
  • Nelsen (2006) Nelsen, R. (2006). An introduction to copulas (Second ed.). Springer Series in Statistics. Springer-Verlag New York.
  • Oladyshkin and Nowak (2012) Oladyshkin, S. and W. Nowak (2012). Data-driven uncertainty quantification using the arbitrary polynomial chaos expansion. Reliability Engineering and System Safety 106, 179–190.
  • Panagiotelis et al. (2012) Panagiotelis, A., C. Czado, and H. Joe (2012). Pair copula constructions for multivariate discrete data. J. Amer. Statist. Assoc. 107, 1063–1072.
  • Panagiotelis et al. (2017) Panagiotelis, A., C. Czado, H. Joe, and J. Stöber (2017). Model selection for discrete regular vine copulas. Computational Statistics & Data Analysis 106, 138–152.
  • Parzen (1962) Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076.
  • Puig et al. (2002) Puig, B., F. Poirion, and C. Soize (2002). Non-Gaussian simulation using Hermite polynomial expansion: convergences. Prob. Eng. Mech. 17, 253–264.
  • Quinlan (1993) Quinlan, J. R. (1993). Combining instance-based and model-based learning. In Proceedings on the Tenth International Conference of Machine Learning, pp. 236–243.
  • R Kelley Pace (1997) R Kelley Pace, O. W. G. (1997). Using the spatial configuration of the data to improve estimation. Journal of Real Estate Finance and Economics 14(3), 333–340.
  • Rosenblatt (1952) Rosenblatt, M. (1952). Remarks on a multivariate transformation. The Annals of Mathematical Statistics 23, 470–472.
  • Rosenblatt (1956) Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics 27(3), 832–837.
  • Saltelli et al. (2000) Saltelli, A., K. Chan, and E. Scott (Eds.) (2000). Sensitivity analysis. J. Wiley & Sons.
  • Schepsmeier (2015) Schepsmeier, U. (2015). Efficient information based goodness-of-fit tests for vine copula models with fixed margins. Journal of Multivariate Analysis 138(C), 34–52.
  • Schölkopf and Smola (2002) Schölkopf, B. and A. Smola (2002). Learning with kernels. MIT Press.
  • Sklar (1959) Sklar, A. (1959). Fonctions de répartition à dimensions et leurs marges. Publications de l’Institut de Statistique de L’Université de Paris 8, 229–231.
  • Snoek et al. (2012) Snoek, J., H. Larochelle, and R. P. Adams (2012). Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS’12, USA, pp. 2951–2959. Curran Associates Inc.
  • Sobol’ (1993) Sobol’, I. (1993). Sensitivity estimates for nonlinear mathematical models. Mathematical Modeling & Computational Experiment 1, 407–414.
  • Sobol’ (1967) Sobol’, I. M. (1967). On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics 7(4), 86–112.
  • Soize and Ghanem (2004) Soize, C. and R. Ghanem (2004). Physical systems with random uncertainties: chaos representations with arbitrary probability measure. SIAM J. Sci. Comput. 26(2), 395–410.
  • Stöber et al. (2013) Stöber, J., H. Joe, and C. Czado (2013). Simplified pair copula constructions — limitations and extensions. Journal of Multivariate Analysis 119, 101–118.
  • Stuart and Ord (1994) Stuart, A. and K. Ord (1994). Kendall’s advanced theory of statistics Vol. 1 – Distribution theory. Arnold.
  • Sudret (2008) Sudret, B. (2008). Global sensitivity analysis using polynomial chaos expansions. Reliab. Eng. Sys. Safety 93, 964–979.
  • Sudret et al. (2015) Sudret, B., S. Marelli, and C. Lataniotis (2015). Sparse polynomial chaos expansions as a machine learning regression technique. International Symposium on Big Data and Predictive Computational Modeling, invited lecture.
  • Todor and Schwab (2007) Todor, R.-A. and C. Schwab (2007). Convergence rates for sparse chaos approximations of elliptic problems with stochastic coefficients. IMA J. Numer. Anal 27, 232–261.
  • Torre et al. (2017) Torre, E., S. Marelli, P. Embrechts, and B. Sudret (2017). A general framework for data-driven uncertainty quantification under complex input dependencies using vine copulas. arXiv 1709.08626. Under revision in Probabilistic Engineering Mechanics.
  • Tüfekci (2014) Tüfekci, P. (2014). Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems 60, 126–140.
  • Vapnik (1995) Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York.
  • Wan and Karniadakis (2006) Wan, X. and G. E. Karniadakis (2006). Beyond Wiener-Askey expansions: handling arbitrary PDFs. J. Sci. Comput. 27, 455–464.
  • Wei et al. (2016) Wei, Z., D. Kim, and E. M. Conlon (2016). Parallel computing for copula parameter estimation with big data: A simulation study. arXiv 1609.05530.
  • Witten et al. (2016) Witten, I. H., E. Frank, M. A. Hall, and C. J. Pal (2016). Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques (4th ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
  • Xiu and Karniadakis (2002) Xiu, D. and G. Karniadakis (2002). The Wiener-Askey polynomial chaos for stochastic differential equations. SIAM J. Sci. Comput. 24(2), 619–644.

Appendix A Mutually dependent inputs modelled through copulas

In Section 2.3 we illustrated PCE for an input with independent components being uniformly distributed in . was obtained from the true input by assuming either that the latter had independent components as well, and thus defining , or that a transformation to perform this mapping was available.

This section recalls known results in probability theory allowing one to specify in terms of its marginal distributions and a dependence function called the copula of . We focus in particular on regular vine copulas (R-vines), for which the transformation can be computed numerically. R-vines provide a flexible class of dependence models for . This will prove beneficial to the ability of the PCE models to predict statistics of the output accurately, compared to simpler dependence models. However, and perhaps counter-intuitively, the pointwise error may not decrease accordingly (see Section 2.1 for details), especially when is highly non-linear. Examples on simulated data and a discussion are provided in Section 3.

a.1 Copulas and Sklar’s theorem

A -copula is defined as a -variate joint with uniform marginals in the unit interval, that is,

Sklar’s theorem (Sklar, 1959) guarantees that any -variate joint can be expressed in terms of marginals and a copula, specified separately.

Theorem (Sklar).

For any -variate with marginals , a -copula exists, such that

(A.1)

Besides, is unique on , where Ran is the range operator. In particular, is unique on if all are continuous, and it is given by

(A.2)

Conversely, for any -copula and any set of univariate s , , the function defined by

(A.3)

is a -variate with marginals .

Throughout this study it is assumed that has continuous marginals . The relation (A.3) allows one to model any multivariate by modelling separately univariate s and a copula function . One first models the marginals , then transforms each into a uniform random variable by the so-called probability integral transform (PIT)

(A.4)

Finally, the copula of is obtained as the joint of . The copula models the dependence properties of the random vector. For instance, mutual independence is achieved by using the independence copula

(A.5)

Table A.4 provides a list of different parametric families of pair copulas implemented in the VineCopulaMatlab toolbox (Kurz, 2015), which was also used in this study. Details on copula theory and on various copula families can be found in Nelsen (2006) and in Joe (2015).

Sklar’s theorem can be re-stated in terms of probability densities. If admits joint and copula density , , then

(A.6)

Once all marginal s and the corresponding s have been determined (see Section 2.3.2), each data point is mapped onto the unit hypercube by the PIT (A.4), obtaining a transformed data set of pseudo-observations of . The copula of can then be inferred on .

a.2 Vine copulas

In high dimension , specifying a -copula which properly describes all pairwise and higher-order input dependencies may be challenging. Multivariate extensions of pair-copula families (e.g. Gaussian or Archimedean copulas) are often inadequate when is large. In Joe (1996) and later in Bedford and Cooke (2002) and Aas et al. (2009), an alternative construction by multiplication of -copulas was introduced. Copula models built in this way are called vine copulas. Here we briefly introduce the vine copula formalism, referring to the references for details.

Let be the vector obtained from the vector by removing its -th component, i.e., . Similarly, let be the vector obtained by removing the -th and -th component, and so on. For a general subset , is defined analogously. Also, and indicate in the following the joint and of the random vector conditioned on . In the following, and form a partition of , i.e., and .

Using (A.6), can be expressed as

(A.7)

where is an -copula density – that of the conditional random variables – and is the conditional of given , . Following Joe (1996), the univariate conditional distributions can be further expressed in terms of any conditional pair copula between and , :

(A.8)

An analogous relation readily follows for conditional densities:

(A.9)

Substituting iteratively (A.8)-(A.9) into (A.7), Bedford and Cooke (2002) expressed as a product of pair copula densities multiplied by . Recalling (A.6), it readily follows that the associated joint copula density can be factorised into pair copula densities. Copulas expressed in this way are called vine copulas.

The factorisation is not unique: the pair copulas involved in the construction depend on the variables chosen in the conditioning equations (A.8)-(A.9) at each iteration. To organise them, Bedford and Cooke (2002) introduced a graphical model called the regular vine (R-vine). An R-vine among random variables is represented by a graph consisting of trees , where each tree consists of a set of nodes and a set of edges between nodes and . The trees satisfy the following three conditions:

  1. Tree has nodes and edges

  2. for , the nodes of are the edges of :