Global sensitivity analysis for statistical model parameters

Global sensitivity analysis for statistical model parameters

Joseph Hart Department of Mathematics, North Carolina State University, Raleigh, NC Julie Bessac Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL Emil Constantinescu
Abstract

Global sensitivity analysis (GSA) is frequently used to analyze the influence of uncertain parameters in mathematical models and simulations. In principle, tools from GSA may be extended to analyze the influence of parameters in statistical models. Such analyses may enable reduced or parsimonious modeling and greater predictive capability. However, difficulties such as parameter correlation, model stochasticity, multivariate model output, and unknown parameter distributions prohibit a direct application of GSA tools to statistical models. We introduce a novel framework that addresses these difficulties and enables GSA for statistical model parameters. Theoretical and computational properties are considered and illustrated on a synthetic example. The framework is applied to a Gaussian process model from the literature, which depends on 95 parameters. Non-influential parameters are discovered through GSA and a reduced model with equal or stronger predictive capability is constructed by using only 79 parameters.

1 Introduction

Global sensitivity analysis (GSA) aims to quantify the relative importance of input variables or factors in determining the value of a function [1, 2]. It has been used widely for analysis of parameter uncertainty in mathematical models and simulations [1, 3]. In particular, GSA may be used to improve modeling insight, encourage model parsimony, and accelerate the model-fitting process. In this paper we propose a novel method for GSA of parameters in statistical models. To our knowledge, the GSA tools developed for mathematical models and simulations have not been systematically developed for analysis of statistical models. The combination of parameter correlation, model stochasticity, multivariate model output, and having an unknown parameter distribution prohibits a direct application of GSA tools to statistical models. Nevertheless, problem structure in statistical models may be exploited to enable efficient GSA. This paper provides a framework to use existing GSA tools along with tools from statistics to address these challenges and yield a powerful new GSA tool for analysis of statistical models.

This work is motivated by a statistical model that fuses two datasets of atmospheric wind speed in order to provide statistical prediction of wind speed in space and time [4]. The predictions are generated from a Gaussian process whose mean and covariance are parameterized through a large number of parameters. Our objective in this application is to reduce the dimension of the parameter space making the model easier to fit and interpret. The GSA method is developed in an abstract setting and subsequently used to analyze this Gaussian process model.

There are a variety of methods under the umbrella of GSA, the most common is variance-based [5, 6, 7, 8]. For reasons of computational efficiency, derivative-based methods [9, 10, 11, 12, 13, 14, 15] have also gained attention recently; they are related to the classical Morris method [16]. Theoretical challenges in the aforementioned methods has motivated interest in alternatives such as moment-independent importance measures [17, 18, 19], Shapley effects [20, 21, 22, 23], and dependence measures [24]. In this paper we propose a derivative-based strategy for GSA of statistical models. In principle, any of the aforementioned methods may be used. Computational considerations make derivative-based methods preferable. In particular, the number of samples is independent of the parameter space dimension and they do not require sampling from conditional distributions.

To perform GSA, a probability distribution must be defined on the input space, and the analysis is done with respect to it. GSA has been well developed and studied for problems where the inputs are independent. In many statistical models, however, the inputs (parameters) are correlated, thus posing additional challenges to traditional GSA tools. Developing GSA tools for problems with correlated inputs is an active area of research [25, 26, 27, 28, 29, 30, 18, 31, 19, 21, 17, 23, 20, 22]. In addition to the theoretical and computational challenges posed by input correlations, simply defining or fitting a joint distribution on the inputs may be challenging in the context of statistical models.

The traditional GSA framework has focused on real-valued deterministic functions with uncertain inputs. A space-time Gaussian process, as in our motivating application, is not a deterministic real-valued function but rather a stochastic vector-valued function, i.e. , where is a set of random vectors. Real-valued stochastic processes are considered in [32, 33] and vector-valued deterministic models in [34, 35]; generalizing GSA tools to stochastic or vector-valued functions is also an active area of research.

This article provides a framework to connect the mathematical and statistical tools needed to extend GSA to statistical models. We use the loss function associated with the statistical model estimation to define a joint probability distribution, which respects the correlation structure in the problem. This distribution is sampled using a Markov Chain Monte Carlo method, and derivative-based sensitivity indices are computed from these samples. In this framework, we are able to discover both sensitivities and correlation structures without requiring a priori knowledge about the parameters.

We provide definitions and construct the abstract framework in Section 2. Section 3 details the computational methodology and summarizes the proposed method. In Section 4, we present numerical results for a synthetic test problem and for our motivating application. Section 5 provides a brief summary of our conclusions.

2 Preliminaries

Let be a statistical model defined through parameters ,
. Let be a loss function (or cost function) associated with ; that is, to fit to the data , one computes . In the model-oriented context of statistics, parameters of the model are usually estimated by minimizing computed on observed data . The loss function is chosen given the model and its intended use. Common examples are least-squares and maximum likelihood. For simplicity, will be used instead of for the remainder of the paper. In our motivating application, is a Gaussian process with mean and covariance depending on a vector of parameters and is the the negative log likelihood.

We are interested in the global sensitivity of with respect to . Since may be a complex mathematical object, we propose to analyze the global sensitivity of to through the global sensitivity of to . This makes our analysis dependent on the choice of loss function, which is consistent with a goal oriented choice of . This is appropriate since encodes the dependence of on . Further, since is a deterministic real-valued function of , it is frequently easier to analyze than .

One challenge is that the statistical model may be mathematically well defined for parameters but yield a practically irrelevant solution in the context of a given application. To avoid this scenario, we let be the subset that restricts to parameters yielding relevant solutions. For instance, a quantity in may be required to be nonnegative so restricts to parameters respecting this constraint. We assume that is a Lebesgue measurable set; this is easily verified in most applications.

We make three assumptions about ; formally, they are expressed as the following.

  1. is differentiable.

  2. such that , , and , , exist and are finite.

  3. such that , .

Assumption I is necessary since we seek to use a derivative-based GSA. This assumption is easily verifiable in most cases. Assumption II is needed so that global sensitivity indices (3) are well defined. Assumption III is a needed technical assumption requiring that the loss function be bounded below. Note that if is continuously differentiable and is compact, then all three assumptions follow immediately; this is a common case.

To define global sensitivity indices, we must specify a probability measure to integrate against. Let

(1)

for some and ; is the characteristic function of a set, and is the Euclidean norm. Note that is defined through constraints on so it is generally difficult to express in terms of simple algebraic constraints. In most cases, however, the constraints may be checked when is evaluated, and hence is easily evaluated through evaluating .

From Assumption II and the fact that , it follows that is integrable. We define the probability density function (PDF) as

(2)

Then is supported on and gives the greatest probability to regions where is close to its minimum namely, where is a good fit. A PDF of this form corresponds to a Gibbs measure [36] with temperature ; the temperature determines how the probability mass disperses from the modes. The scalar is a regularization factor that aids when is too heavy tailed; this is illustrated in Section 4. The determination of and is considered in Section 3. Note that determining a PDF for correlated parameters may be challenging in general. This framework provides a natural may to define the PDF, but it comes at the cost of needing to determine and .

Definition 1.

Let the the sensitivity index of with respect to be defined as

(3)

The reasoning behind (3) is that the expected partial derivative, , is a measure of global sensitivity that depends on the units of . This dependence may create challenges because parameters on multiple scales are difficult to compare. Scaling by alleviates this problem, yielding scale-invariant global sensitivity indices. For the subsequent development we assume that the gradient is available; finite-difference approximations may be used otherwise.

Correlations in make Monte Carlo integration with uniform sampling intractable for computing the ’s. Importance sampling may be used if an efficient proposal distribution is found; however, this is also challenging in most cases. Therefore, we propose to compute the ’s with Markov Chain Monte Carlo (MCMC) methods.

In summary, the global sensitivity of to may be estimated by using only evaluations of and along with MCMC. This framework also admits additional useful information as by-products of estimating (3). More details are given in Section 3.

3 Computing sensitivities

In this section we present the main result of this study. The proposed method may be partitioned into three stages:

  1. Preprocessing where we collect information about the loss function,

  2. Sampling where samples are drawn from the probability measure (2),

  3. Post-processing where sensitivities as well as additional information are computed.

In the preprocessing stage we seek to find characteristic values for the parameters and the loss function. These characteristic values are used to determine the temperature and regularization factor in the PDF (2).

In the sampling stage we first determine the temperature and regularization factor. Subsequently an MCMC sampler is run to collect samples from (2).

In the post-processing stage we compute sensitivities by evaluating the gradient of the loss function at the samples drawn in the sampling stage ii. In addition, the robustness of the sensitivities with respect to perturbations in the temperature and the parameter correlations are extracted from the existing samples and gradient evaluations. These two pieces of information are by-products of computing sensitivities and require no additional computation.

These three stages are described in Subsections 3.13.2, and  3.3, respectively. The method is summarized as a whole in Subsection 3.4.

3.1 Preprocessing stage

Characteristic magnitudes for and are needed to determine the regularization factor and temperature. To this end we introduce two auxiliary computations as a preprocessing step.

The first auxiliary computation runs an optimization routine to minimize ; the choice of optimizer is not essential here. Let be the minimizing parameter vector. For our purposes it is acceptable if is not the global minimizer of because we only need to capture characteristic magnitudes.

The second auxiliary computation uses to determine the range of loss function values that our MCMC sampler should explore. Let , and let be a random vector defined by

where all the ’s are independent of one another and denotes the uniform distribution. Hence, represents uniform uncertainty of about .

Determining is an application-dependent problem. In fact, its determination is the only portion of our proposed method that cannot be automated. To choose , we suggest fixing a value for , sampling from , and assessing the quality of ’s predictions using the sample. Repeating this sample and assessment process for various values of allow the user to determine a that yields reasonable predictions. This step is highly subjective and application dependent; however, it is a very natural means of inserting user specification. One simple way to do this is visualizing the model output for each sample and increasing until the outputs become unrealistic.

Taking large values for will result in the PDF giving significant probability to regions of the parameter space yielding poor fits, and thus hence sensitivity indices that are not useful. Taking small values for will result in the PDF giving significant probability to regions of the parameter space near local minima, thus making the sensitivity indices local. Since the choice of is strongly user dependent, the robustness of the sensitivity indices with respect to perturbations in is highly relevant; this is indirectly addressed by Theorem 2.

Once is specified, then a threshold , which is used to compute the regularization factor and temperature (see Subsection 3.2.1 and Subsection 3.2.2), may be easily computed via Monte Carlo integration. We define the threshold

(4)

Note that the expectation in (4) is computed with respect to the independent uniform measure; all other expectations in the paper are computed with respect to the PDF (2).

3.2 Sampling stage

We use an MCMC method to sample from (2) through evaluations of the unnormalized density (1). Then the ’s may be computed through evaluations of at the sample points. Many MCMC methods may be used to sample ; see, for example [37, 38, 39, 40, 41, 42].

Determining which MCMC method to use and when it has converged may be challenging. Convergence diagnostics [43, 44] have been developed that may identify when the chain has not converged; however, they all have limitations and cannot ensure convergence [45]. In Section 4, adaptive MCMC [39] is used with the convergence diagnostic from [46].

Assuming that an MCMC sampler is specified, we focus on determining the temperature and regularization factors in Subsections 3.2.1 and 3.2.2, respectively.

3.2.1 Determining the regularization factor

To determine the regularization factor , consider the function

The PDF gives greatest probability to regions where is small. If is small relative to , then the local minima of are near the local minima of . Ideally we would like , but in some cases this results in being too heavy tailed. Instead we may require that for some ; that is, the regularization term contributes percent of the value of . Setting and replacing and with and , we get

(5)

In practice we suggest beginning with . If the MCMC sampler yields heavy-tailed distributions that converge slowly, then may be increased to aid the convergence. This case is illustrated in Section 4.

3.2.2 Determining the temperature

To determine the temperature , we first define

We seek to find so that with probability ; is suggested to mitigate wasted computation in regions where yields a poor fit. Let . We note that is a Lebesgue measurable set since is continuous and is Lebesgue measurable. We define the function by

(6)

Then gives the probability that . The optimal temperature is the solution of . Four results are given below showing that possesses advantageous properties making the nonlinear equation easily solvable. The proofs of the following propositions are given in the appendix.

Proposition 1.

If , then is differentiable on with

(7)
Proposition 2.

If , then is a strictly increasing function on .

Propositions 1 and 2 yield desirable properties of . The assumption that
is integrable is necessary for to be well defined. Note that this assumption follows from Assumption I when is bounded. Theorem 1 and Corollary 1 below give existence and uniqueness, respectively, for the solution of under mild assumptions.

Theorem 1.

If is a bounded set and such that , then such that .

Corollary 1.

If , is a bounded set, and such that
, then admits a unique solution.

The assumption that is bounded is reasonable in most applications; may be unbounded, but is restricted to relevant solutions that will typically be bounded. The assumption that means that is not chosen as the global minimum, which should always hold in practice. The assumption that is necessary for existence. Typically is much less than 1, while is chosen close to 1. The assumptions Theorem 1 and Corollary 1 hold in most applications

In summary, under mild assumptions is a scalar nonlinear equation admitting a unique solution and possesses nice properties (monotonicity and differentiability). Further, and may be approximated simultaneously by running MCMC. The challenge is that evaluating and in high precision requires running a long MCMC chain. In fact, is significantly more challenging to evaluate than . For this reason we suggest using derivative-free nonlinear solvers which will still be efficient since is a well-behaved function. In the spirit of inexact Newton methods [47], shorter chains may be run for the early iterations solving and the precision increased near the solution. In practice, relatively few evaluations of are needed because of its properties, shown above.

3.3 Post-processing stage

Having attained samples from (2), the sensitivities (3) may be estimated by evaluating at the sample points and forming the Monte Carlo estimator for the expectations in (3). In addition to computing these sensitivities, we may extract two other useful pieces of information, namely, the robustness of the sensitivities with respect to perturbations in the temperature and the parameter correlations. These are described in Subsections 3.3.1 and 3.3.2, respectively.

3.3.1 Robustness with respect to the temperature

As a result of the uncertainty in the determination of (choice of , computation of , solution of ), we analyze the robustness of the sensitivities with respect to . Consider the functions

; clearly . Theorem 2 gives the derivative of the sensitivity index with respect to the temperature , namely, .

Theorem 2.

If , , and
exist and are finite, then is differentiable with

(8)

where is the covariance operator.

Theorem 2 allows to be computed from the samples and function evaluations used to compute . For small , so the robustness of may be estimated without any further computational expense.

Since the magnitude of may depend on , it is useful to normalize for each when assessing robustness. We define

(9)

which may be plotted for to assess robustness. Since this is only a local estimate we suggest taking , reflecting a uncertainty about .

3.3.2 Extracting parameter correlations

Parameters are typically correlated, and the correlation information is a valuable complement to the sensitivity indices. For instance, if is sensitive to two parameters that are highly correlated, then it may be possible to remove one of them from since the other may compensate. In addition, the correlations may reveal parameter misspecifications in .

The strength and nature of the correlations in are typically not known a priori. Correlation coefficients may be computed from the MCMC samples and returned as a by-product of computing sensitivity indices. The Pearson correlation coefficient (PCC) is commonly used to measure correlations from sampled data. Other measures of correlation may be interchanged within our framework as well.

3.4 Summary of the method

This subsection summarizes our proposed method. The method is divided into three algorithms, one for each stage described in Section 3.

Algorithm 1 performs the auxiliary computations of Subsection 3.1. Note that determining in line 2 is the only application-specific portion of the proposed method; user discernment is necessary to choose .

Algorithm 2 requires the user to specify the parameter from Subsection 3.2.1, the parameter from Subsection 3.2.2, and the number of MCMC samples . We suggest starting with and rerunning Algorithm 2 with a larger if the convergence results indicate that the PDF is heavy tailed. Hence may be viewed as a computed quantity rather than one specified by the user. As mentioned in Subsection 3.2.2, we suggest using . It may be considered fixed and the user only needs to change it if they have a specific purpose which requires giving more weight to “poor” parameter choices. The choice of may be difficult; however, more samples may be appended after an initial run so can be adapted without any wasted computation.

Algorithm 3 is a simple post-processing of the MCMC samples to compute sensitivity indices, robustness estimates, and parameter correlations. One may also perform convergence diagnostics on the MCMC estimators of , , along with Algorithm 3.

1:compute via some optimization routine
2:determine through visualization of model outputs
3:estimate via Monte Carlo integration
Algorithm 1 Auxiliary Computation
1:function (, , )
2:compute using (5)
3:solve
4:run MCMC sampler to draw samples from (2)
5:store MCMC samples in a matrix
6:test convergence of the sampler
7:end function
Algorithm 2 Sampling
1:evaluate at points in
2:estimate ,
3:estimate ,
4:compute empirical correlation matrices from
Algorithm 3 Sensitivities, Perturbations, and Correlations

4 Numerical results

In this section we apply the proposed method to two problems. The first is a synthetic test problem meant to illustrate the methodological details described in Section 3. The second is our motivating application where is a space-time hierarchical Gaussian process used for wind speed forecast [4].

4.1 Synthetic test problem

This synthetic problem validates the proposed method of GSA and illustrate its properties on a simple example with least squares estimation. We demonstrate the difficulty of MCMC sampling with heavy tailed distributions and how the regularization factor (5) alleviates this problem.

Consider a space-time process governed by the function

(10)

where

with , , and

(11)

We draw samples from (10) on a uniform grid of , which gives data

A model parameterized in the same form as (10) is proposed, but the parameters are assumed to be unknown. They are determined by minimizing the least squares loss function

Analytic solutions for the sensitivities are intractable; however, we can validate our results by comparing them with our knowledge of the true model that generated the data. In particular, the relative importance of the parameters is clear by examining (10) and (11). We expect to be the most important parameter and and to be the least important parameters.

The proposed method is used with , , , and . Five independent chains are generated from overdispersed initial iterates using adaptive MCMC [39]. When , the MCMC sampler fails to converge because the tail of is too heavy. To illustrate this, Figure 1 shows the iteration history for the parameter in each of the five chains after a burn-in period is discarded. The two leftmost frames indicate that is heavy tailed; the other three chains never reach the tail. A heavy-tailed PDF such as this requires extensive sampling, which makes the reliable computation of sensitivity indices intractable. Therefore, we use regularization to alleviate this problem by increasing as we monitor the sampler’s convergence. We find that yields converged chains with samples. The chains are deemed convergent by using the potential scale reduction factor (PSRF) [46] as well as visualizing the iteration histories and histograms from each of the five chains.

Figure 1: Iteration history for parameter . Each frame corresponds to an independent chain.

Plotting the iteration history of indicates that a burn-in of is sufficient. Then the remaining samples from the five chains are pooled together so that sensitivities and correlations may be computed from them. Figure 2 shows the sensitivity indices and Pearson correlation matrix computed from the pooled samples. These results are consistent with our expectations, is seen as the most important parameter and as the least important. Two primarily blocks are seen in the correlation plot representing the set of spatial variables and the set of temporal variables. Negative correlations are observed on the off diagonal blocks since the spatial and temporal variables are multiplied by one another and hence are inversely related.

Figure 2: Sensitivity indices (left) and Pearson correlation coefficients of the parameters (right) for the synthetic test problem.

Figure 3 displays (9) plotted for , . The horizontal lines indicate that errors in determining are immaterial since the analysis would be unchanged by perturbing .

Figure 3: Sensitivity index perturbations for the synthetic test problem. Each line corresponds to a given parameter.

4.2 Analysis of a space-time Gaussian process

In this section we apply the proposed method to analyze the motivating statistical model [4]. The model aims at forecasting wind speed by fusing two heterogeneous datasets: numerical weather prediction model (NWP) outputs and physical observations. Let denote the output of the NWP model and denote the observed measurements. They are modeled as a bivariate space-time Gaussian process specified in terms of mean and covariance structures as follows,

(12)

where is the set of parameters that describe the shapes of the means and covariances. The model is expressed in a hierarchical conditional manner to avoid the specification of the full joint covariance in (12), indeed the mean and covariance of the distributions and are specified in time, geographical coordinates, and parameters from the numerical model (the land-use parameter). More precisely,

(13)

where is a spatial location, is time, represents a sum of temporal harmonics with daily, half-daily and 8hr-periodicities, is a categorical variable that represents the land-use associated with location , and are the latitude and longitude coordinates.

(14)

, are temporal squared exponential covariances expressed as

where is the Kronecker delta, , , and are parameters to be estimated. are land-use specific terms, and is linear in the latitude and longitude coordinates and quadratic in time. The parameters , , , along with the parameters of , and , will be estimated during the maximum likelihood procedure. We will denote the collection of all these parameters by .

The conditional distribution is expressed through its mean and covariance:

where is written similarly to as a product of temporal harmonics and a linear combination of the coordinates latitude and longitude. is a projection matrix specified in time, latitude, longitude and the land-use parameter. The covariance is parametrized with a similar shape to (14) with a different set of parameters. Parameters of these functions are denoted as in the following and will estimated by maximum likelihood.

This model is fitted by maximum likelihood on the two datasets with respect to the parameters . The negative log likelihood of the model can be decomposed as

(16)

where and are the negative log likelihoods for the marginal distribution of and the conditional distribution , respectively. Since the model decomposes in this way, we will consider analysis of the parameters in and separately. Our dataset consists of 27 days of measurements from August 2012; details may be found in [4]. The parameter sensitivity during the first 13 days for and is analyzed in Subsection 4.2.1 and Subsection 4.2.2, respectively. Inferences are drawn from this analysis and validated using the later 14 days of data in Subsection 4.2.3.

4.2.1 Parameter sensitivity analysis for

In this subsection we apply the proposed method to determine the sensitivity of the marginal model for to its 41 parameters during a 13-day period. The sensitivities being computed are with respect to the parameters in , but for notational simplicity we will denote them by in this subsection.

In order to determine (line 1 of Algorithm 1), the L-BFGS-B algorithm is used to minimize . Visualizing the model predictions for various choices of yields (line 2 of Algorithm 1). Then (4) is estimated with Monte Carlo samples (line 3 of Algorithm 1). It returns an estimate with standard deviation 2; hence is considered to be sufficiently large. These steps complete the preprocessing stage by providing characteristic values for the parameters and the loss function .

The PDF is found to be heavy tailed, so is chosen to reduce this effect. Then the equation is solved with by evaluating and manually updating . The solution is obtained. This converged in very few iterations because of the nice properties of the equation . Any other derivative-free nonlinear solver may be used in our framework; however, manual tuning is preferable in many cases because of the simplicity of the equation and the stochasticity of the function evaluations. Having determined and , the PDF from which we draw samples is now well defined.

Adaptive MCMC [39] is used with a desired acceptance rate of . Five chains of length each are generated independently from overdispersed initial iterates, and the first iterates are discarded as burn-in. The PSRF convergence diagnostic from [46] is used on and separately to assess the convergence of each. The PSRFs for all parameters lie in the intervals and , respectively. Other visual diagnostics are applied as well, along with comparing sensitivity indices from each of the chains. The sensitivity estimation appears to converge.

Figure 4 displays the sensitivity indices estimated from each of the chains. The five different colors represent the five different chains; their comparability demonstrates that MCMC estimation errors are negligible. The intercept terms in the mean and the covariance kernel parameters are the most influential. The terms parameterizing are less influential, particularly the quadratic temporal terms.

Figure 4: Sensitivity indices for . The five colors represent the sensitivity indices computed from each of the five chains.

As discussed in Subsection 3.3.1, the robustness of the sensitivities with respect to errors in may be estimated as a by-product of computing sensitivities. Figure 5 displays (9) plotted for , . Most of the curves are nearly horizontal, and those that not horizontal display small variation that does not change the resulting inference. Thus the sensitivities are robust with respect to , and hence any errors made when determining are negligible.

Figure 5: Sensitivity index perturbations for . Each line corresponds to a given parameter.

As mentioned in Subsection 3.3.2, parameter correlation information is a useful complement to sensitivity indices. Figure 6 displays the empirical Pearson correlation matrix computed from the MCMC samples retained after removing burn-in and pooling the chains. Strong positive correlations are observed between the three land-use dependent spatial intercepts in the mean. Strong negative correlations are observed between the temporal range and the nugget term parameterizing the land-use specific covariance kernels . This correlation is expected since the nugget term represents the variance of the signal that is not explained by the exponential part.

Figure 6: Pearson correlation coefficients for the parameters of .

4.2.2 Parameter sensitivity analysis for

In this subsection we apply the proposed method to determine the sensitivity of the model for to its 54 parameters during the same 13-day period used in Subsection 4.2.1. The sensitivities being computed are with respect to the parameters in , but for notational simplicity we will denote them by in this subsection.

In a similar fashion to Subsection 4.2.1, the L-BFGS-B algorithm is used to determine (line 1 of Algorithm 1). Visualizing the model predictions for various choices of yields (line 2 of Algorithm 1). Then M (4) is estimated with Monte Carlo samples (line 3 of Algorithm 1). It returns an estimate with standard deviation 2; hence is considered to be sufficiently large. These steps complete the preprocessing stage by providing characteristic values for the parameters and the loss function . One may note that and are significantly smaller for than for . Their difference is unsurprising since and model two different processes.

The PDF is found to be heavy tailed, so is chosen to reduce the tail of . Analogously to Subsection 4.2.1, is solved yielding (line 3 of Algorithm 2).

Adaptive MCMC [39] is used with a desired acceptance rate of . Five chains of length each are generated independently from overdispersed initial iterates, and the first iterates are discarded as burn-in. The convergence diagnostic from [46] is used on and separately to assess the convergence of each. The PSRFs for all parameters lie in the intervals and , respectively. Other visual diagnostics are applied as well, along with comparing sensitivity indices from each of the chains. A few sensitivity indices have not fully converged; however, the remaining uncertainty in their estimation is sufficiently small for our purposes. These uncertain sensitivities are among the largest in magnitude. Since our goal is encouraging model parsimony, then precisely ordering the most influential parameters is of secondary importance.

Figure 7 displays the sensitivity indices estimated from each of the chains. The five different colors represent the five different chains. The sensitivities with greatest uncertainties are demonstrated by the differences in their estimated values in each chain; however, these discrepancies are sufficiently small that they do not alter our resulting inference. The longitudinal terms in the mean and are observed to have little influence since their sensitivity indices are nearly zero. The most influential parameters are the spatial weights in the matrix which acts on .

Figure 7: Sensitivity indices for . The five colors represent the sensitivity indices computed from each of the five chains.

As discussed in Subsection 3.3.1, the robustness of the sensitivities with respect to errors in may be estimated as a by-product of computing sensitivities. Figure 8 displays (9) plotted for , . Most of the curves are nearly horizontal, and those that are not horizontal display small variation that does not change the resulting inference.

Figure 8: Sensitivity index perturbations for . Each line corresponds to a given parameter.

As mentioned in Subsection 3.3.2, parameter correlation information is a useful complement to sensitivity indices. Figure 9 displays the empirical Pearson correlation matrix computed from the MCMC samples retained after burn-in and pooling the chains. Strong correlations are observed between the spatial weights in the matrix . Similar to , strong negative correlations are also observed between the temporal range and the nugget term of the land-use specific covariance kernels.

Figure 9: Pearson correlation coefficients for the parameters of .

4.2.3 Inference and validation of results

In some cases with mathematical models one may set a threshold and fix or remove all parameters whose sensitivity is below the threshold. This approach is not suitable for statistical models because the parameters must be understood in light of their contribution to the model structure and their correlation with other parameters. Using the results of Subsection 4.2.1, and considerations of the model structure, we determine that the model is insensitive to the temporal quadratic terms in the parameterization of . Similarly, coupling the results of Subsection 4.2.2, and the model structure, we determine that the model is insensitive to several of the longitude terms. Specifically, the longitude term in the parameterization of the mean and the nine longitude terms in the parameterization of . This conclusion confirms what one would expect from the physics considerations. The flow is predominantly east-west and thus the north-south correlation is relatively weaker. The east-west information is likely to be well captured by and hence is not needed in .

The 16 insensitive parameters are removed from the model, yielding a more parsimonious model that we refer to as the reduced model. To validate our inferences, we use the original model and the reduced model for prediction on the other 14 days of data we have available but did not use in the sensitivity analysis. Leave-one-out cross-validation is used to fit each of the models and assess their predictive capabilities.

We simulated 1,000 scenarios for each of the 14 days and use two metrics to quantify the predictive capacity of the full and reduced models, namely, the energy score and the root mean square error. The energy score [48, 49] is computed for the joint distribution of all spatial locations at each day; hence, 14 energy scores are computed. The root mean square error is computed as the square root of the time average squared error at each spatial location; hence, there are 11 root mean square errors. Figure 10 displays the energy scores on the left and root mean square errors on the right. The reduced model has slightly smaller (and hence better) energy scores and root mean square errors in the majority of the cases. The sum of energy scores for full model and reduced model is 121.1 and 120.7, respectively. The sum of root mean squared errors for the full model and reduced model is 11.4 and 11.3, respectively.

Figure 10: Left: energy scores for each of the 14 days being predicted; right: root mean square error for each of the 11 spatial locations being predicted. The full model is red and the reduced model is black.

To further illustrate the difference between the full and reduced models, we display the simulated scenarios for a typical case. Specifically, we take the spatial location with median root mean square error and six days with median energy score and plot the 1,000 scenarios along with the observed data. Figure 11 displays the results with the full model on the left and reduced model on the right.

Figure 11: Predictions using the full model (left) and reduced model (right) for 6 days at a fixed spatial location. The red curve is the observed wind speed and the grey curves are 1000 simulations generated from each model.

We have thus simplified the parameterization of the model from having 95 parameters to 79. The reduced model has equal or better predictive capability and is simpler to fit and analyze. Further, the reduced model typically has fewer outlying scenarios, as evidenced in Figure 11.

For this application we observed a strong insensitivity of to some of its terms contributing longitudinal information. This is likely because the model captures longitudinal information well and hence the Gaussian process does not need to fit terms contributing longitudinal information. Thus, inferences may also be made on the data being input to the statistical model through the parameter sensitivities.

These inferences and simplifications are useful for multiple reasons. First, long term weather prediction is difficult so the parameters must be optimized frequently to accommodate changing weather patterns. Hence a large optimization problem must be solved frequently, reducing the number of parameters allows for faster and more robust optimization. Second, by removing unimportant parameters the model is more robust and can be more easily integrated into a larger workflow, namely power scheduling. Third, we are able to learn more about the underlying system through these inferences, for instance, the unimportance of longitudinal information.

5 Conclusion

A new method for global sensitivity analysis of statistical model parameters was introduced. It addresses the challenges and exploits the problem structure specific to parameterized statistical models. The method is nearly fully automated; one step depends on the user’s discretion, but this level of user specification is likely necessary for any alternative method. The proposed method also admits perturbation analysis at no additional computational cost, thus yielding sensitivities accompanied with certificates of confidence in them.

The method was motivated by, and applied to, a Gaussian process model aimed at wind simulation. Sensitivities were computed and the model parameterization simplified by removing 17% of the model parameters. This simpler model was validated and shown to provide equal or superior predictive capability compared with the original model.

Our proposed method has two primary limitations. First, it relies heavily on Markov Chain Monte Carlo sampling for which convergence diagnostics are notoriously challenging. Second, regularization may be needed to eliminate heavy-tailed distributions. Determining the regularization constant is simple in principle but may require drawing many samples to resolve. However, these limitations are classical and have been observed in various applications previously.

Global sensitivity analysis has seen much success analyzing parameter uncertainty for mathematical models. The framework presented in this paper provides the necessary tools to extend global sensitivity analysis to parameters of complex statistical models.

Appendix

This appendix contains the proofs for Proposition 1, Proposition 2, Theorem 1, Corollary 1, and Theorem 2.

Proof of Proposition 1

Let and . Then

By using Theorem 6.28 in [50] with dominating , we have that and are differentiable with

is differentiable since , , and applying the quotient rule for derivatives gives

Simple manipulations yields

Writing and using the linearity of the integral completes the proof.

Proof of Proposition 2

From the proof of Proposition 1 we have

Since

we have

Proof of Theorem 1

Let . Define

By Assumption I, is continuous so . Then such that

We want to show that

It is enough to show

If , then

hence it is enough to show that

Since the exponential is nonnegative and , it is equivalent to show

If , then

But

by our construction of .

Proof of Corollary 1

Since is bounded, . Mimicking the argument of Proposition 1, is differentiable and hence continuous at . Since , Theorem 1 gives such that . Then existence holds by the intermediate value theorem. Uniqueness follows from Proposition 2.

Proof of Theorem 2

Let

and

Then we have